Open-source · Apache-2.0 · Self-hostable
Point. Pick. Extract.
No fabricated values.
Visual field-picker. Human-in-the-loop review. Every extracted value links back to its source — not hallucinated by an LLM. Self-host it on your own infrastructure or use the hosted version free.
50 documents/month on the free tier. No credit card. No data leaves your environment when you self-host.
Template fields
- invoice_numbertext1.00
- invoice_datedate0.99
- due_datedate0.96
- vendor_nametext1.00
- bill_totext0.98
- subtotalcurrency0.99
- taxcurrency0.97
- totalcurrency0.97
Every value links back to a region in the source — and to the model + version that read it.
How it works
Three steps. One promise: nothing fabricated.
- 01
Define a template
Upload a sample document. Draw bounding boxes around the fields you want. Name them, set types (text, number, date, currency, table). Save the template.
- 02
Run a batch
Upload many similar documents — invoices, intake forms, receipts. The engine runs each through the OCR cascade, finds the values, grounds them to source regions.
- 03
Review uncertainties
Low-confidence fields surface in a review queue with the highlighted source region. Approve or correct. Export the rest as CSV or JSON — every value with its provenance.
Why it's different
What the document actually says. Not what the model guessed.
Most extraction tools either hallucinate confident-but-wrong values, charge enterprise prices for what should be commodity work, or ship as a black box with no self-host option. DocuExtract is the opposite of all three.
Verbatim grounding
Every extracted value must trace back to a source span in the original document. Values that can’t be grounded are dropped or routed to review — never fabricated by the model.
Visual field-picker
Draw bounding boxes on a sample. No JSON schemas to write, no regex to maintain. Anchors handle position drift between similar documents.
Honest multilingual tiering
Stable: English, Spanish. Beta: 8 widely-used languages including Arabic RTL. Experimental: Indic scripts (Hindi, Punjabi, Tamil, Telugu, Bengali). We label by maturity, not marketing.
Human-in-the-loop, by default
Low-confidence fields surface for review with the source region highlighted. Reviewer corrections feed back to improve future extractions on that template.
Self-hostable
docker compose up. Postgres + Redis + MinIO + Ollama, all in one stack. No paid API required. Apache-2.0. Run it on your own hardware, your own models, your own data.
Auditable
Every field carries: which OCR tier ran, which model + version, the confidence score, any human correction. Append-only audit log. Forensic-grade provenance.
What we read
Handles the documents others can't.
Printed text and born-digital PDFs are table stakes. We also read handwriting, multi-column tables, degraded scans, and 15 languagesacross three honesty tiers. We label by maturity instead of claiming "100+ languages" like most vendors do.
Document types
Printed text & PDFs
Born-digital PDFs read directly from the embedded text layer. Scanned printed text runs through Tesseract / PaddleOCR depending on script.
Handwriting
Signatures, short field values, and printed-form handwriting handled via Tier 3 vision-LLM (Qwen 2.5-VL). 70–85% accuracy on clean handwriting; cursive freeform falls back to human review.
Tables & multi-column
Single-page tables, key/value forms, multi-column layouts. Complex multi-page tables are on the roadmap (tracked in KNOWN_ISSUES).
Degraded scans
Low-resolution photos, faxed documents, stained scans. The cascade escalates to vision-LLM automatically; uncertainty routes to review queue.
Languages (honestly tiered)
Production-ready. Validated against a comprehensive fixture set.
- English
- Spanish
Works on clean documents. Validate on yours.
- Chinese (Simplified)
- Chinese (Traditional)
- French
- Vietnamese
- Korean
- Tagalog
- Portuguese
- Arabic (RTL)
- Hindi
Active development. Accuracy varies.
- Punjabi
- Tamil
- Telugu
- Bengali
Join the waitlist. Self-host anytime. Bring us in when it matters.
Hosted invites are rolling out in batches. Drop your email and we’ll let you know when you can sign up. Self-host the OSS version today for unlimited volume.