Features

Built for documents that matter.

Most extraction tools optimize for the demo. We optimize for the production requirement: every value traceable to source, uncertainty surfaced honestly, no third-party data exfiltration. Here's what that actually means.

Signature feature

Visual field-picker

Upload a sample of your document. Draw a bounding box around each field you want extracted. Name it, set its type (text, number, date, currency, enum, table), optionally anchor it to a nearby label so the engine finds it even when the value position drifts.

That set of definitions is a Template. Save it once and run thousands of similar documents through it — clean structured data out, consistent column names, consistent types, ready for downstream systems.

Draw, don't code — no JSON schemas, no regex
Anchor-based matching handles position drift between similar documents
Six field types: text, number, date, currency, enum, table
Templates are versioned — edits create v+1, existing batches stay reproducible

invoice_template · v3

Acme CorporationInvoice

Invoice no.

INV-2026-0042

Date

2026-06-15

Due date

2026-07-21

Bill to

Beta Industries LLC

Consulting (40 hrs)$15,000.00

Materials$617.37

Subtotal$15,617.37

VAT (19%)$2,967.30

Total$18,584.67

Template fields

invoice_numbertext
1.00
invoice_datedate
0.99
due_datedate
0.96
vendor_nametext
1.00
bill_totext
0.98
subtotalcurrency
0.99
taxcurrency
0.97
totalcurrency
0.97

Every value links back to a region in the source — and to the model + version that read it.

Anti-hallucination spine

Verbatim grounding.

Every value the engine extracts must point back to a source span in the original document. The LLM doesn't invent — it finds. Values that can't be grounded are dropped (default) or routed to the human-review queue (configurable per template).

This isn't a feature toggle. It's structural. The is_grounded column on every extraction defaults to false — only flipped to true after a real source span is located. A crashed extraction can never accidentally surface an ungrounded value.

Direct verbatim match → highest confidence
Fuzzy match (1–2 char edit distance) handles OCR noise like O↔0, 1↔l↔I
Semantic match grounds typed values (date "Jan 5" → 2026-01-05 → source token)
No match → drop or route to review. Never invent.

Extracted field

total$18,584.67

confidence 0.97 · grounded ✓

Source span

...invoice total amount $18,584.67 due by July 21...

page 1 · bbox (412, 891, 88, 18) · text matched verbatim

Quality boundary

Human-in-the-loop, by default

Confidence below the template's threshold? That field surfaces in a review queue with the source region highlighted on the original document image. The reviewer sees the field name, the engine's best guess, and exactly where it came from. One click to approve, one to correct.

Corrections feed back as exemplars that improve future extractions for that template. This is what separates "OCR with confidence scores" from a usable production workflow.

Configurable confidence threshold per template (default 0.75)
Source-region highlight on the page image, with text snippet context
Per-field review — only the uncertain values, not whole documents
Corrections logged to the audit trail (who, when, before / after)

Review queue · 3 fields

invoice_date0.68
06/15/2026
date ambiguity (US/EU)
ApproveCorrectView source
tax0.71
$2,967.30
OCR uncertainty: 7 vs 1
ApproveCorrectView source
vendor_id0.62
AC-2024-1109
no clear anchor label
ApproveCorrectView source

Languages we actually support

Honest multilingual tiering.

Every competitor markets "100+ languages." Most of those claims fall apart on real documents — bad accuracy, broken bounding boxes for right-to-left, no Indic support at all. We label by maturity instead.

Stable means production-ready, validated. Beta means clean documents work; bring yours. Experimental means active development, accuracy varies. You always know which is which.

Stable: English, Spanish
Beta: Chinese (Simplified + Traditional), French, Vietnamese, Korean, Tagalog, Portuguese, Arabic (RTL), Hindi
Experimental: Punjabi, Tamil, Telugu, Bengali (Indic scripts most tools fail on)
Right-to-left + complex-script handling is real engineering, not a language pack

Stable

English
Spanish

Beta

Chinese (Simp)
Chinese (Trad)
French
Vietnamese
Korean
Tagalog
Portuguese
Arabic (RTL)
Hindi

Experimental

Punjabi
Tamil
Telugu
Bengali

Beyond clean printed text

Handwriting + degraded scans.

Form fields filled in by hand, signatures, dates scrawled in margins, faxed receipts with stamped overlays, photos taken with phone cameras at bad angles. The OCR cascade escalates these automatically to a self-hosted vision-LLM (Qwen 2.5-VL 7B by default) that reads what traditional OCR engines miss.

We're honest about the limits. Clean printed-form handwriting and short field values (signatures, names, dates, single-line answers) extract at 70–85% accuracy. Cursive freeform handwriting and degraded multi-line text often drop below the confidence threshold and route to the human-review queue rather than confidently guessing wrong.

Signatures, hand-printed form fields, marginal notes, stamps
Available on Premium tier (and bundled with Team and up)
Same model handles handwriting and multilingual scripts — no separate setup
Low-confidence handwriting routes to review with the source region highlighted
Self-host: runs on your own GPU; hosted: Modal scale-to-zero GPU

Handwritten signature field

Signature

Extracted: J. Müller · confidence 0.78 · grounded ✓

confidence 0.78 < threshold 0.80

→ routed to review queue for confirmation

Forensic-grade provenance

Audit trail.

Every field carries its full lineage: which OCR tier ran, which model + version, the confidence score, the source page and bounding region, any human correction (who, when, before, after). All written to an append-only event log.

This is what makes the product auditable, not just accurate. Regulated workflows (legal, medical, financial) need to reconstruct exactly why any field reached its final state. The audit log is that reconstruction.

Append-only — events are never updated or deleted
Indexed by extraction, field, template, time, and event type
15+ canonical event types covering OCR, extraction, grounding, review
Export-ready for compliance, SOC 2, HIPAA workflows (managed-deployment tier)

Audit log · extraction d8f3...4a91

14:02:01.341document_uploadedinvoice_07.pdf · 3 pages
14:02:02.118language_detectedscript=Latin · lang=en · 0.99
14:02:03.005ocr_tier_passedtier=1 · confidence=0.94
14:02:08.622extraction_field_extractedtotal · method=llm · 0.97
14:02:08.847grounding_passedtotal · edit_distance=0
14:02:09.011extraction_completed7 fields · overall=0.95

What runs under the hood

Progressive OCR cascade.

Cheap, fast OCR runs first. Quality gates escalate to heavier engines only when needed. Born-digital PDFs never touch a model. Clean scans run Tesseract. Hard cases escalate to PaddleOCR. Handwriting and degraded scans go to a self-hosted vision-LLM. Failures route to human review — never to a paid third-party API by default.

On Premium tiers, two independent LLM passes vote on each field. Disagreements get a third pass or route to review. No silent guessing.

Tier 0: embedded PDF text (born-digital, free, instant)
Tier 1: Tesseract (Latin scripts, 8 supported languages)
Tier 2: PaddleOCR (CJK, Indic, complex layouts)
Tier 3: Vision-LLM (Qwen 2.5-VL 7B, self-hosted on scale-to-zero GPU)
2-pass agreement on Premium; 3rd-call tiebreaker on disagreement

Cascade flow

Tier 0
Embedded text
pypdfium2
free
Tier 1
Tesseract
Latin scripts
~CPU sec
Tier 2
PaddleOCR
CJK / Indic / layout
~CPU sec
Tier 3
Vision-LLM
Qwen 2.5-VL 7B
~$0.003/page
Tier 4
Human review
queue with source highlight
your labor

Quality gate at each tier decides whether to escalate or stop. Cheap tiers run first.

For developers

Public API.

Every feature of the product is available over a documented REST API. Generate API keys in the dashboard, pick a tier (Standard / Premium / Premium + multi-pass), integrate from any stack.

Webhooks fire on batch completion. Rate limits scale with your plan. Idempotency headers on every endpoint. OpenAPI spec at /docs.

REST + JSON, OpenAPI 3.1 spec, Swagger UI at /docs
Per-key API keys with scopes and rotation
Webhooks for batch completion (HMAC-signed)
Tier-aware rate limits (60 → 6,000 req/min by plan)
Idempotency keys to make retries safe

Quick start

curl -X POST https://docuextract.ai/v1/extract \
  -H "Authorization: Bearer $DOCUEXTRACT_API_KEY" \
  -H "Content-Type: application/pdf" \
  -H "X-Template: tpl_acme_invoices" \
  -H "X-Tier: premium" \
  --data-binary @invoice.pdf

# → { "fields": [...], "audit_id": "...", ... }

Bring your own template, or use a public one from the gallery.

How we compare

The combination is the moat.

Almost every individual feature exists somewhere. No competitor combines them: OSS + self-host + visual picker + HITL + verbatim grounding + honest multilingual + no paid- API-by-default. That's the gap we're built into.

Capability	DocuExtract	Typical competitor
Apache-2.0 open source
Full self-host (no contact-sales)
Polished visual bounding-box picker
Integrated HITL review queue
Verbatim grounding (no-hallucination as guarantee)
Honest multilingual tier labeling
Multi-pass LLM agreement
No paid third-party APIs by default
Custom-template visual editor
Public API + documented OpenAPI spec
Batch processing + webhooks

"Typical competitor" abstracts across Nanonets, Sensible, Rossum, Docparser, Hyperscaler APIs. Individual competitors may match on a given row; none match on the full set.

Try it on your documents.

50 free documents per month covers a real evaluation. Or self-host for unlimited volume on your own hardware — same code, same features.

Start free See pricing Self-host on GitHub →