Skip to content
DocuExtract

Features

Built for documents that matter.

Most extraction tools optimize for the demo. We optimize for the production requirement: every value traceable to source, uncertainty surfaced honestly, no third-party data exfiltration. Here's what that actually means.

Signature feature

Visual field-picker

Upload a sample of your document. Draw a bounding box around each field you want extracted. Name it, set its type (text, number, date, currency, enum, table), optionally anchor it to a nearby label so the engine finds it even when the value position drifts.

That set of definitions is a Template. Save it once and run thousands of similar documents through it — clean structured data out, consistent column names, consistent types, ready for downstream systems.

  • Draw, don't code — no JSON schemas, no regex
  • Anchor-based matching handles position drift between similar documents
  • Six field types: text, number, date, currency, enum, table
  • Templates are versioned — edits create v+1, existing batches stay reproducible
invoice_template · v3
Acme CorporationInvoice
Invoice no.
INV-2026-0042
Date
2026-06-15
Due date
2026-07-21
Bill to
Beta Industries LLC
Consulting (40 hrs)$15,000.00
Materials$617.37
Subtotal$15,617.37
VAT (19%)$2,967.30
Total$18,584.67

Template fields

  • invoice_numbertext
    1.00
  • invoice_datedate
    0.99
  • due_datedate
    0.96
  • vendor_nametext
    1.00
  • bill_totext
    0.98
  • subtotalcurrency
    0.99
  • taxcurrency
    0.97
  • totalcurrency
    0.97

Every value links back to a region in the source — and to the model + version that read it.

Anti-hallucination spine

Verbatim grounding.

Every value the engine extracts must point back to a source span in the original document. The LLM doesn't invent — it finds. Values that can't be grounded are dropped (default) or routed to the human-review queue (configurable per template).

This isn't a feature toggle. It's structural. The is_grounded column on every extraction defaults to false — only flipped to true after a real source span is located. A crashed extraction can never accidentally surface an ungrounded value.

  • Direct verbatim match → highest confidence
  • Fuzzy match (1–2 char edit distance) handles OCR noise like O↔0, 1↔l↔I
  • Semantic match grounds typed values (date "Jan 5" → 2026-01-05 → source token)
  • No match → drop or route to review. Never invent.

Extracted field

total$18,584.67

confidence 0.97 · grounded ✓

Source span

...invoice total amount $18,584.67 due by July 21...

page 1 · bbox (412, 891, 88, 18) · text matched verbatim

Quality boundary

Human-in-the-loop, by default

Confidence below the template's threshold? That field surfaces in a review queue with the source region highlighted on the original document image. The reviewer sees the field name, the engine's best guess, and exactly where it came from. One click to approve, one to correct.

Corrections feed back as exemplars that improve future extractions for that template. This is what separates "OCR with confidence scores" from a usable production workflow.

  • Configurable confidence threshold per template (default 0.75)
  • Source-region highlight on the page image, with text snippet context
  • Per-field review — only the uncertain values, not whole documents
  • Corrections logged to the audit trail (who, when, before / after)

Review queue · 3 fields

  • invoice_date0.68

    06/15/2026

    date ambiguity (US/EU)

    ApproveCorrectView source
  • tax0.71

    $2,967.30

    OCR uncertainty: 7 vs 1

    ApproveCorrectView source
  • vendor_id0.62

    AC-2024-1109

    no clear anchor label

    ApproveCorrectView source

Languages we actually support

Honest multilingual tiering.

Every competitor markets "100+ languages." Most of those claims fall apart on real documents — bad accuracy, broken bounding boxes for right-to-left, no Indic support at all. We label by maturity instead.

Stable means production-ready, validated. Beta means clean documents work; bring yours. Experimental means active development, accuracy varies. You always know which is which.

  • Stable: English, Spanish
  • Beta: Chinese (Simplified + Traditional), French, Vietnamese, Korean, Tagalog, Portuguese, Arabic (RTL), Hindi
  • Experimental: Punjabi, Tamil, Telugu, Bengali (Indic scripts most tools fail on)
  • Right-to-left + complex-script handling is real engineering, not a language pack
Stable
  • English
  • Spanish
Beta
  • Chinese (Simp)
  • Chinese (Trad)
  • French
  • Vietnamese
  • Korean
  • Tagalog
  • Portuguese
  • Arabic (RTL)
  • Hindi
Experimental
  • Punjabi
  • Tamil
  • Telugu
  • Bengali

Beyond clean printed text

Handwriting + degraded scans.

Form fields filled in by hand, signatures, dates scrawled in margins, faxed receipts with stamped overlays, photos taken with phone cameras at bad angles. The OCR cascade escalates these automatically to a self-hosted vision-LLM (Qwen 2.5-VL 7B by default) that reads what traditional OCR engines miss.

We're honest about the limits. Clean printed-form handwriting and short field values (signatures, names, dates, single-line answers) extract at 70–85% accuracy. Cursive freeform handwriting and degraded multi-line text often drop below the confidence threshold and route to the human-review queue rather than confidently guessing wrong.

  • Signatures, hand-printed form fields, marginal notes, stamps
  • Available on Premium tier (and bundled with Team and up)
  • Same model handles handwriting and multilingual scripts — no separate setup
  • Low-confidence handwriting routes to review with the source region highlighted
  • Self-host: runs on your own GPU; hosted: Modal scale-to-zero GPU

Handwritten signature field

Signature

Extracted: J. Müller · confidence 0.78 · grounded ✓

confidence 0.78 < threshold 0.80

→ routed to review queue for confirmation

Forensic-grade provenance

Audit trail.

Every field carries its full lineage: which OCR tier ran, which model + version, the confidence score, the source page and bounding region, any human correction (who, when, before, after). All written to an append-only event log.

This is what makes the product auditable, not just accurate. Regulated workflows (legal, medical, financial) need to reconstruct exactly why any field reached its final state. The audit log is that reconstruction.

  • Append-only — events are never updated or deleted
  • Indexed by extraction, field, template, time, and event type
  • 15+ canonical event types covering OCR, extraction, grounding, review
  • Export-ready for compliance, SOC 2, HIPAA workflows (managed-deployment tier)

Audit log · extraction d8f3...4a91

  • 14:02:01.341document_uploadedinvoice_07.pdf · 3 pages
  • 14:02:02.118language_detectedscript=Latin · lang=en · 0.99
  • 14:02:03.005ocr_tier_passedtier=1 · confidence=0.94
  • 14:02:08.622extraction_field_extractedtotal · method=llm · 0.97
  • 14:02:08.847grounding_passedtotal · edit_distance=0
  • 14:02:09.011extraction_completed7 fields · overall=0.95

What runs under the hood

Progressive OCR cascade.

Cheap, fast OCR runs first. Quality gates escalate to heavier engines only when needed. Born-digital PDFs never touch a model. Clean scans run Tesseract. Hard cases escalate to PaddleOCR. Handwriting and degraded scans go to a self-hosted vision-LLM. Failures route to human review — never to a paid third-party API by default.

On Premium tiers, two independent LLM passes vote on each field. Disagreements get a third pass or route to review. No silent guessing.

  • Tier 0: embedded PDF text (born-digital, free, instant)
  • Tier 1: Tesseract (Latin scripts, 8 supported languages)
  • Tier 2: PaddleOCR (CJK, Indic, complex layouts)
  • Tier 3: Vision-LLM (Qwen 2.5-VL 7B, self-hosted on scale-to-zero GPU)
  • 2-pass agreement on Premium; 3rd-call tiebreaker on disagreement

Cascade flow

  1. Tier 0

    Embedded text

    pypdfium2

    free
  2. Tier 1

    Tesseract

    Latin scripts

    ~CPU sec
  3. Tier 2

    PaddleOCR

    CJK / Indic / layout

    ~CPU sec
  4. Tier 3

    Vision-LLM

    Qwen 2.5-VL 7B

    ~$0.003/page
  5. Tier 4

    Human review

    queue with source highlight

    your labor

Quality gate at each tier decides whether to escalate or stop. Cheap tiers run first.

For developers

Public API.

Every feature of the product is available over a documented REST API. Generate API keys in the dashboard, pick a tier (Standard / Premium / Premium + multi-pass), integrate from any stack.

Webhooks fire on batch completion. Rate limits scale with your plan. Idempotency headers on every endpoint. OpenAPI spec at /docs.

  • REST + JSON, OpenAPI 3.1 spec, Swagger UI at /docs
  • Per-key API keys with scopes and rotation
  • Webhooks for batch completion (HMAC-signed)
  • Tier-aware rate limits (60 → 6,000 req/min by plan)
  • Idempotency keys to make retries safe

Quick start

curl -X POST https://docuextract.ai/v1/extract \
  -H "Authorization: Bearer $DOCUEXTRACT_API_KEY" \
  -H "Content-Type: application/pdf" \
  -H "X-Template: tpl_acme_invoices" \
  -H "X-Tier: premium" \
  --data-binary @invoice.pdf

# → { "fields": [...], "audit_id": "...", ... }

Bring your own template, or use a public one from the gallery.

How we compare

The combination is the moat.

Almost every individual feature exists somewhere. No competitor combines them: OSS + self-host + visual picker + HITL + verbatim grounding + honest multilingual + no paid- API-by-default. That's the gap we're built into.

CapabilityDocuExtractTypical competitor
Apache-2.0 open source
Full self-host (no contact-sales)
Polished visual bounding-box picker
Integrated HITL review queue
Verbatim grounding (no-hallucination as guarantee)
Honest multilingual tier labeling
Multi-pass LLM agreement
No paid third-party APIs by default
Custom-template visual editor
Public API + documented OpenAPI spec
Batch processing + webhooks

"Typical competitor" abstracts across Nanonets, Sensible, Rossum, Docparser, Hyperscaler APIs. Individual competitors may match on a given row; none match on the full set.

Try it on your documents.

50 free documents per month covers a real evaluation. Or self-host for unlimited volume on your own hardware — same code, same features.