Skip to content
DocuExtract

Languages

We tell you what works.
Most vendors don't.

Every competitor markets "100+ languages." Most of those claims fall apart on real documents — bad accuracy, broken bounding boxes on right-to-left, no Indic support at all. We label by maturity tier and tell you the real caveats per language. What follows is the full matrix.

Beyond language

We also read what others miss.

Language coverage is only half the story. The other half is document shape and quality: handwriting, complex tables, degraded scans. Here's what the engine handles regardless of language.

Printed text & born-digital PDFs

Embedded PDF text (born-digital) reads at 100% accuracy with no model involved. Printed scans run through Tesseract for Latin scripts and PaddleOCR for CJK/Indic.

Handwriting

Signatures, hand-filled form fields, dates, short freeform answers. Routed to Tier 3 vision-LLM (Qwen 2.5-VL 7B). Expect 70–85% on clean handwriting; cursive multi-line text often drops to review queue. Available on Premium tier (bundled with Team plan and up).

Multi-column tables

Single-page tables extracted with bounding-polygon awareness. Multi-page tables (continuation rows across pages) are on the roadmap — tracked in KNOWN_ISSUES.

Degraded scans

Low-resolution photos, faxed documents, stained pages. The cascade escalates to vision-LLM automatically when traditional OCR confidence drops below 0.65.

Tier · Stable

Stable support

Production-ready. Validated against a comprehensive fixture set. Use these in business-critical workflows with confidence.

English

en
Script
Latin
OCR path
Embedded text / Tesseract
Accuracy
99%+ on born-digital PDFs; 96–98% on clean scans.

Caveats

Degraded scans escalate to vision-LLM (Tier 3).

Spanish

es
Script
Latin
OCR path
Embedded text / Tesseract
Accuracy
99%+ on born-digital; 96–98% clean scans.

Caveats

Diacritic-handling validated; bilingual EN/ES docs use dominant-script routing.

Tier · Beta

Beta support

Works on clean documents. Real-world accuracy depends on your document quality. Validate on a sample before committing to volume.

Chinese (Simplified)

zh-Hans
Script
Han
OCR path
PaddleOCR
Accuracy
92–95% on clean printed text.

Caveats

Handwritten or stylized fonts often need Tier 3 vision-LLM. Tables work; complex multi-column layouts can confuse the layout parser.

Chinese (Traditional)

zh-Hant
Script
Han
OCR path
PaddleOCR
Accuracy
90–94% on clean printed text.

Caveats

Some traditional characters less frequent in training data — vision-LLM fallback often improves accuracy.

French

fr
Script
Latin
OCR path
Tesseract
Accuracy
95–97% on clean printed text.

Caveats

Diacritic-heavy text handled; right-to-left tables (currency placement) validated.

Vietnamese

vi
Script
Latin (diacritic-heavy)
OCR path
Tesseract
Accuracy
93–96% on clean printed text.

Caveats

Stacked tone marks can confuse low-resolution OCR. Validate on your scan quality.

Korean

ko
Script
Hangul
OCR path
PaddleOCR
Accuracy
91–95% on clean printed text.

Caveats

Mixed Hangul/Hanja documents (older formal text) escalate to vision-LLM.

Tagalog

tl
Script
Latin
OCR path
Tesseract
Accuracy
94–96% on clean printed text.

Caveats

Spanish loanwords and abbreviations handled correctly.

Portuguese

pt
Script
Latin
OCR path
Tesseract
Accuracy
95–97% on clean printed text.

Caveats

BR and PT variants both supported; currency/date formats locale-aware.

Arabic

ar
Script
Arabic (right-to-left)
OCR path
Tesseract + RTL pipeline
Accuracy
88–93% on clean printed RTL text.

Caveats

Field-picker overlay coordinates need polish on rotated scans (tracked in KNOWN_ISSUES). Diacritical marks in classical Arabic may need Tier 3 vision-LLM.

Hindi

hi
Script
Devanagari
OCR path
Indic-specialized / vision-LLM
Accuracy
80–90% on clean printed text.

Caveats

Conjunct-heavy handwriting and degraded scans struggle. Vision-LLM fallback typically improves accuracy 10–15 percentage points.

Tier · Experimental

Experimental support

Active development. Accuracy varies. The hardest scripts most tools fail on entirely — we ship them honestly labeled instead of overpromising.

Punjabi

pa
Script
Gurmukhi
OCR path
Indic-specialized / vision-LLM
Accuracy
70–85% on clean printed; varies widely.

Caveats

Active development. Bring real documents for evaluation. Government forms with stamped overlays remain difficult.

Tamil

ta
Script
Tamil
OCR path
Indic-specialized / vision-LLM
Accuracy
70–85% on clean printed; degraded scans much lower.

Caveats

Complex consonant clusters and Vatteluttu-influenced glyphs can confuse OCR. Vision-LLM helps but adds latency.

Telugu

te
Script
Telugu
OCR path
Indic-specialized / vision-LLM
Accuracy
70–85% on clean printed.

Caveats

Compound vowel signs above and below the base character — small OCR noise causes character-level errors.

Bengali

bn
Script
Bengali
OCR path
Indic-specialized / vision-LLM
Accuracy
70–85% on clean printed.

Caveats

Conjuncts and ligatures common in formal text; vision-LLM significantly improves over traditional OCR.

How we set the tiers

Honest beats optimistic.

Stablemeans we have run thousands of synthetic and anonymized real-world documents through it, validated extraction accuracy against ground truth, and shipped it as production-ready. We'd use it ourselves in a regulated workflow.

Betameans the OCR engine and routing work, but the maturity isn't backed by exhaustive validation. We've tested it on clean documents. Your scan quality, document layout, and font choice might surface failures we haven't seen. Validate on a sample first.

Experimentalmeans it works enough to be useful, but accuracy varies meaningfully across documents. Indic scripts are the hardest cases in OCR — most tools refuse to ship them at all, or ship them with misleading "supported" labels. We ship them with honest expectations instead.

As accuracy improves, languages get promoted. Promotions are documented in the CHANGELOG with the validation work that supported them.

Language not on this list?

Inspire AI Lab has run extraction at scale on a 230M-document multilingual legal corpus. If you have a custom language requirement — fine-tuning, new script support, dialect handling — we can scope a custom build against your real corpus.