Languages

We tell you what works.
Most vendors don't.

Every competitor markets "100+ languages." Most of those claims fall apart on real documents — bad accuracy, broken bounding boxes on right-to-left, no Indic support at all. We label by maturity tier and tell you the real caveats per language. What follows is the full matrix.

Beyond language

We also read what others miss.

Language coverage is only half the story. The other half is document shape and quality: handwriting, complex tables, degraded scans. Here's what the engine handles regardless of language.

Printed text & born-digital PDFs

Embedded PDF text (born-digital) reads at 100% accuracy with no model involved. Printed scans run through Tesseract for Latin scripts and PaddleOCR for CJK/Indic.

Handwriting

Signatures, hand-filled form fields, dates, short freeform answers. Routed to Tier 3 vision-LLM (Qwen 2.5-VL 7B). Expect 70–85% on clean handwriting; cursive multi-line text often drops to review queue. Available on Premium tier (bundled with Team plan and up).

Multi-column tables

Single-page tables extracted with bounding-polygon awareness. Multi-page tables (continuation rows across pages) are on the roadmap — tracked in KNOWN_ISSUES.

Degraded scans

Low-resolution photos, faxed documents, stained pages. The cascade escalates to vision-LLM automatically when traditional OCR confidence drops below 0.65.

Tier · Stable

Stable support

Production-ready. Validated against a comprehensive fixture set. Use these in business-critical workflows with confidence.

English

Script: Latin
OCR path: Embedded text / Tesseract
Accuracy: 99%+ on born-digital PDFs; 96–98% on clean scans.

Caveats

Degraded scans escalate to vision-LLM (Tier 3).

Spanish

Script: Latin
OCR path: Embedded text / Tesseract
Accuracy: 99%+ on born-digital; 96–98% clean scans.

Caveats

Diacritic-handling validated; bilingual EN/ES docs use dominant-script routing.

Tier · Beta

Beta support

Works on clean documents. Real-world accuracy depends on your document quality. Validate on a sample before committing to volume.

Chinese (Simplified)

zh-Hans

Script: Han
OCR path: PaddleOCR
Accuracy: 92–95% on clean printed text.

Caveats

Handwritten or stylized fonts often need Tier 3 vision-LLM. Tables work; complex multi-column layouts can confuse the layout parser.

Chinese (Traditional)

zh-Hant

Script: Han
OCR path: PaddleOCR
Accuracy: 90–94% on clean printed text.

Caveats

Some traditional characters less frequent in training data — vision-LLM fallback often improves accuracy.

French

Script: Latin
OCR path: Tesseract
Accuracy: 95–97% on clean printed text.

Caveats

Diacritic-heavy text handled; right-to-left tables (currency placement) validated.

Vietnamese

Script: Latin (diacritic-heavy)
OCR path: Tesseract
Accuracy: 93–96% on clean printed text.

Caveats

Stacked tone marks can confuse low-resolution OCR. Validate on your scan quality.

Korean

Script: Hangul
OCR path: PaddleOCR
Accuracy: 91–95% on clean printed text.

Caveats

Mixed Hangul/Hanja documents (older formal text) escalate to vision-LLM.

Tagalog

Script: Latin
OCR path: Tesseract
Accuracy: 94–96% on clean printed text.

Caveats

Spanish loanwords and abbreviations handled correctly.

Portuguese

Script: Latin
OCR path: Tesseract
Accuracy: 95–97% on clean printed text.

Caveats

BR and PT variants both supported; currency/date formats locale-aware.

Arabic

Script: Arabic (right-to-left)
OCR path: Tesseract + RTL pipeline
Accuracy: 88–93% on clean printed RTL text.

Caveats

Field-picker overlay coordinates need polish on rotated scans (tracked in KNOWN_ISSUES). Diacritical marks in classical Arabic may need Tier 3 vision-LLM.

Hindi

Script: Devanagari
OCR path: Indic-specialized / vision-LLM
Accuracy: 80–90% on clean printed text.

Caveats

Conjunct-heavy handwriting and degraded scans struggle. Vision-LLM fallback typically improves accuracy 10–15 percentage points.

Tier · Experimental

Experimental support

Active development. Accuracy varies. The hardest scripts most tools fail on entirely — we ship them honestly labeled instead of overpromising.

Punjabi

Script: Gurmukhi
OCR path: Indic-specialized / vision-LLM
Accuracy: 70–85% on clean printed; varies widely.

Caveats

Active development. Bring real documents for evaluation. Government forms with stamped overlays remain difficult.

Tamil

Script: Tamil
OCR path: Indic-specialized / vision-LLM
Accuracy: 70–85% on clean printed; degraded scans much lower.

Caveats

Complex consonant clusters and Vatteluttu-influenced glyphs can confuse OCR. Vision-LLM helps but adds latency.

Telugu

Script: Telugu
OCR path: Indic-specialized / vision-LLM
Accuracy: 70–85% on clean printed.

Caveats

Compound vowel signs above and below the base character — small OCR noise causes character-level errors.

Bengali

Script: Bengali
OCR path: Indic-specialized / vision-LLM
Accuracy: 70–85% on clean printed.

Caveats

Conjuncts and ligatures common in formal text; vision-LLM significantly improves over traditional OCR.

How we set the tiers

Honest beats optimistic.

Stablemeans we have run thousands of synthetic and anonymized real-world documents through it, validated extraction accuracy against ground truth, and shipped it as production-ready. We'd use it ourselves in a regulated workflow.

Betameans the OCR engine and routing work, but the maturity isn't backed by exhaustive validation. We've tested it on clean documents. Your scan quality, document layout, and font choice might surface failures we haven't seen. Validate on a sample first.

Experimentalmeans it works enough to be useful, but accuracy varies meaningfully across documents. Indic scripts are the hardest cases in OCR — most tools refuse to ship them at all, or ship them with misleading "supported" labels. We ship them with honest expectations instead.

As accuracy improves, languages get promoted. Promotions are documented in the CHANGELOG with the validation work that supported them.

Language not on this list?

Inspire AI Lab has run extraction at scale on a 230M-document multilingual legal corpus. If you have a custom language requirement — fine-tuning, new script support, dialect handling — we can scope a custom build against your real corpus.

Talk to consulting Report on a beta language

We tell you what works.Most vendors don't.

We also read what others miss.

Printed text & born-digital PDFs

Handwriting

Multi-column tables

Degraded scans

Stable support

Beta support

Experimental support

Honest beats optimistic.

Language not on this list?

We tell you what works.
Most vendors don't.