Advanced Multi-Language OCR System

Features:

🌐 Multi-language support: English, Bangla (Bengali), and Mathematical expressions
🧮 Advanced Math Recognition: Pix2Text integration for LaTeX and mathematical formulas
📊 Detailed Analysis: Character-level classification and confidence scores
💾 Download Results: Get extracted text and detailed JSON analysis

📄 Upload PDF File

📊 Extraction Summary

📝 Extracted Text

Evaluation Features:

🎯 Character-level accuracy: Precise character matching and edit distance
📚 Word-level accuracy: Word matching and error rates
📄 Line-level accuracy: Line comparison and similarity scores
🌐 Language-specific metrics: Separate accuracy for English, Bangla, and Math
🏆 Grading system: Letter grades from A+ to F with recommendations

📄 OCR Extracted Text File (.txt)

📑 Ground Truth Baseline File (.txt)

📝 Evaluation Name (Optional)

🎯 Evaluation Summary

📈 Detailed Evaluation Results

🔍 Advanced Multi-Language OCR System

This application provides state-of-the-art Optical Character Recognition (OCR) for documents containing mixed languages and mathematical expressions.

Multi-language Support: Simultaneously process English and Bangla (Bengali) text
Mathematical Recognition: Advanced extraction of mathematical formulas and equations using Pix2Text
Intelligent Classification: Automatic detection and classification of text regions by language/content type
High Accuracy: Optimized preprocessing and multiple OCR engines for best results
Detailed Analysis: Character-by-character analysis with confidence scores and language distribution

Comprehensive Metrics: Character, word, and line-level accuracy measurements
Language-Specific Analysis: Separate accuracy scores for different languages and mathematical content
Edit Distance Calculation: Precise measurement of text differences using Levenshtein distance
Grading System: Letter grades (A+ to F) with improvement recommendations
Detailed Comparison: Side-by-side diff analysis showing insertions, deletions, and matches

A+ (95-100%): Excellent - Professional-grade accuracy
A (90-94%): Very Good - High-quality results with minor errors
B (80-89%): Good - Acceptable for most applications
C (70-79%): Fair - May require manual review
D (60-69%): Poor - Significant improvements needed
F (<60%): Very Poor - Major issues requiring attention

English: Full Latin alphabet with punctuation and symbols
Bangla (Bengali): Complete Bengali Unicode range (U+0980-U+09FF)
Mathematical Expressions:
- Basic arithmetic operators (+, -, ×, ÷, =)
- Greek letters (α, β, γ, δ, π, θ, λ, μ, Ω, etc.)
- Mathematical symbols (∑, ∫, √, ∞, ∂, →, ≤, ≥, etc.)
- Subscripts and superscripts
- Functions and equations
- LaTeX-style expressions

PDF Quality: Use high-resolution PDFs (300+ DPI) for better accuracy
Text Clarity: Ensure text is not blurry, skewed, or low contrast
Language Consistency: Mixed-language documents work best when languages are clearly separated
Mathematical Content: Complex equations may require manual verification
File Size: Larger documents may take longer to process

Empty Results: Check if the PDF contains selectable text or if images need OCR
Low Accuracy: Try preprocessing the PDF to improve image quality
Mixed Languages: Ensure the document has clear language boundaries
Mathematical Errors: Complex formulas may need manual correction

For issues, suggestions, or contributions, please visit our GitHub repository.

Made with ❤️ for advancing multilingual text recognition