๐ Advanced Multi-Language OCR System
Powered by Pix2Text, Tesseract, and FastAPI
Extract text from PDFs containing English, Bangla, and Mathematical expressions with high accuracy. Evaluate OCR performance with comprehensive metrics and detailed analysis.
Upload a PDF and extract text using advanced multi-language OCR
Features:
- ๐ Multi-language support: English, Bangla (Bengali), and Mathematical expressions
- ๐งฎ Advanced Math Recognition: Pix2Text integration for LaTeX and mathematical formulas
- ๐ Detailed Analysis: Character-level classification and confidence scores
- ๐พ Download Results: Get extracted text and detailed JSON analysis
Compare OCR extracted text with ground truth baseline for accuracy analysis
Evaluation Features:
- ๐ฏ Character-level accuracy: Precise character matching and edit distance
- ๐ Word-level accuracy: Word matching and error rates
- ๐ Line-level accuracy: Line comparison and similarity scores
- ๐ Language-specific metrics: Separate accuracy for English, Bangla, and Math
- ๐ Grading system: Letter grades from A+ to F with recommendations
๐ Advanced Multi-Language OCR System
This application provides state-of-the-art Optical Character Recognition (OCR) for documents containing mixed languages and mathematical expressions.
๐ Key Features
๐ PDF Text Extraction
- Multi-language Support: Simultaneously process English and Bangla (Bengali) text
- Mathematical Recognition: Advanced extraction of mathematical formulas and equations using Pix2Text
- Intelligent Classification: Automatic detection and classification of text regions by language/content type
- High Accuracy: Optimized preprocessing and multiple OCR engines for best results
- Detailed Analysis: Character-by-character analysis with confidence scores and language distribution
๐ OCR Accuracy Evaluation
- Comprehensive Metrics: Character, word, and line-level accuracy measurements
- Language-Specific Analysis: Separate accuracy scores for different languages and mathematical content
- Edit Distance Calculation: Precise measurement of text differences using Levenshtein distance
- Grading System: Letter grades (A+ to F) with improvement recommendations
- Detailed Comparison: Side-by-side diff analysis showing insertions, deletions, and matches
๐ ๏ธ Technology Stack
- Pix2Text: Advanced mathematical expression recognition
- Tesseract OCR: Multi-language text recognition with Bengali support
- OpenCV: Image preprocessing and enhancement
- PDF2Image: High-quality PDF to image conversion
- FastAPI: RESTful API backend
- Gradio: Interactive web interface
๐ Usage Instructions
For PDF Text Extraction:
- Upload a PDF file using the file picker
- Click "๐ Extract Text" to start processing
- Review the extraction summary for statistics
- Copy the extracted text or download the files
- Download the JSON file for detailed analysis data
For OCR Evaluation:
- Upload the OCR-extracted text file (what you want to evaluate)
- Upload the ground truth baseline file (the correct text)
- Optionally provide an evaluation name for identification
- Click "๐ Evaluate Accuracy" to run the comparison
- Review the detailed metrics and recommendations
๐ฏ Accuracy Grading System
- A+ (95-100%): Excellent - Professional-grade accuracy
- A (90-94%): Very Good - High-quality results with minor errors
- B (80-89%): Good - Acceptable for most applications
- C (70-79%): Fair - May require manual review
- D (60-69%): Poor - Significant improvements needed
- F (<60%): Very Poor - Major issues requiring attention
๐ Supported Languages & Content
- English: Full Latin alphabet with punctuation and symbols
- Bangla (Bengali): Complete Bengali Unicode range (U+0980-U+09FF)
- Mathematical Expressions:
- Basic arithmetic operators (+, -, ร, รท, =)
- Greek letters (ฮฑ, ฮฒ, ฮณ, ฮด, ฯ, ฮธ, ฮป, ฮผ, ฮฉ, etc.)
- Mathematical symbols (โ, โซ, โ, โ, โ, โ, โค, โฅ, etc.)
- Subscripts and superscripts
- Functions and equations
- LaTeX-style expressions
๐ง Tips for Best Results
- PDF Quality: Use high-resolution PDFs (300+ DPI) for better accuracy
- Text Clarity: Ensure text is not blurry, skewed, or low contrast
- Language Consistency: Mixed-language documents work best when languages are clearly separated
- Mathematical Content: Complex equations may require manual verification
- File Size: Larger documents may take longer to process
๐ Troubleshooting
- Empty Results: Check if the PDF contains selectable text or if images need OCR
- Low Accuracy: Try preprocessing the PDF to improve image quality
- Mixed Languages: Ensure the document has clear language boundaries
- Mathematical Errors: Complex formulas may need manual correction
๐ Support & Feedback
For issues, suggestions, or contributions, please visit our GitHub repository.
Made with โค๏ธ for advancing multilingual text recognition
๐ Links: GitHub Repository | Documentation
โก Powered by: Pix2Text โข Tesseract OCR โข OpenCV โข FastAPI โข Gradio