Thai-TrOCR Model
Introduction
ThaiTrOCR is a fine-tuned version of the TrOCR base handwritten model, specifically crafted for Optical Character Recognition (OCR) in both Thai and English. This multilingual model adeptly processes handwritten text-line images in both languages, leveraging the TrOCR architecture, which combines a Vision Transformer encoder with an Electra-based text decoder. Designed to be compact and lightweight, ThaiTrOCR is optimized for efficient deployment in resource-constrained environments while achieving high accuracy in character recognition.
- Encoder: TrOCR Base Handwritten
- Decoder: Electra Small (Trained with Thai corpus)
Training Dataset
- pythainlp/thai-wiki-dataset-v3
- pythainlp/thaigov-corpus
- Salesforce/wikitext
How to Use
Here’s how to use this model in PyTorch:
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import requests
# Load processor and model
processor = TrOCRProcessor.from_pretrained('openthaigpt/thai-trocr')
model = VisionEncoderDecoderModel.from_pretrained('openthaigpt/thai-trocr')
# Load an image
url = 'your_image_url_here'
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
# Process and generate text
pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)
Model Performance Comparison
The table below summarizes the performance metrics of various models across different document types, based on the adjusted mean score:
Document Type | ThaiTrOCR | EasyOCR | Tesseract |
---|---|---|---|
Handwritten | 0.190034 | 0.410738 | 1.032375 |
PDF Document | 0.057597 | 0.085937 | 0.761595 |
PDF Document (EN-TH) | 0.053968 | 0.308075 | 1.061107 |
Real Document | 0.147440 | 0.293482 | 0.915707 |
Scene Text | 0.134182 | 0.390583 | 2.408704 |
Adjusted Mean | 0.123600 | 0.298474 | 1.269101 |
Notes
- The CER metric indicates that lower scores reflect better performance.
- Tesseract supports only one language at a time; this benchmark uses only Thai.
- Benchmarking was performed on a Google Colab CPU task.
- The evaluation dataset is sourced from the openthaigpt/thai-ocr-evaluation.
Sponsors
Authors
- Suchut Sapsathien (suchut@outlook.com)
- Jillaphat Jaroenkantasima (autsadang41@gmail.com)
- Downloads last month
- 58
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.