Thai-TrOCR Model

Introduction

ThaiTrOCR is a fine-tuned version of the TrOCR base handwritten model, specifically crafted for Optical Character Recognition (OCR) in both Thai and English. This multilingual model adeptly processes handwritten text-line images in both languages, leveraging the TrOCR architecture, which combines a Vision Transformer encoder with an Electra-based text decoder. Designed to be compact and lightweight, ThaiTrOCR is optimized for efficient deployment in resource-constrained environments while achieving high accuracy in character recognition.

Encoder: TrOCR Base Handwritten
Decoder: Electra Small (Trained with Thai corpus)

Training Dataset

pythainlp/thai-wiki-dataset-v3
pythainlp/thaigov-corpus
Salesforce/wikitext

How to Use

Here’s how to use this model in PyTorch:

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import requests

# Load processor and model
processor = TrOCRProcessor.from_pretrained('openthaigpt/thai-trocr')
model = VisionEncoderDecoderModel.from_pretrained('openthaigpt/thai-trocr')

# Load an image
url = 'your_image_url_here'
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

# Process and generate text
pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)

Model Performance Comparison

The table below summarizes the performance metrics of various models across different document types, based on the adjusted mean score:

Document Type	ThaiTrOCR	EasyOCR	Tesseract
Handwritten	0.190034	0.410738	1.032375
PDF Document	0.057597	0.085937	0.761595
PDF Document (EN-TH)	0.053968	0.308075	1.061107
Real Document	0.147440	0.293482	0.915707
Scene Text	0.134182	0.390583	2.408704
Adjusted Mean	0.123600	0.298474	1.269101

Notes

The CER metric indicates that lower scores reflect better performance.
Tesseract supports only one language at a time; this benchmark uses only Thai.
Benchmarking was performed on a Google Colab CPU task.
The evaluation dataset is sourced from the openthaigpt/thai-ocr-evaluation.

Authors

Suchut Sapsathien (suchut@outlook.com)
Jillaphat Jaroenkantasima (autsadang41@gmail.com)