Benchmarking and Improving OCR Systems for Southeast Asian Languages

While Optical Character Recognition (OCR) has been widely studied for high-resource languages such as English and Chinese, its performance on Southeast Asian (SEA) languages remains largely unexplored. This study addresses this gap by evaluating three OCR tools — EasyOCR, Tesseract, and the transformer-based General OCR Theory (GOT) — on English, Indonesian, Vietnamese, and Thai. We introduce a reusable pipeline for collecting textual data from Wikipedia and benchmarking OCR tools. Contrary to popular belief, our results show that OCR tools perform well on complex scripts like Vietnamese and Thai, with most errors arising from misclassifying characters outside the target language. Additionally, we demonstrate the effectiveness of fine-tuning GOT with limited training data, yielding notable improvements on Vietnamese and Thai.

Jason Qiu
Jason Qiu
FYP Student (Aug ‘24)

FYP student

Tongyao Zhu
Tongyao Zhu
IPP Doctoral Student (Jan ‘23; SEA)

PhD Candidate January 2023 Intake

Min-Yen Kan
Min-Yen Kan
Associate Professor

WING lead; interests include Digital Libraries, Information Retrieval and Natural Language Processing.