Benchmarking and Improving OCR Systems for Southeast Asian Languages

1 Aug, 2024

While Optical Character Recognition (OCR) has been widely studied for high-resource languages such as English and Chinese, its performance on Southeast Asian (SEA) languages remains largely unexplored. This study addresses this gap by evaluating three OCR tools — EasyOCR, Tesseract, and the transformer-based General OCR Theory (GOT) — on English, Indonesian, Vietnamese, and Thai. We introduce a reusable pipeline for collecting textual data from Wikipedia and benchmarking OCR tools. Contrary to popular belief, our results show that OCR tools perform well on complex scripts like Vietnamese and Thai, with most errors arising from misclassifying characters outside the target language. Additionally, we demonstrate the effectiveness of fine-tuning GOT with limited training data, yielding notable improvements on Vietnamese and Thai.

Benchmarking and Improving OCR Systems for Southeast Asian Languages

Jason Qiu

FYP Alumnus (Aug ‘24) Thesis: Benchmarking and Improving OCR System for Southeast Asian Languages

Tongyao Zhu

IPP Doctoral Student (Jan ‘23; SEA)

Min-Yen Kan

Associate Professor