Benchmarking and Improving OCR System for Southeast Asian Languages

While Optical Character Recognition (OCR) has been widely studied for high-resource languages such as English and Chinese, the efficacy and limitations of OCR models on Southeast Asian (SEA) languages remain largely unexplored. This study aims to bridge this gap by assessing and improving the performance of OCR technologies on SEA languages. To achieve this objective, we propose a reusable pipeline to gather SEA-language text from Wikipedia and benchmark popular OCR tools.

Jason Qiu
Jason Qiu
FYP Student (Aug ‘24)

FYP student

Tongyao Zhu
Tongyao Zhu
IPP Doctoral Student (Jan ‘23; SEA)

PhD Candidate January 2023 Intake

Min-Yen Kan
Min-Yen Kan
Associate Professor

WING lead; interests include Digital Libraries, Information Retrieval and Natural Language Processing.