
While Optical Character Recognition (OCR) has been widely studied for high-resource languages such as English and Chinese, its performance on Southeast Asian (SEA) languages remains largely unexplored. This study addresses this gap by evaluating three OCR tools — EasyOCR, Tesseract, and the transformer-based General OCR Theory (GOT) — on English, Indonesian, Vietnamese, and Thai. We introduce a reusable pipeline for collecting textual data from Wikipedia and benchmarking OCR tools. Contrary to popular belief, our results show that OCR tools perform well on complex scripts like Vietnamese and Thai, with most errors arising from misclassifying characters outside the target language. Additionally, we demonstrate the effectiveness of fine-tuning GOT with limited training data, yielding notable improvements on Vietnamese and Thai.