Scholarly Document Information Extraction using Extensible Features for Efficient Higher Order Semi-CRFs

Nguyen Viet Cuong, Muthu Kumar Chandrasekaran, Min-Yen Kan, Wee Sun Lee

January 2015

Abstract

We address the tasks of recovering bibliographic and document structure metadata from scholarly documents. We leverage higher order semi-Markov conditional random fields to model long-distance label sequences, improving upon the performance of the linear-chain conditional random field model. We introduce the notion of extensible features, which allows the expensive inference process to be simplified through memoization, resulting in lower computational complexity. Our method significantly betters the state-of-the-art on three related scholarly document extraction tasks.

Type

Conference paper

Publication

Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries

Scholarly Document Information Extraction using Extensible Features for Efficient Higher Order Semi-CRFs

Abstract

Muthu Kumar Chandrasekaran

Doctoral Alumnus (May ‘19). Thesis: Predicting Instructor Intervention in MOOC forums.

Min-Yen Kan

Associate Professor