Stylistic and lexical co-training for web block classification

Abstract

Many applications which use web data extract information from a limited number of regions on a web page. As such, web page division into blocks and the subsequent block classification have become a preprocessing step. We introduce PARCELS, an open-source, co-trained approach that performs classification based on separate stylistic and lexical views of the web page. Unlike previous work, PARCELS performs classification on fine-grained blocks. In addition to table-based layout, the system handles real-world pages which feature layout based on divisions and spans as well as stylistic inference for pages using cascaded style sheets. Our evaluation shows that the co-training process results in a reduction of 28.5% in error rate over a single-view classifier and that our approach is comparable to other state-of-the-art systems.

Publication
Proceedings of the 6th Annual ACM International Workshop on Web Information and Data Management
Min-Yen Kan
Min-Yen Kan
Associate Professor

WING lead; interests include Digital Libraries, Information Retrieval and Natural Language Processing.