Advancing PARCELS: PARser for Content Extraction and Logical Structure Using Inter- and Intra- Similarity Features
Research Area: Information Extraction Year: 2005
Type of Publication: Technical Report  
Authors:
  • Aik Miang Lau
 
Institution: School of Computing, National University of Singapore
Type of Publication: Undergraduate Thesis
   
Abstract:
Labeling different regions of a web page helps to extract the desired information. PARCELS is a system that divides a web page into blocks and classify them into different labels in an automated way. It aims to distinguish, for example, between site navigation and main content. However, many web pages from the same site are usually generated from fixed template, resulting in pages that are similar in layout and wording. A point to note is that there are also similarities between blocks on the same page. Hence, I introduce features to PARCELS that help to detect inter- and intra-page similarities. The purpose for such features is to model related blocks that could have the same labels and hence enhancing the performance of the classification task in PARCELS. Evaluation shows that the additional features in PARCELS improve the system as the F1 measure increases by 5.1% from the original system. Blocks are also more precisely classified in the combined classifier for each class label, with the macro precision value achieving a high 75.2%. The improvement in the accuracy of the system is also statistically significant.
Digital version