Automatic Inference of Web Document Metadata - A Study on Sitemaps
Research Area: Information Retrieval Year: 2004
Type of Publication: Technical Report  
Authors:
  • Alex Ng
 
Institution: School of Computing, National University of Singapore
Type of Publication: Undergraduate Thesis
   
Abstract:
Although different types of metadata may be extracted or inferred from a website, little research has been done to determine which metadata are useful to users in determining a website’s utility. This thesis presents the results of a survey which examines the importance and relationship between different types of metadata for webpage relevance. The results suggest that the ranking of different metadata depends heavily on the domain and size of the website. To follow up with the novel findings of the survey, we describe and implemented a system for the automatic annotation and detection of sitemaps in websites. Sitemap features were automatically extracted from websites and using Support Vector Machines(SVMs) as a black-box classifier, a model was trained from the labeled training data to automatically annotate websites. The system was tested on 488 random websites and shows high accuracy in sitemap detection.
Digital version