Predicting Web 2.0 Thread Updates
Research Area: Information Retrieval Year: 2012
Type of Publication: Technical Report Keywords: web crawlers, revisitation, discussion threads, evaluation metrics
  • Shawn Tan
With the advent of Web 2.0, sites with forums, or similar thread-based discussion features are increasingly common. An incremental web crawler aiming to maintain a database of up-to-date, extracted information from sites with such discussion features must strike a balance between bandwidth usage and freshness of data. Our objective: To estimate the arrival times of the next update to such threads. We demonstrate three different methods for achieving this using regression methods, and make recommendations as to how they can be used in a crawling system. We also propose a novel metric for measuring the timeliness of such a model that balances between the model’s timeliness and bandwidth consumption. We show that our methods outperform the baseline, and recommend a way to incorporate these methods into an incremental crawler.
