WING paper reading
2004/09/07
 
A Repetition Based Measure for Verification of Text Collections and for Text Categorization
Dmitry V. Khmelev and William J Teahan

- In SIGIR '03
- highlighted, printed and filed
- related to plagiarism detection, webpage similarity, corpus verification, PARCELS.

Simple repetition of text substrings for plagiarism and duplicate detection. The formula involves computing a concatenated suffix array for an entire set of documents. The idea is to use not only the single longest common substring but a sum of the longest common substrings across all prefixes of a target document.

The R measure is apparently good not just for duplicate detection but also for authorship detection in the test corpora demonstrated in their paper.

To think about: how to adapt this measure to have an effective (and speedy) tool for web page fragment classification and classification.



 
Detecting and Partitioning of Data Objects in Complex Web Pages
Shiren Ye and Tat-Seng Chua

- In Web Intelligence '04: available at http://www.comp.nus.edu.sg/~yesr/webmining/Wr2290_ye_s.pdf
- Read (4 page version)
- Related to: information extraction, PARCELS, web page cleaning

Uses a tree-based kernel (?) to calculate the similarity of a page to a corpus of webpages, using the DOM tree structure to retrieve the tree structure. They define a novelty value to distinguish the "data" portion of a web page from the "non-data" portion of the web page. Further processing is used to delimit the data portion into records, but I will not focus on this aspect of the work in this summary.

Their similarity metrics uses both attributes (think html tags) as well as the text of the html node to calculate similarity of a DOM tree node. The formula for novelty and repeatability I'm not exactly sure whether I deciphering correctly. Not exactly sure how this is calculated, better to ask Shiren or Tat-Seng about this...

Food for thought: Wondering how this work can be related to the R measure introduced in SIGIR 03 by Khmelev and Teahan. Sort of a tree kernel based R similarity (but without the efficiency gains?)








Powered by Blogger