Introduction

March 8th, 2007

Evaluating keyphrase extraction algorithms has been shown to be subjective and difficult in many previous works. According to Medelyan and Witten (2006) the evaluation of keyphrase extraction algorithms requires multiple judgements and cannot rely on a single set of keyphrases provided by a paper's author. We followed this line of argument and sought to construct a corpus in which each document contains multiple annotations for keyphrases.

In our corpus, each document may contain more than one set of keyphrases: one from author and the remaining sets from other annotators. Moreover, each is broken down into sections. These are stored in a separated XML file. This enables partial retrieval of documents in case there is a need for a particular section e.g. abstract or introduction.

The corpus consists of more than 200 scientific publications, each has 4 different formats: PDF, HTML, plain text, and XML.

Link to the corpus (for browsing): Keyphrase Corpus.

(updated by Min: Sun Aug 17 12:13:06 SGT 2008) To download the full corpus, you or your institution must have a subscription to the ACM Digital Library (as the download contains source .PDF files from the ACM Digital Library): Keyphrase Corpus (Warning! large ~87MB).

Corpus collection

The collection of keyphrase sets is done via email. We first emailed students and staff in School of Computing of National University of Singapore(NUS) to invite them to participate in the experiment. To ensure that all the participants are familiar with reading Computer Science papers, we have limited the students to be third and fourth year students of School of Computing in NUS.

More than 80 subjects responded to our invitation. We divided these subjects into groups of 4. In the next step, links to three PDF papers were allocated to each of the subjects through email. In these three papers, there was one paper that had been annotated by us. This annotated paper was meant to measure how carefully the subjects considered the tasks. Subjects in the same group are allocated the same annotated paper. The papers have the minimum length of 4 pages and the maximum length of 12 pages. The author's keyphrases were removed so as not to influence annotators' decision. Each subject was asked to first read each paper for 10 minutes. Then, they were asked to come up with a list of 10 keyphrases for each paper. However, only 40 subjects returned back the results. Including documents that were annotated by us, there are 156 annotated documents in total.

Publications

  • Thuy Dung Nguyen and Min-Yen Kan (2007) Keyphrase Extraction in Scientific Documents. In D.H.-L. Goh et al. (Eds.): ICADL 2007, LNCS 4822, pp. 317-326.
    [ preprint ] [ slides ]
  • Thuy Dung Nguyen (2007) Automatic keyphrase generation. Technical report, National University of Singapore.
    [ Thesis ]

Group Members

  • Min-Yen Kan, project lead
  • Emma Thuy Dung Nguyen, final year project student