Proactive Scholarly Paper Metadata Crawler

This is the home page of the open-source scholarly paper metadata crawler Kairos. 

This focused crawler crawls likely conference and workshop web sites in a proactive manner, visiting the sites around the event dates, so as to acquire the metadata and links to any open-access scholarly documents in a timely manner. Since Kairos acts proactively, new scholarly work can be available for use by other scholars with a relatively short delay between publication and ingestion to the DL. This work is significant, cutting down the delay that researchers need to wait before new scholarly work is available in search engines.

The crawler built on top of the popular open-source crawler Nutch, and is logically structured in two stages. The first stage locates likely conference websites from portals that list scholarly events. The second stage performs a breadth limited crawl of each confirmed conference website, locating pertinent scholarly document metadata and links to full-text PDF source documents. The separation allows future modules to interact with the crawler at two different interfaces: entering a known portal listing scholarly events and entering a known webpage of a workshop of conference event listing scholarly documents.

For the first stage, we modified the inner Lucene text retrieval library to use a custom filtering module that discard candidate URLs if they do not appear to be conference sites. The URL classifier is implemented using a maximum entropy classifier which uses features extracted from the candidate URL to make its judgment.

In the second stage, Nutch is called to crawl the verified conference and workshop event websites during and after the date of the event, identifying and extracting metadata from each HTML page using a conditional random field (CRF) model.

In the current implementation, Kairos extracts conference website and dates of conferences from WikiCFP to implement its time-sensitive crawl. While Kairos is still under development, we have chosen to make the project open-source and publicly available, as we feel other DLs and communities may benefit from such a component.



You can download the open-source code below. The code contains both training data, feature generator and shell scripts as well as a documentation. The source requires you to re-compile CRF++ source code.

Kairos is available on GitHub now!

GitHub Project Page

# git clone git://github.com/WING-NUS/Kairos.git

HTML Annotation Tool

This is a HTML annotation tool for Windows and/or Unix systems.

It works by processing an input HTML file or a URL. The output is the original file but adds extra javascript and alters <A HREF>s tags so that the text
can be annotated. A user can then annotate this file by using a javascript-enabled browser by simply highlighting spans and selecting an appropriate annotation from the annotation pane. The user can also annotate images with the same tags by clicking on them directly.

The annotation tool helps you with the issue of labeling scholarly paper metadata from scientific conference Web pages along several different axes: title, author, author with associated affiliation and affiliation.

A detailed documentation of the annotation tool can be found in the Thesis (Appendix A).

The source assumes that perl is installed on your system and can be invoked using perl (must be in your path). You can perform the annotation only with the Mozilla Firefox Web browser.


annotation-tool-1.0.tbz [5.0M / MD5 = 79cac1751fe938b503fd2e2c2ad54da9]


Gold Standard Input

Chunk tagged scholarly paper metadata dataset (suitable for system training). The metadata consists of author names, affiliations, author names with associated affiliations and titles from scholarly papers.

golden-standard-input-1.0.tbz [372K / MD5 = 3a4606741747d5f32713fe5a10fb1b6e]


Markus Hänse (2009) Harvesting Research Paper Metadata From Scientific Conference Web Sites. Undergraduate Thesis, Hochschule Furtwangen University. [.pdf] [.slide] [.web]

Markus Hänse, Min-Yen Kan and Achim P. Karduck (2010)

Proactive Harvesting of Research Paper Metadata from Scientific

Conference Web Sites. Proceedings of the International Conference on

Asia-Pacific Digital Libraries (ICADL '10), Brisbane, Australia, June.

pp. 226-235. [.pdf] [.slide] [.web]

Group Members


If you find problems, email the project leader at <kanmy@comp.nus.edu.sg>. Please use the subject “[Kairos]” to ensure that it reaches our attention. If you have annotated data that you do not mind providing us, we can use that to further improve the system capabilities.

Related Links

ParsCit - citation parsing using maximum entropy and global repairs.

ForeCite - web 2.0 based citation manager a la Citeulike, Citeseer, Rexa.

ForeCiteNote - a personal notebook, aims to support beginning researchers in taking notes and synthesizing their ideas in their literature search for related work to their projects.

Nutch - open-source web-search software, built on Lucene Java.

Lucene - Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.

CRF++ - is a simple, customizable, and open-source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data.

Maxent - opennlp.maxent package is a mature Java package for training and using maximum entropy models.