» Toggle Table of Contents

[ Back to the WING home page ] [ Back to ForeCite/CiteSeer web services ]
Download ] [ Web Service ] [ Web-based Demonstration ] [ Publications ] [ Gold Standard Input and Sample Output ] [ Group Members ] [ Troubleshooting ]
Picture of ParsCit Swami

ParsCit: An open-source CRF Reference String Parsing Package

This is the home page of the ParsCit project, which performs reference string parsing. It is architected as a supervised machine learning procedure that uses Conditional Random Fields as its learning mechanism. You can download the code below, parse strings online, or send batch jobs to our web service (coming soon!). The code contains both the training data, feature generator and shell scripts to connect the system to a web service (used here too).

Some definitions (thanks to Robert Dale):

Reference String:
A text string in the bibliography or reference section of a work, usually at the end of the document that refers to a unique document. Usually occurs with other reference strings that point to other documents. May also appear as footnotes.
Citation:
A text string (usually explicit) in the document body that points to a corresponding reference string at the end of the document. Several citations may co-refer to a single reference string.

This project deals with the problem of parsing the reference strings. Other projects related to ParsCit (some here in WING, some elsewhere) deal with parsing the headers of the document (i.e., information on the title page) and with identifying and linking citations to reference strings).


Download

You can download the open-source code for ParsCit here (coming soon). The source requires you to re-compile the CRFPP source code and assumes that perl is installed on your system and can be invoked using perl (must be in your path).

Web Service

Coming soon!

More NLP services are now being made available on the web. Following this trend you can send your plain text citations to use via our web service. We will parse these for you free of charge (as and when time and processing power allows, these processes are done with lower priority).

N.B. We keep logs of what's parsed in these demos, to improve the accuracy and productivity of ParsCit. If you'd like these to be kept private or you find you use this service a lot, why not install a local copy of ParsCit for yourself? If you do, please let us know where you are so we acknowledge you here and can re-direct some traffic your way.

Web-based Demonstration

N.B.: We keep logs of what's parsed in these demos, to improve the accuracy and productivity of ParsCit. If you'd like these to be kept private, why not install a local copy of ParsCit for yourself?

You can also run ParsCit directly in your browser. The form below submits your text input (after suitable cleaning) to the ParsCit service to parse the input file or strings.

Demo #1: Parsing the citation contexts and the reference strings from a whole text file

NB - this demo does not handle PDF input at this time. You can use another web service or software to convert PDFs to text.

Method 1) Enter a URL to a file on the web (e.g., http://wing.comp.nus.edu.sg/~forecite/samples/E06-1050.txt).

Method 2) Upload a .txt file (ASCII; UTF-8)

Method 3) Paste the whole file here:


Demo #2: Parsing individual strings only

Method 1) Enter a URL to a file on the web in the correct format (each line should be a separate citation; e.g., http://wing.comp.nus.edu.sg/~forecite/samples/E06-1050-cites.txt).

Method 2) Upload a file (again, each line should be a separate citation)

Method 3) Enter a list of plain text citations (again, one per line):


Publications

International Referreed Conference Publications:

Others:

Gold Standard Input and Sample Output

Group Members

Troubleshooting

A list of common problems with ParsCit. If you find problems, email the lead developer at <kanmy@comp.nus.edu.sg>. Please use the subject "[ParsCit]" to ensure that it reaches our attention. If you have hand-corrected tagged data that you don't mind providing us, we can use that to further improve ParsCit's extracting capabilities. Nevertheless, there are problems with the output occasionally. Below are some common problems people have encountered.

Issue numbers don't get extracted.
We're looking into this. The training data does not make a distinction about volume and issue number. We'd like to fix that in a subsequent release.
Separation of author names and publishing year fails
In some reference data with non-standard sequences of first names and family names, e.g.
  Baltes, Paul, Ursula Staudinger, Ulmann Lindenberger (1999): Lifespan
  psychology: theory and application of intellectual functioning; in:
  Annual Review of Psychology, 50, 471-507
ParsCit's post processing step may not detect and deal with these problems reliably. We're working to fix these too.

Min-Yen Kan <kanmy@comp.nus.edu.sg>
Created on: Fri Dec 24 01:48:05 SGT 2004 | Version: 1.0 | Last modified: Wed Mar 7 09:14:24 2007