ParsCit: An open-source CRF Reference String and Logical Document Structure Parsing Package


This is the home page of the ParsCit project, which performs two tasks: 1) reference string parsing, sometimes also called citation parsing or citation extraction, and 2) logical structure parsing of scientific documents. It is architected as a supervised machine learning procedure that uses Conditional Random Fields as its learning mechanism. You can download the code below, parse strings online, or send batch jobs to our web service. The code contains both the training data, feature generator and shell scripts to connect the system to a web service (used on this web site).

Some definitions (thanks to Robert Dale for Citations and Reference Strings):

Reference String:

A text string in the bibliography or reference section of a work, usually at the end of the document that refers to a unique document. Usually, occurs with other reference strings that point to other documents. May also appear as footnotes.


A text string (usually explicit) in the document body that points to a corresponding reference string at the end of the document. Several citations may co-refer to a single reference string.

Document Logical Structure:

A hierarchy of logical components, for example, titles, authors, affiliations, abstracts, sections, etc., according to (Mao, Rosenfeld & Kanungo,2003). Our logical structure is more comprehensive, comprising not only header metadata and references, but also the logical structure of the internals of the document – sections, subsections, figures, tables, equations, footnotes and captions.

This project deals with the problem of parsing the reference strings and parsing the logical structure of a document. The first task is handled by a module with the project namesake, ParsCit, and the second task by a separate module SectLabel.

Extracting and Matching Authors and Affiliations in Scholarly Documents

We introduce Enlil, an information extraction system that discovers the institutional affiliations of authors in scholarly papers. Enlil consists of two steps: one that first identifies authors and affiliations using a conditional random field; and a second support vector machine that connects authors to their affiliations. Enlil relies on the SectLabel for its input. It collects all reported author and affiliation lines from SectLabel for processing.


Gold Standard Input and Sample Output

  • Chunk tagged data for Cora, CiteSeerX, FLUX-CiM and humanities (Italian, English, and mixed language) datasets (suitable for ParsCit training). For FLUX-CiM data, please try the original hosting site maintained by Eli Cortez. Credits to Matteo Romanello for contributing the humanities datasets.
  • Chunk tagged data for some ICONIP papers. Contributed by Cheong Chi Hung.
  • Results of running the v080917 version of ParsCit on FLUX-CiM’s dataset for [ 300 computer science references ] [ 2000 medical references ] [ on the CORA dataset]. Note that these results are considered cheating as current version has been trained on this data.
  • Tagged section data for the SectLabel module. [ XML Format ] [ Plain Text Format ] [ GenericSect training data ]


Sorry, no publications matched your criteria.