ParsCit: An open-source CRF Reference String and Logical Document Structure Parsing Package
This is the home page of the ParsCit project, which performs two
tasks: 1) reference string parsing, sometimes also called citation
parsing or citation extraction, and 2) logical structure parsing of
scienfific documents. It is architected as a supervised machine
learning procedure that uses Conditional Random Fields as its learning
mechanism. You can download the code below, parse strings online, or
send batch jobs to our web service. The code contains both the
training data, feature generator and shell scripts to connect the
system to a web service (used on this web site).
Some definitions (thanks to Robert Dale for Citations and Reference
- Reference String:
- A text string in the bibliography or
reference section of a work, usually at the end of the document that
refers to a unique document. Usually occurs with other reference
strings that point to other documents. May also appear as
- A text string (usually explicit) in the
document body that points to a corresponding reference string at the
end of the document. Several citations may co-refer to a single
- Document Logical Structure:
- A hierarchy of logical
components, for example, titles, authors, affiliations, abstracts,
sections, etc., according to (Mao, Rosenfeld &
Kanungo,2003). Our logical structure is more comprehensive,
comprising not only header metadata and references, but also the
logical structure of the internals of the document -- sections,
subsections, figures, tables, equations, footnotes and
This project deals with the problem of parsing the reference
strings and parsing the logical structure of a document. The first
task is handled by a module with the project namesake, ParsCit, and
the second task by a separate module SectLabel.
Extracting and Matching Authors and Affiliations in
We introduce Enlil, an information extraction system that discovers
the institutional affiliations of authors in scholarly papers. Enlil
consists of two steps: one that first identifies authors and affiliations
using a conditional random field; and a second support vector
machine that connects authors to their affiliations. Enlil relies on the SectLabel
for its input. It collects all reported author and affiliation lines from SectLabel
This software is licensed under the Lesser GNU Public
License (LGPL), which means you are free to use it for any
purpose, including embedding in commercial products.
You can download the open-source code for ParsCit here. The source requires you to re-compile the CRFPP source code
and assumes that perl is installed on your system and can be invoked
perl (must be in your path).
- Current version 130908
The (partially ported) Windows version is here (provided by Yumichika). See the CHANGES FOR WINDOWS.txt
We have also pushed a copy of the ParsCit current distribution into GitHub:knmnyn/parscit.
The Windows version has also been pushed to GitHub:wing-nus/parscit.
While we'll strive to keep the GitHub version as updated as possible, the versions on this page will remain the most authoritative for major updates.
- Other versions:
110505b: Added XML::Twig for XML processing. ParsCit now uses input provided by SectLabel. See CHANGELOG.txt .
101101: Incorporated BiblioScript and BibUtils software. See CHANGELOG.txt;
100401d: Added SectLabel (logical structure parsing) software from the NUS team, and Iconip training data from Cheong Chi Hong for ParsCit with new ParsCit model retrained. See CHANGELOG.txt;
090625b: Added documentation for complete re-installation. Improved ParsHed with added line-level CRF model together and post-processing module by NUS team, WSDL file and client for service at NUS and minor bug fixes for ParsCit. See CHANGELOG.txt;
090316: Incorporation of ParsHed (header parsing) software from the NUS team. See CHANGELOG.txt;
081201: Bug fixes and incorporation of byte position offset from the Scienstein.org team. See CHANGELOG.txt;
080917: Minor changes (improved models and mulilingual support), see CHANGELOG.txt;
080402: First public release. Comes with precompiled linux binaries for CRF++;
080310: Beta release.
- CRF++: A conditional random fields toolkit that you may need to install, if the compiled one does not work for you. We recommend that you use version 0.51.
More NLP services are now being made available on the web.
Following this trend you can send your plain text citations to use via
our web service. We will parse these for you free of charge (as and
when time and processing power allows, these processes are done with
N.B. We keep logs of what's parsed in these demos, to
improve the accuracy and productivity of ParsCit. If you'd like these
to be kept private or you find you use this service a lot, why not
install a local copy of ParsCit for yourself? If you do, please
let us know where you are so we acknowledge you here and can re-direct
some traffic your way.
N.B.: We keep logs of what's parsed in these demos, to
improve the accuracy and productivity of ParsCit. If you'd like these
to be kept private, why not install a local copy of ParsCit for
You can also run ParsCit directly in your browser. The form below
submits your text input (after suitable cleaning) to the ParsCit
service to parse the input file or strings.
Note that if system loads gets high, your demo call may not be executed. If you want to run this program in batch, please download your own copy.
Demo #1: Parsing the header, logical structure and/or reference strings (and citation contexts) from a text file
Demo #2: As above but using XML input (XML must conform to Omnipage output). This demo is slow so please be patient.
Demo #3: Parsing individual reference strings only (just
- Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan (2010)
Logical Structure Recovery in Scholarly Articles with Rich
Document Features. International
Journal of Digital Library Systems (IJDLS), 1(4), 1-23.
[ pre-print .pdf ]
International Referreed Conference Publications:
- Nguyen Viet Cuong, Muthu Kumar Chandrasekaran, Min-Yen Kan and Wee Sun Lee (2015).
Scholarly Document Information Extraction using Extensible Features for Efficient Higher Order Semi-CRFs.
To appear in proceedings of Joint Conference on Digital Libraries 2015 (JCDL 2015), Knoxville, USA.
[ pre-print .pdf ]
- Huy Hoang Nhat Do, Muthu Kumar Chandrasekaran, Philip S. Cho,
and Min-Yen Kan (2013). Extracting and Matching Authors and Affiliations
in Scholarly Documents.In Proceedings of the Thirteenth Annual International
ACM/IEEE Joint Conference on Digital Libraries (JCDL'13), Indianapolis, USA. ACM.
[ pre-print .pdf ]
- Isaac G. Councill, C. Lee Giles, Min-Yen Kan. (2008)
ParsCit: An open-source CRF reference string parsing
package. In Proceedings of the Language Resources and
Evaluation Conference (LREC 08), Marrakesh, Morrocco, May, 2008.
[ .pdf ]
[ Poster (.png) ]
- Yong Kiat Ng. (2004) Citation Parsing Using Maximum Entropy
and Repairs. Undergraduate thesis. National University of
[ .pdf ]
Gold Standard Input and Sample Output
- What platforms does ParsCit work on?
- ParsCit works on all major platforms: Windows, Linux and MacOS.
The installation requires ruby and perl and the CRF++ embedded
package also requires standard UNIX utilities like sed. You
should have a working knowledge of UNIX and some experience in
installing UNIX tools. Due to our time constraints, we may not be
able answer your particular problems with installation. Do let us
know if there was something important that you had to do to get
your particular download and installation working; we'll
incorporate it into the Troubleshooting section below.
- What is the difference of SectLabel and previous ParsHed?
- SectLabel is a newly-developed module that further extends
ParsHed in functionality. It not only classifies header metadata,
but analyzes full documents to output the logical structure of
the internals of the document -- sections, subsections, figures,
tables, equations, footnotes and captions.
issues, the ParsHed module is still retained in our source code
and command line options.
- How do I retrain ParsCit for a different language? I saw code in
lib/ParsCit/PreProcess' to find the beginning of the bibliography
section, and changed that but it doesn't work.
- The current version does not depend on those regular expressions
anymore, they are for previous versions (e.g., v101101). ParsCit
now first labels each line using the SectLabel module and
discovers which lines to parse references for based on the first
step's output. You need to retrain SectLabel for this, by
providing labeled data about what class of line each line in your
training data is. It's also possible to "downgrade" the current
version to go back to use the rule-based method for identifying
the reference section.
- What is the "genericHeader" in the output of SectLabel? What is
the difference between "genericSect.tagged" and "SectLabel.tagged"?
- Generic headers, such as introduction, methodology, and
evaluation, represent generic purposes of different sections in a
scholarly article. We map all section names to generic ones
(i.e., "5. Text Features" to "Methodology"). This promotes
comparative viewing of sections with identical purpose across
articles. For the second question, actually, Generic section is
a component of SectLabel. It is responsible for classifying the
section headers of a paper into the generic categories such as
Introduction, Methodology, Result, etc. For details refer to our
IJDLS journal paper.
- Why is there an option to input file in XML format? Which DTD
should it follow?
- SectLabel is a robust logical document structure inference
system that can handle both rich input (produced by OCR software
such as font or spatial features) to boost recognition
performance, but still be able to perform inference on
impoverished input (plain text) with degraded
performance. Currently, the XML input must be in the form of
output from Nuance OmniPage (version 16)'s XML format, and hence,
should follows the DTD by OmniPage. Note: The ParsCit team is not
affiliated with Nuance in any way nor does it endorse
- I need to run ParsCit but I can't get well-formed text from my
PDF documents. Can you help?
- No, we cannot help you with this. We don't perform OCR or text
extraction from PDF documents. You will have to find your own
source for doing the extraction or conversion. We've found
Omnipage useful in our own project work (hence the possibility of
XML input), but we don't endorse any product.
- The OmniPage XML doesn't seem to be well-formed. Is that OK?
- Yes. The sample "XML" provided in the links (for Demo 2) are
actual outputs for a sequence of XML pages (one XML file per
page). If you use OmniPage to save an XML file for input to
ParsCit, make sure to save individual pages as separate files,
then concatenate them to send to ParsCit. You may want to
download the sample links for inspection (as they are
concatenations of several XML files, your browser will likely
complain about them not being well-formed.
- I ran Demos 1 and 2 with the default "all" settings, but sections
don't seem to be detected.
- There's no problem. The demo just hides the SectLabel output
by default. Click "Show SectLabel output" to reveal it.
- I ran ParsCit using the OmniPage XML output, but encountered malformed UTF8 character errors.
- OmniPage normally outputs XML results in UTF-16 format, a conversion into UTF-8 will solve the problem, see below:
iconv --from-code UTF-16 --to-code UTF-8 omnipageOutput.xml > newOmnipageOutput.xml
- I read your paper 'Extracting and matching authors and affiliations in scholarly documents'
and tried the latest version of parscit. You mentioned in the paper that this process has two parts:
extraction and matching. I looked at the code of citeExtract.pl. I wonder if the author-affiliation matching
only works if the input file is xml? Is there a way to make this work for .txt files as well?
- Other modules of ParsCit run on text input too. But for author-affiliation matching that Enlil does,
we need xml as input (Demo 2). But you could still get the author and affiliation lines extracted from
-- see colour coded Sectlabel output for this from the demo 1 output page.
XML input & Omnipage
In general, for optimum performance we report in our paper you need to provide an XML file which is an output
from an OCR package. We use Omnipage. You could use your favourite OCR package that provides as much
information as Omnipage as possible.
For an example Omnipage output file http://wing.comp.nus.edu.sg/~wing.nus/samples/E06-1050.xml on Demo 2.
Please also see FAQ "Why is there an option to input file in XML format? Which DTD should it follow?"
Other OCR similar to Omnipage
If you are using a different OCR package then you need to write a crosswalk script to translate
your package output into a file format that ParsCit consumes as input.
A list of common problems with ParsCit. If you find problems,
email the lead developer at <firstname.lastname@example.org>. Please use
the subject "[ParsCit]" to ensure that it reaches our attention. If
you have hand-corrected tagged data that you don't mind providing us,
we can use that to further improve ParsCit's extracting capabilities.
Nevertheless, there are problems with the output occasionally. Below
are some common problems people have encountered.
- ParsCit v110505 seems to be a lot slower when used on Omnipage
output than the previous versions, why?
- You are correct. We are now using XML::Twig to do the XML
processing correctly, rather than do it ad-hoc ourselves, but this
requires constructing an exhaustive DOM tree for the Omnipage input.
This is the timesink that you are experiencing.
- I'm running ParsCit on Windows but I can't get it to work, even
after installing a perl interpreter. Specifically, the
citeExtract.pl program dies complaining that it Can't open
"/tmp/...." at line 175.
- ParsCit hasn't been fully tested on windows at NUS, so we can't
vouch for whether it will run correctly. In this specific error
case, the "/tmp/" directory (a standard place for temporary files in
UNIX systems) is normally not available in Windows, and may generate
problems. You may need to change the code and/or create an
appropriate directory for ParsCit to generate such files.
- I tried downloading and running ParsCit but I get complaints
about /bin/sed and crf not being found. Help?
- Please read the INSTALL.txt directions. You need to recompile
CRF++ for your platform. The paths included with the install are
for our version, you need to recompile to have the paths point
- When running citeExtract.pl I get some errors complaining about
the wrong ELF class of the binaries. How can I fix this?
- This seems to be a problem with the compiled executables of
CRF++ bundled with the software. Follow the INSTALL instructions
but after step 1 do:
$ cp -Rf * ../../.libs
$ cp crf_learn ../../.libs/lt-crf_learn
$ cp crf_test ../../.libs/lt-crf_test
- I'm trying to install parscit v110505 using the instructions in the install file, and when I get to the point where you're supposed to recompile CRF, it exists with an error:
In file included from node.h:13:0,
path.h:26:52: error: 'size_t' has not been declared
make: *** [node.lo] Error 1
make: Leaving directory `/home/agarnett/parscit/crfpp/CRF++-0.51'
make: *** [all] Error 2
The install file mentions that this may fail the first time; unfortunately for me, it keeps failing. any help?
- The error is from CRF++ package (not from ParsCit), there are two ways to fix it:
1. Add the line.
#include<iostream> in node.cpp and compile crf++ again, or;
2. Go to http://crfpp.googlecode.com/svn/trunk/doc/index.html and download the latest version. The instruction is the same. Hope this helps.
- Issue numbers don't get extracted.
- This issue should be fixed as of the v110505
release. There is now some heuristic postprocessing code to
take care of breaking single or multiple tokens for issues and
- Separation of author names and publishing year fails
- In some reference data with non-standard sequences of
first names and family names, e.g.
Baltes, Paul, Ursula Staudinger, Ulmann Lindenberger (1999): Lifespan
psychology: theory and application of intellectual functioning; in:
Annual Review of Psychology, 50, 471-507
ParsCit's post processing step may not detect and deal with these
problems reliably. We're working to fix these too.
- I passed ParsCit plain text output but in another, non-English
language. I didn't get good results or I got empty results. Can
- Aside from English, ParsCit can handle Italian and German to a
limited extent, thanks to the multilingual training data.
However, the demo web interface uploads non-ASCII (e.g., UTF-8 or
UTF-16 data) as binary data and fails to execute ParsCit.
However, if you download a copy of ParsCit, the libraries do work
on such data. Here's a sample. We'd love to help make
a more universal model that can accommodate reference strings in
other languages. If you're willing to help contribute ground
truth data, we love to hear from you!
- How about retraining ParsCit for another language/domain?
- You can put your supervised exemplar data into the same format
as tagged_references.txt found in crfpp/traindata/. Once you have
this file you can generate the appropriate model for ParsCit, by
using three commands (assumes you are in the crfpp/traindata
$ ../../bin/tr2crfpp.pl tagged_references.txt > parsCit.train.data
$ ../crf_learn parsCit.template parsCit.train.data model
$ mv model ../../resources/parsCit.model
The first command creates the input feature file that crfpp uses
from the training data. The second creates the model using the
crf_learn command. You can then move the model file to the
resources/ subdirectory where it can be utilized. To replace the
default model that comes with ParsCit, just execute the final
- Can I retrain the package for a different set of tags if I
change the tagset in the training data?
- Yes, you should be able to change the tagset to suit your
dataset. You can add, eliminate and change the tagset as you
wish. You need to retrain the parser system after creating your
tag data. For more details on the training process, see the
documentation for CRF++, that is on the web at sourceforge.
- When retraining I get a "bad_alloc" error. What gives?
- We're not entirely sure of this. CRF training is quite memory
intensive and running a large amount of training data tuples may
cause the embedded CRF++ package to fail. You can try with less
training data, or try training on a machine with a larger amount
- Does the web service actually work? I can't seem to run it.
- Occasionally our school's networking staff changes the firewall
settings, so the port for our group's web services may be blocked
(port 4000 on host wing.comp.nus.edu.sg). If you find you can't
reach our services (they time out), please let us know.
- I get funny errors with crf_test not being useful. How do I
- The updated README.txt file in the 090625b
distribution fixes this. Basically you need to recompile CRF++
0.51 and place the libraries and the executables in the proper
place. See the README for details.
- I am trying to install CRF++(CRFPP).
When I executed the following in the terminal,
$ /opt/parscit2/bin/citeExtract.pl -m extract_all
the following error was thrown.
$ /opt/parscit2/crfpp/.libs/lt-crf_test: error while loading shared libraries:
libcrfpp.so.0: cannot open shared object file: No such file or directory
- The file libcrfpp.so.0 is not in your library path. Normally, a successful
reinstallation of CRF++ (following the instructions above) will also solve this
problem. If it is still persists, you can try setting the LD_LIBRARY_PATH variable to the location of libcrfpp.so.0 file :
$ export LD_LIBRARY_PATH=<PATH-TO-LIBCRFPP.SO.0-FILE>
I get the following error message :
Can't locate XML/Twig.pm in @INC (@INC contains: ...)
BEGIN failed--compilation aborted at ...
Compilation failed in require at ...
This is because the Perl interpreter on your machine/server can't locate all the Perl modules required (XML::Twig in this case)
for ParsCit to function properly.
- You will either have to install these modules from CPAN before running ParsCit. You can find a list of all the
required modules for ParsCit on this page.
- In case you do have the required modules on the server, it is possible that the interpreter was not able to locate
these. In this case, you would have to update the list of paths that the interpreter searches to find the required modules.
This can be done by updating the environment variable PERL5LIB (for Perl5) or PERLLIB to include the path of the
- When trying to install CRF++ using make the build fails because the file
path.h is missing an include that defines size_t. This can easily be
added by adding #include to that file and recompiling. I would submit a
pull request on github but the source file is in the CRF .tar.gz.
- We have directly embedded the CRF++ source code, but since we don't manage
that project we've left that as-is.
ParsCit owes its continued maintenance and support from its user
base. Here we'd like to thank them for their help.
Thanks to Trevor Killeen for a spotting missing dependancy in CRF++
source code. Thanks to David Judd who reconfigured how CRF++ is located with
respect to the main code. Thanks to Alex Garnett in spotting more
problems with CRF dependencies. Thanks to George E. Raptis and Eric
Tran for the port to Windows. Thanks to Zhu Ying-Bo
(email@example.com) from the Language Computing and Web Mining Group,
Institute of Computer Science and Technology of Peking University for
the partial port to Windows. Thanks to Yustus Oktian for questions
about training for another language. Thanks to Madhur Kapoor for
asking questions about PDF conversion. Thanks to Behrang Qasemizadeh
for reporting problems with truncation of XML entities in XML output
(v110505). Thanks Tim Brody for his BiblioScript patch. Thanks to
David Jurgens for suggesting that remove temporary files after runs
(v110505). Thanks Nikolay Nikolov for suggesting the conversion of
OmniPage XML results from UTF-16 to UTF-8 to avoid encoding
problems. Thanks to Matteo Romanello for the suggestion and permission
to incorporate BiblioScript software (v101101). Many thanks to Kris
Jack for pointing out problems with the ELF binaries and an
appropriate fix. Thanks to Cheong Chi Hong for fixing problems with
Preprocess.pm (v100401) and contributing the ICONIP training data and
XML entity problems in reference string parsing (v100401). Thanks to
Priya Venkateshan for pointing out sudo/root installation
possibilities (v100401). Thanks to Mario Lipinski for reporting
punctuation stripping problems in reference string parsing (v100401).
Thanks to Artemy Kolchinsky for fixes in Preprocess.pm
(v090625). Thanks to Matteo Romanello for the humanities training
datasets. Thanks to Dain Kaplan for helping us fix the Preprocess.pm
bug. Thanks to Ayeh Bandeh-Ahmadi for correcting the warning in
parseRefString.pl. Thanks to Nick Friedrich and Jöran Beel of
scienstein.org for all fixes in the v081201 version of ParsCit. Also
thanks to Madian Khabsa for indicating problems with installation to
ParsCit is used by many projects worldwide, and not just in
experimental, research and academic places, but in commercial
enterprises as well. Mendeley
is using ParsCit to parse references from contributed papers, as is
the Citations in Economics
Other, open-source citation parsers:
Other related links. Contact Min below to get your other related
software listed here. Thanks!
Min-Yen Kan <firstname.lastname@example.org>
Created on: Fri Dec 24 01:48:05 SGT 2004
| Version: 1.0
| Last modified:
Tue May 27 07:21:03 SGT 2014