WING homepage ]
WING web services ]

Download ]
Web Service ]
Web Demo ]
Publications ]
Input and Output ]
Group Members ]
FAQ ]
Troubleshooting ]

Picture of ParsCit Swami

ParsCit: An open-source CRF Reference String and Logical Document Structure Parsing Package

This is the home page of the ParsCit project, which performs two tasks: 1) reference string parsing, sometimes also called citation parsing or citation extraction, and 2) logical structure parsing of scienfific documents. It is architected as a supervised machine learning procedure that uses Conditional Random Fields as its learning mechanism. You can download the code below, parse strings online, or send batch jobs to our web service. The code contains both the training data, feature generator and shell scripts to connect the system to a web service (used on this web site).

Some definitions (thanks to Robert Dale for Citations and Reference Strings):

Reference String:
A text string in the bibliography or reference section of a work, usually at the end of the document that refers to a unique document. Usually occurs with other reference strings that point to other documents. May also appear as footnotes.
Citation:
A text string (usually explicit) in the document body that points to a corresponding reference string at the end of the document. Several citations may co-refer to a single reference string.
Document Logical Structure:
A hierarchy of logical components, for example, titles, authors, affiliations, abstracts, sections, etc., according to (Mao, Rosenfeld & Kanungo,2003). Our logical structure is more comprehensive, comprising not only header metadata and references, but also the logical structure of the internals of the document -- sections, subsections, figures, tables, equations, footnotes and captions.

This project deals with the problem of parsing the reference strings and parsing the logical structure of a document. The first task is handled by a module with the project namesake, ParsCit, and the second task by a separate module SectLabel.

Extracting and Matching Authors and Affiliations in Scholarly Documents

We introduce Enlil, an information extraction system that discovers the institutional affiliations of authors in scholarly papers. Enlil consists of two steps: one that first identifies authors and affiliations using a conditional random field; and a second support vector machine that connects authors to their affiliations. Enlil relies on the SectLabel for its input. It collects all reported author and affiliation lines from SectLabel for processing.


License

This software is licensed under the Lesser GNU Public License (LGPL), which means you are free to use it for any purpose, including embedding in commercial products.


Download

You can download the open-source code for ParsCit here. The source requires you to re-compile the CRFPP source code and assumes that perl is installed on your system and can be invoked using perl (must be in your path).

Web Service

More NLP services are now being made available on the web. Following this trend you can send your plain text citations to use via our web service. We will parse these for you free of charge (as and when time and processing power allows, these processes are done with lower priority).

N.B. We keep logs of what's parsed in these demos, to improve the accuracy and productivity of ParsCit. If you'd like these to be kept private or you find you use this service a lot, why not install a local copy of ParsCit for yourself? If you do, please let us know where you are so we acknowledge you here and can re-direct some traffic your way.

Web-based Demonstration

N.B.: We keep logs of what's parsed in these demos, to improve the accuracy and productivity of ParsCit. If you'd like these to be kept private, why not install a local copy of ParsCit for yourself?

You can also run ParsCit directly in your browser. The form below submits your text input (after suitable cleaning) to the ParsCit service to parse the input file or strings. Note that if system loads gets high, your demo call may not be executed. If you want to run this program in batch, please download your own copy.

Demo #1: Parsing the header, logical structure and/or reference strings (and citation contexts) from a text file

NB - this demo does not handle PDF input at this time. You can use another web service or software to convert PDFs to text.

Internal key (if applicable):

Input Method 1) Enter a URL to a file on the web (e.g., http://wing.comp.nus.edu.sg/~wing.nus/samples/E06-1050.txt or http://wing.comp.nus.edu.sg/~wing.nus/samples/W06-0102.txt).

Input Method 2) Upload a .txt file (ASCII; UTF-8)

Input Method 3) Paste the whole file here:

Parse the document using the following options

Citation export formats ADS BIB EndNote ISI RIS WordBib


Demo #2: As above but using XML input (XML must conform to Omnipage output). This demo is slow so please be patient.

Internal key (if applicable):

Input Method 1) Enter a URL to a file on the web (e.g., http://wing.comp.nus.edu.sg/~wing.nus/samples/E06-1050.xml or http://wing.comp.nus.edu.sg/~wing.nus/samples/W06-0102.xml).

Input Method 2) Upload a .xml file (ASCII; UTF-8)

Input Method 3) Paste the whole .xml file here:

Input Method 4) Upload your own .pdf file (less than 50 pages & smaller than 10MB):

Parse the document using the following options

Citation export formats ADS BIB EndNote ISI RIS WordBib


Demo #3: Parsing individual reference strings only (just extract_citations)

Internal key (if applicable):

Input Method 1) Enter a URL to a file on the web in the correct format (each line should be a separate citation; e.g., http://wing.comp.nus.edu.sg/~wing.nus/samples/E06-1050.cite or http://wing.comp.nus.edu.sg/~wing.nus/samples/W06-0102.cite).

Input Method 2) Upload a file (again, each line should be a separate citation)

Input Method 3) Enter a list of plain text citations (again, one per line):

Citation export formats ADS BIB EndNote ISI RIS WordBib


Publications

Journal Papers:

International Referreed Conference Publications:

Others:

Gold Standard Input and Sample Output

Group Members

FAQ

What platforms does ParsCit work on?
ParsCit works on all major platforms: Windows, Linux and MacOS. The installation requires ruby and perl and the CRF++ embedded package also requires standard UNIX utilities like sed. You should have a working knowledge of UNIX and some experience in installing UNIX tools. Due to our time constraints, we may not be able answer your particular problems with installation. Do let us know if there was something important that you had to do to get your particular download and installation working; we'll incorporate it into the Troubleshooting section below.
What is the difference of SectLabel and previous ParsHed?
SectLabel is a newly-developed module that further extends ParsHed in functionality. It not only classifies header metadata, but analyzes full documents to output the logical structure of the internals of the document -- sections, subsections, figures, tables, equations, footnotes and captions.
For compatibility issues, the ParsHed module is still retained in our source code and command line options.
How do I retrain ParsCit for a different language? I saw code in lib/ParsCit/PreProcess' to find the beginning of the bibliography section, and changed that but it doesn't work.
The current version does not depend on those regular expressions anymore, they are for previous versions (e.g., v101101). ParsCit now first labels each line using the SectLabel module and discovers which lines to parse references for based on the first step's output. You need to retrain SectLabel for this, by providing labeled data about what class of line each line in your training data is. It's also possible to "downgrade" the current version to go back to use the rule-based method for identifying the reference section.
What is the "genericHeader" in the output of SectLabel? What is the difference between "genericSect.tagged" and "SectLabel.tagged"?
Generic headers, such as introduction, methodology, and evaluation, represent generic purposes of different sections in a scholarly article. We map all section names to generic ones (i.e., "5. Text Features" to "Methodology"). This promotes comparative viewing of sections with identical purpose across articles. For the second question, actually, Generic section is a component of SectLabel. It is responsible for classifying the section headers of a paper into the generic categories such as Introduction, Methodology, Result, etc. For details refer to our IJDLS journal paper.
Why is there an option to input file in XML format? Which DTD should it follow?
SectLabel is a robust logical document structure inference system that can handle both rich input (produced by OCR software such as font or spatial features) to boost recognition performance, but still be able to perform inference on impoverished input (plain text) with degraded performance. Currently, the XML input must be in the form of output from Nuance OmniPage (version 16)'s XML format, and hence, should follows the DTD by OmniPage. Note: The ParsCit team is not affiliated with Nuance in any way nor does it endorse OmniPage.
I need to run ParsCit but I can't get well-formed text from my PDF documents. Can you help?
No, we cannot help you with this. We don't perform OCR or text extraction from PDF documents. You will have to find your own source for doing the extraction or conversion. We've found Omnipage useful in our own project work (hence the possibility of XML input), but we don't endorse any product.
The OmniPage XML doesn't seem to be well-formed. Is that OK?
Yes. The sample "XML" provided in the links (for Demo 2) are actual outputs for a sequence of XML pages (one XML file per page). If you use OmniPage to save an XML file for input to ParsCit, make sure to save individual pages as separate files, then concatenate them to send to ParsCit. You may want to download the sample links for inspection (as they are concatenations of several XML files, your browser will likely complain about them not being well-formed.
I ran Demos 1 and 2 with the default "all" settings, but sections don't seem to be detected.
There's no problem. The demo just hides the SectLabel output by default. Click "Show SectLabel output" to reveal it.
I ran ParsCit using the OmniPage XML output, but encountered malformed UTF8 character errors.
OmniPage normally outputs XML results in UTF-16 format, a conversion into UTF-8 will solve the problem, see below:
      iconv --from-code UTF-16 --to-code UTF-8 omnipageOutput.xml > newOmnipageOutput.xml

Troubleshooting

A list of common problems with ParsCit. If you find problems, email the lead developer at <kanmy@comp.nus.edu.sg>. Please use the subject "[ParsCit]" to ensure that it reaches our attention. If you have hand-corrected tagged data that you don't mind providing us, we can use that to further improve ParsCit's extracting capabilities. Nevertheless, there are problems with the output occasionally. Below are some common problems people have encountered.

ParsCit v110505 seems to be a lot slower when used on Omnipage output than the previous versions, why?
You are correct. We are now using XML::Twig to do the XML processing correctly, rather than do it ad-hoc ourselves, but this requires constructing an exhaustive DOM tree for the Omnipage input. This is the timesink that you are experiencing.
I'm running ParsCit on Windows but I can't get it to work, even after installing a perl interpreter. Specifically, the citeExtract.pl program dies complaining that it Can't open "/tmp/...." at line 175.
ParsCit hasn't been fully tested on windows at NUS, so we can't vouch for whether it will run correctly. In this specific error case, the "/tmp/" directory (a standard place for temporary files in UNIX systems) is normally not available in Windows, and may generate problems. You may need to change the code and/or create an appropriate directory for ParsCit to generate such files.
I tried downloading and running ParsCit but I get complaints about /bin/sed and crf not being found. Help?
Please read the INSTALL.txt directions. You need to recompile CRF++ for your platform. The paths included with the install are for our version, you need to recompile to have the paths point correctly.
When running citeExtract.pl I get some errors complaining about the wrong ELF class of the binaries. How can I fix this?
This seems to be a problem with the compiled executables of CRF++ bundled with the software. Follow the INSTALL instructions but after step 1 do:

$ cp -Rf * ../../.libs $ cp crf_learn ../../.libs/lt-crf_learn
$ cp crf_test ../../.libs/lt-crf_test

I'm trying to install parscit v110505 using the instructions in the install file, and when I get to the point where you're supposed to recompile CRF, it exists with an error:
In file included from node.h:13:0,
                   from node.cpp:9:
		   path.h:26:52: error: 'size_t' has not been declared
		   make[1]: *** [node.lo] Error 1
		   make[1]: Leaving directory `/home/agarnett/parscit/crfpp/CRF++-0.51'
		   make: *** [all] Error 2

The install file mentions that this may fail the first time; unfortunately for me, it keeps failing. any help?
The error is from CRF++ package (not from ParsCit), there are two ways to fix it:
1. Add the line. #include<iostream> in node.cpp and compile crf++ again, or;
2. Go to http://crfpp.googlecode.com/svn/trunk/doc/index.html and download the latest version. The instruction is the same. Hope this helps.
Issue numbers don't get extracted.
This issue should be fixed as of the v110505 release. There is now some heuristic postprocessing code to take care of breaking single or multiple tokens for issues and volumes.
Separation of author names and publishing year fails
In some reference data with non-standard sequences of first names and family names, e.g.
  Baltes, Paul, Ursula Staudinger, Ulmann Lindenberger (1999): Lifespan
  psychology: theory and application of intellectual functioning; in:
  Annual Review of Psychology, 50, 471-507
ParsCit's post processing step may not detect and deal with these problems reliably. We're working to fix these too.
I passed ParsCit plain text output but in another, non-English language. I didn't get good results or I got empty results. Can you help?
Aside from English, ParsCit can handle Italian and German to a limited extent, thanks to the multilingual training data. However, the demo web interface uploads non-ASCII (e.g., UTF-8 or UTF-16 data) as binary data and fails to execute ParsCit. However, if you download a copy of ParsCit, the libraries do work on such data. Here's a sample. We'd love to help make a more universal model that can accommodate reference strings in other languages. If you're willing to help contribute ground truth data, we love to hear from you!
How about retraining ParsCit for another language/domain?
You can put your supervised exemplar data into the same format as tagged_references.txt found in crfpp/traindata/. Once you have this file you can generate the appropriate model for ParsCit, by using three commands (assumes you are in the crfpp/traindata directory):

$ ../../bin/tr2crfpp.pl tagged_references.txt > parsCit.train.data
$ ../crf_learn parsCit.template parsCit.train.data model
$ mv model ../../resources/parsCit.model

The first command creates the input feature file that crfpp uses from the training data. The second creates the model using the crf_learn command. You can then move the model file to the resources/ subdirectory where it can be utilized. To replace the default model that comes with ParsCit, just execute the final command.

Can I retrain the package for a different set of tags if I change the tagset in the training data?
Yes, you should be able to change the tagset to suit your dataset. You can add, eliminate and change the tagset as you wish. You need to retrain the parser system after creating your tag data. For more details on the training process, see the documentation for CRF++, that is on the web at sourceforge.
When retraining I get a "bad_alloc" error. What gives?
We're not entirely sure of this. CRF training is quite memory intensive and running a large amount of training data tuples may cause the embedded CRF++ package to fail. You can try with less training data, or try training on a machine with a larger amount of RAM.
Does the web service actually work? I can't seem to run it.
Occasionally our school's networking staff changes the firewall settings, so the port for our group's web services may be blocked (port 4000 on host wing.comp.nus.edu.sg). If you find you can't reach our services (they time out), please let us know.
I get funny errors with crf_test not being useful. How do I fix this?
The updated README.txt file in the 090625b distribution fixes this. Basically you need to recompile CRF++ 0.51 and place the libraries and the executables in the proper place. See the README for details.
I am trying to install CRF++(CRFPP). When I executed the following in the terminal,

$ /opt/parscit2/bin/citeExtract.pl -m extract_all
$ /var/www/internal/indra/uploads_batch/proses.txt
$ /var/www/internal/indra/uploads_batch/hasiltemp.xml


the following error was thrown.
$ /opt/parscit2/crfpp/.libs/lt-crf_test: error while loading shared libraries:
libcrfpp.so.0: cannot open shared object file: No such file or directory
The file libcrfpp.so.0 is not in your library path. Normally, a successful reinstallation of CRF++ (following the instructions above) will also solve this problem. If it is still persists, you can try setting the LD_LIBRARY_PATH variable to the location of libcrfpp.so.0 file :
$ export LD_LIBRARY_PATH=<PATH-TO-LIBCRFPP.SO.0-FILE>

I get the following error message :

  Can't locate XML/Twig.pm in @INC (@INC contains: ...)
  BEGIN failed--compilation aborted at ...
  Compilation failed in require at ...
  ...
This is because the Perl interpreter on your machine/server can't locate all the Perl modules required (XML::Twig in this case) for ParsCit to function properly.
  1. You will either have to install these modules from CPAN before running ParsCit. You can find a list of all the required modules for ParsCit on this page.
  2. In case you do have the required modules on the server, it is possible that the interpreter was not able to locate these. In this case, you would have to update the list of paths that the interpreter searches to find the required modules. This can be done by updating the environment variable PERL5LIB (for Perl5) or PERLLIB to include the path of the installed modules.

Kudos

ParsCit owes its continued maintenance and support from its user base. Here we'd like to thank them for their help.

Thanks to David Judd who reconfigured how CRF++ is located with respect to the main code. Thanks to Alex Garnett in spotting more problems with CRF dependencies. Thanks to George E. Raptis and Eric Tran for the port to Windows. Thanks to Zhu Ying-Bo (yumichika@163.com) from the Language Computing and Web Mining Group, Institute of Computer Science and Technology of Peking University for the partial port to Windows. Thanks to Yustus Oktian for questions about training for another language. Thanks to Madhur Kapoor for asking questions about PDF conversion. Thanks to Behrang Qasemizadeh for reporting problems with truncation of XML entities in XML output (v110505). Thanks Tim Brody for his BiblioScript patch. Thanks to David Jurgens for suggesting that remove temporary files after runs (v110505). Thanks Nikolay Nikolov for suggesting the conversion of OmniPage XML results from UTF-16 to UTF-8 to avoid encoding problems. Thanks to Matteo Romanello for the suggestion and permission to incorporate BiblioScript software (v101101). Many thanks to Kris Jack for pointing out problems with the ELF binaries and an appropriate fix. Thanks to Cheong Chi Hong for fixing problems with Preprocess.pm (v100401) and contributing the ICONIP training data and XML entity problems in reference string parsing (v100401). Thanks to Priya Venkateshan for pointing out sudo/root installation possibilities (v100401). Thanks to Mario Lipinski for reporting punctuation stripping problems in reference string parsing (v100401). Thanks to Artemy Kolchinsky for fixes in Preprocess.pm (v090625). Thanks to Matteo Romanello for the humanities training datasets. Thanks to Dain Kaplan for helping us fix the Preprocess.pm bug. Thanks to Ayeh Bandeh-Ahmadi for correcting the warning in parseRefString.pl. Thanks to Nick Friedrich and Jöran Beel of scienstein.org for all fixes in the v081201 version of ParsCit. Also thanks to Madian Khabsa for indicating problems with installation to MacOS.

ParsCit is used by many projects worldwide, and not just in experimental, research and academic places, but in commercial enterprises as well. Mendeley is using ParsCit to parse references from contributed papers, as is the Citations in Economics (CitEc) project.

Related Links

Other, open-source citation parsers:

Other related links. Contact Min below to get your other related software listed here. Thanks!


Min-Yen Kan <kanmy@comp.nus.edu.sg>
Created on: Fri Dec 24 01:48:05 SGT 2004 | Version: 1.0 | Last modified: Tue Jul 23 00:28:05 SGT 2013