Index of /~min/dAnth/acl

[ICO]NameLast modifiedSizeDescription

[DIR]Parent Directory  -
[DIR]pdfbox/02-May-2008 17:49 -
[DIR]pdftohtml/23-Oct-2006 10:35 -

README for ACL Anthology Text Extraction

ACL Anthology Text Extraction

Above you can find in the individual directories conversions of the ACL Anthology (as of Oct 2006) converted by either pdfbox (with the UTF-8 extraction flag) or pdftohtml (with the -c "complex" flag on). These conversions are done on a best-effort basis by a consortium of university labs; no guarantees of correct extraction or cleanliness is given. These materials are being made available to the public on a "as-is" basis and copyrights to the materials still belong to the ACL and its copyright holders.

pdftohtml's format is html output and thus may preserve more of the formatting and 2-column output, but pdfbox has been observed to be more robust to failure. If you know of an alternate pdf to text converter that works well and doesn't use one of the above engines as a core, then we'd love to hear about it.

Each zip file contains a slice of the Anthology data, these are named for the subdirectory structure in the anthology itself. For example the zip files beginning with "A" are the conversion outputs of the ANLP series (as ANLP '83's link is http://acl.ldc.upenn.edu/A/A83/).

If you are interested in the extraction details or helping to contribute, please consider joining the mailing list or the contributing to the wiki (note that wiki requires registration):

Digital Anthology: [ Mailing list ] [ Wiki ]

Please be considerate in downloading these zip files. They are large and hosted in Singapore. If you encounter problems the web server may be too busy, please come back later to try again. Contact me at the below address if you encounter systematic problems.


Min-Yen Kan <kanmy@comp.nus.edu.sg>
Created on: Thu Feb 1 13:57:30 2007 | Version: 1.0 | Last modified: Thu Feb 1 14:09:54 2007