Corpora - written, spoken, transcribed data for natural language analysis and use.
- 20 newsgroups - The twenty newsgroup collection is often used for machine
learning benchmarks. It was installed locally at SoC to test the
bow
machine learning package.
- 4 Stopword lists - Four downloaded stoplists available from the web. See the README.html file in the directory for more information.
- A Test Collection of Preference Judgments - A collection of preference judgments over documents judged for the Topic Distillation task of the TREC 2003 Web track. http://ciir.cs.umass.edu/~carteret/BBR.html
- Academic Web Link Databases - Link structure of Spanish, U.K., Taiwanese and Australian Universities. See the local copy of the original description HTML file at http://cybermetrics.wlv.ac.uk/database/ from Wolverhampton.
- AQUAINT (TREC) QA evaluation corpus - TREC QA (AQUAINT) Data for 2002/2003.
A corpus comprising of data from the New York Times, Xinhua news
service and the Associated Press. See the index.html file in the
directory for more details.
- AQUAINT-2 corpus - The AQUAINT-2 collection is a subset of the LDC English Gigaword Third Edition (LDC catalog number LDC2007T07). The AQUAINT-2 collection comprises approximately 2.5 GB of text (about 907K documents) spanning the time period of October 2004 - March 2006. Articles are in English and come from a variety of sources including Agence France Presse, Central News Agency (Taiwan), Xinhua News Agency, Los Angeles Times-Washington Post News Service, New York Times, and the Associated Press. Document IDs identify both the source newswire service and the date when the article was delivered to LDC by the service.
- Argumentative Zoning Corpus (pre-distribution) - This is a mostly cleaned corpus of 80 computational linguistic articles that have been marked up for argumentative zoning relations. You can learn more about this from Simone's home page or from Yee Seng Chan's (search for "zoning") Digital Library course project.
- Bank Search Dataset - A web document clustering dataset, provided free of charge from the University of Reading.
- British National Corpus, World Edition - The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written. See the home page of the BNC at http://www.natcorp.ox.ac.uk/ for more details. We have a five year license for this product.
- CBC4Kids corpus - This directory contains the automatically annotated marked-up version
of MITRE's CBC4Kids corpus of online news stories for teenagers
- Chinese Treebank - The Penn Chinese Treebank is an ongoing project, that started in the summer of 1998. The goal of the project is to create of a 500,000-word corpus of Chinese text with syntactic bracketing. Chinese Treebank 1.0 was first published in 2000, and it was later corrected and released in 2001 as Chinese Treebank 2.0. More information about the project is available on the Penn Chinese Treebank website at: http://www.cis.upenn.edu/%7Echinese/.
- CoPhIR - Content-based Photo Image Retrieval Test-Collection. Homepage: http://cophir.isti.cnr.it/
- Cora datasets - This is the data from Andrew McCallum's home page on the scientific search engine CORA. It includes the citation matching, research paper classification and information extraction datasets.
- CoreLex - Systematic polysemy and underspecification of nouns. Home page: http://www.cs.brandeis.edu/~paulb/CoreLex/corelex.html
- Cotraining Web KB Data - This is a subsection of the WebKB text
classification corpus containing both hyperlink and the documents with
judgments on the webpages into two categories, course and non-course.
The relevant web page has been downloaded into root directory and is
also found on the WWW at http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-51/www/co-training/data/
(as of Mon Apr 14 20:03:52 GMT-8 2003).
- DBLP XML records - These are the XML records of the entire DBLP database. They are available from http://dblp.uni-trier.de/xml/. The copies here are dated from 2005 Jul 18, 1 Sep 2005 and 1 Aug 2006.
- DUC 2001-2003 data - Data (mostly testing data) from the Document Understanding Conference for the years 2001-2003. This is a summarization competition, held by NIST of the USA. See the
DUC web site for
details.
- English gazetteere - The main part of the World Gazetteer consists of current and some historical population data for all countries and territories, their administrative divisions, cities and towns. "Historical" in this context means the last or last two censuses or estimates after the last census. There are no older data
Furthermore it contains the national flags for all countries and territories and some administrative divisions, some summary statistics (such as a list of the largest cities or a complete list of countries with their population) and further information such as a pronunciation table for a number of languages
- English Wikipedia - English Wikipedia corpus (3 CDs)
- Excite Query Logs - For research
purposes only. Anyone connected to corporate research may not
use this research. Access is restricted.
- Excite Query Logs (2001) - Image and text queries from the Excite search engine, circa 2001.
See Amanda Spink's publications for details.
URL: http://www.sis.pitt.edu/~aspink/
- HIT-IR Lab Language Technology Platform Sharing Package 1.5.0 - The Harbin Institute of Technology Information Retrieval Lab sharing package consists of the following corpora:
1. HIT-IR Chinese-English Bilingual Corpus
2. HIT-IR Chinese Dependency Treebank
3. HIT-IR Tongyici Cilin (Extended)
4. HIT-IR QA Question Set
5. HIT-IR Text Summarization Corpus
6. HIT-IR Multi-Document Summarization Corpus
The sharing package also contains programming tools for the Windows platform.
- Hong Kong News Parallel Text - This FTP publication contains the Hong Kong News Parallel Text, produced by the Linguistic Data Consortium (LDC), catalog number LDC2000T46, isbn 1-58563-169-8. The Hong Kong News Parallel Text was created when the LDC collected parallel Cantonese - English news articles from the Information Services Department of Hong Kong Special Administrative Region (HKSAR) of the People's Republic of China.
- Hong Kong News Parallel Text - This FTP publication contains the Hong Kong News Parallel Text, produced by the Linguistic Data Consortium (LDC), catalog number LDC2000T46, isbn 1-58563-169-8. The Hong Kong News Parallel Text was created when the LDC collected parallel Cantonese - English news articles from the Information Services Department of Hong Kong Special Administrative Region (HKSAR) of the People's Republic of China.
- ICWSM 2007 Blogs Dataset - UMBC Ebiquity group is hosting the Blogs Collection associated with the 1st International Conference on Weblogs and Social Media, 2007 (ICWSM). This collection is provided by Nielsen BuzzMetrics. You are required to have completed the Data Share Agreement prior to download.
The URL is here: http://ebiquity.umbc.edu/blogger/icwsm-2007-blogs-dataset/
- ILP learning dataset - Another subset of the WebKB text classification
corpus as used in the ILP 98 paper. See the root directory README for
more details.
- Internet Movie Database - Plain text files for the Internet Movie Database, downloaded from http://www.imdb.com/interfaces.
- ISL Meeting transcripts - The ISL Meeting Corpus Part 1 is a first subset of the ISL Meeting Corpus (112 meetings). It contains 18 meetings collected at the Interactive Systems Laboratories at Carnegie Mellon University in Pittsburgh, PA during the years 2000-2001. The recorded meetings were either natural meetings where participants needed to meet in the real world, or artificial meetings, which were designed explicitly for the purposes of data collection but still had real topics and tasks. The duration of the meetings in this corpus ranges from 8 to 64 minutes and averages at 34 minutes. The audio files are available as ISL Meeting Speech Part 1. See the home page for the corpus at: http://wave.ldc.upenn.edu/Catalog/docs/LDC2004T10/.
- LDC Chinese Resources - LDC Chinese Resources, see http://projects.ldc.upenn.edu/Chinese/LDC_ch.htm
- LDC English Gigaword Corpus - A large newspaper article corpus from the LDC, overlaps with WSJ and the AQUAINT corpora. Here's a link to its description from the LDC: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05.
- Live mirror of the Citeseer - This is our live mirror of the citeseer system, which
contains over 750K pdf and ps documents used by the system. The
dataset can be used by anyone in the wing group. The front-end of the
system can be found at http://citeseer.comp.nus.edu.sg/
- Mirror of www.comp.nus.edu.sg - This is a crawl of the http://www.comp.nus.edu.sg/ site consisting of
approximately 250K pages/data items equaling ~17GB. Crawled in Aug
2006. Useful for focused crawling/spidering experiments.
- MNIST database of handwritten digits - The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. Homepage: http://yann.lecun.com/exdb/mnist/
- Moby corpus' complete works of Shakespeare - The Moby corpus' version of the unabridged works of William Shakespeare. The Moby project has a number of other lexica, see below and at the source home page: http://www.dcs.shef.ac.uk/research/ilash/Moby/.
- MovieLens Collaborative Filtering dataset - Two datasets used for collaborative filtering research. The first one consists of 100,000 ratings for 1682 movies by 943 users. The second one consists of approximately 1 million ratings for 3900 movies by 6040 users. Before using these datasets, please review the included readme files for the usage license. More information is avaliable from the GroupLens webpage: http://www.grouplens.org/
- MPQA Opinion Corpus 1.1 and 2.0 - The MPQA Opinion Corpus contains news articles from a wide variety of news sources manually annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc.). Homepage: http://www.cs.pitt.edu/mpqa/.
- MUC 6 co-reference data - Message Understanding Conference 6 data, from the Linguistic Data Consortium. See the README file in the source directory for details.
- Name Files - Name files
- North American News Text Corpus - Contains text from the Wall Street Journal,
Reuters, New York Times and the LA Times-Washington Post News Service.
- NPIC Synthetic Image Corpus - This is a large 15K+ images spidered from Google and downloaded from
the Wikipedia Commons used in "Fei Wang and Min-Yen Kan (2006) NPIC: Hierarchical synthetic image classification using image
search and generic features, In Proceedings of the Conference on
Image and Video Retrieval (CIVR), Tempe, Arizona, USA, July 2006."
Status: freely available to all, can be used for non-commercial
uses only. See the WING download section for NPIC
- NTU OPAC query logs - This is a list of about ~700K online public access catalog queries collected by the Nanyang Technological University (NTU) OPAC server in 2002.
- NUS Libraries query logs - About 800 K queries from the simple keyword
interface for the LINC online catalog system of NUS. On-going
collection of queries likely. Provided by NUS Libraries. For research
purposes only. Description updated: Fri Nov 21 09:13:43 GMT-8 2003.
- Open Directory Project web page data - The ODP is a large, open-source, human-edited
directory similar to Yahoo!. The data is distributed under GNU GPL and
is provided here for IR research purposes. See their web page for more details.
- OPUS Parallel corpus (v0.2) - OPUS is an attempt to collect translated texts from the web, to convert and align the entire collection, to add linguistic data, and to provide the community with a publicly available parallel corpus. OPUS is based on open source products and is also delivered as an open source package. We used several tools to compile the current corpus. (Manual corrections have not been made.) See the home page for more details and for their online search interface: http://logos.uio.no/opus/
- Penn Treebank - The Penn Treebank contains Wall Street Journal text that has
been tagged, parsed by both machine and linguists. It is a benchmark
corpus for parsing and part-of-speech tagging tasks. Contains binaries
for grepping on tree nodes (e.g.,
tgrep).
- Presentation to Document Alignment Corpus (v1.1) - This is a manual alignment of 21 scholarly papers from the database community to their corresponding presentations. The alignments are from one slide to multiple paragraphs. Compiled by Eugene Ezekiel. The corpus is distributed with the original papers (.PDFs) and presentations (.PPTs). Mentioned on the downloads page of the WING
website at http://aye.comp.nus.edu.sg/~slideseer/
- PRESRI presentation to Document Alignment Corpus (v1.0) - This is a set of document to presentation alignments compiled by Hayama et al. in their paper: Alignment between a Technical Paper and Presentation Sheets Using a Hidden Markov Model, available from http://www.jaist.ac.jp/~t-hayama/paper/hayamaAMT05.pdf. Two datasets are available, one in Japanese and a smaller one in English.
- Processed versions of the webbase - Processed versions of the webbase. Currently only contains all the URLs contained in the webbase corpus, see directory and script for information.
- PropBank - The PropBank project is creating a corpus of text annotated
with information about basic semantic propositions. Predicate-argument
relations are being added to the syntactic trees of the Penn Treebank. See http://www.cis.upenn.edu/~ace/
for details.
- Remedia Corpus - Remedia corpus
- Reuters-21578 text categorization corpus - The classic text categorization corpus. Found from http://www.daviddlewis.com/resources/testcollections/reuters21578/.
- Search Engine Transaction Logs (Excite 1997, 1999, 2001) - Search Engine Transaction Logs File.
Excite datasets and the query logs from Dr Amanda Spink.
URL: http://www.sis.pitt.edu/~aspink/
- Search Engine Transaction Logs (Excite 1997, 1999, 2001) - Search Engine Transaction Logs File.
Excite datasets and the query logs from Dr Amanda Spink.
URL: http://www.sis.pitt.edu/~aspink/
- Short messages service corpus (SMS Corpus) - Collection of about 10.1K messages of SMS service corpus collected by How Yijue as part of her honors year thesis work. Will be available for worldwide use by the end of 2004. Please see How Yijue's thesis for more documentation.
- Short messages service corpus (SMS Corpus) - Collection of about 10.1K messages of SMS service corpus collected by How Yijue as part of her honors year thesis work. Please see How Yijue's thesis for more documentation.
- Stanford Webbase Corpous - Partial collection of The Stanford WebBase Project hello
- STATTAB - STATTAB calculates cumulative functions, their inverses, and parameters of the following distributions: Incomplete Beta, Binomial, Negative Binomial, Chi-square, Non-central chi-square, Variance Ratio - F, Non-central F, Incomplete Gamma, Normal, Poisson, T, Non-central T. Downloaded from http://biostatistics.mdanderson.org/SoftwareDownload/.
- Summbank - Summary corpus linked to the HKSAR news corpus. Produced and studied extensively by one of the JHU Workshops in 2001. More information about the corpus is at: http://www.summarization.com/summbank/".
- Surname List - A list of 23K+ English surnames compiled from the rootsweb mailing list list. See the local README file for more information.
- TAC 2008 Summarization task data - Contains data for the
update and opinion summarization tasks.
- Text Retrieval Conference (TREC) English Queries - The Text
Retrieval Conference (TREC) has been held for numerous years. The
queries for the competition are housed here. The TREC English queries
home page is at: http://trec.nist.gov/data/topics_eng/index.html.
Status: Currently available for research purposes, cleared by TREC
administrators by TREC maintainers.
- Three test collections for personal name matching - Three test collections for personal name matching were created from DBLP data by Patrick Reuther. Homepage: http://dbis.uni-trier.de/Mitarbeiter/reuther_files/private/reuther.shtml
License: For non-commercial use only
- Tipster Text Research Collection, Vol 1. - The TIPSTER Text research collections
were used extensively for the Text Retrieval Conferences (TREC). Still
a good source of text corpora for the research community.
- Topic Detection & Tracking - The TDT dataset is used for Topic Detection & Tracking (TDT) research. Currently, TDT2, used for 1998 TDT test; TDT3, used for 1999 ~ 2001 TDT tests; and TDT4, used for 2002 ~ 2003 TDT tests are installed. Please refer to http://www.nist.gov/speech/tests/tdt/index.htm for details of TDT research.
- TREC 2003 QA Main Task Questions and Judgments - Questions used in TREC 2003 QA main task, including factoid, list and definition questions, as well as their judgments.
- TREC BLOG06 corpus - Test collection for the TREC-2006 Blogs track. See here for more information.
- TREC Results (Raw) - This collection contains results of previous Text REtrieval Conferences from 1993 (TREC-2) to 2005 (TREC-14). The collection of data in this directory is not in public domain, and is for use within SoC. Please refer to the file licence.txt
for (possibly not up-to-date) terms and conditions. An up to date version can be found at http://trec.nist.gov/results.html.
- Web 1T 5-gram Version 1 - This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. We expect this data will be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses. The n-gram counts were generated from approximately 1 trillion word tokens of text from publicly accessible Web pages. See here for more information.
- Web 1T 5-gram Version 1 - This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. We expect this data will be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses. The n-gram counts were generated from approximately 1 trillion word tokens of text from publicly accessible Web pages. See here for more information.
- Web corpora wt10g and wt2g - hese are two 10 GB and 2 GB corpora used by the TREC web
track. Compiled by CSIRO. See the directory for more information.More details on the corpus can be found on the TREC website and at the CSIRO website Anyone wishing to use
this corpus must sign an individual license agreement
before proceeding.
- Web Pages of Biographies - Crawled web pages of biographies.
- WebBase statistics - Statistics on the Stanford WebBase corpus as compiled by UC
Berkeley. Scripts and files that compute the IDF value of words over
133 M web pages are included. Big file!
- WebKB webpages and judgments - This is the WebKB text classification corpus. The relevant home page
is in the root directory and can be found at http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/
(as of Mon Apr 14 19:59:38 GMT-8 2003). It contains a corpus of 4000+
web pages and their classification into 7 categories.
- William Hersh's MEDLINE corpus as used for the TREC 9 filtering task - The OHSUMED test collection is a set of 348,566 references from MEDLINE, the on-line medical information database, consisting of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991). The available fields are title, abstract, MeSH indexing terms, author, source, and publication type. For more information, consult the TREC data home page, http://trec.nist.gov/data.html.
Proceedings - proceedings and workshop notes from previous research congresses in IR and NLP.
- ACL 2003 - Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, 7-12 July 2003, Sapporo Convention Center, Sapporo, Japan.
- ACL 2004 - Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 21-26 July, 2004, Barcelona, Spain.
- ACL Anthology - The ACL Anthology is now mirrored locally in SoC as part of my group's
activities. This is a collection of over 10,000 papers in digital
form (PDF) that have been published by the Association of
Computational Linguistics and the International Committee on
Computational Linguistics. Our mirror comes directly from the main
copy at acl.ldc.upenn.edu and will be updated approximately once a
week. This is available on both sf3 and aye servers.
- ACL-EACL 2001 - Proceedings of the ACL-EACL Conference, Student Research Workshop, Workshops and local information
- ACM Multimedia - 2002 - Proceedings of the 10th ACM International Conference on Multimedia (ACM MM 2002) - Juan-les-Pins, France, 2002 December 1-6.
- ACM Multimedia - 2004 - Proceedings of the 12th ACM International Conference on Multimedia (ACM MM 2004) - New York, USA, 2004 October 10-16.
- ACM Multimedia - 2005 - Proceedings of the 13th ACM International Conference on Multimedia (ACM MM 2005) - Singapore, 2005 November 6-11.
- AIRS 2006 - Proceedings of the Information Retrieval Technology, Third Asia Information Retrieval Symposium, AIRS 2006, Singapore, October 16-18, 2006.
- CHI 2009 - Proceedings of the 2009 ACM SIGCHI Conference on Human Factors in Computing Systems, 4-9 April 2009, Boston, MA, USA.
- CIKM 2004 - Proceedings of the 2004 ACM CIKM International Conference on Information and Knowledge Management, Washington, DC, USA, November 8-13, 2004.
- Coling/ACL 2006 - This is a mirror of the proceedings CD (Sydney) complete with papers
from: Fifth SIGHAN Workshop on Chinese Language Processing,
Information Extraction Beyond The Document, Workshop on Sentiment and
Subjectivity in Text, Constraints and Language Processing, 2nd
Workshop on Ontology Learning and Population: Bridging the Gap between
Text and Knowledge, Frontiers in Linguistically Annotated Corpora
2006, Task-Focused Summarization and Question Answering, How Can
Computational Linguistics Improve Information Retrieval?, Annotating
and Reasoning about Time and Events, Multilingual Language Resources
and Interoperability, Linguistic Distances, Multiword Expressions:
Identifying and Exploiting Underlying Properties, The 7th SIGdial
Workshop on Discourse and Dialogue, Fourth International Natural
Language Generation Conference, The Eighth International Workshop on
Tree Adjoining Grammar and Related Formalisms, 2006 Conference on
Empirical Methods in Natural Language Processing.
- EACL 2006 - Proceedings of the 11th European Association for Computational Linguistics 2006 meeting and associated workshops.
Trento Italy, April 3-7 2006.
- HCII 2005 - These are the proceedings of the HCI International conference held in Caesar's Palace, Las Vegas, USA on July 22-27, 2005. HCII is formed of 7 different meetings that are colocated: * Symposium on Human Interface (Japan) 2005 * 6th International Conference on Engineering Psychology & Cognitive Ergonomics * 3rd International Conference on Universal Access in Human-Computer Interaction * 1st International Conference on Virtual Reality * 1st International Conference on Usability and Internationalization * 1st International Conference on Online Communities and Social Computing * 1st International Conference on Augmented Cognition
- HLT-NAACL 2007 - Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, 2007 April 22-27, Rochester, NY.
- HLT/NAACL 2004 - The Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL-2004) - Boston, USA, 2-7 May 2004
- IJCNLP 2005 - Proceedings for Second International Joint Conference on National Language Processing, Jeju Island, Korea, October 2005.
- IJCNLP 2008 - Proceedings for Third International Joint Conference on National Language
Processing, Hyderabad, India, January 2008.
- JCDL/HT 2008 - Proceedings of the Joint Conference on Digital Libraries (JCDL) 2008 and Hypertext (HT) 2008, Pittsburgh, Pennsylvania, June 2008.
- LREC 2002 - The proceedings for the Language Resources and Evaluation Conference, held in the Canary Islands, Spain, in May 2002. Contains workshop and poster session papers as well.
- LREC 2004 - The proceedings for the Language Resources and Evaluation Conference, held in Lisbon, Portugal, in May 2004. Contains workshop and poster session papers as well.
- LREC 2008 - The proceedings for the Language Resources and Evaluation Conference, held in Marrakech, Morocco, in May 2008. Contains workshop and poster session papers as well.
- Multimedia Data Mining 2001 - Proceedings of the KDD 01 workshop
- Multimedia Data Mining 2002 - Proceedings of the KDD 02 workshop
- NAACL 2001 - The Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL 2001) - Carnegie Mellon University - Pittsburgh, PA USA 2-7 June 2001
- PAKDD 2003 - Proceedings of the Advances in Knowledge Discovery and Data Mining, 7th Pacific-Asia Conference, PAKDD 2003, Seoul, Korea, April 30 - May 2, 2003
- SIGIR 2007 - The proceedings for the 2007 ACM SIGIR Conference on Research and Development in Information Retrieval, held in Amsterdam, The Netherlands.
- SIGIR 2008 - The proceedings for the 2008 ACM SIGIR Conference on Research and Development in Information Retrieval, held in Singapore.
- SIGMOD/PODS 2008 - Proceedings of the ACM SIGMOD/PODS 2008, Vancouver, BC, Canada, June 9-12, 2008.
- WebDB 2004 - Proceedings of the Seventh International Workshop on the Web and Databases, WebDB 2004, June 17-18, 2004, Maison de la Chimie, Paris, France.
- WebDB 2008 - Proceedings of the Eleventh International Workshop on the Web and Databases, WebDB 2008, June 13, 2008, Vancouver, British Columbia, Canada.
- WWW 2003 - The Twelfth International World Wide Web Conference (WWW-2003) - Budapest, HUNGARY, 20-24 May 2003. The proceedings contain 77 referred papers, 207 posters and 38 alternate track papers
- WWW 2004 - The Thirteenth International World Wide Web Conference (WWW-2004) - New York, USA, 17-22 May 2004
Grammars - hand crafted grammars for analysis and generation
- Surge 2.2 - A comprehensive unification grammar for the English language generation. Widely used with FUF. Developed by Jacques Robin from Brazil. Home page: http://www.cs.bgu.ac.il/research/projects/surge/index.htm.
- Surge 2.2 - A comprehensive unification grammar for the English language generation. Widely used with FUF. Developed by Jacques Robin from Brazil. Home page: http://www.cs.bgu.ac.il/research/projects/surge/index.htm.
Lexicons - lexicons and ontologies for word senses, word relations and conflation
- Beth Levin's English Verb Classes and Alternations (EVCA) - Files that describe the verb classes from Levin's seminal work on verb classification by their case frames and alternations. Flat text files.
- CMU Pronunciation Dictionary - The Carnegie Mellon University Pronouncing Dictionary is a machine-readable pronunciation dictionary for North American English that contains over 125,000 words and their transcriptions. This format is particularly useful for speech recognition and synthesis, as it has mappings from words to their pronunciations in the given phoneme set. The current phoneme set contains 39 phonemes, for which the vowels may carry lexical stress,See http://www.speech.cs.cmu.edu/cgi-bin/cmudict. See the README in the directory for more details.
- Extended WordNet 2.0 - In the eXtended WordNet the WordNet glosses are syntactically parsed, transformed into logic forms and content words are semantically disambiguated. Makes this data available in XML form. I have only installed the version that tracks WordNet 2.0. This is work by Moldovan et al. at U Texas. See their home page at: http://xwn.hlt.utdallas.edu/index.html
- FrameNet Release 1.3 - The Berkeley FrameNet project is creating an on-line lexical resource for English, based on frame semantics and supported by corpus evidence. The aim is to document the range of semantic and syntactic combinatory possibilities (valences) of each word in each of its senses, through computer-assisted annotation of example sentences and automatic tabulation and display of the annotation results. The major product of this work, the FrameNet lexical database, currently contains more than 11,600 lexical units (defined below), more than 6,800 of which are fully annotated, in more than 960 semantic frames, exemplified in more than 150,000 annotated sentences. It has gone through five releases, and is now in use by hundreds of researchers, teachers, and students around the world. Home page: http://framenet.icsi.berkeley.edu/
- Java WordNet Library (JWNL) 1.3 RC3, 1.4 RC1 - JWNL is a Java API for accessing the WordNet relational dictionary. WordNet is widely used for developing NLP applications, and a Java API such as JWNL will allow developers to more easily use Java for building NLP applications. Home page at http://jwordnet.sourceforge.net/. Usage notes: Please refer to the README-SOC.TXT file for some usage notes.
- Moby Lexica - The Moby lexicons containing: Hyphenator - 185,000 entries fully hyphenated. Moby Language - Word lists in five of the world's great languages. Moby Part-of-Speech - 230,000 entries fully described by part(s) of speech, listed in priority order. Moby Pronunciator - 175,000 entries fully International Phonetic Alphabet coded. Moby Thesaurus - 30,000 root words, 2.5 million synonyms and related words. Moby Words - 610,000+ words and phrases. The largest word list in the world. The source Moby website is at: University of Sheffield
- NomBank - NomBank is an annotation project at New York University that is related to the PropBank project at the University of Pennsylvania. Our goal is to mark the sets of arguments that cooccur with nouns in the PropBank Corpus (the Wall Street Journal Corpus of the Penn Treebank), just as PropBank records such information for verbs. As a side effect of the annotation process, we will produce a number of other resources including various dictionaries, as well as PropBank style lexical entries called frame files. These resources help the user label the various arguments and adjuncts of the head nouns with roles (sets of argument labels for each sense of each noun). NYU and UPenn are making a coordinated effort to insure that, when possible, role definitions are consistent across parts of speech. For example, we are using Penn's frame file for the verb "decide" in our annotation of the verb "decision". However, our coordination goes far beyond that.
- WordNet 1.7.1 - Probably the most famous lexical ontology. Home page at http://wordnet.princeton.edu/. Documentation and papers available from its home page.
Usage notes: Make sure either $WNHOME is properly set to /home/rpnlpir/lexicons/WordNet-1.7.1 or $WNSEARCHDIR is properly set to /home/rpnlpir/lexicons/WordNet-1.7.1/dict.
- WordNet 2.0 - An update to 1.7.1 featuring quite a lot of changes. See the CHANGES file in the directory. Documentation and papers available from its home page.
Usage notes:Make sure $WNSEARCHDIR is properly set to /home/rsch/rpnlpir/lexicons/WordNet-2.0/dict.
- WordNet 2.1 - An update to 2.0 featuring quite a lot of changes. Documentation and papers available from its home page. Usage notes: Make sure either $WNHOME is properly set to /home/rpnlpir/lexicons/WordNet-2.1 or $WNSEARCHDIR is properly set to /home/rpnlpir/lexicons/WordNet-2.1/dict
- WordNet 3.0 - WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. WordNet's structure makes it a useful tool for computational linguistics and natural language processing. http://wordnet.princeton.edu/
- WordNet log likelihood statistics - Negative log likelihood statistics for WordNet 1.6 synsets. Can be coupled to compute (or partially compute) semantic similarity of words, similar to lexical chaining. See the directory's README file for more information
Tools - a large list of language analysis and generation tools, including parsers, chunkers, part-of-speech taggers, etc.
- Alignment-Based Learning (ABL) grammatical inference system - This is the main page of the Alignment-Based Learning (ABL) grammatical inference system. ABL learns structure from plain sequences (for example natural language sentences) by comparing them. Based on the parts of the sequences that are the same and parts that are not the same in two sequences, structure is inserted in the sequences. Look in /src folder for 3 executable files. Please refer to the homepage here: http://www.ics.mq.edu.au/~menno/research/software/abl/
- Ant 1.6.2, 1.6.5 and 1.7.0 - The build utility for java projects. From http://ant.apache.org/. You may need to set your CLASSPATH to get this tool running properly.
- bmeps - bmeps is an image conversion software. Installed with zlib and libpng libraries
- BoosTexter - BoosTexter is a machine learning algorithm that computes a classifier from simple single level decision trees (a.k.a. decision stumps) via boosting.
- BOW machine learning toolkit - Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs. The current distribution includes the library, as well as front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow). The library and its front-ends were designed and written by Andrew McCallum, with some contributions from several graduate and undergraduate students. The library provides facilities for: Recursively descending directories, finding text files. Finding `document' boundaries when there are multiple documents per file. Tokenizing a text file, according to several different methods. Including N-grams among the tokens. Mapping strings to integers and back again, very efficiently. Building a sparse matrix of document/token counts. Pruning vocabulary by word counts or by information gain. Building and manipulating word vectors. Setting word vector weights according to Naive Bayes, TFIDF, and several other methods. Smoothing word probabilities according to Laplace (Dirichlet uniform), M-estimates, Witten-Bell, and Good-Turning. Scoring queries for retrieval or classification. Writing all data structures to disk in a compact format. Reading the document/token matrix from disk in an efficient, sparse fashion. Performing test/train splits, and automatic classification tests. Operating in server mode, receiving and answering queries over a socket. The code conforms to the GNU coding standards. It is released under the Library GNU Public License (LGPL). Home Page: http://www-2.cs.cmu.edu/~mccallum/bow
- btparse 0.34 - Parser for BibTeX entries, as a C library. Homepage: http://www.gerg.ca/software/btOOL/
- c2html - From
Ashley Clark's
debian linux package. Compiles fine on Linux. A converter for C code
to colorize and write markup in .HTML.
- C4.5 decision tree learner - The classic decision tree learner by Quinlan. Superceded by his 5.0 commericial product. Handles numerical and categorical features. More information from http://www.cse.unsw.edu.au/~quinlan/.
- Charniak Parser - Eugene Charniak's parser, as made available from his Brown homepage, at http://www.cs.brown.edu/people/ec/#software
- Collins Parser - The Collins parser as made available by Michael Collins of MIT. Michael Collins' home page: http://www.ai.mit.edu/people/mcollins/
- Coloring HTML annotation package - This tool is an annotation tool for HTML pages. It converts a target
HTML page to a new HTML page that has javascript to annotate text
spans and images with respect to a {annotation category, color} hash
table. The utility features an undo and resume capabilities and
keyboard shortcuts to make annotations easy. Annotations are saved
automatically after each annotation. A recoloring utility allows
annotated pages to be exported into finalized formats for use in
subsequent viewing or use in downstream machine learning algorithms.
- CRF 1.2 - The CRF package is a java implementation of Conditional Random Fields for sequential labeling developed by Sunita Sarawagi of IIT Bombay. The package is distributed with the hope that it will be useful for researchers working in information extraction or related areas. Please read CRF-USAGE.TXT for usage notes. Homepage: http://crf.sourceforge.net/
- CRUNCH HTML Content Extractor Proxy - Described in Gupta et al.'s paper in WWW 2003. Status: Restricted license for research purposes only
- Daemonized Collins Parser - The modified Collins parser as made available by Min-Yen Kan of NUS. Modified to allow the parser to load the hash tables once and stay resident (as a background daemon process) so that parser can parse multiple files, without having to re-load the hash tables each time
- Duke University's Autobib - The Autobib project proposes and implements a framework of extracting and integrating bibliographic information on the Web automatically using Hidden Markov Models. Here, you will find code and documentations related to this project, and you can also browse the experimental bibliographic data and check for its quality. This project is done in the Computer Science Department at Duke University, under the supervision of Prof. Jun Yang.
- FUF 5.3 - Functional unification based natural language generation system developed by Michael Elhadad. Home page at: http://www.cs.bgu.ac.il/surge/index.html.
- g95 - Fortran 95 compiler. Home page: http://www.g95.org/.
- GATE 4.0 - Architecture and framework for language processing.
- GNU Trove 1.1b5 - Provide fast, lightweight implementations of the java.util Collections API. Whenever possible, provide the same collections support for primitive types. Homepage: http://trove4j.sourceforge.net/
(This tool is required by the CRF package.)
- Google Web API - API for accessing the Google search results, preferable to screen / page scraping. Home page at: http://www.google.com/apis/
- Hidden Markov Model Tookit (HTK) 3.2.1 - The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov models. HTK is primarily used for speech recognition research although it has been used for numerous other applications including research into speech synthesis, character recognition and DNA sequencing. HTK is in use at hundreds of sites worldwide. HTK consists of a set of library modules and tools available in C source form. The tools provide sophisticated facilities for speech analysis, HMM training, testing and results analysis. The software supports HMMs using both continuous density mixture Gaussians and discrete distributions and can be used to build complex HMM systems. The HTK release contains extensive documentation and examples. See http://htk.eng.cam.ac.uk/ for more information.
- HMM Tagger (Xerox tagger) - Xerox part-of-speech tagger. XPOST is a hidden Markov model based part-of-speech tagger. Given a sentence, each token is assigned a part-of-speech ambiguity class from a lexicon (e.g. "package" is in the ambiguity class {noun,verb}). Words not in the lexicon are subjected to suffix analysis. A probabilistic model that assesses the likelihood of particular part-of-speech assignments based on word order is then applied to disambiguate the available choices. The final output is a sentence with each word tagged with the most likely part-of-speech tag. XPOST can process all the languages for which word order predicts part-of-speech tag. FTP site at: ftp://ftp.parc.xerox.com/pub/tagger/.
- International Components for Unicode for Java (ICU4J) 4.2 - ICU4J is a mature, widely used set of Java libraries providing Unicode and Globalization support for software applications. Homepage: http://icu-project.org/
- IQMT Framework for MT Evaluation 1.0 and 1.2 - IQMT Framework for Machine Translation Evaluation is an open source software Perl package based on the QARLA Framework. Homepage: http://www.lsi.upc.edu/~nlp/IQMT/.
- IQMT Framework for MT Evaluation 2.0.2 - IQMT Framework for Machine Translation Evaluation is an open source software Perl package based on the QARLA Framework. Homepage: http://www.lsi.upc.edu/~nlp/IQMT/.
- KEA 5.0 - KEA is an algorithm for extracting keyphrases from text documents. It can be either used for free indexing or for indexing with a controlled vocabulary. For more information, visit http://www.nzdl.org/Kea/.
- LIBLINEAR 1.12 - LIBLINEAR is a linear classifier for data with millions of instances and
features. It supports L2-regularized logistic regression (LR), L2-loss
linear SVM, and L1-loss linear SVM. The main approach for L1-SVM and
L2-SVM is a coordinate descent method. We also implement a trust region
Newton method for LR and L2-SVM.
- LIBSVM 2.85 and 2.86 - LIBSVM is an integrated software for support vector classification,
(C-SVC, nu-SVC), regression (epsilon-SVR, nu-SVR) and distribution
estimation (one-class SVM ). It supports multi-class classification.
- LibWWW 5.4.0 - Libwww is a highly modular, general-purpose client side Web API written in C for Unix and Windows (Win32). It's well suited for both small and large applications, like browser/editors, robots, batch tools, etc. Pluggable modules provided with libwww include complete HTTP/1.1 (with caching, pipelining, PUT, POST, Digest Authentication, deflate, etc), MySQL logging, FTP, HTML/4, XML (expat), RDF (SiRPAC), WebDAV, and much more. The purpose of libwww is to serve as a testbed for protocol experiments. See the home page at http://www.w3.org/Library/.
- Lovins' Stemmer - Three different implementations of the stemmer
are available from Frank
Eibe's home page on the Lovins stemmer
(http://www.cs.waikato.ac.nz/~eibe/stemmers/index.html). The software
is downloadable from Sourceforge.
- lp_solve 5.5.0.14 - lp_solve is a mixed integer linear programming solver. Homepage: http://lpsolve.sourceforge.net/
- Lucene 2.2.0, 2.3.1, and 2.3.2 - Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download from Apache Jakarta. Installed from http://www.apache.org/dyn/closer.cgi/lucene/java/.
- Managing Gigabytes - Installed from http://www.cs.mu.oz.au/mg/
- Matrix Toolkits for Java 0.9.12 - Matrix Toolkits for Java is designed to be used as a library for developing numerical applications, both for small and large scale computations. Homepage: http://code.google.com/p/matrix-toolkits-java/
- Maximum Entropy Part of Speech (POS) Tagger and sentence splitter (MXPOST and MXTERMINATOR) - Adwait Ratnaparkhi's Maximum-Entropy based tagger, as per his
1997 ACL paper. This tools outputs the format expected by Collins'
parser (also locally installed).
- MINIPAR - MINIPAR is a broad-coverage parser for the English language. Homepage: http://www.cs.ualberta.ca/~lindek/minipar.htm
- Minorthird (9 May) - N/A
- morph - A fast and robust morphological analyser for English based on finite-state techniques that returns the lemma and inflection type of a word, given the word form and its part of speech.
Home page: http://www.informatics.susx.ac.uk/research/groups/nlp/carroll/morph.html
- MySQL Connector/J 5.1.6 - A native Java driver that converts JDBC (Java Database Connectivity) calls into the network protocol used by the MySQL database. Home page:
http://www.mysql.com/products/connector/j/
- nlparser (2005 May 26) - A natural language parser for English and Chinese. See the README file for more information. Home page: http://www.cs.brown.edu/software/
This is updated to version 05-Aug-16.
- OpenNLP Maximum Entropy Toolkit 2.3.0, 2.1, and 2.0 - The opennlp.maxent package is a mature Java
package for training and using maximum entropy models. The
documentation has some details about maximum entropy and using the
opennlp.maxent package. It is updated only periodically, so check out
the Sourceforge page for Maxent for the latest news. You can also ask
questions and join in discussions on the forums.
- OpenNLP Tools 1.3.0, 1.2.0 - OpenNLP hosts a variety of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference using the OpenNLP Maxent machine learning package. http://opennlp.sourceforge.net/
- Perl 5.10.0 - Perl packages for RPNLPIR will be installed on this copy of Perl.
- Perl 5.8.7 - See manual inside the directory
- Porter's Stemmer - The Porter stemming algorithm (or ‘Porter stemmer’) is a
process for removing the commoner morphological and inflexional
endings from words in English. Its main use is as part of a term
normalisation process that is usually done when setting up Information
Retrieval systems. Detailed description and a host of downloadable
versions of it in different languages can be found at Porter Stemming
Algorithm.
- Prescript - This is a Postscript to text converter,
developed by the NZDL group. I believe this is the converter used by
Google for PDF files too. Installed from http://www.nzdl.org/html/prescript.html.
- Python 2.4.2 - The python programming language. An older version central to sf3/sunfire can be found at /opt/sfw/bin/python
- Record matching package - This record matching package is written as a framework with the goal of making the writing of programs that perform record matching tasks easier. Homepage: http://wing.comp.nus.edu.sg/~tanyeefa/downloads/recordmatching/
- Reranking Parser - Can be found at Robust Accurate Statistical Parsing (RASP) - Available at http://www.informatics.susx.ac.uk/research/nlp/rasp/
- ROUGE 1.5.5, 1.5.4 and 1.4.2 - ROUGE is an automated summarization evaluation program used by NIST in the DUC conferences to evaluate summarization systems. It is based on the BLEU machine translation scoring metric. See http://www.isi.edu/~cyl/ROUGE/ for more information.
- Search Engine Wrapper (2009 June 9) - This package provides a Java wrapper framework for unifying programmatic access to search engines. It contains an API as well as a command-line application. Homepage: http://wing.comp.nus.edu.sg/~tanyeefa/downloads/searchenginewrapper/
- SecondString (2006 June) - An open-source Java package containing implementations for approximate string-matching techniques, such as Jaccard, Jaro and TF-IDF. Home page: http://secondstring.sourceforge.net/
- Segmenter 1.10 - Min-Yen Kan's linear topical
segmentation program, as described in Coling-ACL 1998.
- SimMetrics 1.6.2 - SimMetrics is an open source extensible library of Similarity or Distance Metrics, e.g. Levenshtein Distance, L2 Distance, Cosine Similarity, Jaccard Similarity etc etc. SimMetrics provides a library of float based similarity measures between String Data as well as the typical unnormalised metric output. Homepage: http://www.dcs.shef.ac.uk/~sam/simmetrics.html
- SNoW POS Tagger - A POS tagger from UIUC, can be found at http://l2r.cs.uiuc.edu/~cogcomp/eoh/pos.html
- Stanford Named Entity Recognizer 1.0 and 1.5 - Named Entity tagger that uses conditional random fields. URL: http://nlp.stanford.edu/software/CRF-NER.shtml
- SVMalign - SVMalign is an implementation of inverse sequence alignment that uses large-margin methods for the underlying optimization, as described in [1]. It uses a simple model for sequence alignment with four distinct operations (Match, Substitution, Insertion, and Deletion) and a number of parameters associated with each (e.g. gap opening, extension, substitution matrix). SVMalign includes, as a subroutine, the Smith-Waterman algorithm for sequence alignment.
It is based on two other software packages, which are included in the distribution below:
SVMlight – an efficient implementation of Support Vector Machines
SVMstruct – an API for learning complex output spaces
Those familiar with SVMlight should find that the usage of SVMalign is quite similar.
- SVMcfg - SVMcfg is an implementation of the Support Vector Machine (SVM) algorithm for learning a weighted context free grammar as described in [1]. The goal is to learn an accurate model from supervised training data, so that this model predicts the correct tree y for a given input x (as, e.g., in natural language parsing). It includes a modified version of the CKY parser written by Mark Johnson.
- SVMlight 6.02 - SVMlight is an implementation of Vapnik's Support Vector Machine [Vapnik, 1995] for the problem of pattern recognition, for the problem of regression, and for the problem of learning a ranking function. The optimization algorithms used in SVMlight are described in [Joachims, 2002a]. [Joachims, 1999a]. The algorithm has scalable memory requirements and can handle problems with many thousands of support vectors efficiently.
- SVMmulticlass 2.20 - SVMmulticlass is an implementation of the multi-class Support Vector Machine (SVM).
- SVMstruct - SVMstruct is a Support Vector Machine (SVM) algorithm for predicting multivariate outputs. It performs supervised learning by approximating a mapping
h: X --> Y
using labeled training examples (x1,y1), ..., (xn,yn). Unlike regular SVMs, however, which consider only univariate predictions like in classification and regression, SVMstruct can predict complex objects y like trees, sequences, or sets. Examples of problems with complex outputs are natural language parsing, sequence alignment in protein homology detection, and markov models for part-of-speech tagging.
- SWI-Prolog 5.4.7 - SWI-Prolog offers a comprehensive Free Software Prolog environment. See its home page at: http://www.swi-prolog.org/.
- Tcl/Tk 8.4.11 - A software system providing a simple command language, and a set of widgets for use in building GUIs. Home page: http://www.tcl.tk/
The only reason for installing Tcl/Tk is because WordNet 2.1 requires Tcl/Tk to install, and only Tcl is found on sf3 (but not Tk).
- Tidy (5 Sep) - A tool to change non conformant HTML to compliant HTML code. From Sourceforge, based on the original version from Dave Raggett.
- TiMBL 6.1 - Tilburg Memory-Based Learner. See http://ilk.uvt.nl/timbl/ for more.
- Tiny SVM 0.09 - TinySVM is an implementation of Support Vector Machines (SVMs) [Vapnik 95], [Vapnik 98] for the problem of pattern recognition. This installation includes the shared library under the lib/ subdirectory. Installed from http://www.chasen.org/~taku/software/TinySVM/ and the doc/index.html file for more information on his tool.
- Transformation Based part of speech tagger (Eric Brill's tagger; a.k.a. Brill tagger) - Brill's part-of-speech tagger, generating Penn treebank tags. Home page at: http://www.cs.jhu.edu/~brill/.
- Tree Kernels In SVM-light - SVM-lighT-TK 1.2 (feature vector Set and Tree FOREST). Contain the source code, the executable and the example data.
For more details, please check http://ai-nlp.info.uniroma2.it/moschitti/TK1.2-software/Tree-Kernel.htm
- umdhmm - A HMM tool
from Tapas Tanungo's software
page. Implementation of Forward-Backward,
Viterbi, and Baum-Welch algorithms.
- Weka 3.4.13, 3.6.0 - Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License. Homepage: http://www.cs.waikato.ac.nz/ml/weka/
- YamCha Chunker v 0.27 - YamCha is a generic, customizable, and open source text chunker
oriented toward a lot of NLP tasks, such as POS tagging, Named
Entity Recognition, base NP chunking, and Text Chunking. YamCha is
using a state-of-the-art machine learning algorithm called Support
Vector Machines (SVMs), first introduced by Vapnik in 1995.
Installed from http://chasen.org/~taku/software/yamcha/.
Libraries - customized libraries to link software to.
- CiteSeer logs - These are logs of the CiteSeer mirror at NUS. Currently three log periods are available.