Here you will find deliverables of the projects done by members of WING, exclusive of publications. If you're looking for demonstrations of systems, they are listed with each project. ToolsThese are some of the in-house NLP and IR tools that we have built to facilitate our research at WING. We hope you'll find some of these tools helpful. A full list of all such tools that we have installed for research at NUS (including tons of ones from external sources can be found on our resource page). - Daemonized Collins parser - Got more than a few sentences to parse? The Collins head driven parser is still considered one of the best open-source English language parsers. We've taken Michael's source code and wrapped it into a daemonized version that you can send sentences to through a socket service, avoiding the long initialization needed by the parser.
- JavaRAP - A Java open-source reimplementation of the famed RAP (Resolution of Anaphora Program) by Boguraev and Kennedy. Note: this program is not considered competitive for anaphora resolution by today's standards but we have implemented it for benchmarking purposes. Feel free to download and use for non-commercial purposes. Contributed by Long Qiu.
- ParsCit - A open source reference string parser primarily using perl, featuring a C++ CRF++ core machine learnining base. Has been tested and integrated with the CiteSeerX and ForeCite projects. A joint collaborative project between IST PSU and NUS. Includes over 1K instances of training data from both computer science, medical and humanities domains. Partially coded by Min-Yen Kan of WING.
- Prastava - (which means "suggestion" in Hindi) An open source, 100% Ruby recommendation system package. Used as the basis for some of our internal recommendation systems work. Coded by Tarun Kumar and Himanshu Gahlot, in their internships with WING.
- Rapi - (which means "tidy" or "neat" in Malay/Indonesian) An open-source OPAC package under the MIT license that allows you to: 1) build a Lucene index from your MARC files, 2) screen scrape live circulation data from your own iii OPAC, and 3) wrap your OPAC with a customizable user interface. A live demo is available here. Contributed by Jesse Prabawa Gozali.
- Record matching package - This record matching package is written as a framework with the goal of making the writing of programs that perform record matching tasks easier. Coded by Yee Fan Tan of WING.
- Search engine wrapper - This package provides a Java wrapper framework for unifying programatic access to search engines. It contains an API as well as a command-line application. Coded by Yee Fan Tan of WING
CorporaThese are text and image and other datasets used by experiments in our group. Most are freely available for research use (not for commercial use in some cases). - ACL Anthology Reference Corpus - A corpus of scholarly publications about Computational Linguistics. This corpus is a canonicalized subset of the ACL Anthology, up to February 2007, consisting of 10,921 articles. We hope this frozen corpus will be used for benchmarking applications for scholarly and bibliometric data processing.
- Javascript Functionality Annotations - Over 1.8K different JavaScript units have been extracted and annotated from the WT10G standard web corpus. These are all the unique JavaScript units that we were able to detect in the entire WT10G, although there were many duplicates in the original 20+K unit instances. Compiled by Wei Lu.
- Keyphrase Corpus - The corpus consists of more than 200 scientific publications, each has 4 different formats: PDF, HTML, plain text, and XML. Compiled by Emma Thuy Dung Nguyen.
- Light Verb / Support Verb Annotations - This is a corpus of light verb annotations (aka support verbs; e.g., "make a call") that were annotated to support a supervised learning algorithm to differentiate them from meaning bearing (heavy) verbs. Compiled by Yee Fan Tan.
NPIC Image Corpus - This is a large 4.7GB image collection, comprising of two different collections: a spidered portion gathered from the Web and another portion taken from the freely-accessible Wikipedia Commons. Compiled by Fei Wang.- NUS Scenario Corpus - This corpus is composed of online newswire articles about 15 scenarios ranging from natural or technological disasters, sports matches, layoffs to political elections. It is a balanced corpus in the sense that each scenario has similar amount of representing events(10), each event has 5 different articles. Furthermore, carefully prepared annotations for all the scenarios are also provided. Compiled by Long Qiu.
- NUS SMS Corpus - This is a corpus of about 10K Short Message Service messages from mobile phone users in Singapore. All are in English. The contributors were mostly university students who contributed messages to this corpus for a small amount of renumeration. Compiled by Yijue How of WING.
- Presentation to Document Alignment Corpus - This is a manual alignment of 20 scholarly papers from the database community to their corresponding presentations. The alignments are from one slide to multiple paragraphs. Compiled by Eugene Ezekiel of WING
|