|
On this page, we list the various projects that are currently undertaken by the WING group. Projects which have closed are listed at the bottom of this page.
For a brief overview of the research work undertaken in WING, you may want to refer to these presentation slides. They are a bit out-dated, but will be able to give you a sense of the work that we are involved in.
If you are a student looking to work on a research project for your graduate studies (GP/MSc), Honours Year Project (HYP) or as part of the Undergraduate Research Opportunity Program (UROP), or considering an internship, we advise you to look at the slides above, read the following notes and visit our group's open project listings.
Reader-centric Scholarly Digital Libraries
ForeCite
Our group's flagship project to combine next generation technologies for research digital libraries. We believes that formal publications are just one facet of research output and that digital libraries need to provide a method for archiving presentation, datasets and subsequent commentaries. ForeCite combines standard citation based information extraction from PDFs, while going ahead to integrate to integrate presentation to paper alignment, a document reading and commenting interface and argumentative zoning of documents.
We also believe that future digital libraries will not be monolithic web sites; that DL services need to be accessible in external workflows. The ForeCite project is examining the feasibility of creating web services and browser extensions that allow the client to interact with ForeCite without needing to interact with the main ForeCite server directly. We have implemented and provided open-source citation services and are exploring further range of services.
ForeCiteNote
Fully integrated with ForeCite, ForeCiteNote (FCN) is a personal digital library system based on TiddlyWiki. FCN aims to support beginning researchers, such as advanced undergraduates and beginning graduate students, take notes and synthesize their ideas in their literature search for related work to their projects. FCN is architected as a simple, standalone web page that can be accessed both online on the web as well as offline on a local client computer.
Within FCN's webpage interface, the user can take notes, automatically retrieve metadata from the server, and organize by tagging and actions what actions to take. The system also helps to locally organize a user's local document storage, to canonicalize the naming and structure of such documents.
Web-based disambiguation of digital library metadata
As digital libraries grow in size, the quality of the digital library metadata records become an issue. Data entry mistakes, string representation differences, ambiguous names, missing field data, repeated entries and other factors contribute to errors and inconsistencies in the metadata records. Noisy metadata records make searching difficult, and possibly result in certain information not being found at all, causing an under-count or over-count and distorting aggregate statistics, and decrease the utility of digital libraries in general.
In this project, we concentrate on using the Web to aid the disambiguation of the metadata records. This is because sometimes the metadata records itself contains insufficient information, or the required knowledge is very difficult to mine. However, the richness of the Web, which represents the collective knowledge of the human population, often provides the answer instantly when suitable queries are presented to a search engine. As search engine calls and web page downloads are processes that are expensive on time, any Web-based disambiguation algorithm must be able to acquire such information in a selective manner in order to scale up to large number of records.
Part of this project was joint work with Dr. Dongwon Lee and Ergin Elmacioglu from the Pennsylvania State University.
Researchers :
Statistical Machine Translation
Modern machine translation (MT) systems have advanced significantly using statistical frameworks. Howerver, these systems treat each word as separate entities without any notion of whether they share morphological relations such as number, gender, or tense. This poses a great challenge for MT systems to deal with morphologically-rich languages, in which an abundant number of affixes could be appended to a word to enrich its meaning.
As such, we aim to imbue MT systems with morphological knowledge, specifically to capture the role of affixes, which are the main constituent parts of a morphologically-rich language, using only unannotated data. We adopt an language-independent approach, and focus on the translation task from a morphologically-poor language to a morphologically-rich language. We consider this translation direction an interesting chanllenge as affixes in a target word might either correspond to separate source words, such as function words, or even do not exist at all at the source side. The ultimate goal is to model such translation relations to improve the translation quality, as well as address the data sparseness problem.
In the past we have also pioneered work on function word syntax based (FWS) machine translation that models the reordering task as being anchored by key "function words" as approximated by frequency information. This past work was joint work with Haizhou LI, of I2R. Due to the nature of the problem, machine translation provides an interesting playground for the implementation of statistical approach. The problems in machine translation are rendered from the ambiguity in several level starting from the surface until the semantic level, where in isolation itself poses a great challenge. In this project, our pursuit is to advance the performance of reordering model. Reordering model attempts to restructure the lexically translated sentence to the correct target language's ordering. In particular, we examine reordering centered around function words. This is motivated by observation that phrases around function words are often incorrectly reordered. By modelling reordering patterns around function words, we hope to capture prominent reordering patterns in both the source and target languages.
Researchers:
Domain-Specific Research Digital Libraries
In this project, we construct software that will impose a generic, shareable, publishable and searchable framework for organizing scientific publications, similar to Cora and CiteSeer. Our work attempts to enable researchers to share annotations, search by fields and add new fields and organization as appropriate, as well as publish annotations. In our projects we are examining DLs for mathematics as well as for coordinating multimedia.
For mathematics, despite the enormous success of common search engines for general search, when it comes to domain-specific search, their performance is often compromised due to the lack of knowledge of (and hence support to) the entities and the users in the domain. In our project, we choose to tackle this problem in the domain of mathematics. Our ultimate goal is to build a system which is able to 1) automatically index and categorize math materials from multiple online resources, and 2) understand the intents and needs of the users and present the results in an accurate and organized manner.
Project Deliverables (excluding publications) :
- dAnth: digital anthologies mailing list - a clearinghouse for researchers dealing with text conversion, citation processing and other scaling issues in digital libraries.
- ForeCite: web 2.0 based citation manager a la Citeulike, Citeseer, Rexa.
- SlashDoc: a Ruby on Rails new media knowledge base for research groups.
- SlideSeer: a digital library of aligned slides and papers.
- ParsCit: citation parsing using maximum entropy and global repairs.
Project Staff :
- Min-Yen Kan
- Dang Dinh Trung: TiddlWiki for scholarly digital libraries (Spring 2007)
- Ezekiel Eugene Ephraim: Presentation summarization, alignment and generation (Spring 2005)
- Guo Min Liew, Visual Slide Analysis (Spring 2007)
- Jesse Prabawa Gozali, ForeCite integration lead (summer project, Summer 2007)
- Hoang Oanh Nguyen Thi, alumnus, Publication Spiderer (Spring 2004)
- Yong Kiat Ng, Maximum Entropy Citation Parsing with Repairs (Spring 2004)
- Emma Thuy Dung Nguyen, Automatic keyword generation for academic publications (Spring 2006)
- Thien An Vo, Support for annotation of scientific papers (Spring 2004)
- Tinh Ky Vu, Public-domain research corpora gatherer (Fall 2004) and Academic Research Repository (Spring 2005)
- Yue Wang, Presentation summarization, alignment and generation (Spring 2006)
- Jin Zhao, Math IR and SlashDoc integrator (Fall 2007)
Intra-Event Photo Organisation
While many works on photo organization have focused on photo collections, very few have studied photos within a single event. Some events have substructure and comprise of sub-events that relate to one another by sharing important facets of the main event. In this project, we define the notion of intra-event photo organization that deals with these related sub-events. The differences from conventional inter-event organization suggest that current inter-event methods may not perform well for organizing photos within one event. Some works on event-based photo organization are based on analyzing the appearance of the photos -- through extraction of appropriate features, including color, texture and shapes. However, since we are analyzing essentially the same set of participants across the sub-events, often in the same location, appearance-based methods can lose their discriminative power. For example, most of the participants will wear the same set of clothes across sub-events; many sub-events will take place in the same location.
We take the novel approach of relying on the photo taking behaviors of the photographer, rather than the appearance of the resulting photos, to identify the sub-events. Photo taking behaviors are analyzed by modeling the contextual information (metadata) of the photos, such as the rate in which they are taken. The key idea is that the relationships between the metadata are indicative of the intent of the photographer. Secondly, we assume that the photographer intent is consistent within the same sub-event. Using this model, we propose to infer changes in the underlying photo taking behavior for intra-event photo organization.
Currently, we have conducted a preliminary study where we used Kleinberg's model for bursty and hierarchical structures to approximate photo taking behaviors. This approximation was evaluated along with four current inter-event methods against manual reference segmentations. Our approach outperforms the four baseline methods providing evidence that inferring changes in photo taking behavior can lead to better intra-event organization.
Researchers :
Advanced OPACs

When library patrons utilize an online public access catalog (OPAC), they tend to type very few query words. Although it has been observed that patrons often have specific information needs, current search engine usability encourages users to underspecify their queries. With an OPAC's fast response times and the difficulty of using more advanced query operators, users are pushed towards a probe-and-browse mode of information seeking. Additionally, patrons have adapted (or forced to adapt) to OPACs and give keywords as their queries, rather than more precise queries. As a consequence, the search results often only approximate the patron's information need, missing crucial resources that may have been phrased differently or offering search results that may be phrased exactly as wanted but which only address the patron's information need tangentially.
One solution is to teach the library patron to use advanced query syntax and to formulate more precise queries. However, this solution is both labor-intensive for library staff and time-intensive for patrons. Furthermore, different OPACs support different levels of advanced capabilities and often represent these operators with different syntax. An alternative solution that we propose is the use of advanced query analysis and query expansion. Rather than change the behavior of the patron, a system can analyze their keyword queries to infer more precise queries that uses advanced operators when appropriate. To make these inferences, the system will not only rely on logic but also will dynamically access both a) historical query logs and b) the library catalog to assess the feasibility of its suggestions.
The proposed research will target three different types of query rewriting/expansion:
- correction of misspellings,
- inferring the relationships between a query's noun and noun phrases, and
- inferring intended advanced query syntax.
The realization of these techniques will allow patrons to continue using OPACs by issuing simple keyword searchers while benefiting from more precise querying and alternative search suggestions that would originate from the implemented system.
In our current work we have examined how to re-engineer and design the User Interface to better supoort the actual information seeking methods used by library patrons. Jesse has re-engineered the work and has incorporated our own NUS OPAC as well as Colorado State University's OPAC results into his tabbed, overview+details framework. If you have a library catalog with MARC21 results that can be exported we can wrap our UI around your database. Alternatively, we have open sourced the code under the MIT license so you can try it out for yourselves. You can also contact us for more information.
This project was carried out jointly with Danny C. C. POO.
Project Deliverables (excluding publications) :
- Jesse Prabawa's OPAC interface for [ NUS ] [ Colorado State University ]
- Long Qiu's report on the Namekeeper (author spelling correction) system.
- List of misspelled titles in the LINC OPAC catalog system (over 1.2K misspellings), reported to Libraries
- Prototype spelling correction and morphology system (linc2.cgi, and its past incarnation mirror.cgi)
- Notes/Slides from past mee
tings with NUS Library staff:
[ 16 June 2004 ] [ 11 May 2005] [ 26 Aug 2005]
Researchers :
- Min-Yen KAN, Project lead
- Jin ZHAO,
- Malcolm LEE
- Tranh Son NGO
- Kalpana KUMAR
- Jesse Prabawa GOZALIi
- Siru TAN
- Meichan NG
- Roopak SELVANATHAN
- Long QIU
Discourse Relation Classification
Discourse relation classification is a task that given a pair of sentences (or clauses), classify the discourse relation (such as 'contrast' and 'causal') between them. Recognizing discourse relations is a very important building block of a discourse parser, a summarization system, a natural language generator, and other NLP tasks.
We are working on the recently released corpus, the Penn Discourse Treebank 2.0, with focus on the implicit dataset (i.e., no discourse markers such as 'if' and 'because') and the top level four relations ('comparison', 'contingency', 'expansion', and 'temporal'). By detecting the discourse relations of a text, we can also evaluate how coherent the text is. In the future, we want to design a coherence evaluation metric based on the discourse relations detected.
Researchers :
Co-training NLP Systems and Language Learners
In recent years, the notion of human computing has been introduced to complement machine learning systems. The idea is to harness what people are good at but machines are poor at. In this way we can
- create Turing tests to separate people from bots and
- create annotated data for training and improving research systems.
This approach is also called human-based computation. It is a technique leverages differences in abilities and alternative costs between humans and computer agents to achieve human-computer interaction. Many tasks are trivial for humans but continue to challenge even the most sophisticated computer programs. Thus, the intelligent combination between computer and human to solve complex tasks becomes a promising approach.
Luis von Ahn et al. proposed a new definition relating to human-based computation which is called game with a purpose (GWAP). GWAPs are games in which limitations by computer computation will be solved by employing human guidance. The data generated by effect of game plays an important role to solve computational problems and trains AI algorithms. Such games are simultaneously made for funny entertainment of human and obtaining useful computational purposes for computer. The main purpose of these games is to collect data as much as possible with the guidance of human and intelligent support of computer.
So far, while successful, the work has not showed how humans can improve from this relationship. Thus far, the idea of human computing has been just to formulate GWAPs, for entertainment. However, such an interaction opportunity also makes it an ideal situation to train or tutor the game player - in the guise of Game Based Learning.
In this project, we will study and implement this framework for acquiring machine translation training data. The goal is to implement a system to train second language learners of English and Chinese, while gathering useful training data for machine translation systems. By gathering useful training data, we mean that the data annotated by the humans must be helpful in improving eventual translation quality. By training second language learners, we mean that the users are measurably improved at their language production or consumption.
Researchers :
Information Extraction and Focused Crawling

Web crawling algorithms have now been devised for topic specific resources, or focused crawling. We examine the specialized crawling of structurally-similar resources that is used as input to other projects. We examine how to devise trainable crawling algorithms such that they "sip" the minimal amount of bandwidth and web pages from a site by considering using context graphs, negative information, web page layout, and URL structure as evidence.
To motivate the crawling algorithm design, we concentrate on the collection of four real-world problems: topical page collection, the collection of song lyrics, scientific publications and geographical map images.
Once resource pages have been found, information extraction processes need to be run to convert information into database accessible rows. As a practical research area, researcher home pages often contain a wealth of information about the research herself. These may include research interests, publications and a brief CV. This project examines how to mine these areas from known home pages of researchers and maintain a database for researcher-centric information.
Project Deliverables (excluding publications)
- Maptlas: A collection of map images culled from the web.
Researchers :
- Abhishek ARORA, alumnus, Map Spidering and Browsing User Interface (Summer 2005)
- Min-Yen KAN, Project lead
- Hoang Oanh NGUYEN THI, alumnus, Publication Spiderer (Spring 2004)
- Litan WANG, alumnus, music lyrics spider (Fall 2004)
- Vasesht RAO, alumnus, Map tiling and spidering (Fall 2004)
- Nha Linh TA, undergraduate student (Fall 2008)
- Sein LIN, Justin, undergraduate student (Spring 2009)
- Fei WANG, alumnus, Non-photograph image categorization (Fall 2004 and Spring 2005)
- Xuan WANG , alumnus, Augmenting Focused Crawling using Search Engine Queries (Spring 2006)
Question-Answering
Information is growing at a much faster rate than we can process it. The need to be able to process and utilize all of these information has led to the growth and development of applications such as search engines, question answering (QA) systems and automatic summarisation systems. These systems are used to help us quickly locate required facts, or highlight important text snippets.
In our work, we seek to explore the use of QA to answer questions about human relationships. Given a large database of text documents, it is useful to identify and reason about the relationships between individuals or groups. One interesting question that such a relationship QA system (RQA) can answer for example is "Does Person A know Person B?".
A huge number of relationships can exist between different people within a large database of documents. A QA system makes it easy to query for specific information from within this collection of identified relationships.
As part of this work, we have developed an open-source QA system - QANUS. QANUS is a pipelined, information-retrieval based QA framework upon which new QA systems can be quickly and easily developed.
Researchers :
Closed Projects
Browser Extensions for Citation Recognition and Server Side Support
On the web, researchers often give a listing of their publications for both record keeping and publicity. Even though the full text or PDF of the paper may be linked to the publication information, having easy access to vital information (e.g. number of citations, which papers cite this work, which works are cited by this paper, what people say about this work) is often difficult: the scholar must actively search for the work in a digital library such as Google Scholar or the ACM Digital Library (ACM Portal).
This project aims to implement a client-side solution to this problem, by developing a Firefox browser extension that recognises publications in the form of citations. The extension will automatically retrieve, from the Internet, additional information about the citations encountered. The retrieved information is then presented to the user. The user can then save the time that would have been spent searching for and retrieving the papers cited.
Researchers :
[ Back to the top ]
Re-examining Association Measures for MWEs
Light verb constructions
Multiword expressions (MWE) are phrases that possess some degree of syntactic and/or semantic and/or statistical idiosyncrasies. Sub-types of MWEs include verb particle constructions (VPC) such astake off and put on; light verb constructions (LVC) such as take a break and give a talk; etc.
We are here primarily concerned with ranking a list of MWE candidates using statistical association measures (AM), or mathematical formulae that are designed to capture MWEs' degree of association. This particular project first focuses on AMs for Verb Particle Constructions (VPCs) and Light Verb Construction (LVCs), especially those listed in Pavel (2006).
In previous work we examined enhancing the detection of light verb constructions. A light verb construction (LVC) is a verb-complement pair in which the verb has little lexical meaning and much of the semantic content of the construction is obtained from the complement. Examples of LVCs are "make a decision" and "give a presentation", and these pose challenges for natural language processing and understanding. In this project, we investigate methods to identify LVCs from a corpus, as well as recognizing linguistic features of LVCs.
Researchers :
[ Back to the top ]
Web Query Analysis
Web queries are often dense and short, but they often have distinct purposes. In our work, we examine how to automatically classify web queries using only the simple, lightweight data of query logs and search results. In comparison, most existing automatic methods integrate rich data sources, such as user sessions and click-through data. We believe there is more untapped potential for analyzing and typing queries based on deeper analysis of these simple sources.
Project Staff
- Min-Yen Kan , Project Lead
- Viet Bang NGUYEN, Macro and Microscopic Query Analysis for Web Queries (Spring 2006)
- Hoang Minh TRINH, Implementing Query Classification (Fall 2007)
[ Back to the top]
LyricAlly: Lyric Alignment
Joint work with Wang Ye and Haizhou Li.
Popular music is often characterized by sung lyrics and regular, repetitive structure. We examine how to capitalize on these characteristics along with constraints from music knowledge to find a suitable alignment of the text lyrics with the acoustic musical signal. Our previous work showed a proof of concept of aligning lyrics to popular music using a hierarchical, musically-informed approach, without the use of noisy results from speech recognition. Later results tried to improve alignment to the per-syllable level using an adapted speech recognition model, initially trained on newswire broadcasts.
However, these approaches are slow and require offline computation and cannot be run in real-time. In recent work, we have been examining whether we can do away with intense computation by using self-similarity to align the lyrics and music directly without explicit multimodal processing.
Project Deliverables (excluding publications)
- Minh Thang LUONG's RepLyal lyric alignment demo / home page
Project Staff
- Min-Yen Kan , Project Co-Lead
- Denny ISKANDAR, MS alumnus
- Minh Thang LUONG, Using Self-Similarity in Lyric Alignment for Popular Music (Spring 2007)
[ Back to the top ]
Automatic Text Summarization
Joint work with Wee Sun Lee and Hwee Tou Ng.
We examine graph based methods to text summarization, with respect to the graph construction and representation of (multidocument) texts and graphical decomposition methods leading to summaries. Unlike previous approaches to graph-based summarization, we devise a graph based approach that creates the graph with a simple model of how an author produces text and a reader consumes it. We are currently applying this work to blog summarization.
Project Staff
- Min-Yen KAN, Project Lead
- Ziheng LIN, undergraduate alumnus, Automatic Text Summarization using a Lead Classifier (Spring 2005 and Spring 2006)
- Xuan WANG, undergraduate alumnus, Blog Summarization (Fall 2007)
[ Back to the top ]
Scenario Template Generation
A Scenario Template is a data structure that reflects the salient aspects shared by a set of events, which are similar enough to be considered as belonging to the same scenario. The salient aspects are typically the scenario's characteristic actions, the entities involved in these actions and the related attributes. Such a scenario template, once populated with respect to a particular event, serves as a concise overview of the event. It also provides valuable information for applications such as information extraction (IE), text summarization, etc.
Manually defining scenario template is expensive. In this project, we aim to automatize the template generation process. Sentences from different event reports are broken down into predicate-argument tuples which are clustered semantically. Then salient aspects are generalized from big clusters, respectively. For this purpose, features we investigate include word similarity, context similarity, etc. The resulting scenario template is not only a structured collection of salient aspects as a manual template is, but also a information source that other NLP systems can refer to for how these salient aspects are realized in news reports.
Stay tuned for a corpus release of newswire articles that Long has compiled for use in the Scenario Template tasks.
Project Staff
[ Back to the top ]
Lightweight NLP
Joint work with Dr. Samarjit Chakraborty.
For embedded systems with constrained power and CPU resources, how should NLP and other machine learning tasks be done. We investigate how different combinations of features and learners can affect machine learned NLP tasks on embedded devices with respect to time, power and accuracy.
Project Staff
[ Back to the top ]

Project Duration: December 2003 - July 2005. Completed.
Web documents that look similar often use different HTML tags to achieve their layout effect. These tags often make it difficult for a machine to find text or images of interest.
Parcels is a backend system [Java] designed to distinguish different components of a web site and parse it into a logical structure. This logical structure is independent of the design/style of any website. The system is implemented using a co-training framework between two independent views: a lexical module and a stylistic module.
Each component in the structure will be given a tag revelant to the domain they are classified under.
Project Deliverables (excluding publications)
- PARCELS toolkit, hosted on sourceforge.net.
- Similar document similarity (integrated within the PARCELS toolkit.
Project Staff
- Min-Yen KAN, Project lead
- Chee How LEE, alumnus, PARCELS Web logical structure parser (Fall 2003)
- Aik Miang LAU, alumnus, Advancing PARCELS (Fall 2004)
- Sandra LAI, alumnus, PARCELS Web logical structure parser (Fall 2003)
[ Back to the top ]
Metadata-based webpage summarization
Project Duration: June 2003 - December 2004.
Search engines currently report the top n documents that seem most relevant to a user's query. We investigate how to change the structure this ranked list into a more meaningful natural language summary. Rather than just focus on the content of the actual webpages, we examine how metadata can be used to create useful summaries for researchers.
Project Deliverables (excluding publications)
- SMART - Supervised Categorization of JavaScript using Program Analysis Features. Product of Wei LU.
- MeURLin: URL based website classification. Poster paper in WWW 2004.
- Welcome Exclusivity Classifier, hosted on sourceforge.net. Product of Edwin LEE.
Project Staff
- Min-Yen KAN, Project lead
- Thiam Chye LEE, alumnus, Multidocument summarization using NLG (Fall 2004)
- Wei LU, alumnus, Multidocument summarization using NLG (Fall 2004)
- Yung Kiat TEO, alumnus, Hierarchical Text Segmentation (Fall 2004)
- Eileen XIE, alumnus, URL-based Web Page Classification (Fall 2004)
- Edwin LEE, alumnus, Automatic metadata extraction for the web (Fall 2003)
- Alex Ng, alumnus, Automatic metadata extraction for the web (Fall 2003)
[ Back to the top ]
SMS text input

Project Duration: December 2003 - July 2005
Short message service (SMS) is now a ubiquitous way of communicating. The numeric keys of the phone are mapped to the letters on the phone. This project examines methods for predictive text entry techniques and user interface design for doing the text entry.
We are working on extend the collection of a publicly available corpus of SMS messages and use them to compile statistics for subsequent analysis. We examine predictive text entry of completion of one word, two or more words, as well as models and data structures for computing completion efficiently for individual hand phones (per-phone modeling) as well as on a corpus wide basis (per-language modeling).
Project Deliverables (excluding publications)
- NUS SMS Corpus - A collection of over 10,000 Short Message Service (SMS) messages.
- Shortform / Longform Codec
Project Staff
- Min-Yen KAN, Project lead
- Mingfeng LEE, alumnus, Shortform Longform SMS Codec (Fall 2004)
- Yijue HOW, alumnus, Analysis of SMS input efficiency (Fall 2003)
[ Back to the top ]
Definitional question answering
Project Duration: Jan 2003 - June 2006. Joint work with PRIS group led by Prof. Tat-Seng Chua
We explore advanced techniques in definition question answering: soft pattern matching, and boosting of IR recall and precision of extended definition sentences using external web resources and historical query logs. We also explore the construction of fluent definitions using sentence understanding and re-synthesis.
Project Deliverables (excluding publications)
- DefSearch: definition searcher demonstration, based on work in TREC 2003 and a paper in WWW 2004.
- JavaRAP: Lappin and Leass' pronominal anaphora resolution program, implemented as a Java program. Freely downloadable and installable. Published in LREC 2004.
- Processed Gigaword and AQUAINT corpora. Fully POS tagged and parsed versions of two commonly-used corpora used in TREC research.
- Processed DUC corpus. Canonicalized versions of the 2001-2004 corpus for internal research.
Project Staff
[ Back to the top ]
Page maintained by Jun-Ping. Last updated, 14-Feb-2009
|