| Projects |
|
|
|
If you're looking for a brief introduction of research in the WING group, check out these presentation slides, prepared by Min in April 2004. There are also a more recent set of slides from the presentation on 13 May 2005. If you are a student looking for a research project for your graduate studies (GP/MSc), Honors Year (HYP) or undergraduate research opportunity program (UROP) or considering a summer internship, we advise you to look at the slides above, read the following notes and visit our group's open project listings.
Web Query AnalysisProject Duration: Continuing. Web queries are often dense and short, but they often have distinct purposes. In our work, we examine how to automatically classify web queries using only the simple, lightweight data of query logs and search results. In comparison, most existing automatic methods integrate rich data sources, such as user sessions and click-through data. We believe there is more untapped potential for analyzing and typing queries based on deeper analysis of these simple sources. Project Staff
LyricAlly: Lyric AlignmentProject Duration: Continuing.
However, these approaches are slow and require offline computation and cannot be run in real-time. In recent work, we have been examining whether we can do away with intense computation by using self-similarity to align the lyrics and music directly without explicit multimodal processing. Project Deliverables (excluding publications)
Project Staff
Automatic Text SummarizationProject Duration: Continuing. We examine graph based methods to text summarization, with respect to the graph construction and representation of (multidocument) texts and graphical decomposition methods leading to summaries. Unlike previous approaches to graph-based summarization, we devise a graph based approach that creates the graph with a simple model of how an author produces text and a reader consumes it. We are currently applying this work to blog summarization. Project Staff
Last updated: Wed Aug 8 10:03:11 GMT-8 2007 [ Back to the top ] Advanced OPACsProject Duration: 4 years, ending July 2007. Continuing work without formal funding. Joint work with Danny C. C. Poo
When library patrons utilize an online public access catalog (OPAC), they tend to type very few query words. Although it has been observed that patrons often have specific information needs, current search engine usability encourages users to underspecify their queries. With an OPAC's fast response times and the difficulty of using more advanced query operators, users are pushed towards a probe-and-browse mode of information seeking. Additionally, patrons have adapted (or forced to adapt) to OPACs and give keywords as their queries, rather than more precise queries. As a consequence, the search results often only approximate the patron's information need, missing crucial resources that may have been phrased differently or offering search results that may be phrased exactly as wanted but which only address the patron's information need tangentially. One solution is to teach the library patron to use advanced query syntax and to formulate more precise queries. However, this solution is both labor-intensive for library staff and time-intensive for patrons. Furthermore, different OPACs support different levels of advanced capabilities and often represent these operators with different syntax. An alternative solution that we propose is the use of advanced query analysis and query expansion. Rather than change the behavior of the patron, a system can analyze their keyword queries to infer more precise queries that uses advanced operators when appropriate. To make these inferences, the system will not only rely on logic but also will dynamically access both a) historical query logs and b) the library catalog to assess the feasibility of its suggestions. The proposed research will target three different types of query rewriting/expansion: 1) correction of misspellings, 2) inferring the relationships between a query's noun and noun phrases, and 3) inferring intended advanced query syntax. The realization of these techniques will allow patrons to continue using OPACs by issuing simple keyword searchers while benefiting from more precise querying and alternative search suggestions that would originate from the implemented system. In our current work we have examined how to re-engineer and design the User Interface to better supoort the actual information seeking methods used by library patrons. Jesse has re-engineered the work and has incorporated our own NUS OPAC as well as Colorado State University's OPAC results into his tabbed, overview+details framework. If you have a library catalog with MARC21 results that can be exported we can wrap our UI around your database. Contact us for more information. Project Deliverables (excluding publications)
Project Staff
Last updated: Wed Aug 8 10:08:33 GMT-8 2007 [ Back to the top ] Scenario Template GenerationProject Duration: Continuing. A Scenario Template is a data structure that reflects the salient aspects shared by a set of events, which are similar enough to be considered as belonging to the same scenario. The salient aspects are typically the scenario's characteristic actions, the entities involved in these actions and the related attributes. Such a scenario template, once populated with respect to a particular event, serves as a concise overview of the event. It also provides valuable information for applications such as information extraction (IE), text summarization, etc. Manually defining scenario template is expensive. In this project, we aim to automatize the template generation process. Sentences from different event reports are broken down into predicate-argument tuples which are clustered semantically. Then salient aspects are generalized from big clusters, respectively. For this purpose, features we investigate include word similarity, context similarity, etc. The resulting scenario template is not only a structured collection of salient aspects as a manual template is, but also a information source that other NLP systems can refer to for how these salient aspects are realized in news reports. Stay tuned for a corpus release of newswire articles that Long has compiled for use in the Scenario Template tasks. Project Staff
Web-based disambiguation of digital library metadataProject Duration: Continuing. As digital libraries grow in size, the quality of the digital library metadata records become an issue. Data entry mistakes, string representation differences, ambiguous names, missing field data, repeated entries and other factors contribute to errors and inconsistencies in the metadata records. Noisy metadata records make searching difficult, and possibly result in certain information not being found at all, causing an under-count or over-count and distorting aggregate statistics, and decrease the utility of digital libraries in general. In this project, we concentrate on using the Web to aid the disambiguation of the metadata records. This is because sometimes the metadata records itself contains insufficient information, or the required knowledge is very difficult to mine. However, the richness of the Web, which represents the collective knowledge of the human population, often provides the answer instantly when suitable queries are presented to a search engine. As search engine calls and web page downloads are processes that are expensive on time, any Web-based disambiguation algorithm must be able to scale up to large number of records. Project Staff
Phrase Based Statistical Machine TranslationProject Duration: Continuing. Due to the nature of the problem, machine translation provides an interesting playground for the implementation of statistical approach. The problems in machine translation are rendered from the ambiguity in several level starting from the surface until the semantic level, where in isolation itself poses a great challenge. In this project, our pursuit is to advance the performance of reordering model. Reordering model attempts to restructure the lexically translated sentence to the correct target language's ordering. In particular, we examine reordering centered around function words. This is motivated by observation that phrases around function words are often incorrectly reordered. By modelling reordering patterns around function words, we hope to capture prominent reordering patterns in both the source and target languages. Project Staff
Last updated: Fri Mar 2 13:31:47 GMT-8 2007 [ Back to the top ] Focused crawling
Project Duration: Continuing. Web crawling algorithms have now been devised for topic specific resources, or focused crawling. We examine the specialized crawling of structurally-similar resources that is used as input to other projects. We examine how to devise trainable crawling algorithms such that they "sip" the minimal amount of bandwidth and web pages from a site by considering using context graphs, negative information, web page layout, and URL structure as evidence. To motivate the crawling algorithm design, we concentrate on the collection of four real-world problems: topical page collection, the collection of song lyrics, scientific publications and geographical map images. Project Deliverables (excluding publications)
Project Staff
Last updated: Fri Mar 2 13:31:47 GMT-8 2007 [ Back to the top ] Domain-Specific Research Digital LibrariesProject Duration: Continuing.
For mathematics, despite the enormous success of common search engines for general search, when it comes to domain-specific search, their performance is often compromised due to the lack of knowledge of (and hence support to) the entities and the users in the domain. In our project, we choose to tackle this problem in the domain of mathematics. Our ultimate goal is to build a system which is able to 1) automatically index and categorize math materials from multiple online resources, and 2) understand the intents and needs of the users and present the results in an accurate and organized manner. Project Deliverables (excluding publications)
Project Staff
Last updated: Wed Aug 8 10:31:05 GMT-8 2007 [ Back to the top ]
Lightweight NLPProject Duration: Continuing.
Project Staff
Last updated: Wed Aug 8 10:31:05 GMT-8 2007 [ Back to the top ] PARCELS: Web page division and classification
Project Duration: December 2003 - July 2005. Completed. Web documents that look similar often use different HTML tags to achieve their layout effect. These tags often make it difficult for a machine to find text or images of interest. Parcels is a backend system [Java] designed to distinguish different components of a web site and parse it into a logical structure. This logical structure is independent of the design/style of any website. The system is implemented using a co-training framework between two independent views: a lexical module and a stylistic module. Each component in the structure will be given a tag revelant to the domain they are classified under. Project Deliverables (excluding publications)
Project Staff
Last updated: Sat Jun 19 13:07:43 SGT 2004 [ Back to the top ] Metadata-based webpage summarizationProject Duration: June 2003 - December 2004. Search engines currently report the top n documents that seem most relevant to a user's query. We investigate how to change the structure this ranked list into a more meaningful natural language summary. Rather than just focus on the content of the actual webpages, we examine how metadata can be used to create useful summaries for researchers. Project Deliverables (excluding publications)
Project Staff
Last updated: Sat Jun 19 13:07:43 SGT 2004 [ Back to the top ] SMS text input
Project Duration: December 2003 - July 2005 Short message service (SMS) is now a ubiquitous way of communicating. The numeric keys of the phone are mapped to the letters on the phone. This project examines methods for predictive text entry techniques and user interface design for doing the text entry. We are working on extend the collection of a publicly available corpus of SMS messages and use them to compile statistics for subsequent analysis. We examine predictive text entry of completion of one word, two or more words, as well as models and data structures for computing completion efficiently for individual hand phones (per-phone modeling) as well as on a corpus wide basis (per-language modeling). Project Deliverables (excluding publications)
Project Staff
Last updated: Wed May 18 23:42:33 SGT 2005 [ Back to the top ] Light verb constructionsProject Duration: May 2004 - July 2005. A light verb construction (LVC) is a verb-complement pair in which the verb has little lexical meaning and much of the semantic content of the construction is obtained from the complement. Examples of LVCs are "make a decision" and "give a presentation", and these pose challenges for natural language processing and understanding. In this project, we investigate methods to identify LVCs from a corpus, as well as recognizing linguistic features of LVCs. Project Staff
Definition question answeringProject Duration: Jan 2003 - June 2006. We explore advanced techniques in definition question answering: soft pattern matching, and boosting of IR recall and precision of extended definition sentences using external web resources and historical query logs. We also explore the construction of fluent definitions using sentence understanding and re-synthesis. Project Deliverables (excluding publications)
Project Staff
|
|
| Last Updated ( Tuesday, 08 April 2008 ) |
Projects 



In this project, we construct software that will impose a generic, shareable, publishable and searchable framework for organizing scientific publications, similar to Cora and CiteSeer. Our work attempts to enable researchers to share annotations, search by fields and add new fields and organization as appropriate, as well as publish annotations. In our projects we are examining DLs for mathematics as well as for coordinating multimedia.
For embedded systems with constrained power and CPU resources, how should NLP and other machine learning tasks be done. We investigate how different combinations of features and learners can affect machine learned NLP tasks on embedded devices with respect to time, power and accuracy. 
