Workshop on text and citation analysis for scholarly digital libraries (NLPIR4DL)

The inaugural workshop at ACL-IJCNLP 09 in Singapore, 7 August 2009


Workshop Program and Paper Archive

Demo Session Info

Call for Papers
Important dates
Workshop Program
Program Committee
Friday, August 7, 2009

9:00Opening Remarks
9:10–10:00Invited Talk by Rick Lee of the World Scientific Publishing Company
[ Slides .ppt .htm .pdf ] [ New! Video Lo Res (.wmv) ]
10:00–10:30Coffee Break
 Session 1: Metadata and Content (Session Chair: Bonnie Webber)
10:30–10:55Researcher affiliation extraction from homepages
István Nagy, Richárd Farkas and Márk Jelasity
[ Slides .ptt .htm .pdf ] [ Home Page Corpus ]
10:55–11:20Anchor Text Extraction for Academic Search>
Shuming Shi, Fei Xing, Mingjie Zhu, Zaiqing Nie and Ji-Rong Wen
11:20–11:45Accurate Argumentative Zoning with Maximum Entropy models
Stephen Merity, Tara Murphy and James R. Curran
11:45–12:10Classification of Research Papers into a Patent Classification System Using Two Translation Models
Hidetsugu Nanba and Toshiyuki Takezawa
12:10–13:50Lunch Break
 Session 2: Systems (Session Chair: Manabu Okumura)
13:50–14:15Detecting key sentences for automatic assistance in peer reviewing research articles in educational sciences
Ágnes Sándor and Angela Vorndran
[ Slides (.pdf) ] [ Demo of highlighted papers (.zip) ]
14:15–14:40Designing a Citation-Sensitive Research Tool: An Initial Study of Browsing-Specific Information Needs
Stephen Wan, Cécile Paris, Michael Muthukrishna and Robert Dale
14:40–15:05The ACL Anthology Network
Dragomir R. Radev, Pradeep Muthukrishnan and Vahed Qazvinian
15:05–15:30NLP Support for Faceted Navigation in Scholarly Collection
Marti A. Hearst and Emilia Stoica
[ Slides .ppt .htm .pdf ]
15:30–16:00Coffee Break
 Session 3: Citation Support (Session Chair: Robert Dale)
16:00–16:25FireCite: Lightweight real-time reference string extraction from webpages
Ching Hoi Andy Hong, Jesse Prabawa Gozali and Min-Yen Kan
[ Slides (.pdf) ] [ Firefox 3 Extension ]
16:25–16:50Citations in the Digital Library of Classics: Extracting Canonical References by Using Conditional Random Fields
Matteo Romanello, Federico Boschetti and Gregory Crane
[ Slides (.pdf) ] [ CREFEX software site @ Google Code ]
16:50–17:15Automatic Extraction of Citation Contexts for Research Paper Summarization: A Coreference-chain based Approach
Dain Kaplan, Ryu Iida and Takenobu Tokunaga
17:15–18:00Informal Demonstration Session -- Wrap up


Informal Demonstration Session

We are planning an informal demo session as the final event of the workshop. The idea behind this is that many of us are spontaneously shown demos by our colleagues at conferences anyway, and that many of us would like a small audience for showing our own new digital libraries-related demos. We wanted to give a semi-formal opportunity at NLPIR4DL to anybody who is interested in doing just that. We hope that this workshop is small and focussed enough for this to be a sucessful experiment.

Organisation of the informal demos is also very informal, i.e., without peer review:

Call for Papers

In recent years, interest in scholarly publications in electronic forms has boomed, and several large-scale electronic digital libraries and citation indices are now used everyday by researchers. Current digital libraries collect and allow access to digital papers and their metadata (including citations), but largely do not attempt to analyze the items they collect.

The goal of this workshop is to investigate how developments in natural language processing and information retrieval techniques can advance the state-of-the-art in scholarly document understanding, analysis and retrival. Full document text analysis can help design automatic summarization and sentiment detection methods, automated recommendation and reviewing systems, and may provide data for visualizing scientific trends and bibliometrics. Citation analysis takes this a step further, adding scientific social network analysis as another strand of evidence to enhance solutions to the above challenges. Web based digital libraries add download counts and Web 2.0 information such as tagging.

Aside from researchers, this workshop hopes to interest other stakeholders, namely implementers, publishers and policymakers. Even within computer science, many different scholarly sites exist -- ACM Portal, IEEE Xplore, Google Scholar, PSU's CiteSeerX, MSRA's Libra, Tsinghua's ArnetMiner, Trier's DBLP, UMass' Rexa, Hiroshima's PRESRI -- and with this workshop we hope to bring a number of these contributers together. Today's publishers continue to seek new ways to be relevant to their consumers, in disseminating the right published works to their audience. The fact that formal citation metrics have become an increasingly large factor in decision-making by universities and funding bodies worldwide makes the need for research in such topics and for better methods for measuring the impact of work more pressing.

We invite stimulating as well as unpublished submissions on topics including but not limited to) full-text analysis, multimedia and multilingual analysis and alignment as well as citation-based NLP or IR. Specific examples of fields of interests include:

Submission details:

Important Dates

Program Committee


Simone Teufel
University of Cambridge Computer Laboratory
William Gates Building, JJ Thompson Ave,
Cambridge CB3 0FD, United Kingdom.

Simone Teufel is a senior lecturer in the Computer laboratory at Cambridge University, where she has worked since 2001. Her main research interests are in corpus-linguistic approaches to discourse theory, and in the application of such information to summarisation, information retrieval and citation analysis. She has a background in computer science (1994 Diploma from University Stuttgart) and in cognitive science (2000 PhD from Edinburgh University), and has also experience in medical information processing and search, from a postdoctoral stay at Columbia University, and in collocation extraction, from a research post at Xerox Europe. Her lastest research interests include lexical acquisition, and the visualisation and language generation of the analysis results of scientific articles.

Min-Yen Kan
AS6 05-12
Computing 1, Law Link
National University of Singapore

Min-Yen Kan is an assistant professor at the National University of Singapore. His research interests include digital libraries and applied natural language processing. Specific projects include work in the areas of citation analysis, document structure acquisition, verb analysis, and applied text summarization. Prior to joining NUS, he was a graduate research assistant at Columbia University, and has interned at various industry laboratories, including AT&T, IBM and Eurospider Technologies in Switzerland.

Min-Yen Kan <>
