Scientific Document Processing


There is an increase in the number of scientific articles in modern times. A large number of articles makes it very difficult for scientists to organise knowledge, which is crucial in their everyday work. Scholarly document processing which involves natural language processing and information retrieval facilitates the management of scientific knowledge and makes scientists efficient. Scholarly document processing involves retrieving scientific articles related to a topic, summarising scientific articles, identifying influential works and authors in a particular field amongst many other related tasks. Neural network methods are increasingly applied to understand scientific articles semantically. A semantic understanding of such articles provides fine-grained access to knowledge available in the scientific literature for which many semantic search engines like Semantic Scholar are a testament.

Dataset Mention Extraction and Classification

Nowadays many research fields conduct empirical studies based on real-world datasets. There is a lack of a proper mechanism to find papers using certain datasets or identify datasets used in certain papers. Identifying important aspects of scientific publications such as dataset mentions is important for many downstream tasks like indexing, search among many. In social science, the dataset forms an integral aspect of the studies, however, it is referred to in many different surface forms. In this project, we explore different approaches of identifying such mentions of datasets in papers i.e. mention extraction, and classifying the mentions to the refereed dataset i.e. dataset discovery.

Citation String Parsing on Large-scale Synthetic Data

Reference Strings, which usually appear at the end of scientific documents, are text strings that contain information like author, title of the full document and time when the document was published. Identifying these information and parsing them from raw text strings is important for calculating bibliometrics like H-Index, understanding the network of scientific papers, recommendation systems, summarization etc. 

To achieve the goal of accurately identifying information in reference strings, we can use deep learning models. However, a deep learning model with good performance often requires large-scale data, and data annotation job on citation strings can be very time-consuming. To alleviate this problem, we can generate synthetic data. This project will discuss topics like the performance of deep learning models on synthetic data.

Other Research Tasks

  • Multi-document Summarization
  • Long-document Modeling
  • Scientific Claim Verification



SDP Toolkits:, BERT-ParsCit, SciAssist.

Dataset Extraction Dataset:

Scientific Document Processing Workshops: SDP@EMNLP2020 , SDP@NAACL2021, SDP@COLING2022.


  • Benjamin Aw (Mcomp Capstone Project Student)
  • Linxiao Zhu (Mcomp Capstone Project Student)
  • Niranjana Unnithan (Remote Intern Student)
  • Yixi Ding (Remote Intern Student)
  • Jiahe Li (UROP Student)
  • Po-Wei Huang (UROP Student)
  • Hon Hao Chen (FYP Student)
  • Animesh Prasad (Graduate Student)
  • Chenglei Si (Research Assistant)
  • Yanxia Qin (Research Fellow)
  • Min-Yen Kan (Professor and Advisor)