There is an increase in the number of scientific articles in modern times. A large number of articles makes it very difficult for scientists to organise knowledge, which is crucial in their everyday work. Scholarly document processing which involves natural language processing and information retrieval facilitates the management of scientific knowledge and makes scientists efficient. Scholarly document processing involves retrieving scientific articles related to a topic, summarising scientific articles, identifying influential works and authors in a particular field amongst many other related tasks. Neural network methods are increasingly applied to understand scientific articles semantically. A semantic understanding of such articles provides fine-grained access to knowledge available in the scientific literature for which many semantic search engines like Semantic Scholar are a testament.
Dataset Mention Extraction and Classification
Nowadays many research fields conduct empirical studies based on real-world datasets. There is a lack of a proper mechanism to find papers using certain datasets or identify datasets used in certain papers. Identifying important aspects of scientific publications such as dataset mentions is important for many downstream tasks like indexing, search among many. In social science, the dataset forms an integral aspect of the studies, however, it is referred to in many different surface forms. In this project, we explore different approaches of identifying such mentions of datasets in papers i.e. mention extraction, and classifying the mentions to the refereed dataset i.e. dataset discovery.
Original Dataset: https://coleridgeinitiative.org/richcontextcompetition