Mining person name data across datasets presents challenges in dealing with the diversity of ways that person names and their roles naturally manifest. We refine the modern neural named entity recognition (NER) approach for the extraction of person names and their roles by leveraging these key relationships. By leveraging high-quality embed-dings extracted from clean datasets, we improve BiLSTM–CRF extraction performance in lower-quality datasets. Our method addresses name data sparsity problems through a process of data augmentation and refinement that synthesizes auxiliary data to improve the recognition of underrepresented name ethnicities.
We employ our method to extract service contributions – in the form of editorial board roles – from journal websites. Our method augments limited supervised data tuples of researcher’s names and affiliations and their board roles. We employ our method to construct a large dataset of approximately 300 journals and resultant extractions of such service contributions over three major scientific publication houses: the IEEE, Springer, and ACM. We demonstrate that these refinements significantly and consistently reduce the errors by over 30% made by the standard BiLSTM–CRF and that these improvements hold over the component data sources which exhibit differing levels of consistency.