Abstract
Named Entity Recognition (NER) is a core task of NLP. State-of-art supervised NER models rely heavily on a large amount of high-quality annotated data, which is quite expensive to obtain. Various existing ways have been proposed to reduce the heavy reliance on large training data, but only with limited effect. In this paper, we propose a novel way to make full use of the weakly-annotated texts in encyclopedia pages for exactly unsupervised NER learning, which is expected to provide an opportunity to train the NER model with no manually-labeled data at all. Briefly, we roughly divide the sentences of encyclopedia pages into two parts simply according to the density of inner url links contained in each sentence. While a relatively small number of sentences with dense links are used directly for training the NER model initially, the left sentences with sparse links are then smartly selected for gradually promoting the model in several self-training iterations. Given the limited number of sentences with dense links for training, a data augmentation method is proposed, which could generate a lot more training data with the help of the structured data of encyclopedia to greatly augment the training effect. Besides, in the iterative self-training step, we propose to utilize a graph model to help estimate the labeled quality of these sentences with sparse links, among which those with the highest labeled quality would be put into our training set for updating the model in the next iteration. Our empirical study shows that the NER model trained with our unsupervised learning approach could perform even better than several state-of-art models fully trained on newswires data.
Original language | English (US) |
---|---|
Title of host publication | Web and Big Data |
Publisher | Springer International Publishing |
Pages | 329-344 |
Number of pages | 16 |
ISBN (Print) | 9783030260712 |
DOIs | |
State | Published - Jul 18 2019 |
Bibliographical note
KAUST Repository Item: Exported on 2020-10-01Acknowledgements: This research is partially supported by National Natural Science Foundation of China (Grant No. 61632016, 61572336, 61572335, 61772356), and the Natural Science Research Project of Jiangsu Higher Education Institution (No. 17KJA520003, 18KJA520010).