Overcoming Semantic drift in information extraction

Zhixu Li, Hongsong Li, Haixun Wang, Yi Yang, Xiangliang Zhang, Xiaofang Zhou

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

Semantic drift is a common problem in iterative information extraction. Previous approaches for minimizing semantic drift may incur substantial loss in recall. We observe that most semantic drifts are introduced by a small number of questionable extractions in the earlier rounds of iterations. These extractions subsequently introduce a large number of questionable results, which lead to the semantic drift phenomenon. We call these questionable extractions Drifting Points (DPs). If erroneous extractions are the "symptoms" of semantic drift, then DPs are the "causes" of semantic drift. In this paper, we propose a method to minimize semantic drift by identifying the DPs and removing the effect introduced by the DPs. We use isA (concept-instance) extraction as an example to demonstrate the effectiveness of our approach in cleaning information extraction errors caused by semantic drift. We perform experiments on a isA relation iterative extraction, where 90.5 million of isA pairs are automatically extracted from 1.6 billion web documents with a low precision. The experimental results show our DP cleaning method enables us to clean more than 90% incorrect instances with 95% precision, which outperforms the previous approaches we compare with. As a result, our method greatly improves the prevision of this large isA data set from less than 50% to over 90%.

Original languageEnglish (US)
Title of host publicationAdvances in Database Technology - EDBT 2014
Subtitle of host publication17th International Conference on Extending Database Technology, Proceedings
EditorsVincent Leroy, Vassilis Christophides, Vassilis Christophides, Stratos Idreos, Anastasios Kementsietsidis, Minos Garofalakis, Sihem Amer-Yahia
PublisherOpenProceedings.org, University of Konstanz, University Library
Pages169-180
Number of pages12
ISBN (Electronic)9783893180653
DOIs
StatePublished - 2014
Event17th International Conference on Extending Database Technology, EDBT 2014 - Athens, Greece
Duration: Mar 24 2014Mar 28 2014

Publication series

NameAdvances in Database Technology - EDBT 2014: 17th International Conference on Extending Database Technology, Proceedings

Conference

Conference17th International Conference on Extending Database Technology, EDBT 2014
Country/TerritoryGreece
CityAthens
Period03/24/1403/28/14

ASJC Scopus subject areas

  • Computer Science Applications
  • Information Systems
  • Software

Fingerprint

Dive into the research topics of 'Overcoming Semantic drift in information extraction'. Together they form a unique fingerprint.

Cite this