TY - GEN
T1 - Overcoming Semantic drift in information extraction
AU - Li, Zhixu
AU - Li, Hongsong
AU - Wang, Haixun
AU - Yang, Yi
AU - Zhang, Xiangliang
AU - Zhou, Xiaofang
PY - 2014
Y1 - 2014
N2 - Semantic drift is a common problem in iterative information extraction. Previous approaches for minimizing semantic drift may incur substantial loss in recall. We observe that most semantic drifts are introduced by a small number of questionable extractions in the earlier rounds of iterations. These extractions subsequently introduce a large number of questionable results, which lead to the semantic drift phenomenon. We call these questionable extractions Drifting Points (DPs). If erroneous extractions are the "symptoms" of semantic drift, then DPs are the "causes" of semantic drift. In this paper, we propose a method to minimize semantic drift by identifying the DPs and removing the effect introduced by the DPs. We use isA (concept-instance) extraction as an example to demonstrate the effectiveness of our approach in cleaning information extraction errors caused by semantic drift. We perform experiments on a isA relation iterative extraction, where 90.5 million of isA pairs are automatically extracted from 1.6 billion web documents with a low precision. The experimental results show our DP cleaning method enables us to clean more than 90% incorrect instances with 95% precision, which outperforms the previous approaches we compare with. As a result, our method greatly improves the prevision of this large isA data set from less than 50% to over 90%.
AB - Semantic drift is a common problem in iterative information extraction. Previous approaches for minimizing semantic drift may incur substantial loss in recall. We observe that most semantic drifts are introduced by a small number of questionable extractions in the earlier rounds of iterations. These extractions subsequently introduce a large number of questionable results, which lead to the semantic drift phenomenon. We call these questionable extractions Drifting Points (DPs). If erroneous extractions are the "symptoms" of semantic drift, then DPs are the "causes" of semantic drift. In this paper, we propose a method to minimize semantic drift by identifying the DPs and removing the effect introduced by the DPs. We use isA (concept-instance) extraction as an example to demonstrate the effectiveness of our approach in cleaning information extraction errors caused by semantic drift. We perform experiments on a isA relation iterative extraction, where 90.5 million of isA pairs are automatically extracted from 1.6 billion web documents with a low precision. The experimental results show our DP cleaning method enables us to clean more than 90% incorrect instances with 95% precision, which outperforms the previous approaches we compare with. As a result, our method greatly improves the prevision of this large isA data set from less than 50% to over 90%.
UR - http://www.scopus.com/inward/record.url?scp=85014389620&partnerID=8YFLogxK
U2 - 10.5441/002/edbt.2014.16
DO - 10.5441/002/edbt.2014.16
M3 - Conference contribution
AN - SCOPUS:85014389620
T3 - Advances in Database Technology - EDBT 2014: 17th International Conference on Extending Database Technology, Proceedings
SP - 169
EP - 180
BT - Advances in Database Technology - EDBT 2014
A2 - Leroy, Vincent
A2 - Christophides, Vassilis
A2 - Christophides, Vassilis
A2 - Idreos, Stratos
A2 - Kementsietsidis, Anastasios
A2 - Garofalakis, Minos
A2 - Amer-Yahia, Sihem
PB - OpenProceedings.org, University of Konstanz, University Library
T2 - 17th International Conference on Extending Database Technology, EDBT 2014
Y2 - 24 March 2014 through 28 March 2014
ER -