Web Harvesting enables the enrichment of incomplete data sets by retrieving required information from the Web. However, the ambiguity of instances may greatly decrease the quality of the harvested data, given that any instance in the local data set may become ambiguous when attempting to identify it on theWeb. Although plenty of disambiguation methods have been proposed to deal with the ambiguity problems in various settings, none of them are able to handle the instance ambiguity problem in Web Harvesting. In this paper, we propose to do instance disambiguation in Web Harvesting with a novel disambiguation method inspired by the idea of collaborative identity recognition. In particular, we expect to find some common properties in forms of latent shared attribute values among instances in the list, such that these shared attribute values can differentiate instances within the list against those ambiguous ones on the Web. Our extensive experimental evaluation illustrates the utility of collaborative disambiguation for a popular Web Harvesting application, and shows that it substantially improves the accuracy of the harvested data.
|Original language||English (US)|
|Title of host publication||18th International Workshop on the Web and Databases, WebDB 2015|
|Subtitle of host publication||Freshness, Correctness, Quality of Information and Knowledge on the Web - Proceedings|
|Editors||Julia Stoyanovich, Fabian M. Suchanek|
|Publisher||Association for Computing Machinery, Inc|
|Number of pages||7|
|State||Published - May 31 2015|
|Event||18th International Workshop on the Web and Databases, WebDB 2015 - Melbourne, Australia|
Duration: May 31 2015 → …
|Name||18th International Workshop on the Web and Databases, WebDB 2015: Freshness, Correctness, Quality of Information and Knowledge on the Web - Proceedings|
|Other||18th International Workshop on the Web and Databases, WebDB 2015|
|Period||05/31/15 → …|
Bibliographical noteFunding Information:
This research is partially supported by Natural Science Foundation of China (No. 61472263, 61402313, 61303019), the Australian Research Council (No. DP140103171), the Youth Teacher Startup Fund of South China Normal University (No. 14KJ18) and the King Abdullah University of Science and Technology (KAUST).
Copyright ©2010 by the Association for Computing Machinery.
ASJC Scopus subject areas
- Computer Networks and Communications
- Computer Science Applications
- Information Systems