TY - JOUR
T1 - Exploiting link structure for web page genre identification
AU - Zhu, Jia
AU - Xie, Qing
AU - Yu, Shoou I.
AU - Wong, Wai Hung
N1 - KAUST Repository Item: Exported on 2020-10-01
PY - 2015/7/7
Y1 - 2015/7/7
N2 - As the World Wide Web develops at an unprecedented pace, identifying web page genre has recently attracted increasing attention because of its importance in web search. A common approach for identifying genre is to use textual features that can be extracted directly from a web page, that is, On-Page features. The extracted features are subsequently inputted into a machine learning algorithm that will perform classification. However, these approaches may be ineffective when the web page contains limited textual information (e.g., the page is full of images). In this study, we address genre identification of web pages under the aforementioned situation. We propose a framework that uses On-Page features while simultaneously considering information in neighboring pages, that is, the pages that are connected to the original page by backward and forward links. We first introduce a graph-based model called GenreSim, which selects an appropriate set of neighboring pages. We then construct a multiple classifier combination module that utilizes information from the selected neighboring pages and On-Page features to improve performance in genre identification. Experiments are conducted on well-known corpora, and favorable results indicate that our proposed framework is effective, particularly in identifying web pages with limited textual information. © 2015 The Author(s)
AB - As the World Wide Web develops at an unprecedented pace, identifying web page genre has recently attracted increasing attention because of its importance in web search. A common approach for identifying genre is to use textual features that can be extracted directly from a web page, that is, On-Page features. The extracted features are subsequently inputted into a machine learning algorithm that will perform classification. However, these approaches may be ineffective when the web page contains limited textual information (e.g., the page is full of images). In this study, we address genre identification of web pages under the aforementioned situation. We propose a framework that uses On-Page features while simultaneously considering information in neighboring pages, that is, the pages that are connected to the original page by backward and forward links. We first introduce a graph-based model called GenreSim, which selects an appropriate set of neighboring pages. We then construct a multiple classifier combination module that utilizes information from the selected neighboring pages and On-Page features to improve performance in genre identification. Experiments are conducted on well-known corpora, and favorable results indicate that our proposed framework is effective, particularly in identifying web pages with limited textual information. © 2015 The Author(s)
UR - http://hdl.handle.net/10754/566107
UR - http://link.springer.com/10.1007/s10618-015-0428-8
UR - http://www.scopus.com/inward/record.url?scp=84936817751&partnerID=8YFLogxK
U2 - 10.1007/s10618-015-0428-8
DO - 10.1007/s10618-015-0428-8
M3 - Article
SN - 1384-5810
VL - 30
SP - 550
EP - 575
JO - Data Mining and Knowledge Discovery
JF - Data Mining and Knowledge Discovery
IS - 3
ER -