TY - JOUR
T1 - Data Stream Clustering With Affinity Propagation
AU - Zhang, Xiangliang
AU - Furtlehner, Cyril
AU - Germain-Renaud, Cecile
AU - Sebag, Michele
N1 - KAUST Repository Item: Exported on 2020-10-01
PY - 2013/8/23
Y1 - 2013/8/23
N2 - Data stream clustering provides insights into the underlying patterns of data flows. This paper focuses on selecting the best representatives from clusters of streaming data. There are two main challenges: how to cluster with the best representatives and how to handle the evolving patterns that are important characteristics of streaming data with dynamic distributions. We employ the Affinity Propagation (AP) algorithm presented in 2007 by Frey and Dueck for the first challenge, as it offers good guarantees of clustering optimality for selecting exemplars. The second challenging problem is solved by change detection. The presented StrAP algorithm combines AP with a statistical change point detection test; the clustering model is rebuilt whenever the test detects a change in the underlying data distribution. Besides the validation on two benchmark data sets, the presented algorithm is validated on a real-world application, monitoring the data flow of jobs submitted to the EGEE grid.
AB - Data stream clustering provides insights into the underlying patterns of data flows. This paper focuses on selecting the best representatives from clusters of streaming data. There are two main challenges: how to cluster with the best representatives and how to handle the evolving patterns that are important characteristics of streaming data with dynamic distributions. We employ the Affinity Propagation (AP) algorithm presented in 2007 by Frey and Dueck for the first challenge, as it offers good guarantees of clustering optimality for selecting exemplars. The second challenging problem is solved by change detection. The presented StrAP algorithm combines AP with a statistical change point detection test; the clustering model is rebuilt whenever the test detects a change in the underlying data distribution. Besides the validation on two benchmark data sets, the presented algorithm is validated on a real-world application, monitoring the data flow of jobs submitted to the EGEE grid.
UR - http://hdl.handle.net/10754/556655
UR - http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6585253
UR - http://www.scopus.com/inward/record.url?scp=84904433383&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2013.146
DO - 10.1109/TKDE.2013.146
M3 - Article
SN - 1041-4347
VL - 26
SP - 1644
EP - 1656
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 7
ER -