TY - JOUR
T1 - Dynamic Set Similarity Join: An Update Log Based Approach
AU - Yang, Chengcheng
AU - Chen, Lisi
AU - Wang, Hao
AU - Shang, Shuo
AU - Mao, Rui
AU - Zhang, Xiangliang
N1 - Generated from Scopus record by KAUST IRTS on 2023-09-20
PY - 2023/4/1
Y1 - 2023/4/1
N2 - The set similarity join finds all pairs of similar sets from two collections of sets. It has many real world applications, such as personalized recommendation and community mining. In this paper, we study the problem of computing the similarity join in a dynamic context, where the sets are updated dynamically. This, however, is inefficient with the state-of-the-art join methods, because they usually assume that data collections are static and have to compute the join result from scratch whenever a set is updated. To address this issue, we propose ALJoin, an adaptive filtering approach that computes the join result incrementally based on the update logs. We first investigate the effect of set updates on the similarity values, and on this basis we propose to build a neighborhood index for each set. The neighborhood index of a specific set consists of any other sets that can be transformed into its similar sets within a threshold number of update operations. ALJoin then uses this index to effectively identify both similar and dissimilar set pairs based on their update logs. To efficiently build the neighborhood index, we devise several filtering techniques and propose a 'lazy-forward' method to reduce the computational cost. In addition, to improve the efficiency on varying workloads, we propose an analytical cost model, and design an online algorithm with performance guarantees to dynamically consolidate the update logs and adapt the neighborhood indexes. We evaluated our method using four real-world datasets. Experimental results show that our approach outperforms existing methods by up to 3.7×.
AB - The set similarity join finds all pairs of similar sets from two collections of sets. It has many real world applications, such as personalized recommendation and community mining. In this paper, we study the problem of computing the similarity join in a dynamic context, where the sets are updated dynamically. This, however, is inefficient with the state-of-the-art join methods, because they usually assume that data collections are static and have to compute the join result from scratch whenever a set is updated. To address this issue, we propose ALJoin, an adaptive filtering approach that computes the join result incrementally based on the update logs. We first investigate the effect of set updates on the similarity values, and on this basis we propose to build a neighborhood index for each set. The neighborhood index of a specific set consists of any other sets that can be transformed into its similar sets within a threshold number of update operations. ALJoin then uses this index to effectively identify both similar and dissimilar set pairs based on their update logs. To efficiently build the neighborhood index, we devise several filtering techniques and propose a 'lazy-forward' method to reduce the computational cost. In addition, to improve the efficiency on varying workloads, we propose an analytical cost model, and design an online algorithm with performance guarantees to dynamically consolidate the update logs and adapt the neighborhood indexes. We evaluated our method using four real-world datasets. Experimental results show that our approach outperforms existing methods by up to 3.7×.
UR - https://ieeexplore.ieee.org/document/9609684/
UR - http://www.scopus.com/inward/record.url?scp=85149992851&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2021.3126631
DO - 10.1109/TKDE.2021.3126631
M3 - Article
SN - 1558-2191
VL - 35
SP - 3727
EP - 3741
JO - IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
JF - IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
IS - 4
ER -