TY - JOUR
T1 - Fast and scalable inequality joins
AU - Khayyat, Zuhair
AU - Lucia, William
AU - Singh, Meghna
AU - Ouzzani, Mourad
AU - Papotti, Paolo
AU - Quiané-Ruiz, Jorge Arnulfo
AU - Tang, Nan
AU - Kalnis, Panos
N1 - KAUST Repository Item: Exported on 2020-10-01
Acknowledgements: Portions of the research in this paper used the MDC Database made available by Idiap Research Institute, Switzerland and owned by Nokia.
PY - 2016/9/7
Y1 - 2016/9/7
N2 - Inequality joins, which is to join relations with inequality conditions, are used in various applications. Optimizing joins has been the subject of intensive research ranging from efficient join algorithms such as sort-merge join, to the use of efficient indices such as (Formula presented.)-tree, (Formula presented.)-tree and Bitmap. However, inequality joins have received little attention and queries containing such joins are notably very slow. In this paper, we introduce fast inequality join algorithms based on sorted arrays and space-efficient bit-arrays. We further introduce a simple method to estimate the selectivity of inequality joins which is then used to optimize multiple predicate queries and multi-way joins. Moreover, we study an incremental inequality join algorithm to handle scenarios where data keeps changing. We have implemented a centralized version of these algorithms on top of PostgreSQL, a distributed version on top of Spark SQL, and an existing data cleaning system, Nadeef. By comparing our algorithms against well-known optimization techniques for inequality joins, we show our solution is more scalable and several orders of magnitude faster. © 2016 Springer-Verlag Berlin Heidelberg
AB - Inequality joins, which is to join relations with inequality conditions, are used in various applications. Optimizing joins has been the subject of intensive research ranging from efficient join algorithms such as sort-merge join, to the use of efficient indices such as (Formula presented.)-tree, (Formula presented.)-tree and Bitmap. However, inequality joins have received little attention and queries containing such joins are notably very slow. In this paper, we introduce fast inequality join algorithms based on sorted arrays and space-efficient bit-arrays. We further introduce a simple method to estimate the selectivity of inequality joins which is then used to optimize multiple predicate queries and multi-way joins. Moreover, we study an incremental inequality join algorithm to handle scenarios where data keeps changing. We have implemented a centralized version of these algorithms on top of PostgreSQL, a distributed version on top of Spark SQL, and an existing data cleaning system, Nadeef. By comparing our algorithms against well-known optimization techniques for inequality joins, we show our solution is more scalable and several orders of magnitude faster. © 2016 Springer-Verlag Berlin Heidelberg
UR - http://hdl.handle.net/10754/622197
UR - http://link.springer.com/article/10.1007%2Fs00778-016-0441-6
UR - http://www.scopus.com/inward/record.url?scp=84986275869&partnerID=8YFLogxK
U2 - 10.1007/s00778-016-0441-6
DO - 10.1007/s00778-016-0441-6
M3 - Article
SN - 1066-8888
VL - 26
SP - 125
EP - 150
JO - The VLDB Journal
JF - The VLDB Journal
IS - 1
ER -