TY - GEN
T1 - BigDansing
AU - Khayyat, Zuhair
AU - Ilyas, Ihab F.
AU - Jindal, Alekh
AU - Madden, Samuel
AU - Ouzzani, Mourad
AU - Papotti, Paolo
AU - Quiané-Ruiz, Jorge-Arnulfo
AU - Tang, Nan
AU - Yin, Si
N1 - KAUST Repository Item: Exported on 2020-10-01
PY - 2015/6/2
Y1 - 2015/6/2
N2 - Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.
AB - Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.
UR - http://hdl.handle.net/10754/623134
UR - http://dl.acm.org/citation.cfm?doid=2723372.2747646
UR - http://www.scopus.com/inward/record.url?scp=84949872769&partnerID=8YFLogxK
U2 - 10.1145/2723372.2747646
DO - 10.1145/2723372.2747646
M3 - Conference contribution
SN - 9781450327589
SP - 1215
EP - 1230
BT - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15
PB - Association for Computing Machinery (ACM)
ER -