Efficient processing of distributed iceberg semi-joins

Mohammed Kasim Imthiyaz*, Dong Xiaoan, Panos Kalnis

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review

Abstract

The Iceberg SemiJoin (ISJ) of two datasets R and S returns the tuples in R which join with at least k tuples of S. The ISJ operator is essential in many practical applications including OLAP, Data Mining and Information Retrieval. In this paper we consider the distributed evaluation of Iceberg SemiJoins, where R and S reside on remote servers. We developed an efficient algorithm which employs Bloom filters. The novelty of our approach is that we interleave the evaluation of the Iceberg set in server S with the pruning of unmatched tuples in server R. Therefore, we are able to (i) eliminate unnecessary tuples early, and (ii) extract accurate Bloom filters from the intermediate hash tables which are constructed during the generation of the Iceberg set. Compared to conventional two-phase approaches, our experiments demonstrate that our method transmits up to 80% less data through the network, while reducing the disk I/O cost.

Original languageEnglish (US)
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
EditorsFernando Galindo, Makoto Takizawa, Roland Traunmuller
PublisherSpringer Verlag
Pages634-643
Number of pages10
ISBN (Print)3540229361, 9783540229360
DOIs
StatePublished - 2004
Externally publishedYes

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume3180
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Efficient processing of distributed iceberg semi-joins'. Together they form a unique fingerprint.

Cite this