Details
Original language | English |
---|---|
Pages (from-to) | 119-156 |
Number of pages | 38 |
Journal | Distributed and parallel databases |
Volume | 28 |
Issue number | 2-3 |
Publication status | Published - 2 Sept 2010 |
Abstract
Bloom filters are extensively used in distributed applications, especially in distributed databases and distributed information systems, to reduce network requirements and to increase performance. In this work, we propose two novel Bloom filter features that are important for distributed databases and information systems. First, we present a new approach to encode a Bloom filter such that its length can be adapted to the cardinality of the set it represents, with negligible overhead with respect to computation and false positive probability. The proposed encoding allows for significant network savings in distributed databases, as it enables the participating nodes to optimize the length of each Bloom filter before sending it over the network, for example, when executing Bloom joins. Second, we show how to estimate the number of distinct elements in a Bloom filter, for situations where the represented set is not materialized. These situations frequently arise in distributed databases, where estimating the cardinality of the represented sets is necessary for constructing an efficient query plan. The estimation is highly accurate and comes with tight probabilistic bounds. For both features we provide a thorough probabilistic analysis and extensive experimental evaluation which confirm the effectiveness of our approaches.
Keywords
- Bloom filters, Distributed databases, Distributed information systems
ASJC Scopus subject areas
- Computer Science(all)
- Software
- Computer Science(all)
- Information Systems
- Computer Science(all)
- Hardware and Architecture
- Decision Sciences(all)
- Information Systems and Management
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
In: Distributed and parallel databases, Vol. 28, No. 2-3, 02.09.2010, p. 119-156.
Research output: Contribution to journal › Article › Research › peer review
}
TY - JOUR
T1 - Cardinality estimation and dynamic length adaptation for Bloom filters
AU - Papapetrou, Odysseas
AU - Siberski, Wolf
AU - Nejdl, Wolfgang
PY - 2010/9/2
Y1 - 2010/9/2
N2 - Bloom filters are extensively used in distributed applications, especially in distributed databases and distributed information systems, to reduce network requirements and to increase performance. In this work, we propose two novel Bloom filter features that are important for distributed databases and information systems. First, we present a new approach to encode a Bloom filter such that its length can be adapted to the cardinality of the set it represents, with negligible overhead with respect to computation and false positive probability. The proposed encoding allows for significant network savings in distributed databases, as it enables the participating nodes to optimize the length of each Bloom filter before sending it over the network, for example, when executing Bloom joins. Second, we show how to estimate the number of distinct elements in a Bloom filter, for situations where the represented set is not materialized. These situations frequently arise in distributed databases, where estimating the cardinality of the represented sets is necessary for constructing an efficient query plan. The estimation is highly accurate and comes with tight probabilistic bounds. For both features we provide a thorough probabilistic analysis and extensive experimental evaluation which confirm the effectiveness of our approaches.
AB - Bloom filters are extensively used in distributed applications, especially in distributed databases and distributed information systems, to reduce network requirements and to increase performance. In this work, we propose two novel Bloom filter features that are important for distributed databases and information systems. First, we present a new approach to encode a Bloom filter such that its length can be adapted to the cardinality of the set it represents, with negligible overhead with respect to computation and false positive probability. The proposed encoding allows for significant network savings in distributed databases, as it enables the participating nodes to optimize the length of each Bloom filter before sending it over the network, for example, when executing Bloom joins. Second, we show how to estimate the number of distinct elements in a Bloom filter, for situations where the represented set is not materialized. These situations frequently arise in distributed databases, where estimating the cardinality of the represented sets is necessary for constructing an efficient query plan. The estimation is highly accurate and comes with tight probabilistic bounds. For both features we provide a thorough probabilistic analysis and extensive experimental evaluation which confirm the effectiveness of our approaches.
KW - Bloom filters
KW - Distributed databases
KW - Distributed information systems
UR - http://www.scopus.com/inward/record.url?scp=77957859235&partnerID=8YFLogxK
U2 - 10.1007/s10619-010-7067-2
DO - 10.1007/s10619-010-7067-2
M3 - Article
AN - SCOPUS:77957859235
VL - 28
SP - 119
EP - 156
JO - Distributed and parallel databases
JF - Distributed and parallel databases
SN - 0926-8782
IS - 2-3
ER -