KSample: Dynamic Sampling Over Unbounded Data Streams

Tiago Rodrigo Kepe, Eduardo Cunha de Almeida, Thomas Cerqueus


Data sampling over data streams is common practice to allow the analysis of data in real-time. However, sampling over data streams becomes complex when the stream does not fit in memory, and worse yet, when the length of the stream is unknown. A well-known technique for sampling data streams is the Reservoir Sampling. It requires a fixed-size reservoir that corresponds to the resulting sample size. But, defining the reservoir size is challenging: huge samples may waste computing resources and may not fit in memory; whereas tiny samples may be inadequate and prevent from drawing meaningful conclusions. This article presents KSample, a novel data sampling algorithm over unbounded data streams. It does not require to know the length of the stream or the size of the sample. The key idea of KSample is based on an invariant that keeps the percentage of the stream regardless of its length. That is the reservoir invariably represents at least the target percentage of the stream. KSample eliminates the problem of memory space by defining the concept of distributed mini-reservoirs grounded on the same invariant. Experiments show that KSample is substantially faster than the Reservoir Sampling algorithm to generate samples. Finally, KSample was put in practice to speed up data analytics over MapReduce jobs, reducing their response times by up to a factor of 20.


Data Analytics; Data Sampling; Data Stream

Full Text:


An official publication of the Brazilian Computer Society Special Interest Group on Databases.