Shane Chu

Tracking the frequently occurring elements with count-min sketch

Motivations

Imagine we need to process a data stream and track frequently occurring (or large) elements, often called heavy hitters. The items in the data stream are given as pairs (it,st)(i_t, s_t), where tt denotes the time point, iti_t is the item encoded in some form, and sts_t represents the size of iti_t. In some cases, we may not have enough space to store all the items (such as in routers for networking; or k-mers in genomics, which require exponential space), so we need a smart way to store only the important or frequently occurring elements.

Count-Min sketch

This is where a probabilistic data structure called the count-min sketch comes into play. Briefly put, the count-min sketch consists of a count table CC whose entries allow us to approximate the counts of the most frequent items that pass through the data stream. Additionally, a count-min sketch consists of KK universal hash functions h1,,hKh_1,\dots,h_K, whose domain is the space of the items and whose range is the column indices of the table CC. Note that universal hash functions can be constructed easily (e.g. see Universal hash functions).

To construct the count table CC, we define it to have KK rows and LL columns, where each row in the table sometimes referred to as a group. Each entry C[k,]C[k,\ell] in the table CC is called a counter, where 1kK1\leq k \leq K and 1L1\leq \ell \leq L. Now, we will define

Count(i,T)= t:it=i,1tTst\text{Count}(i,T)=\underset{\begin{aligned} &\,\,t :i_t=i, \\ &1\leq t \leq T \end{aligned}}{\sum} s_t

as the total count of item ii up to time TT. Next, assume that

We now have the following result:

Let an error tolerance parameter ϵ>0\epsilon > 0 be given and a K×LK\times L count table CC constructed as above. Fix ii as an item from the input space. Then, with probability 1(1Lϵ)K1-(\frac{1}{L\epsilon})^K over the choice of hash functions, for a sequence of items (i1,s1),,(iT,sT)(i_1,s_1),\dots,(i_T,s_T), we have that

Count(i,T)min =hk(i), 1kKC[k,] Count(i,T)+ϵt=1Tst \text{Count}(i,T) \leq \underset{\begin{aligned} &\,\,\,\ell=h_k(i), \\\ &1\leq k\leq K \end{aligned}}{\text{min}} C[k, \ell] \,\,\,\leq \text{Count}(i,T) + \epsilon\sum_{t=1}^T s_t

This results states that, for a given item ii, one of the K entries (i.e. C[1,h1(i)],,C[K,hK(i)]C[1,h_1(i)],\dots,C[K,h_K(i)]) in the count table CC reasonably approximate its count in the data stream with high probability. Specifically, this entry is the one with the minimum count among different groups. This result is significant because the space complexity is now fixed – we only need the space to store the size of the count table, which is O(KL)\mathcal O(KL). Even though the count table entry (i.e. the minimum of the entries C[k,hk(i)]C[k,h_k(i)] for k=1,,Kk=1,\dots,K) may overestimate the real count Count(i,T)\text{Count}(i,T), this overestimation is bounded. In particular, each count estimate is bounded by ϵt=1Tst\epsilon\sum_{t=1}^T s_t, which is a fraction of the total size of the items observed so far.

We also have the following result:

Suppose that we use a count-min sketch with K=ln1δK=\lceil \ln\frac{1}{\delta}\rceil hash functions, m=K×eϵm=K \times\lceil \frac{e}{\epsilon}\rceil counters, and a threshold value qq. Let QQ be the total size of the items in the input data stream. Then all the heavy hitters (whose count are more than qq) are included on the list, and any item that corresponds to fewer than qϵQq-\epsilon Q counts is put on the list with probability at most δ\delta.

This results states that if we construct the count table CC in count-min sketch according to the specified δ\delta and ϵ\epsilon, we are guaranteed that false positive (non-heavy hitter appear on the heavy hitter's list) occur only with a small probability.

References

Contact me by E-mail | Github | Linkedin
This work is licensed under CC BY-SA 4.0. Last modified: January 15, 2025.
Website built with Franklin.jl and the Julia language.