Imagine we need to process a data stream and track frequently occurring (or large) elements, often called heavy hitters. The items in the data stream are given as pairs , where denotes the time point, is the item encoded in some form, and represents the size of . In some cases, we may not have enough space to store all the items (such as in routers for networking; or k-mers in genomics, which require exponential space), so we need a smart way to store only the important or frequently occurring elements.
This is where a probabilistic data structure called the count-min sketch comes into play. Briefly put, the count-min sketch consists of a count table whose entries allow us to approximate the counts of the most frequent items that pass through the data stream. Additionally, a count-min sketch consists of universal hash functions , whose domain is the space of the items and whose range is the column indices of the table . Note that universal hash functions can be constructed easily (e.g. see Universal hash functions).
To construct the count table , we define it to have rows and columns, where each row in the table sometimes referred to as a group. Each entry in the table is called a counter, where and . Now, we will define
as the total count of item up to time . Next, assume that
The data stream came in discrete time points
As each point came in, we compute the hash values for
We increment the each entry , for in the count table by
We now have the following result:
Let an error tolerance parameter be given and a count table constructed as above. Fix as an item from the input space. Then, with probability over the choice of hash functions, for a sequence of items , we have that
This results states that, for a given item , one of the K entries (i.e. ) in the count table reasonably approximate its count in the data stream with high probability. Specifically, this entry is the one with the minimum count among different groups. This result is significant because the space complexity is now fixed – we only need the space to store the size of the count table, which is . Even though the count table entry (i.e. the minimum of the entries for ) may overestimate the real count , this overestimation is bounded. In particular, each count estimate is bounded by , which is a fraction of the total size of the items observed so far.
We also have the following result:
This results states that if we construct the count table in count-min sketch according to the specified and , we are guaranteed that false positive (non-heavy hitter appear on the heavy hitter's list) occur only with a small probability.