Subject: Perspective in Informatics 3 – Fall Semester 2014 Professor: Davood Rafiei
Assignment No.2
HOANG Nguyen Phong
Submitted on: November 25th
ID number: 6930-26-1264
Question 1 (15 Marks) 4.2.1 Suppose we have a stream of tuples with the schema: Grade (University, courseID, studentID, grade) Assume universities are unique, but a courseID is unique only within a university (i.e., different universities may have different courses with the same ID, e.g., “CS101”) and likewise, studentID’s are unique only within a university (different university may assign the same ID to different students). Suppose we want to answer certain queries approximately from a 1/20th sample of the data. For each of queries below, indicate how you would construct the sample. That is, tell what the key attributes should be. a) For each university, estimate the average number of students in a course. In this case, we construct the sample with key attribute is university. b) Estimate the fraction of students who have a GPA of 3.5 or more. In this case, we construct the sample with key attribute is student. c) Estimate the fraction of courses where at least half the students got “A.” In this case, we construct the sample with key attribute is course. 4.3.1 For the situation of our running example (8 billion bits, 1 billion members of the set S), calc1ulate the false-positive rate if we use three hash functions? What if we use four hash functions? Where
k is the number of hash functions
n is the bit-array length (the length of the Bloom Filter)
m is the number of members of S.
we have the probability that a bit is still 0 after m inserts is:
(𝟏 −
𝟏 𝒌𝒎 ) 𝒏
𝟏 𝒏𝒌𝒎 ) 𝒏 𝒏
= (𝟏 −
≈
𝒌𝒎
𝒆− 𝒏
It implies that the probability that k arbitrarily-picked bits are set (in other words, the probability of a false positive) is: 𝒌𝒎
− (1-𝒆 𝒏 )k Therefore, the false-positive rate if:
Three hash functions are used is:
Four hash functions are used is:
= 0.030579 ≈ 3.1% = 0.023969 ≈ 2.3%
1
Subject: Perspective in Informatics 3 – Fall Semester 2014 Professor: Davood Rafiei
Question 2 (15 Marks) 4.4.1 Suppose our stream consists of the integers 3, 1, 4, 1, 5, 9, 2, 6, 5. Our hash functions will all be of the form h(x) = ax + b mode 32 for some a and b. You should treat the result as a 5-bit binary integer. Determine the tail length for each stream element and the resulting estimate of the number of distinct elements. In this problem, we apply the Flajolet-Martin Algorithm as following:
We apply a hash function h(x) to the stream element.
The bit string of h(x) will end in some number of 0’s.
Count this number as tail length.
Let R be the maximum tail length of any x seen so far in the stream.
The number of distinct elements seen in the stream is estimated by
𝟐𝐑 𝝓
where 𝝓 = 0.77351
a) h(x) = 2x + 1 mod 32 Element Hashed value Convert to Binary Tail Length = R Number of distinct elements = 2R 3
7
00111
0
1
1
3
00011
0
1
4
9
01001
0
1
1
3
00011
0
1
5
11
01011
0
1
9
19
10011
0
1
2
5
00101
0
1
6
13
01101
0
1
5
11
01011
0
1
With this hash function, the maximum tail length R = 0, the number of distinct elements is 𝟐𝐑
𝟐𝟎
estimated to be 𝝓 = 𝟎.𝟕𝟕𝟑𝟓𝟏 = 𝟐. 𝟓𝟖𝟓 ≈ 𝟑. b) h(x) = 3x + 7 mod 32 Element Hashed value Convert to Binary Tail Length = R Number of distinct elements = 2R 3
16
10000
4
16
1
10
01010
1
2
4
19
10011
0
1
1
10
01010
1
2
5
22
10110
1
2
9
2
00010
1
2
2
13
01101
0
1
6
25
11001
0
1
5
22
10110
1
2
2
Subject: Perspective in Informatics 3 – Fall Semester 2014 Professor: Davood Rafiei
With this hash function, the maximum tail length R = 4, the number of distinct elements is 𝟐𝐑 𝝓
estimated to be
=
𝟐𝟒 𝟎.𝟕𝟕𝟑𝟓𝟏
= 𝟐𝟎. 𝟔𝟖𝟒 ≈ 𝟐𝟏.
c) h(x) = 4x mode 32 Element Hashed value Convert to Binary Tail Length = R Number of distinct elements = 2R 3
12
01100
2
4
1
4
00100
2
4
4
16
10000
4
16
1
4
00100
2
4
5
20
10100
2
4
9
4
00100
2
4
2
8
01000
3
8
6
24
11000
3
8
5
20
10100
2
4
With this hash function, the maximum tail length R = 4, the number of distinct elements is 𝟐𝐑
𝟐𝟒
estimated to be 𝝓 = 𝟎.𝟕𝟕𝟑𝟓𝟏 = 𝟐𝟎. 𝟔𝟖𝟒 ≈ 𝟐𝟏. In order to improve the accuracy of this algorithm, numbers of distinct elements from each hash function are first averaged. Then, median of the average R is used to estimate the distinct elements of the stream as following: Average of 2R Hash function a
1
Hash function b
3.2222
Hash function c
6.2222
Median value of 2R 3.2222
Distinct elements Estimate =
3.2222 0.77351
=4.1657≈4
Although the estimate of 4 elements is still far from the real distinct elements in this case (which is 7), by averaging and taking the median of all hash functions, the result is improved much better compared with each single result from each hash function. Actually, if more accuracy is desired, we need more hash functions. However, that will imply higher computation cost. At this time, we can get a value closer to real distinct elements. 4.4.2 Do you see any problems with the choice of hash functions in Exercise 4.4.1? What advice could you give someone who was going to use as hash function of the form h(x) = ax + b mod 2k? During the above computational processes, it is obvious that this algorithm is quite sensitive to the hash function parameters. Some of the following advice may be useful for someone who is going to use h(x) = ax + b mod 2k as hash function. That is the parameter a has to be odd numbers for the best result. Since when it is even number, whatever b is, the hash function does not generate good result as observed in 4.4.1:
3
Subject: Perspective in Informatics 3 – Fall Semester 2014 Professor: Davood Rafiei
When a is even, and b is odd, the hash function always returns odd numbers. That causes the binary value always ends by 1’s, making the tail length is always equal to 0 as in case [a].
When a is even, and b is also even, the hashed value will be badly affected. For instance, in case [c] hashed values of 2 different elements become the same (both 1 and 9 have the same hashed value of 4). It revokes the primary rule of a hash function that is: “with different elements, the hash function is supposed to generate different values”.
Question 3 (20 Marks) 4.5.1 Compute the surprise number (second moment) for the stream 3, 1, 4, 1, 3, 4, 2, 1, 2. What is the third moment of this stream? The frequency moment of a stream is calculated by using the following formula: 𝒎 th 𝑭𝒎 = ∑𝑵 𝒊 𝒇𝒊 Where m is the order of moment, and f is number of occurrence(s) of the i element.
Element
Occurrence
1st moment
2nd moment
3rd moment
1
3
3
9
27
2
2
2
4
8
3
2
2
4
8
4
2
2
4
8
𝐹𝑚 = 9
𝐹𝑚 = 21
𝐹𝑚 = 51
The first moment, or the length of the stream is 9.
The second moment (or the surprise number, self-join size, repeat rate, homogeneity index) of the stream is 21.
The third moment of this stream is 51.
4.5.2 If a stream has n elements, of which m are distinct, what are the minimum and maximum possible surprise number, as a function of m and n? As we know that the 2nd moment of a stream is a value to evaluate the repeat rate of the elements in the stream. In other words, it is used to measure how uneven the distribution of elements in the stream is. As a result, the repeat rate is at its minimum value when the occurrences are evenly distributed to all distinct elements in the stream. On the contrary, the repeat rate is at its maximum value when the occurrences are unevenly and highly distributed to only a particular element, while the other elements are only distributed once. In other words, the minimum and maximum possible surprise number in term of m and n are estimated as below: Min 𝑭𝟐 =
Max 𝑭𝟐 =
4
Subject: Perspective in Informatics 3 – Fall Semester 2014 Professor: Davood Rafiei
4.5.3 Suppose we are given the stream of Exercise 4.5.1, to which we apply the Alon-Matias-Szegedy Algorithm to estimate the surprise number. For each possible value of i, if Xi is a variable starting position i, what is the value of Xi.value? According to the Alon-Matias-Szegedy Algorithm, we have the following table: Starting position i
Xi.element
Xi.Value
1st
3
2
2nd
1
3
3rd
4
2
4th
1
2
5th
3
1
6th
4
1
7th
2
2
8th
1
1
9th
2
1
From this table, we can derive an estimate of the second moment by the following formula
Where n is the length the stream.
Therefore, the second moment of the stream is estimated as:
This result is exactly same as the result calculated in 4.5.1 using the formula
𝒎 𝑭 𝒎 = ∑𝑵 𝒊 𝒇𝒊
. It is because
we utilized all of 9 possible starting positions for all variables. If we use less than 9 variables to save computational cost, the result will be slightly different from the true value but still acceptable error. Question 4 (10 Marks) Consider the count sketch (also referred to as FastAGMS) discussed in class; you have seen how the sketch is used to estimate the self-join size of a stream. Now suppose you are given two streams s1 and s2 and are asked to estimate the join size of s1 and s2. How can you use the count-sketch for this estimation? Explain how and why the estimation would work. The size of join of two data streams is defined as the inner-product of the frequency vectors (or the sketch vectors) of 2 data streams. Let 𝑓⃗ and 𝑔⃗ are the two frequency vectors of 2 data stream s1 and s2 as following: 1 2 3 4 … y 1 2 3 4 … y h(1) 𝑓11 𝑓12 𝑓13 𝑓14 … 𝑓1𝑦 h(1) 𝑔11 𝑔12 𝑔13 𝑔14 … 𝑔1𝑦 ⋮
⋮
⋮
⋮
h(x)
𝑓𝑥1
𝑓𝑥2
𝑓𝑥3 𝑓𝑥4
⋮
… ⋮
⋮
… 𝑓𝑥1
h(x) 𝑔𝑥1
⋮
⋮
⋮
⋮
… ⋮
𝑔𝑥2
𝑔𝑥3
𝑔𝑥4
… 𝑔𝑥1 5
Subject: Perspective in Informatics 3 – Fall Semester 2014 Professor: Davood Rafiei
Where x is the number of hash functions, y is the length of the sketch windows, y and g are occurrences of elements counted by hash functions. Next, the inner-product of every hash functions is calculated: 𝒉(𝟏)𝒇 ∗ 𝒉(𝟏)𝒈 , 𝒉(𝟐)𝒇 ∗ 𝒉(𝟐)𝒈 , 𝒉(𝟑)𝒇 ∗ 𝒉(𝟑)𝒈 , 𝒉(𝟒)𝒇 ∗ 𝒉(𝟒)𝒈 , … , 𝒉(𝒙)𝒇 ∗ 𝒉(𝒙)𝒈 Finally, the join size is estimated by the median of the above set.
This algorithm woks properly because the two data streams are processed based on the same set of hash functions, although the elements arrive from different data streams. In other words, although s1 and s2 are separately processed in 2 different tables, the same (and ideal*) set of hash functions will still generate the same value at the corresponding position in both sketch tables for an identical element. As a result, the inner-product of the two tables (or the two frequency vectors) properly reflects the join-size of two data stream s1 and s2.
*an ideal hash function is the one that generate different value for different element. In other words, a hash function is ideal when it does not cause collision problem.
6