Subject: Perspective in Informatics 3 – Fall Semester 2014 Professor: Davood Rafiei

Assignment No.2

HOANG Nguyen Phong

Submitted on: November 25th

ID number: 6930-26-1264

Question 1 (15 Marks) 4.2.1 Suppose we have a stream of tuples with the schema: Grade (University, courseID, studentID, grade) Assume universities are unique, but a courseID is unique only within a university (i.e., different universities may have different courses with the same ID, e.g., “CS101”) and likewise, studentID’s are unique only within a university (different university may assign the same ID to different students). Suppose we want to answer certain queries approximately from a 1/20th sample of the data. For each of queries below, indicate how you would construct the sample. That is, tell what the key attributes should be. a) For each university, estimate the average number of students in a course. In this case, we construct the sample with key attribute is university. b) Estimate the fraction of students who have a GPA of 3.5 or more. In this case, we construct the sample with key attribute is student. c) Estimate the fraction of courses where at least half the students got “A.” In this case, we construct the sample with key attribute is course. 4.3.1 For the situation of our running example (8 billion bits, 1 billion members of the set S), calc1ulate the false-positive rate if we use three hash functions? What if we use four hash functions? Where 

k is the number of hash functions



n is the bit-array length (the length of the Bloom Filter)



m is the number of members of S.

we have the probability that a bit is still 0 after m inserts is:

(𝟏 −

𝟏 𝒌𝒎 ) 𝒏

𝟏 𝒏𝒌𝒎 ) 𝒏 𝒏

= (𝟏 −



𝒌𝒎

𝒆− 𝒏

It implies that the probability that k arbitrarily-picked bits are set (in other words, the probability of a false positive) is: 𝒌𝒎

− (1-𝒆 𝒏 )k Therefore, the false-positive rate if:



Three hash functions are used is:



Four hash functions are used is:

= 0.030579 ≈ 3.1% = 0.023969 ≈ 2.3%

1

Subject: Perspective in Informatics 3 – Fall Semester 2014 Professor: Davood Rafiei

Question 2 (15 Marks) 4.4.1 Suppose our stream consists of the integers 3, 1, 4, 1, 5, 9, 2, 6, 5. Our hash functions will all be of the form h(x) = ax + b mode 32 for some a and b. You should treat the result as a 5-bit binary integer. Determine the tail length for each stream element and the resulting estimate of the number of distinct elements. In this problem, we apply the Flajolet-Martin Algorithm as following: 

We apply a hash function h(x) to the stream element.



The bit string of h(x) will end in some number of 0’s.



Count this number as tail length.



Let R be the maximum tail length of any x seen so far in the stream.



The number of distinct elements seen in the stream is estimated by

𝟐𝐑 𝝓

where 𝝓 = 0.77351

a) h(x) = 2x + 1 mod 32 Element Hashed value Convert to Binary Tail Length = R Number of distinct elements = 2R 3

7

00111

0

1

1

3

00011

0

1

4

9

01001

0

1

1

3

00011

0

1

5

11

01011

0

1

9

19

10011

0

1

2

5

00101

0

1

6

13

01101

0

1

5

11

01011

0

1

With this hash function, the maximum tail length R = 0, the number of distinct elements is 𝟐𝐑

𝟐𝟎

estimated to be 𝝓 = 𝟎.𝟕𝟕𝟑𝟓𝟏 = 𝟐. 𝟓𝟖𝟓 ≈ 𝟑. b) h(x) = 3x + 7 mod 32 Element Hashed value Convert to Binary Tail Length = R Number of distinct elements = 2R 3

16

10000

4

16

1

10

01010

1

2

4

19

10011

0

1

1

10

01010

1

2

5

22

10110

1

2

9

2

00010

1

2

2

13

01101

0

1

6

25

11001

0

1

5

22

10110

1

2

2

Subject: Perspective in Informatics 3 – Fall Semester 2014 Professor: Davood Rafiei

With this hash function, the maximum tail length R = 4, the number of distinct elements is 𝟐𝐑 𝝓

estimated to be

=

𝟐𝟒 𝟎.𝟕𝟕𝟑𝟓𝟏

= 𝟐𝟎. 𝟔𝟖𝟒 ≈ 𝟐𝟏.

c) h(x) = 4x mode 32 Element Hashed value Convert to Binary Tail Length = R Number of distinct elements = 2R 3

12

01100

2

4

1

4

00100

2

4

4

16

10000

4

16

1

4

00100

2

4

5

20

10100

2

4

9

4

00100

2

4

2

8

01000

3

8

6

24

11000

3

8

5

20

10100

2

4

With this hash function, the maximum tail length R = 4, the number of distinct elements is 𝟐𝐑

𝟐𝟒

estimated to be 𝝓 = 𝟎.𝟕𝟕𝟑𝟓𝟏 = 𝟐𝟎. 𝟔𝟖𝟒 ≈ 𝟐𝟏. In order to improve the accuracy of this algorithm, numbers of distinct elements from each hash function are first averaged. Then, median of the average R is used to estimate the distinct elements of the stream as following: Average of 2R Hash function a

1

Hash function b

3.2222

Hash function c

6.2222

Median value of 2R 3.2222

Distinct elements Estimate =

3.2222 0.77351

=4.1657≈4

Although the estimate of 4 elements is still far from the real distinct elements in this case (which is 7), by averaging and taking the median of all hash functions, the result is improved much better compared with each single result from each hash function. Actually, if more accuracy is desired, we need more hash functions. However, that will imply higher computation cost. At this time, we can get a value closer to real distinct elements. 4.4.2 Do you see any problems with the choice of hash functions in Exercise 4.4.1? What advice could you give someone who was going to use as hash function of the form h(x) = ax + b mod 2k? During the above computational processes, it is obvious that this algorithm is quite sensitive to the hash function parameters. Some of the following advice may be useful for someone who is going to use h(x) = ax + b mod 2k as hash function. That is the parameter a has to be odd numbers for the best result. Since when it is even number, whatever b is, the hash function does not generate good result as observed in 4.4.1:

3

Subject: Perspective in Informatics 3 – Fall Semester 2014 Professor: Davood Rafiei



When a is even, and b is odd, the hash function always returns odd numbers. That causes the binary value always ends by 1’s, making the tail length is always equal to 0 as in case [a].



When a is even, and b is also even, the hashed value will be badly affected. For instance, in case [c] hashed values of 2 different elements become the same (both 1 and 9 have the same hashed value of 4). It revokes the primary rule of a hash function that is: “with different elements, the hash function is supposed to generate different values”.

Question 3 (20 Marks) 4.5.1 Compute the surprise number (second moment) for the stream 3, 1, 4, 1, 3, 4, 2, 1, 2. What is the third moment of this stream? The frequency moment of a stream is calculated by using the following formula: 𝒎 th 𝑭𝒎 = ∑𝑵 𝒊 𝒇𝒊 Where m is the order of moment, and f is number of occurrence(s) of the i element.

Element

Occurrence

1st moment

2nd moment

3rd moment

1

3

3

9

27

2

2

2

4

8

3

2

2

4

8

4

2

2

4

8

𝐹𝑚 = 9

𝐹𝑚 = 21

𝐹𝑚 = 51



The first moment, or the length of the stream is 9.



The second moment (or the surprise number, self-join size, repeat rate, homogeneity index) of the stream is 21.



The third moment of this stream is 51.

4.5.2 If a stream has n elements, of which m are distinct, what are the minimum and maximum possible surprise number, as a function of m and n? As we know that the 2nd moment of a stream is a value to evaluate the repeat rate of the elements in the stream. In other words, it is used to measure how uneven the distribution of elements in the stream is. As a result, the repeat rate is at its minimum value when the occurrences are evenly distributed to all distinct elements in the stream. On the contrary, the repeat rate is at its maximum value when the occurrences are unevenly and highly distributed to only a particular element, while the other elements are only distributed once. In other words, the minimum and maximum possible surprise number in term of m and n are estimated as below: Min 𝑭𝟐 =

Max 𝑭𝟐 =

4

Subject: Perspective in Informatics 3 – Fall Semester 2014 Professor: Davood Rafiei

4.5.3 Suppose we are given the stream of Exercise 4.5.1, to which we apply the Alon-Matias-Szegedy Algorithm to estimate the surprise number. For each possible value of i, if Xi is a variable starting position i, what is the value of Xi.value? According to the Alon-Matias-Szegedy Algorithm, we have the following table: Starting position i

Xi.element

Xi.Value

1st

3

2

2nd

1

3

3rd

4

2

4th

1

2

5th

3

1

6th

4

1

7th

2

2

8th

1

1

9th

2

1

From this table, we can derive an estimate of the second moment by the following formula

Where n is the length the stream.

Therefore, the second moment of the stream is estimated as:

This result is exactly same as the result calculated in 4.5.1 using the formula

𝒎 𝑭 𝒎 = ∑𝑵 𝒊 𝒇𝒊

. It is because

we utilized all of 9 possible starting positions for all variables. If we use less than 9 variables to save computational cost, the result will be slightly different from the true value but still acceptable error. Question 4 (10 Marks) Consider the count sketch (also referred to as FastAGMS) discussed in class; you have seen how the sketch is used to estimate the self-join size of a stream. Now suppose you are given two streams s1 and s2 and are asked to estimate the join size of s1 and s2. How can you use the count-sketch for this estimation? Explain how and why the estimation would work. The size of join of two data streams is defined as the inner-product of the frequency vectors (or the sketch vectors) of 2 data streams. Let 𝑓⃗ and 𝑔⃗ are the two frequency vectors of 2 data stream s1 and s2 as following: 1 2 3 4 … y 1 2 3 4 … y h(1) 𝑓11 𝑓12 𝑓13 𝑓14 … 𝑓1𝑦 h(1) 𝑔11 𝑔12 𝑔13 𝑔14 … 𝑔1𝑦 ⋮







h(x)

𝑓𝑥1

𝑓𝑥2

𝑓𝑥3 𝑓𝑥4



… ⋮



… 𝑓𝑥1

h(x) 𝑔𝑥1









… ⋮

𝑔𝑥2

𝑔𝑥3

𝑔𝑥4

… 𝑔𝑥1 5

Subject: Perspective in Informatics 3 – Fall Semester 2014 Professor: Davood Rafiei

  

Where x is the number of hash functions, y is the length of the sketch windows, y and g are occurrences of elements counted by hash functions. Next, the inner-product of every hash functions is calculated: 𝒉(𝟏)𝒇 ∗ 𝒉(𝟏)𝒈 , 𝒉(𝟐)𝒇 ∗ 𝒉(𝟐)𝒈 , 𝒉(𝟑)𝒇 ∗ 𝒉(𝟑)𝒈 , 𝒉(𝟒)𝒇 ∗ 𝒉(𝟒)𝒈 , … , 𝒉(𝒙)𝒇 ∗ 𝒉(𝒙)𝒈 Finally, the join size is estimated by the median of the above set.

This algorithm woks properly because the two data streams are processed based on the same set of hash functions, although the elements arrive from different data streams. In other words, although s1 and s2 are separately processed in 2 different tables, the same (and ideal*) set of hash functions will still generate the same value at the corresponding position in both sketch tables for an identical element. As a result, the inner-product of the two tables (or the two frequency vectors) properly reflects the join-size of two data stream s1 and s2.

*an ideal hash function is the one that generate different value for different element. In other words, a hash function is ideal when it does not cause collision problem.

6

HW2-HOANG Nguyen Phong_protected.pdf

... is even number, whatever b is, the hash function does not generate good. result as observed in 4.4.1: Page 3 of 6. HW2-HOANG Nguyen Phong_protected.pdf.

431KB Sizes 16 Downloads 381 Views

Recommend Documents

Tuan Nguyen Thesis
I declare that the research in this thesis is my own original work dur- ing my ...... framework that can be used as a test bed for context-aware application develop- ...... uses WiFi and GSM radio fingerprints collected from users' mobile devices to.

Issue 2.3 Nguyen
Software Testing & Quality Engineering www.stqemagazine. ... IIS (Web server) virtual directory has not been set up .... s Check if the proper versions of the server software such as Web server ..... to come up with good test cases, ask relevant ...

robert a. nguyen
Fordham Mimes &. Kitt Lavoie, dir. Mummers. Vocal. The Fordham Ramblers. Men's a cappella group. Baritone. Training. Acting I/Movement. Fordham College.

83.NGUYEN THANH HAI.pdf
Dương là một trong hai khu vực có mức tăng mạnh nhất với con số khoảng 4 – 5%. Điều này rõ ràng là có lợi đối với sự phát triển của du lịch Việt Nam. 1.

56.NGUYEN VAN MINH.pdf
Trong dòng chảy của nghề sÆ¡n và nghệ thuật sÆ¡n mài Việt Nam, nghề sÆ¡n mài Bình. DÆ°Æ¡ng đã có trên hai trăm năm tồn tại và phát triển qua các loại hình trang ...

6. NGUYEN VAN DINH.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. 6. NGUYEN ...

Duy Nguyen-Monster Origami.pdf
Page 1 of 97. Page 1 of 97. Page 2 of 97. Page 2 of 97. Page 3 of 97. Page 3 of 97. Duy Nguyen-Monster Origami.pdf. Duy Nguyen-Monster Origami.pdf. Open.

4.TS. Nguyen Tan Tai.pdf
(2009). Page 3 of 5. 4.TS. Nguyen Tan Tai.pdf. 4.TS. Nguyen Tan Tai.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying 4.TS. Nguyen Tan Tai.pdf.

21.NGUYEN TRONG HIEU.pdf
Trong số các đảo trên, nổi bật nhất là đảo Nhím có vị trí và cảnh quan thuận lợi để đầu tư, xây. dựng và phát triển các khu nghĩ dưỡng. Cảnh quan thiên nhiên ...

60.NGUYEN THI THANH TUNG.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. 60.NGUYEN THI ...

82.NGUYEN THI VAN HANH.pdf
reality technology, has brought great changes to the tourism industry and is a technical base. for the emergence of cyber-tourism. Being a new form of tourism, ...

62.NGUYEN VAN CHAT.pdf
11 WANG Ling-en, CHENG Sheng-kui, ZHONG Lin-sheng, MU Song-lin, DHRUBA Bijaya GC, REN Guo-zhu,. Rural Tourism Development in China: Principles, ...

73.NGUYEN THANH TUONG.pdf
Central coastal strip is one of the areas with various natural resources which is valuable. to develop many economic sectors such as transport, port, fishing, ...

21.NGUYEN TRONG HIEU.pdf
Keywords: Eco-tourism, Dau Tieng lake, Tay Ninh Province. 1. Đặt vấn ... tạo nên một nguồn lợi thuỷ sản tự nhiên khá phong phú. ... NGUYEN TRONG HIEU.pdf.Missing:

Nguyen Quoc Dat- LATS.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Nguyen Quoc Dat- LATS.pdf. Nguyen Quoc Dat- LATS.pdf. Open. Extract. Open with. Sign In. Main menu.

nguyen-tat-thanh-ha-noi.pdf
nguyen-tat-thanh-ha-noi.pdf. nguyen-tat-thanh-ha-noi.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying nguyen-tat-thanh-ha-noi.pdf. Page 1 of 5.

Nguyen Ngoc Thang(Eng).pdf
transition to higher-value economy);. Local consultant in Knowledge Based Management project of JICA Vietnam (2010. -2011);. Local consultant in Corporate Social Responsibility project of UNIDO Vietnam. (2011 -2013);. Local consultant for the final e

38.NGUYEN THI NGA.pdf
... đồng thời chung đúc lên sự phong phú, đa dạng và tính chất riêng bản sắc. dân tộc. Bảo tồn và phát huy bản sắc văn hoá dân tộc, (văn hoá sinh thái nhân ...

TS NGUYEN TUAN ANH.pdf
Page 1 of 3. Tuan-AnhNguyen. E-mail: [email protected]. Tel: 0934061793. EDUCATION & QUALIFICATIONS. 1999-2004 Ho Chi Minh City University of Technology and Education, Vietnam. Bachelor of Science (BS) in Garment Technology. 2005-2007 Hanoi Universit

nguyen-ly-80-20.pdf
Page 1 of 5. HP Consultancy Co., Ltd. Unit 303, 3rd Floor, Hoang Long Long An Office Building, 281-283 An Duong Vuong Street, District 5, HCMC. Tel (+84) ...

Nguyen Kim Phi Phung.pdf
Araliaceae, Nyctaginaceae, Sonneraticeae, Avicenniaceae and some species of lichen. Parmotrema, Usnea, Lobaria, Sticta, Roccella... looking for compounds ...

TTKH_ ThS Nguyen Ngoc Tuyen.pdf
Tham dá»± hội thảo khoa học Quốc tế WESTPAC lần. 9 tại Nha Trang, Việt Nam năm 2 4. Page 3 of 3. TTKH_ ThS Nguyen Ngoc Tuyen.pdf. TTKH_ ThS Nguyen ...

87.NGUYEN THI THUY NGAN.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. 87.NGUYEN THI ...

60.NGUYEN THI THANH TUNG.pdf
NGUYEN THI THANH TUNG.pdf. 60.NGUYEN THI THANH TUNG.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying 60.NGUYEN THI THANH ...