Hadoop Design and k-Means Clustering Kenneth Heafield Google Inc

January 15, 2008

Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.

Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

1 / 31

Outline

Hadoop Design 1

Fault Tolerance

2

Data Flow Input Output

3

MapTask Map Partition

4

ReduceTask Fetch and Sort Reduce

Later in this talk: Performance and k-Means Clustering Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

2 / 31

Fault Tolerance

Managing Tasks JobTracker TaskTracker ReduceTask MapTask

TaskTracker MapTask

MapTask

Design TaskTracker reports status or requests work every 10 seconds MapTask and ReduceTask report progress every 10 seconds Issues + Detects failures and slow workers quickly - JobTracker is a single point of failure Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

3 / 31

Fault Tolerance

Coping With Failure

Failed Tasks Rerun map and reduce as necessary. Slow Tasks Start a second backup instance of the same task. Consistency Any MapTask or ReduceTask might be run multiple times Map and Reduce should be functional

Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

4 / 31

Fault Tolerance

Use of Random Numbers Purpose Support randomized algorithms while remaining consistent Sampling Mapper private Random rand; void configure(JobConf conf) { rand.setSeed((long)conf.getInt("mapred.task.partition")); } void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) { if (rand.nextFloat() < 0.1) { output.collect(key, value); } } Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

5 / 31

Data Flow

Data Flow

HDFS Input

InputFormat splits and reads files

Mapper Local Output

SequenceFileOutputFormat writes serialized values

HTTP Input

Map outputs are retrieved over HTTP and merged

Reduce HDFS Output

Kenneth Heafield (Google Inc)

OutputFormat writes a SequenceFile or text

Hadoop Design and k-Means Clustering

January 15, 2008

6 / 31

Data Flow

Input

InputSplit

Purpose Locate a single map task’s input. Important Functions Path FileSplit.getPath(); Implementations MultiFileSplit is a list of small files to be concatenated. FileSplit is a file path, offset, and length. TableSplit is a table name, start row, and end row.

Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

7 / 31

Data Flow

Input

RecordReader Purpose Parse input specified by InputSplit into keys and values. Handle records on split boundaries. Important Functions boolean next(Writable key, Writable value); Implementations LineRecordReader reads lines. Key is an offset, value is the text. KeyValueLineRecordReader reads delimited key-value pairs. SequenceFileRecordReader reads a SequenceFile, Hadoop’s binary representation of key-value pairs.

Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

8 / 31

Data Flow

Input

InputFormat Purpose Specifies input file format by constructing InputSplit and RecordReader. Important Functions RecordReader getRecordReader(InputSplit split, JobConf job, Reporter reporter); InputSplit[] getSplits(JobConf job, int numSplits); Implementations TextInputFormat reads text files. TableInputFormat reads from a table.

Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

9 / 31

Data Flow

Output

OutputFormat Purpose Machine or human readable output. Makes RecordWriter, which is analogous to RecordReader Important Functions RecordWriter getRecordWriter(FileSystem fs, JobConf job, String name, Progressable progress); Formats SequenceFileOutputFormat writes a binary SequenceFile TextOutputFormat writes text files

Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

10 / 31

MapTask

MapTask Default Setup InputFormat

Split files and read records

MapRunnable

Map all records in the task

Mapper

Map a record

OutputCollector

Consult Partitioner and save files

Partitioner

Assign key-value pairs to reducers

Reducer

Reducer

Kenneth Heafield (Google Inc)

Reducers retrieve files over HTTP Hadoop Design and k-Means Clustering

January 15, 2008

11 / 31

MapTask

Map

MapRunnable Purpose Sequence of map operations Default Implementation public void run(RecordReader input, OutputCollector output, Reporter reporter) throws IOException { try { WritableComparable key = input.createKey(); Writable value = input.createValue(); while (input.next(key, value)) { mapper.map(key, value, output, reporter); } } finally { mapper.close(); } } Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

12 / 31

MapTask

Map

Mapper Purpose Single map operation Important Functions void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter); Pre-defined Mappers IdentityMapper InverseMapper flips key and value. RegexMapper matches regular expressions set in job. TokenCountMapper implements word count map.

Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

13 / 31

MapTask

Partition

Partitioner

Purpose Decide which reducer handles map output. Important Functions int getPartition(WritableComparable key, Writable value, int numReduceTasks); Implementations HashPartitioner uses key.hashCode() % numReduceTasks. KeyFieldBasedPartitioner hashes only part of key.

Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

14 / 31

ReduceTask

Fetch and Sort

Fetch and Sort Fetch TaskTracker tells Reducer where mappers are Reducer requests input files from mappers via HTTP Merge Sort Recursively merges 10 files at a time 100 MB in-memory sort buffer Calls key’s Comparator, which defaults to key.compareTo Important Functions int WritableComparable.compareTo(Object o); int WritableComparator.compare(WritableComparable a, WritableComparable b); Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

15 / 31

ReduceTask

Reduce

Reduce

Important Functions void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter); Pre-defined Reducers IdentityReducer LongSumReducer sums LongWritable values Behavior Reduce cannot start until all Mappers finish and their output is merged.

Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

16 / 31

Using Hadoop

5

Performance Combiners

6

k-Means Clustering Algorithm Implementation

Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

17 / 31

Performance

Performance

Why We Care ≥ 10, 000 programs Average 100, 000 jobs/day ≥ 20 petabytes/day Source: Dean, Jeffrey and Ghemawat, Sanjay. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51 (2008), 107–113.

Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

18 / 31

Performance

Barriers

Concept Barriers wait for N things to happen Examples Reduce waits for all Mappers to finish Job waits for all Reducers to finish Search engine assembles pieces of results Moral Worry about the maximum time. This implies balance.

Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

19 / 31

Performance

Combiners

Combiner Purpose Lessen network traffic by combining repeated keys in MapTask. Important Functions void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter); Example Implementation LongSumReducer adds LongWritable values Behavior Framework decides when to call. Uses Reducer interface, but called with partial list of values. Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

20 / 31

Performance

Combiners

Extended Combining Problem 1000 map outputs are buffered before combining. Keys can still be repeated enough to unbalance a reduce. Two Phase Reduce 1 Run a MapReduce to combine values Use Partitioner to balance a key over Reducers Run Combiner in Mapper and Reducer 2

Run a MapReduce to reduce values Map with IdentityMapper Partition normally Reduce normally

Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

21 / 31

Performance

General Advice

General Advice Small Work Units More inputs than Mappers Ideally, more reduce tasks than Reducers Too many tasks increases overhead Aim for constant-memory Mappers and Reducers Map Only Skip IdentityReducer by setting numReduceTasks to -1 Outside Tables Increase HDFS replication before launching Keep random access tables in memory Use multithreading to share memory Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

22 / 31

k-Means Clustering

Data

Netflix data Goal Find similar movies from ratings provided by users Vector Model Give each movie a vector Make one dimension per user Put origin at average rating (so poor is negative) Normalize all vectors to unit length Often called cosine similarity

Issues - Users are biased in the movies they rate + Addresses different numbers of raters Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

23 / 31

k-Means Clustering

Algorithm

k-Means Clustering Two Dimensional Clusters Goal Cluster similar data points Approach Given data points x[i] and distance d: Select k centers c Assign x[i] to closest center c[i] P Minimize i d(x[i], c[i]) d is sum of squares Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

24 / 31

k-Means Clustering

Algorithm

Lloyd’s Algorithm

Algorithm 1

Randomly pick centers, possibly from data points

2

Assign points to closest center

3

Average assigned points to obtain new centers

4

Repeat 2 and 3 until nothing changes

Issues - Takes superpolynomial time on some inputs - Not guaranteed to find optimal solution + Converges quickly in practice

Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

25 / 31

k-Means Clustering

Implementation

Lloyd’s Algorithm in MapReduce

Reformatting Data Create a SequenceFile for fast reading. Partition as you see fit. Initialization Use a seeded random number generator to pick initial centers. Iteration Load centers table in MapRunnable or Mapper. Termination Use TextOutputFormat to list movies in each cluster.

Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

26 / 31

k-Means Clustering

Implementation

Iterative MapReduce Centers Version i Points

Points

Mapper

Mapper

Find Nearest Center Key is Center, Value is Movie

Reducer

Reducer

Average Ratings

Centers Version i + 1

Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

27 / 31

k-Means Clustering

Implementation

Direct Implementation Mapper Load all centers into RAM off HDFS For each movie, measure distance to each center Output key identifying the closest center Reducer Output average ratings of movies Issues - Brute force distance and all centers in memory - Unbalanced reduce, possibly even for large k

Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

28 / 31

k-Means Clustering

Implementation

Two Phase Reduce Implementation 1

Combine Mapper key identifies closest center, value is point. Partitioner balances centers over reducers. Combiner and Reducer add and count points.

2

Recenter IdentityMapper Reducer averages values

Issues + Balanced reduce - Two phases - Mapper still has all k centers in memory Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

29 / 31

k-Means Clustering

Implementation

Large k Implementation Map task responsible for part of movies and part of k centers. For each movie, finds closest of known centers. Output key is point, value identifies center and distance.

Reducer takes minimum distance center. Output key identifies center, value is movie.

Second phase averages points in each center. Issues + Large k while still fitting in RAM - Reads data points multiple times - Startup and intermediate storage costs

Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

30 / 31

Exercises

Exercises

Recommended: PageRank Finish iterative step Balance pages with many incoming links Optional: k-Means Run on part of Netflix Read about and implement Canopies: http://www.kamalnigam.com/papers/canopy-kdd00.pdf

Kenneth Heafield (Google Inc)

Hadoop Design and k-Means Clustering

January 15, 2008

31 / 31

Hadoop Design and k-Means Clustering

Jan 15, 2008 - 2 Data Flow. Input. Output ... Start a second backup instance of the same task. Consistency ..... Startup and intermediate storage costs. Kenneth ...

302KB Sizes 5 Downloads 22 Views

Recommend Documents

25-clustering-and-kmeans-handout.pdf
Connect more apps... Try one of the apps below to open or edit this item. 25-clustering-and-kmeans-handout.pdf. 25-clustering-and-kmeans-handout.pdf. Open.

Hadoop and MapReduce.pdf
Download. Connect more apps... Try one of the apps below to open or edit this item. Hadoop and MapReduce.pdf. Hadoop and MapReduce.pdf. Open. Extract.

CLUSTERING AND CLOUD COMPUTING.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. CLUSTERING ...

Intro Hadoop and MapReduce Certificate.pdf
There was a problem loading more pages. Intro Hadoop and MapReduce Certificate.pdf. Intro Hadoop and MapReduce Certificate.pdf. Open. Extract. Open with.

Web page clustering using Query Directed Clustering ...
IJRIT International Journal of Research in Information Technology, Volume 2, ... Ms. Priya S.Yadav1, Ms. Pranali G. Wadighare2,Ms.Sneha L. Pise3 , Ms. ... cluster quality guide, and a new method of improving clusters by ranking the pages by.

Clustering and Matching Headlines for Automatic ... - DAESO
Ap- plications of text-to-text generation include sum- marization (Knight and Marcu, 2002), question- answering (Lin and Pantel, 2001), and machine translation.

data clustering
Clustering is one of the most important techniques in data mining. ..... of data and more complex data, such as multimedia data, semi-structured/unstructured.

Fuzzy Clustering
2.1 Fuzzy C-Means . ... It means we can discriminate clearly whether an object belongs to .... Sonali A., P.R.Deshmukh, Categorization of Unstructured Web Data.

Spectral Clustering - Semantic Scholar
Jan 23, 2009 - 5. 3 Strengths and weaknesses. 6. 3.1 Spherical, well separated clusters . ..... Step into the extracted folder “xvdm spectral” by typing.

Parallel Spectral Clustering
Key words: Parallel spectral clustering, distributed computing. 1 Introduction. Clustering is one of the most important subroutine in tasks of machine learning.

Survey on Data Clustering - IJRIT
common technique for statistical data analysis used in many fields, including machine ... The clustering process may result in different partitioning of a data set, ...

Blind Speaker Clustering
∗Speech Processing Laboratory, Temple University, Philadelphia, PA 19122, USA. E-mail: {aniyer,uche1 ... span from improving speech recognition (by enabling the ... speech windows. Various distances are investigated and results are presented. This

Multiple Kernel Clustering
leviate this problem. Examples include semi-definite programming (SDP) [26, 27, 28], alternating optimiza-. ∗Department of Automation, Tsinghua University, ...

pro hadoop pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. pro hadoop pdf.

Web Search Clustering and Labeling with Hidden Topics - CiteSeerX
relevant to the domain of application. Moreover ..... the support of hidden topics. If λ = 1, we ..... Táo (Apple, Constipation, Kitchen God), Chuô. t (Mouse), Ciju'a s.

A Comparison of Clustering Methods for Writer Identification and ...
a likely list of candidates. This list is ... (ICDAR 2005), IEEE Computer Society, 2005, pp. 1275-1279 ... lected from 250 Dutch subjects, predominantly stu- dents ...