Advanced Tools, Techniques and Applications Unit No: 6

• MapReduce • Combiner • Partitioner • MapReduce Word Count Example • MongoDB and MapReduce Functions • ETL Processing • Apache Pig • Pig Features • Pig Execution Modes • Pig Running Modes • Pig UDF’s • Word Count Example using Pig

Advanced Tools, Techniques, Applications

2



Large scale data processing was difficult! ◦ ◦ ◦ ◦ ◦



Managing hundreds or thousands of processors Managing parallelization and distribution I/O Scheduling Status and monitoring Fault/crash tolerance

MapReduce provides all of these, easily! Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html

Advanced Tools, Techniques, Applications

3



What is it? ◦ Programming model used by Google ◦ A combination of the Map and Reduce models with an associated implementation ◦ Used for processing and generating large data sets

Advanced Tools, Techniques, Applications

4



How does it solve our previously mentioned problems? ◦ MapReduce is highly scalable and can be used across many computers. ◦ Many small machines can be used to process jobs that normally could not be processed by a large machine.

Advanced Tools, Techniques, Applications

5



Inputs a key/value pair

◦ Key is a reference to the input value ◦ Value is the data set on which to operate



Evaluation

◦ Function defined by user ◦ Applies to every value in value input  Might need to parse input



Produces a new list of key/value pairs ◦ Can be different type from input pair

Advanced Tools, Techniques, Applications

6

 

 

Starts with intermediate Key / Value pairs Ends with finalized Key / Value pairs Starting pairs are sorted by key Iterator supplies the values for a given key to the Reduce function.

Advanced Tools, Techniques, Applications

7



Typically a function that:

◦ Starts with a large number of key/value pairs  One key/value for each word in all files being greped (including multiple entries for the same word)

◦ Ends with very few key/value pairs

 One key/value for each unique word across all the files with the number of instances summed into this entry



Broken up so a given worker works with input of the same key. Advanced Tools, Techniques, Applications

8

Map returns information

Reduces accepts information

Reduce applies a user defined function to reduce the amount of data

Advanced Tools, Techniques, Applications

9



Yahoo!

◦ Webmap application uses Hadoop to create a database of information on all known webpages



Facebook



Rackspace

◦ Hive data center uses Hadoop to provide business statistics to application developers and advertisers ◦ Analyzes sever log files and usage data using Hadoop

Advanced Tools, Techniques, Applications

11



Creates an abstraction for dealing with complex overhead

◦ The computations are simple, the overhead is messy



Removing the overhead makes programs much smaller and thus easier to use

◦ Less testing is required as well. The MapReduce libraries can be assumed to work properly, so only user code needs to be tested



Division of labor also handled by the MapReduce libraries, so programmers only need to focus on the actual computation Advanced Tools, Techniques, Applications

12

import java.io.IOException; import java.util.*; import import import import import

org.apache.hadoop.fs.Path; org.apache.hadoop.conf.*; org.apache.hadoop.io.*; org.apache.hadoop.mapred.*; org.apache.hadoop.util.*;

public class WordCount { public static class Map extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text();

Advanced Tools, Techniques, Applications

13

public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }

Advanced Tools, Techniques, Applications

14

public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Advanced Tools, Techniques, Applications

15

public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); conf.setInputPath(new Path(args[0])); conf.setOutputPath(new Path(args[1]));

}

JobClient.runJob(conf);

Advanced Tools, Techniques, Applications

16



MapReduce is composed of several components, including:

◦ JobTracker -- the master node that manages all jobs and resources in a cluster ◦ TaskTrackers -- agents deployed to each machine in the cluster to run the map and reduce tasks ◦ JobHistoryServer -- a component that tracks completed jobs, and is typically deployed as a separate function or with JobTracker

Advanced Tools, Techniques, Applications

17

  







MapReduce operates in parallel across massive cluster sizes. MapReduce is available in several languages, including C, C++, Java, Ruby, Perl and Python. Programmers can use MapReduce libraries to create tasks without dealing with communication or coordination between nodes. MapReduce is also fault-tolerant, with each node periodically reporting its status to a master node. If a node doesn't respond as expected, the master node reassigns that piece of the job to other available nodes in the cluster. This creates resiliency and makes it practical for MapReduce to run on inexpensive commodity servers.

Advanced Tools, Techniques, Applications

18

Advanced Tools, Techniques, Applications

19

Advanced Tools, Techniques, Applications

20

Advanced Tools, Techniques, Applications

21

Advanced Tools, Techniques, Applications

22

Advanced Tools, Techniques, Applications

23

Advanced Tools, Techniques, Applications

24

Advanced Tools, Techniques, Applications

25

Advanced Tools, Techniques, Applications

26

Advanced Tools, Techniques, Applications

27

Advanced Tools, Techniques, Applications

28

Advanced Tools, Techniques, Applications

29

Advanced Tools, Techniques, Applications

30

Advanced Tools, Techniques, Applications

31

Advanced Tools, Techniques, Applications

32

Advanced Tools, Techniques, Applications

33

Advanced Tools, Techniques, Applications

34

Advanced Tools, Techniques, Applications

35

Advanced Tools, Techniques, Applications

36

Specifying ranges in FOREACH Operator

Advanced Tools, Techniques, Applications

37

Advanced Tools, Techniques, Applications

38

Advanced Tools, Techniques, Applications

39

Advanced Tools, Techniques, Applications

40

Advanced Tools, Techniques, Applications

41

Advanced Tools, Techniques, Applications

42

Advanced Tools, Techniques, Applications

43

Advanced Tools, Techniques, Applications

44

Advanced Tools, Techniques, Applications

45

The NESTED FOREACH

Advanced Tools, Techniques, Applications

46

Advanced Tools, Techniques, Applications

47

Advanced Tools, Techniques, Applications

48

Advanced Tools, Techniques, Applications

49

Advanced Tools, Techniques, Applications

50

Splitting Data Sets

Advanced Tools, Techniques, Applications

51

Advanced Tools, Techniques, Applications

52

Advanced Tools, Techniques, Applications

53

UDF’s

Advanced Tools, Techniques, Applications

54

Advanced Tools, Techniques, Applications

55

Advanced Tools, Techniques, Applications

56

Advanced Tools, Techniques, Applications

57

Advanced Tools, Techniques, Applications

58

Advanced Tools, Techniques, Applications

59

Advanced Tools, Techniques, Applications

60

1.

How Hadoop helps to process big data? Explain with

suitable case study. 2.

Why MapReduce is important as far as big data is concerned? Explain with some analogy.

3.

Write

short

note

on



mapper,

reducer,

combiner,

partitioner. 4.

Write and explain MapReduce program (Java/Python) to

count word occurrences in a text file. 5.

How MongoDb can be helpful in MapReduce paradigm?

Advanced Tools, Techniques, Applications

61

6.

What is Pig? Where it can be found in Hadoop

architecture? 7.

Explain execution modes of Pig.

8.

Write and explain any 10 HDFS commands.

9.

Write short note on UDF.

10.

Write and explain wordcount example using Pig.

Advanced Tools, Techniques, Applications

62

Thank You http://www.pavanjaiswal.com

Advanced Tools, Techniques, Applications

63

Unit 6 Advanced Tools Techniques.pdf

Managing hundreds or thousands of processors. ◦ Managing parallelization and distribution. ◦ I/O Scheduling. ◦ Status and monitoring. ◦ Fault/crash tolerance. MapReduce provides all of these, easily! Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html. 3. Advanced Tools, Techniques,.

6MB Sizes 0 Downloads 211 Views

Recommend Documents

No documents