Unit 5 Hadoop Programming.pdf

Viewer
Transcript

Unit No: 5

Hadoop Programming Pavan R Jaiswal



Data science



Symmetric multi-processing system (SMP)



Distributed systems



Shared memory, shared nothing architecture (SNA)



Hadoop and its ecosystem



Hadoop distributions



Hadoop YARN



Word count program Hadoop and Programming

2



 



Data science is a process of extracting knowledge from large volume of data Data can be in structured or unstructured form This is also known as “knowledge discovery and data mining (KDD)” Unstructured data have wide range of scope. It includes ◦ ◦ ◦ ◦

Email Videos Photos Social media, etc

Hadoop and Programming

3







Data science often requires sorting through great amount of information It needs well efficient algorithms to extract insights from this data It involves techniques that utilizes ◦ Data preparation ◦ Statistics ◦ And machine learning to find problems in various domains

Hadoop and Programming

4

Hadoop and Programming

5

Hadoop and Programming

6







This enables user to have immediate access to the right information which in turn results in more informed decisions Traditional BI technology loads data onto disk in form of tables and multi-dimensional cubes against which queries are run. As name suggest, data is loaded into main memory (RAM or flash memory) instead of hard disk

Hadoop and Programming

7







This eliminates the need of optimized databases, indexes, aggregates and designing of cubes and star schemas Most in-memory tools use compression algorithms that reduces size of in-memory beyond what would be necessary for hard disks With this data available for analysis can be as large as data mart or small data warehouse

Hadoop and Programming

8

1.

Cheaper and higher performance hardware

2.

64-bit operating systems

3.

Data volumes

4.

Reduced cost

Hadoop and Programming

9





 

A computer architecture that provides fast performance by making multiple CPUs available SMP uses single OS and shares common memory Both UNIX and Windows NT supports SMP SMP are tightly coupled multi-processor systems

Hadoop and Programming

10

Hadoop and Programming

11







In this, large number of processors are used to perform set of coordinated computations in parallel Massively parallel processor arrays (MPPAs) – a type of integrated circuit with an array of hundreds or thousands of CPUs and RAM banks These processor pass work to another through a reconfigurable interconnect of channels Hadoop and Programming

12









Typically, MPP processor communicate using some messaging interface In some cases, 200 or more processors can work on same application MPP system is also known as a “loosely coupled” or “shared nothing” system Applications of MPP systems can be found in decision support system, data warehouse application(s), etc

Hadoop and Programming

13

Hadoop and Programming

14









Distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages In this, systems can be far from each other. For eg in different continent also In this, communication cost and problems can not be ignored. Distributed computing issues – transparency, scalability, reliability/fault tolerance Hadoop and Programming

15

Hadoop and Programming

16

Hadoop and Programming

17









In this, node consists of processor, memory and one or more disks Processor at one node communicates with processor at other node using interconnection network Node functions as server for the data on the disk or disk it owns Share nothing architecture is a distributed computing architecture in which each node is independent and self-sufficient Hadoop and Programming

18







Processor share neither a common memory nor common disk Data accessed from local disks do not pass through interconnection network, thereby minimizing the interference of resource sharing This architecture eliminates single point of failure, allows self healing capabilities and offers non disruptive upgrades

Hadoop and Programming

19

Hadoop and Programming

20



 





Hadoop filesystem was developed using distributed filesystem design It is run on commodity hardware Unlike other filesystems, HDFS is highly fault tolerant and designed using low cost hardware HDFD holds very large amount of data and provides easier access To store such huge data, files are stored across multiple machines Hadoop and Programming

21







 

It is suitable for distributed storage processing Hadoop provides common interface interact with HDFS The built-in servers of namenode datanode helps users to easily check status of cluster Streaming access to filesystem data HDFS provides file permissions authentication Hadoop and Programming

and

to and the

and

22

Hadoop and Programming

23



It follows master-slave architecure and has following daemons running 1. Namenode 2. Secondary namenode 3. Datanode

Hadoop and Programming

24

Hadoop and Programming

25

Hadoop and Programming

26









There are number of datanodes, usually one per node in cluster These manages storage attached to the nodes that they run on HDFS exposes a filesystem namespace and allows user data to be stored in files Internally a file is split into one or more blocks and these blocks are stored in a set of datanodes

Hadoop and Programming

27







Datanodes are responsible for serving read and write requests from the filesystem’s clients In HDFS, writes are pipelined and read are parallel The datanode also performs block creation, deletion and replication upon instruction from namenode

Hadoop and Programming

28

 

YARN – Yet Another Resource Negotiator Its original purpose was to split up the two major responsibilites of the JobTracker/TaskTracker into separate entities: ◦ ◦ ◦ ◦

A global resource manager A per-application Application-Master A per-node slave NodeManager A per-application Container running NodeManager

Hadoop and Programming

on

a

29

Hadoop and Programming

30



For more information on Hadoop and its ecosystem, refer the “Hadoop” tab on below link www.pavanjaiswal.com

Hadoop and Programming

31



For more information on Hadoop word count program (in Java and Python), refer the “Hadoop” tab on below link www.pavanjaiswal.com

Hadoop and Programming

32

1.

What is data science? Explain the data science process with flowchart

2.

Explain the difference between in-memory analytics and in-database

3.

Explain the factors involved for in-memory analytics

4.

Write a short note on

symmetric multi-processing

systems(SMP) 5.

Describe Massively parallel Processing Architecture

Hadoop and Programming

33

6.

Explain the difference between parallel and Distributed Systems

7.

Explain shared memory concept with help of example

8.

Write

a

short

note

on

Shared

Nothing

Architecture(SNA). State advantages of SNA 9.

Explain Brewer’s Theorem for distributed systems

10.

Explain difference between NoSQL and NewSQL

Hadoop and Programming

34

11.

Explain HDFS architecture

12.

State difference between RDBMS and Hadoop

13.

Explain YARN architecture

Hadoop and Programming

35

Thank You http://www.pavanjaiswal.com

Hadoop and Programming

36