Unit No: 5

Hadoop Programming Pavan R Jaiswal



Data science



Symmetric multi-processing system (SMP)



Distributed systems



Shared memory, shared nothing architecture (SNA)



Hadoop and its ecosystem



Hadoop distributions



Hadoop YARN



Word count program Hadoop and Programming

2



 



Data science is a process of extracting knowledge from large volume of data Data can be in structured or unstructured form This is also known as “knowledge discovery and data mining (KDD)” Unstructured data have wide range of scope. It includes ◦ ◦ ◦ ◦

Email Videos Photos Social media, etc

Hadoop and Programming

3







Data science often requires sorting through great amount of information It needs well efficient algorithms to extract insights from this data It involves techniques that utilizes ◦ Data preparation ◦ Statistics ◦ And machine learning to find problems in various domains

Hadoop and Programming

4

Hadoop and Programming

5

Hadoop and Programming

6







This enables user to have immediate access to the right information which in turn results in more informed decisions Traditional BI technology loads data onto disk in form of tables and multi-dimensional cubes against which queries are run. As name suggest, data is loaded into main memory (RAM or flash memory) instead of hard disk

Hadoop and Programming

7







This eliminates the need of optimized databases, indexes, aggregates and designing of cubes and star schemas Most in-memory tools use compression algorithms that reduces size of in-memory beyond what would be necessary for hard disks With this data available for analysis can be as large as data mart or small data warehouse

Hadoop and Programming

8

1.

Cheaper and higher performance hardware

2.

64-bit operating systems

3.

Data volumes

4.

Reduced cost

Hadoop and Programming

9





 

A computer architecture that provides fast performance by making multiple CPUs available SMP uses single OS and shares common memory Both UNIX and Windows NT supports SMP SMP are tightly coupled multi-processor systems

Hadoop and Programming

10

Hadoop and Programming

11







In this, large number of processors are used to perform set of coordinated computations in parallel Massively parallel processor arrays (MPPAs) – a type of integrated circuit with an array of hundreds or thousands of CPUs and RAM banks These processor pass work to another through a reconfigurable interconnect of channels Hadoop and Programming

12









Typically, MPP processor communicate using some messaging interface In some cases, 200 or more processors can work on same application MPP system is also known as a “loosely coupled” or “shared nothing” system Applications of MPP systems can be found in decision support system, data warehouse application(s), etc

Hadoop and Programming

13

Hadoop and Programming

14









Distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages In this, systems can be far from each other. For eg in different continent also In this, communication cost and problems can not be ignored. Distributed computing issues – transparency, scalability, reliability/fault tolerance Hadoop and Programming

15

Hadoop and Programming

16

Hadoop and Programming

17









In this, node consists of processor, memory and one or more disks Processor at one node communicates with processor at other node using interconnection network Node functions as server for the data on the disk or disk it owns Share nothing architecture is a distributed computing architecture in which each node is independent and self-sufficient Hadoop and Programming

18







Processor share neither a common memory nor common disk Data accessed from local disks do not pass through interconnection network, thereby minimizing the interference of resource sharing This architecture eliminates single point of failure, allows self healing capabilities and offers non disruptive upgrades

Hadoop and Programming

19

Hadoop and Programming

20



 





Hadoop filesystem was developed using distributed filesystem design It is run on commodity hardware Unlike other filesystems, HDFS is highly fault tolerant and designed using low cost hardware HDFD holds very large amount of data and provides easier access To store such huge data, files are stored across multiple machines Hadoop and Programming

21







 

It is suitable for distributed storage processing Hadoop provides common interface interact with HDFS The built-in servers of namenode datanode helps users to easily check status of cluster Streaming access to filesystem data HDFS provides file permissions authentication Hadoop and Programming

and

to and the

and

22

Hadoop and Programming

23



It follows master-slave architecure and has following daemons running 1. Namenode 2. Secondary namenode 3. Datanode

Hadoop and Programming

24

Hadoop and Programming

25

Hadoop and Programming

26









There are number of datanodes, usually one per node in cluster These manages storage attached to the nodes that they run on HDFS exposes a filesystem namespace and allows user data to be stored in files Internally a file is split into one or more blocks and these blocks are stored in a set of datanodes

Hadoop and Programming

27







Datanodes are responsible for serving read and write requests from the filesystem’s clients In HDFS, writes are pipelined and read are parallel The datanode also performs block creation, deletion and replication upon instruction from namenode

Hadoop and Programming

28

 

YARN – Yet Another Resource Negotiator Its original purpose was to split up the two major responsibilites of the JobTracker/TaskTracker into separate entities: ◦ ◦ ◦ ◦

A global resource manager A per-application Application-Master A per-node slave NodeManager A per-application Container running NodeManager

Hadoop and Programming

on

a

29

Hadoop and Programming

30



For more information on Hadoop and its ecosystem, refer the “Hadoop” tab on below link www.pavanjaiswal.com

Hadoop and Programming

31



For more information on Hadoop word count program (in Java and Python), refer the “Hadoop” tab on below link www.pavanjaiswal.com

Hadoop and Programming

32

1.

What is data science? Explain the data science process with flowchart

2.

Explain the difference between in-memory analytics and in-database

3.

Explain the factors involved for in-memory analytics

4.

Write a short note on

symmetric multi-processing

systems(SMP) 5.

Describe Massively parallel Processing Architecture

Hadoop and Programming

33

6.

Explain the difference between parallel and Distributed Systems

7.

Explain shared memory concept with help of example

8.

Write

a

short

note

on

Shared

Nothing

Architecture(SNA). State advantages of SNA 9.

Explain Brewer’s Theorem for distributed systems

10.

Explain difference between NoSQL and NewSQL

Hadoop and Programming

34

11.

Explain HDFS architecture

12.

State difference between RDBMS and Hadoop

13.

Explain YARN architecture

Hadoop and Programming

35

Thank You http://www.pavanjaiswal.com

Hadoop and Programming

36

Unit 5 Hadoop Programming.pdf

Whoops! There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps.

2MB Sizes 8 Downloads 120 Views

Recommend Documents

Unit 5.pdf
Brindavan 13th c Govind Dev ... The 2 main components of the temple are : ... Whoops! There was a problem loading this page. Unit 5.pdf. Unit 5.pdf. Open.

Unit 5 Menu.pdf
2. Complete a paired “comunicación” section in the textbook. (5 pts each). Conversation Skill Builder. 3. Create Spanish dialogue between 2 or more characters ...

unit 5.pdf
A function code has been allocated to each service provided by INT ... (keyboard) and echoes (send) the character to the standard output device (monitor). It.

UNIT 5 VINODPADHYANI.pdf
Page 1 of 2. MADE BY A.K.PARMAR. विनोदऩधानी. બાલકન ુંનામ :-......................................................................................................ક ઱ ગણ :- ૨૦ ગણ. ઴ાલાન 

Unit-5.pdf
5.6 Targeted Public Distribution System (TPDS). 5.7 Food ... 5.8 Diversion from the PDS .... during 1951-2001. Page 3 of 44. Unit-5.pdf. Unit-5.pdf. Open. Extract.

Unit 5 Review.pdf
Page 1 of 2. Geometry - Unit 5 Review WS Name. Block ______. 1. Find the indicated trig ratios. a. tan A ______. b. sin C___________. c. cos C___________.

Unit 5 - Review.pdf
What is the percentile rank for a student finishing this test in 48 minutes? 5. The South Metro Fire Department claims to have collected information from 60 calls in a week and found response. times to be normally distributed with a mean of 6 minutes

Unit 5 FM3 Worksheet 5.pdf
Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Unit 5 FM3 Worksheet 5.pdf. Unit 5 FM3 Worksheet 5.pdf. Open.

Unit 5 Advanced Tools & Technologies.pdf
Page 2 of 35. Make tools. ◦ make, nmake, cmake. AWK tool. Grep, egrep, fgrep. Sorting tools. UEFI boot. Case study of Fedora 19 EFI files.

ICS-Unit 5.pdf
Page 1 of 29. Introduction to Control Systems. Control System means any quantity of interest in a machine or mechanism is maintained or altered in. accordance ...

unit-22 5- BY Civildatas.blogspot.in.pdf
Define slenderness ratio. (MAY/JUNE 12). Slenderness ratio of a column is defined as the ratio of effective length to corresponding radius. of gyration of the section. Thus. Slenderness ratio=le/r. Where,. le=effective length. r=appropriate radius of

Unit 5 Biochemistry esrmnotes.in.pdf
Page 1 of 67. BT1004 Biochemistry. UNIT V. • Introduction. • Bioenergetics, High energy compounds,. Biological oxidation. • Electron transport chain, Oxidative. phospholyration, Chemiosmotic theory. • Shuttle pathway – Glycerol phosphate. S

Unit 5 Identity Management Models.pdf
Identity Management. Current IdM solutions are mainly concerned with identities that are used by end users,. and services identify themselves in the networked ...

Unit 5 World History Packet.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Unit 5 World ...

Unit 5 focus wall posters.pdf
Page 2 of 56. The Unsinkable. Wreck of the R.M.S.. Titanic. by Robert D. Ballard and Rick. Archbold. Expository Text. 5-2. Page 2 of 56 ...

MP Unit-5 SE-II.pdf
Debugging and Virtual 8086 Mode. Mr. Sumit Shinde. Assistant Professor. Computer Engineering Department. Pune Institute of Computer Technology.

unit 5 in mis perspective
5.5 Framework for Understanding Management Information Systems ... The very first application of computers in business was to create a transaction processing ...

RE GIS unit (5)_NoRestriction.pdf
Spatial Data. 16,What are the types of spatial data models? . Raster. Vector . lmage. l7.Whal are the major data sources of GIS? o . Conventional analog map ...

unit 5 note-taking -1 - eGyanKosh
1 Congenital heart disease (h.d.)present at birth, e.g., blue baby ... degenerative h.d. intake in diet. 4 Ischaemic h.d. blockage of arteries by cholesterol+angina, heart attacks. You will notice that the sentences and some of the words have ..... I

pfs unit 5 2marks_NoRestriction.pdf
l/)cv,vA MMA M.c) I fv. TWO MARKS: ... What are the different types of connections? 'fhere are tvvo ... What are the requirements ofioint!, (MAY/IuNE 2009, 2012).

Grade 4, Unit 5 Memoir.pdf
Page 1 of 15. 1. 4. th Grade. Writer's Workshop. Unit 5. 3-5 Book 6. Memoir: The Art of Writing Well. The heart of the CSISD Writers Workshop Units of Study stem ...

Unit 5 (Gueridon Service).pdfpppd.pdf
The diners of restaurants with. Gueridon service can see how they're food is prepared and cooked. The term Gueridon means a movable trolley which consists ...

unit-42 5- BY Civildatas.blogspot.in.pdf
bending fail by lateral tensional buckling before attaining their bending strength. The effect of. lateral tensional buckling need not be considered in the design of beams. These types of beams. are called laterally unsupported beams. 2. What is a pl