Unit No: 5
Hadoop Programming Pavan R Jaiswal
Data science
Symmetric multi-processing system (SMP)
Distributed systems
Shared memory, shared nothing architecture (SNA)
Hadoop and its ecosystem
Hadoop distributions
Hadoop YARN
Word count program Hadoop and Programming
2
Data science is a process of extracting knowledge from large volume of data Data can be in structured or unstructured form This is also known as “knowledge discovery and data mining (KDD)” Unstructured data have wide range of scope. It includes ◦ ◦ ◦ ◦
Email Videos Photos Social media, etc
Hadoop and Programming
3
Data science often requires sorting through great amount of information It needs well efficient algorithms to extract insights from this data It involves techniques that utilizes ◦ Data preparation ◦ Statistics ◦ And machine learning to find problems in various domains
Hadoop and Programming
4
Hadoop and Programming
5
Hadoop and Programming
6
This enables user to have immediate access to the right information which in turn results in more informed decisions Traditional BI technology loads data onto disk in form of tables and multi-dimensional cubes against which queries are run. As name suggest, data is loaded into main memory (RAM or flash memory) instead of hard disk
Hadoop and Programming
7
This eliminates the need of optimized databases, indexes, aggregates and designing of cubes and star schemas Most in-memory tools use compression algorithms that reduces size of in-memory beyond what would be necessary for hard disks With this data available for analysis can be as large as data mart or small data warehouse
Hadoop and Programming
8
1.
Cheaper and higher performance hardware
2.
64-bit operating systems
3.
Data volumes
4.
Reduced cost
Hadoop and Programming
9
A computer architecture that provides fast performance by making multiple CPUs available SMP uses single OS and shares common memory Both UNIX and Windows NT supports SMP SMP are tightly coupled multi-processor systems
Hadoop and Programming
10
Hadoop and Programming
11
In this, large number of processors are used to perform set of coordinated computations in parallel Massively parallel processor arrays (MPPAs) – a type of integrated circuit with an array of hundreds or thousands of CPUs and RAM banks These processor pass work to another through a reconfigurable interconnect of channels Hadoop and Programming
12
Typically, MPP processor communicate using some messaging interface In some cases, 200 or more processors can work on same application MPP system is also known as a “loosely coupled” or “shared nothing” system Applications of MPP systems can be found in decision support system, data warehouse application(s), etc
Hadoop and Programming
13
Hadoop and Programming
14
Distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages In this, systems can be far from each other. For eg in different continent also In this, communication cost and problems can not be ignored. Distributed computing issues – transparency, scalability, reliability/fault tolerance Hadoop and Programming
15
Hadoop and Programming
16
Hadoop and Programming
17
In this, node consists of processor, memory and one or more disks Processor at one node communicates with processor at other node using interconnection network Node functions as server for the data on the disk or disk it owns Share nothing architecture is a distributed computing architecture in which each node is independent and self-sufficient Hadoop and Programming
18
Processor share neither a common memory nor common disk Data accessed from local disks do not pass through interconnection network, thereby minimizing the interference of resource sharing This architecture eliminates single point of failure, allows self healing capabilities and offers non disruptive upgrades
Hadoop and Programming
19
Hadoop and Programming
20
Hadoop filesystem was developed using distributed filesystem design It is run on commodity hardware Unlike other filesystems, HDFS is highly fault tolerant and designed using low cost hardware HDFD holds very large amount of data and provides easier access To store such huge data, files are stored across multiple machines Hadoop and Programming
21
It is suitable for distributed storage processing Hadoop provides common interface interact with HDFS The built-in servers of namenode datanode helps users to easily check status of cluster Streaming access to filesystem data HDFS provides file permissions authentication Hadoop and Programming
and
to and the
and
22
Hadoop and Programming
23
It follows master-slave architecure and has following daemons running 1. Namenode 2. Secondary namenode 3. Datanode
Hadoop and Programming
24
Hadoop and Programming
25
Hadoop and Programming
26
There are number of datanodes, usually one per node in cluster These manages storage attached to the nodes that they run on HDFS exposes a filesystem namespace and allows user data to be stored in files Internally a file is split into one or more blocks and these blocks are stored in a set of datanodes
Hadoop and Programming
27
Datanodes are responsible for serving read and write requests from the filesystem’s clients In HDFS, writes are pipelined and read are parallel The datanode also performs block creation, deletion and replication upon instruction from namenode
Hadoop and Programming
28
YARN – Yet Another Resource Negotiator Its original purpose was to split up the two major responsibilites of the JobTracker/TaskTracker into separate entities: ◦ ◦ ◦ ◦
A global resource manager A per-application Application-Master A per-node slave NodeManager A per-application Container running NodeManager
Hadoop and Programming
on
a
29
Hadoop and Programming
30
For more information on Hadoop and its ecosystem, refer the “Hadoop” tab on below link www.pavanjaiswal.com
Hadoop and Programming
31
For more information on Hadoop word count program (in Java and Python), refer the “Hadoop” tab on below link www.pavanjaiswal.com
Hadoop and Programming
32
1.
What is data science? Explain the data science process with flowchart
2.
Explain the difference between in-memory analytics and in-database
3.
Explain the factors involved for in-memory analytics
4.
Write a short note on
symmetric multi-processing
systems(SMP) 5.
Describe Massively parallel Processing Architecture
Hadoop and Programming
33
6.
Explain the difference between parallel and Distributed Systems
7.
Explain shared memory concept with help of example
8.
Write
a
short
note
on
Shared
Nothing
Architecture(SNA). State advantages of SNA 9.
Explain Brewer’s Theorem for distributed systems
10.
Explain difference between NoSQL and NewSQL
Hadoop and Programming
34
11.
Explain HDFS architecture
12.
State difference between RDBMS and Hadoop
13.
Explain YARN architecture
Hadoop and Programming
35
Thank You http://www.pavanjaiswal.com
Hadoop and Programming
36