www.aryansexport.com

Download Report

Transcript www.aryansexport.com

A new way to store and analyze data
Presented By :: Harsha JainCSE – IV Year
Student
www.powerpointpresentationon.blogspot.com
Topics Covered
•What is Hadoop?
•Why, Where, When?
•Benefits of Hadoop
•How Hadoop Works?
•Hdoop Architecture
•Hadoop Common
•HDFS
•Hadoop MapReduce
•Installation &
Execution
•Demo of installation
•Hadoop Community
By Harsha Jain
What is Hadoop?
• Hadoop was created by Douglas Reed Cutting who named
,
haddop after his child’s stuffed elephant to support Lucene and
Nutch search engine projects.
• Open-source project administered by Apache Software Foundation.
• Hadoop consists of two key services:
a. Reliable data storage using the Hadoop Distributed File System (HDFS).
b. High-performance parallel data processing using a technique called
MapReduce.
• Hadoop is large-scale, high-performance processing jobs — in spite
of system changes or failures.
By Harsha Jain
Hadoop, Why?
• Need to process 100TB datasets
• On 1 node:
– scanning @ 50MB/s = 23 days
• On 1000 node cluster:
– scanning @ 50MB/s = 33 min
• Need Efficient, Reliable and Usable framework
By Harsha Jain
Where and When Hadoop
Where
When
• Batch data processing, not
real-time / user facing (e.g.
Document Analysis and
Indexing, Web Graphs and
Crawling)
• Process lots of unstructured
data
• When your processing can
easily be made parallel
• Highly parallel data intensive
distributed applications
• Very large production
• Running batch jobs is
acceptable
• When you have access to lots
of cheap hardware
deployments (GRID)
By Harsha Jain
Benefits of Hadoop
• Hadoop is designed to run on cheap commodity
hardware
• It automatically handles data replication and node
failure
• It does the hard work – you can focus on processing
data
• Cost Saving and efficient and reliable data
processing
By Harsha Jain
How Hadoop Works
• Hadoop implements a computational paradigm named
Map/Reduce, where the application is divided into many small
fragments of work, each of which may be executed or re-executed
on any node in the cluster.
• In addition, it provides a distributed file system (HDFS) that
stores data on the compute nodes, providing very high aggregate
bandwidth across the cluster.
• Both Map/Reduce and the distributed file system are designed so
that node failures are automatically handled by the framework.
By Harsha Jain
Hdoop Architecture
The Apache Hadoop project develops open-
source software for reliable, scalable, distributed computing
Hadoop Consists::
• Hadoop Common*: The common utilities that support the other
Hadoop subprojects.
• HDFS*: A distributed file system that provides high throughput
access to application data.
• MapReduce*: A software framework for distributed processing of
large data sets on compute clusters.
Hadoop is made up of a number of elements. Hadoop consists of the Hadoop Common,
At the bottom is the Hadoop Distributed File System (HDFS), which stores files across
storage nodes in a Hadoop cluster. Above the HDFS is the MapReduce engine, which
consists of JobTrackers and TaskTrackers.
* This presentation is primarily focus on Hadoop architecture and related sub
project
By Harsha Jain
Data Flow
Web
Servers
Scribe
Servers
Network
Storage
Oracle
RAC
Hadoop Cluster
By Harsha Jain
MySQ
L
Hadoop Common
•Hadoop Common is a set of utilities that
support the other Hadoop subprojects.
Hadoop Common includes FileSystem,
RPC, and serialization libraries.
By Harsha Jain
HDFS
•Hadoop Distributed File System (HDFS) is
the primary storage system used by
Hadoop applications.
•HDFS creates multiple replicas of data
blocks and distributes them on compute
nodes throughout a cluster to enable
reliable, extremely rapid computations.
•Replication and locality
By Harsha Jain
HDFS Architecture
By Harsha Jain
Hadoop MapReduce
• The Map-Reduce programming model
– Framework for distributed processing of large data sets
– Pluggable user code runs in generic framework
Common design pattern in data processing
•
cat * | grep | sort | unique -c | cat > file
input | map | shuffle | reduce | output
Natural for:
•
– Log processing
– Web search indexing
– Ad-hoc queries
By Harsha Jain
MapReduce Implementation
1.Input files split (M splits)
2.Assign Master & Workers
3.Map tasks
4.Writing intermediate data to
disk (R regions)
5.Intermediate data read &
sort
6.Reduce tasks
7.Return
By Harsha Jain
MapReduce Cluster
Implementation
M
map
R reduce
Input files
Intermediate
tasks
files
tasks
split 0
split 1
split 2
split 3
split 4
Several map or
reduce tasks can
run on a single
computer
Output
files
Output 0
Output 1
Each intermediate file
is divided into R
partitions, by
partitioning function
By Harsha Jain
Each reduce task
corresponds to one
partition
Examples of MapReduceWord
Count
•Read text files and count how often words
occur.
oThe input is text files
oThe output is a text file
 each line: word, tab, count
•Map: Produce pairs of (word, count)
•Reduce: For each word, sum up the
counts.
By Harsha Jain
Lets Go…
Installation ::
• Requirements: Linux, Java
1.6, sshd, rsync
• Copy input data into HDFS
• Execute bin/hadoop jar with
• Unpack Hadoop distribution
• Edit a few configuration files
• Format the DFS on the
name node
• Start all the daemon
processes
• Compile your job into a JAR
file
• Configure SSH for
password-free
authentication
Execution::
relevant args
• Monitor tasks via Web
interface (optional)
• Examine output when job is
complete
By Harsha Jain
Demo Video for installation
By Harsha Jain
Hadoop Community
Hadoop Users
• Adobe
• Alibaba
• Amazon
• AOL
• Facebook
• Google
• IBM
Major Contributor
• Apache
• Cloudera
• Yahoo
By Harsha Jain
References
•Apache Hadoop! (http://hadoop.apache.org )
•Hadoop on Wikipedia
(http://en.wikipedia.org/wiki/Hadoop)
•Free Search by Doug Cutting
(http://cutting.wordpress.com )
•Hadoop and Distributed Computing at Yahoo!
(http://developer.yahoo.com/hadoop )
•Cloudera - Apache Hadoop for the Enterprise
(http://www.cloudera.com )
By Harsha Jain