Hadoop - HAMS technologies

Transcript Hadoop - HAMS technologies

1
HAMS Technologies
www.hams.co.in
[email protected]
[email protected]
[email protected]
HAMS Technologies
Hadoop overview
» A framework that lets one easily write and run applications that process vast
amounts of data. It includes terminology like: MapReduce, HDFS, Hive, Hbase,
Pig.
» Yahoo is the biggest contributor. Other major contributor are Facebook, Google,
Amazon/A9.
» Here's what makes it especially useful:





Scalable and reliable
Easy of implementation
Efficient
Lots of tool available
Supporting many well known languages and scripts.
2
HAMS Technologies
How Hadoop works ?
• MapReduce divides applications into small blocks of work.
• HDFS creates desire replicas of data blocks for reliability, placing them on
compute nodes around the cluster.
• MapReduce can then process the data locally followed by aggregation of
intermediate result .
3
HAMS Technologies
General flow in MapReduce architecture
1.
2.
3.
4.
Create a clustered network.
Load the data into cluster using Map (mapper task).
Fetch the processing data with help of Map (mapper task).
Aggregate the result with Reducer ( Reducer task).
Local Data
Local Data
Map
Partial
Result-1
Map
Partial
Result-2
Reduce
Local Data
Map
Partial
Result-3
Aggregated
Result
4
HAMS Technologies
General attributes of in MapReduce architecture
1.
2.
3.
4.
Distributed file system (DFS)
Data locality
Data redundancy for fault tolerance
Map tasks applied to partitioned data it scheduled so that input blocks
are on same machine.
5. Reducer tasks applied to process data partitioned by MAP task.
Local Data
Local Data
Map
Partial
Result-1
Map
Partial
Result-2
Reduce
Local Data
Map
Partial
Result-3
Aggregated
Result
5
HAMS Technologies
Hadoop is an open source implementation of MapReduced architecture
maintained by Apache
Hadoop
Master
nodes
HDFS
MapReduce
Hadoop Distributed file system
Job trackers
name node/s
Job tracker node/s
Data node/s
Hive
(Hadoop
interactIVE)
Slave
nodes
Data Node
Data Node
Data Node
Data node/s
Data node/s
Data node/s
Tracker node/s
Tracker node/s
Tracker node/s
6
HAMS Technologies
» Hadoop-streaming allow to create and run MapReducde job as Mapper
and/or as Reducer.
» HDFS (Hadoop Distributed File System) is a clustered network used to
store data. HDFS contain the script to replicate and track the different data
blocks. HDFS write is show below. In same reverse manner we retrieve
data from HDFS.
I am having a file contains 3
blocks.. Where should I write
these?
hams.txt
Block-1
1
Block-2
Block-3
2
3
3
Okey, Write these
on data-node 1 ,2
and 3
Name Node
3
Data Node-1
Data Node-2
Data Node-3
Data Node-n
Data node/s
Data node/s
Data node/s
Data node/s
Tracker node/s
Tracker node/s
Tracker node/s
Tracker node/s
7
HAMS Technologies
When to use Hadoop
• Unstructured data for analysis
• Very large amount of data
• Write ones (less), read many
• Multiple modules written in different languages
8
HAMS Technologies
Kind of people working in development of Application using Hadoop
1. Hadoop Admin/Technical person : People who configure the Hadoop
environment, setting required number of cluster with detail of all data source
and different nodes
2. Hadoop programmer : People who write the different map reduce function
to perform the data analysis.
*Here we are taking the perspective of Hadoop programmer.
9
HAMS Technologies
Map/Reduce is a programming model for efficient distributed computing
It works like a Unix pipeline:
Unix -> cat input | grep | sort | uniq -c
Hadoop-> Input
| cat > output
| Map | Shuffle & Sort | Reduce | Output
A simple model but good for a lot of applications
Log processing.
Web index building.
Count of URL Access Frequency
ReverseWeb-Link Graph:
list of all source URLs associated with a given target URL
Inverted index: Produces <word, list(Document ID)> pairs
Distributed sort
10
HAMS Technologies
11
HAMS Technologies
Here we need to take care the implementation of Map and reduce function and need
to write code for launching the application
Mapper
Input: value: lines of text of input
Output: key: word, value: 1
Reducer
Input: key: word, value: set of counts
Output: key: word, value: sum
Launching program
Defines the job
Submits job to cluster
12
HAMS Technologies
Mapper ( example for word count)
public static class WordCountMap extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line,"\t");
//System.out.println(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
HAMS Technologies
13
Reducer ( example for word count)
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
14
HAMS Technologies
Map reduce launcher
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordCountMap.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
HAMS Technologies
15
Running the complete program
• Build the jar file either directly using eclipse or by jar command.
• Configure the Hadoop.
• Place the jar file in appropriate location.
• Lets move to the Demo : )
16
HAMS Technologies
Documentation :
•Hadoop Wiki
– Introduction
• http://hadoop.apache.org/core/
– Getting Started
• http://wiki.apache.org/hadoop/GettingStartedWithHadoop
– Map/Reduce Overview
• http://wiki.apache.org/hadoop/HadoopMapReduce
– DFS
• http://hadoop.apache.org/core/docs/current/hdfs_design.html
• Javadoc
– http://hadoop.apache.org/core/docs/current/api/index.html
17
HAMS Technologies
Thank you
Kindly drop us a mail at below mention address for any suggestion
and clarification. We like to hear from you
HAMS Technologies
www.hams.co.in
[email protected]
[email protected]
[email protected]
18
HAMS Technologies

Hadoop - HAMS technologies

Transcript Hadoop - HAMS technologies

Directory