HadoopTutorial
Download
Report
Transcript HadoopTutorial
Jian Wang
Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das
Yahoo! Inc. Bangalore & Apache Software Foundation
Need to process 10TB datasets
On 1 node:
◦ scanning @ 50MB/s = 2.3 days
On 1000 node cluster:
◦ scanning @ 50MB/s = 3.3 min
Need Efficient, Reliable and Usable
framework
◦ Google File System (GFS) paper
◦ Google's MapReduce paper
Hadoop uses HDFS, a distributed file system
based on GFS, as its shared file system
◦ Files are divided into large blocks and distributed
across the cluster (64MB)
◦ Blocks replicated to handle hardware failure
◦ Current block replication is 3 (configurable)
◦ It cannot be directly mounted by an existing operating
system.
Once you use the DFS (put something in it),
relative paths are from /user/{your usr id}.
E.G. if your id is jwang30 … your “home dir”
is /user/jwang30
Master-Slave Architecture
HDFS Master “Namenode” (irkm-1)
◦ Accepts MR jobs submitted by users
◦ Assigns Map and Reduce tasks to Tasktrackers
◦ Monitors task and tasktracker status, re-executes
tasks upon failure
HDFS Slaves “Datanodes” (irkm-1 to irkm-6)
◦ Run Map and Reduce tasks upon instruction from the
Jobtracker
◦ Manage storage and transmission of intermediate output
Hadoop is locally “installed” on each
machine
◦ Version 0.19.2
◦ Installed location is in /home/tmp/hadoop
◦ Slave nodes store their data in /tmp/hadoop${user.name} (configurable)
If it is the first time that you use it, you need
to format the namenode:
◦ - log to irkm-1
◦ - cd /home/tmp/hadoop
◦ - bin/hadoop namenode –format
Basically we see most commands look
similar
◦ bin/hadoop “some command” options
◦ If you just type hadoop you get all possible
commands (including undocumented)
hadoop dfs
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
[-ls <path>]
[-du <path>]
[-cp <src> <dst>]
[-rm <path>]
[-put <localsrc> <dst>]
[-copyFromLocal <localsrc> <dst>]
[-moveFromLocal <localsrc> <dst>]
[-get [-crc] <src> <localdst>]
[-cat <src>]
[-copyToLocal [-crc] <src> <localdst>]
[-moveToLocal [-crc] <src> <localdst>]
[-mkdir <path>]
[-touchz <path>]
[-test -[ezd] <path>]
[-stat [format] <path>]
[-help [cmd]]
bin/start-all.sh – starts all slave nodes and
master node
bin/stop-all.sh – stops all slave nodes and
master node
Run jps to check the status
Log to irkm-1
rm –fr /tmp/hadoop/$userID
cd /home/tmp/hadoop
bin/hadoop dfs –ls
bin/hadoop dfs –copyFromLocal example
example
After that
bin/hadoop dfs –ls
Mapper.py
Reducer.py
bin/hadoop dfs -ls
bin/hadoop dfs –copyFromLocal example example
bin/hadoop jar contrib/streaming/hadoop-0.19.2-streaming.jar -file
wordcount-py.example/mapper.py -mapper wordcountpy.example/mapper.py -file wordcount-py.example/reducer.py -reducer
wordcount-py.example/reducer.py -input example -output java-output
bin/hadoop dfs -cat java-output/part-00000
bin/hadoop dfs -copyToLocal java-output/part-00000 java-output-local
Hadoop job tracker
◦ http://irkm-1.soe.ucsc.edu:50030/jobtracker.jsp
Hadoop task tracker
◦ http://irkm-1.soe.ucsc.edu:50060/tasktracker.jsp
Hadoop dfs checker
◦ http://irkm-1.soe.ucsc.edu:50070/dfshealth.jsp