Hadoop - Clemson University

Download Report

Transcript Hadoop - Clemson University

Hadoop on Palmetto HPC
Pengfei Xuan
Jan 8, 2015
School of Computing
Clemson University
Outline
• Introduction
• Hadoop over Palmetto HPC Cluster
• Hands-on Practice
HPC Cluster vs. Hadoop Cluster
Compute
Node
Compute
Node
Compute
Node
…
Compute
Node
Hadoop
Networking
HPC
HDD
Data
Node
Data
Node
Data
Node
SSD
RAM
HPC Clusters
The National Center for Supercomputing Application
Forge (NVIDIA GPU Cluster)
• 44 GPU Nodes
• 6 or 8 NVIDIA Fermi M2070 GPUs per Node
• 6GB Graphics Memory per GPU
• 600 TB GPFS File System
• 40GB/sec InfiniBand QDR per Node
(Point-to-point unidirectional speed)
+
InfiniBand Switch
40GB InfiniBand Adapter
+
8 NVIDIA Fermi M2070 GPUs
Hadoop Clusters
History of Hadoop
Jeffrey Dean
1998 Google Funded
2003 GFS Paper
2004 MapReduce Paper
Nutch DFS Impl.
2005 Nutch MR Impl.
2006 BigTable Paper
Hadoop Project
2008 World’s Largest Hadoop
2010 Facebook 21 PB Data
2011 Microsoft, IBM, Oracle,
Twitter, Amazon
Now Everywhere, Our Class!
Doug Cutting
Google vs. Hadoop Infrastructures
HiPal
Databee
Scribe
Databee
Hive
Hive
Oozie
HBase
Hadoop
Oozie
Hive / Pig
Data
Highway
Zookeeper
IV. Data Analysis Layer
Dremel
Evenflow
MySQL
Gateway
Evenflow
Dremel
Sawzall
MapReduce / GFS
III. Data Flow Layer
Bigtable II. Data Storage and
Computing Layer
Chubby
I. Data Coordination Layer
Hue
Sqoop
Kafka
Oozie
Azkaban
Pig
Hadoop
Zookeeper
HBase
Hadoop
Zookeeper
Azkaban
Hive
Voldemort
Sqoop
Flume
Crunch
Oozie
Hive / Pig
Hadoop
Zookeeper
Hive
HBase
MapReduce Word Count Example
cat *
| grep |
sort
| uniq -c | cat > file
Run Hadoop over Palmetto Cluster
1. Setup Hadoop configuration files
2. Start Hadoop services
3. Copy input files to HDFS (stage-in)
4. Run Hadoop job (MapReduce WordCount)
5. Copy output files from HDFS to your home directory (stage-out)
6. Stop Hadoop services
7. Clear up
Commands
1. Create job directory:
$> mkdir myHadoopJob1
$> cd myHadoopJob1
2. Get Hadoop PBS Script:
$> cp /tmp/runHadoop.pbs .
Or,
$> wget https://raw.githubusercontent.com/pfxuan/myhadoop/master/examples/runHadoop.pbs
3. Submit job to Palmetto cluster:
$> qsub runHadoop.pbs
4. Check status of your job:
$> qstat -anu your_cu_username
5. Verify the correctness of your result:
$> grep Hadoop wordcount-output/* | grep 51