CSCI 572: Information Retrieval and Search Engines

Download Report

Transcript CSCI 572: Information Retrieval and Search Engines

Introduction to Apache Hadoop
CSCI 572: Information Retrieval and
Search Engines
Summer 2010
Outline
•
•
•
•
What is Hadoop?
Where did it come from?
What are the current versions of Hadoop?
What can it do?
May-20-10
CS572-Summer2010
CAM-2
Apache Hadoop
• The brainchild of Doug
Cutting
• Built out by brilliant engineers and contributors
from Yahoo, and Facebook and Cloudera and
other companies
• Started in 2007/2008 when code was spun out of
Nutch
• Has grown into really large project at Apache with
significant ecosystem
May-20-10
CS572-Summer2010
CAM-3
How to get started
• Hadoop (0.20.0/0.20.2)
– Put your Java hat on
– Go here:
• http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html
• If you want to do this on Windows, get Cygwin, or VMWare or
something that you can run Linux on
• Run the Map Reduce examples on local mode
• Check on the data generated in your HDFS
– Scaling it out
• Amazon Elastic Map Reduce
• Setting it up on your own cluster: DataNodes and
Task/JobTracker
May-20-10
CS572-Summer2010
CAM-4
Basic Operations
• Listing files
– ./bin/hadoop fs –ls
• Writing files
– ./bin/hadoop fs –put
• Running Map Reduce Jobs
– mkdir input
– cp conf/*.xml input
– ./bin/hadoop jar hadoop-*-examples.jar grep input
output 'dfs[a-z.]+’
– cat output/*
May-20-10
CS572-Summer2010
CAM-5
Advanced Topics
• Writing your Mappers and Reducers
– Check out Map Reduce Tutorial here:
– http://hadoop.apache.org/common/docs/r0.20.0/mapred
_tutorial.html
– Code for several examples including Word Count
May-20-10
CS572-Summer2010
CAM-6
Other Hadoop ecosystem projects
• HBase
– Big Table
• HIVE
– Built at FB, provides SQL interface on HDFS
• Chukwa
– Log Processing
• Pig
– Scientific data analysis language on top of M/R and HDFS
• Zookeeper
– Distributed Systems management
May-20-10
CS572-Summer2010
CAM-7
No releases in a while
• Stick with 0.20.x
May-20-10
CS572-Summer2010
CAM-8
Wrapup
• Lots more information at
– http://hadoop.apache.org
– http://hadoop.apache.org/mapreduce/
– http://hadoop.apache.org/hdfs/
• Project ideas
– Implement GIS or geometrical algorithm in Map
Reduce
– Write REST interface to control HDFS and to M/R
– Add new Writeable input data formats
– Integrate Solr and Hadoop
May-20-10
CS572-Summer2010
CAM-9
Acknowledgements
• Material inspired by discussions and talks on the
Apache Mailing lists for Hadoop and through
discussions with the rest of the Hadoop community
May-20-10
CS572-Summer2010
CAM-10