슬라이드 1 - Kangwon
Download
Report
Transcript 슬라이드 1 - Kangwon
Distributed and Parallel Processing Technology
Chapter1.
Meet Hadoop
Sun Jo
1
Data!
We live in the data age.
Estimates 0.18 ZB in 2006 and forecasting a tenfold growth by 2011 to 1.8 ZB
• 1 ZB = 1021 bytes = 1,000 EB = 1,000,000 PB = 1,000,000,000 TB
The flood of data is coming from many sources
New York Stock Exchange generates 1 TB of new trade data per day
Facebook hosts about 10 billion photos taking up 1 PB (=1,000 TB) of storage
Internet Archive stores around 2 PB, and is growing at a rate of 20 TB per month
‘Big Data’ can affects smaller organizations or individuals
Digital photos, individual’s interactions – phone calls, emails, documents – are
captured and stored for later access
The amount of data generated by machines will be even greater than that
generated by people
Machine logs, RFID readers, sensor networks, vehicle GPS traces, retail transactions
2
Data!
Data can be shared for anyone to download and analyze
Public Data Sets on Amazon Web Services, Infochimps.org, theinfo.org
Astrometry.net project
• Watches the astrometry group on Flickr for new photos of the night sky
• Analyzes each image and identifies the sky
The project shows that are possible when data is made available and used for something that
was not anticipated by the creator
Big Data is here. We are struggling to store and analyze it.
3
Data Storage and Analysis
The storage capacities have increased but access speeds haven’t kept up
1990
2010
1 drive stores
1,370 MB
1 TB
transfers
4.4 MB/s
100 MB/s
reads all the data from a full drive
5 minutes
2 hours and 30 minutes
• Writing is even slower!
Solution : Read and write data in parallel to/from multiple disks
Problem
To solve hardware failure replication
• RAID : Redundant copies of the data are kept in case of failure
To combine the data in a disk with the others
What Hadoop provides
A reliable shared storage (HDFS)
Efficient analysis (MapReduce)
4
Comparison with Other Systems - RDBMS
RDBMS
B-Tree index
Optimized for accessing and updating a small proportion of records
MapReduce
Efficient for updating the large data, uses Sort/Merge to rebuild the DB
Good for the needs to analyze the whole dataset in a batch fashion
Structured vs. Semi- or Unstructured Data
Structured data : particular predefined schema RDBMS
Semi- or Unstructured data : looser or no particular internal structure MapReduce
Normalization
To retain the integrity and remove redundancy, relational data is often normalized
MapReduce performs high-speed streaming reads and writes, and records that is not
normalized are well-suited to analysis with MapReduce.
5
Comparison with Other Systems - RDBMS
RDBMS vs. MapReduce
Co-evolution of RDBMS and MapReduce systems
RDBs start incorporating some of the ideas from MapReduce
Higher-level query languages built on MapReduce
• Making MapReduce systems more approachable to traditional database programmers
6
Comparison with Other Systems – Grid Computing
Grid Computing
High Performance Computing(HPC) and Grid Computing communities have been
doing large-scale data processing
• Using APIs as Message Passing Interface(MPI)
HPC
• Distribute the work across a cluster of machines, which access a shared filesystem, hosted by
a SAN
• Works well for compute-intensive jobs
• Meets a problem when nodes need to access larger data volumes – hundreds of GB, since
the network bandwidth is the bottleneck and compute nodes become idle
Data locality, the heart of MapReduce
MapReduce collocates the data with the compute node, so data access is fast since it
is local
MPI vs. MapReduce
MPI programmers need to handle the mechanics of the data flow
MapReduce programmers think in terms of functions of key and value pairs, and the
data flow is implicit
7
Comparison with Other Systems – Grid Computing
Partial failure
MapReduce is a shared-nothing architecture tasks have no dependence on one
other. the order in which the tasks run doesn’t matter.
MPI programs have to manage the check-pointing and recovery
8
Comparison with Other Systems – Volunteer Computing
Volunteer computing projects
Breaking the problem into chunks called work units
Sending to computers around the world to be analyzed
The Results are sent back to the server when the analysis is completed
The client gets another work unit
SETI@home
to analyze radio telescope data for signs of intelligent life outside earth
SETI@home vs. MapReduce
SETI@home
• very CPU-intensive, which makes it suitable for running on hundreds of thousands of
computers across the world. Volunteers are donating CPU cycles, not bandwidth
• Runs a perpetual computation on untrusted machines on the Internet with highly variable
connection speeds and no data locality
MapReduce
• Designed to run jobs that last minutes or hours on HW running in a single data center with
very high aggregate bandwidth interconnects
9
A Brief History of Hadoop
Hadoop
Created by Doug Cutting, the creator of Apache Lucene, text search library
Has its origin in Apache Nutch, an open source web search engine, a part of the
Lucene project
‘Hadoop’ was the name that Doug’s kid gave to a stuffed yellow elephant toy
History
In 2002, Nutch was started
• A working crawler and search system emerged
• Its architecture wouldn’t scale to the billions of pages on the Web
In 2003, Google published a paper describing the architecture of Google’s
distributed filesystem, GFS
In 2004, Nutch project implemented the GFS idea into the Nutch Distributed
Filesystem, NDFS
In 2004, Google published the paper introducing MapReduce
In 2005, Nutch had a working MapReduce implementation in Nutch
• By the middle of that year, all the major Nutch algorithms had been ported to run using
MapReduce and NDFS
10
A Brief History of Hadoop
History
In Feb. 2006, Doug Cutting started an independent subproject of Lucene, called Hadoop
• In Jan. 2006, Doug Cutting joined Yahoo!
• Yahoo! Provided a dedicated team and the resources to turn Hadoop into a system at web scale
In Feb. 2008, Yahoo! announced its search index was being generated by a 10,000 core
Hadoop cluster
In Apr. 2008, Hadoop broke a world record to sort a terabytes of data
In Nov. 2008, Google reported that its MapReduce implementation sorted one terabytes
in 68 seconds.
In May 2009, Yahoo! used Hadoop to sort one terabytes in 62 seconds
11
Apache Hadoop and the Hadoop Ecosystem
The Hadoop projects that are covered in this book are following
Common – a set of components and interfaces for filesystems and I/O.
Avro – a serialization system for RPC and persistent data storage.
MapReduce – a distributed data processing model.
HDFS – a distributed filesystem running on large clusters of machines.
Pig – a data flow language and execution environment for large datasets.
Hive – a distributed data warehouse providing SQL-like query language.
HBase – a distributed, column-oriented database.
ZooKeeper – a distributed, highly available coordination service.
Sqoop – a tool for efficiently moving data between relational DB and HDFS.
12