슬라이드 1 - Kangwon

Download Report

Transcript 슬라이드 1 - Kangwon

Distributed and Parallel Processing Technology
Chapter1.
Meet Hadoop
Sun Jo
1
Data!
 We live in the data age.
 Estimates 0.18 ZB in 2006 and forecasting a tenfold growth by 2011 to 1.8 ZB
• 1 ZB = 1021 bytes = 1,000 EB = 1,000,000 PB = 1,000,000,000 TB
 The flood of data is coming from many sources
 New York Stock Exchange generates 1 TB of new trade data per day
 Facebook hosts about 10 billion photos taking up 1 PB (=1,000 TB) of storage
 Internet Archive stores around 2 PB, and is growing at a rate of 20 TB per month
 ‘Big Data’ can affects smaller organizations or individuals
 Digital photos, individual’s interactions – phone calls, emails, documents – are
captured and stored for later access
 The amount of data generated by machines will be even greater than that
generated by people
 Machine logs, RFID readers, sensor networks, vehicle GPS traces, retail transactions
2
Data!
 Data can be shared for anyone to download and analyze
 Public Data Sets on Amazon Web Services, Infochimps.org, theinfo.org
 Astrometry.net project
• Watches the astrometry group on Flickr for new photos of the night sky
• Analyzes each image and identifies the sky
 The project shows that are possible when data is made available and used for something that
was not anticipated by the creator
 Big Data is here. We are struggling to store and analyze it.
3
Data Storage and Analysis
 The storage capacities have increased but access speeds haven’t kept up
1990
2010
1 drive stores
1,370 MB
1 TB
transfers
4.4 MB/s
100 MB/s
reads all the data from a full drive
5 minutes
2 hours and 30 minutes
• Writing is even slower!
 Solution : Read and write data in parallel to/from multiple disks
 Problem
 To solve hardware failure  replication
• RAID : Redundant copies of the data are kept in case of failure
 To combine the data in a disk with the others
 What Hadoop provides
 A reliable shared storage (HDFS)
 Efficient analysis (MapReduce)
4
Comparison with Other Systems - RDBMS
 RDBMS
 B-Tree index
 Optimized for accessing and updating a small proportion of records
 MapReduce
 Efficient for updating the large data, uses Sort/Merge to rebuild the DB
 Good for the needs to analyze the whole dataset in a batch fashion
 Structured vs. Semi- or Unstructured Data
 Structured data : particular predefined schema  RDBMS
 Semi- or Unstructured data : looser or no particular internal structure  MapReduce
 Normalization
 To retain the integrity and remove redundancy, relational data is often normalized
 MapReduce performs high-speed streaming reads and writes, and records that is not
normalized are well-suited to analysis with MapReduce.
5
Comparison with Other Systems - RDBMS
 RDBMS vs. MapReduce
 Co-evolution of RDBMS and MapReduce systems
 RDBs start incorporating some of the ideas from MapReduce
 Higher-level query languages built on MapReduce
• Making MapReduce systems more approachable to traditional database programmers
6
Comparison with Other Systems – Grid Computing
 Grid Computing
 High Performance Computing(HPC) and Grid Computing communities have been
doing large-scale data processing
• Using APIs as Message Passing Interface(MPI)
 HPC
• Distribute the work across a cluster of machines, which access a shared filesystem, hosted by
a SAN
• Works well for compute-intensive jobs
• Meets a problem when nodes need to access larger data volumes – hundreds of GB, since
the network bandwidth is the bottleneck and compute nodes become idle
 Data locality, the heart of MapReduce
 MapReduce collocates the data with the compute node, so data access is fast since it
is local
 MPI vs. MapReduce
 MPI programmers need to handle the mechanics of the data flow
 MapReduce programmers think in terms of functions of key and value pairs, and the
data flow is implicit
7
Comparison with Other Systems – Grid Computing
 Partial failure
 MapReduce is a shared-nothing architecture  tasks have no dependence on one
other.  the order in which the tasks run doesn’t matter.
 MPI programs have to manage the check-pointing and recovery
8
Comparison with Other Systems – Volunteer Computing
 Volunteer computing projects




Breaking the problem into chunks called work units
Sending to computers around the world to be analyzed
The Results are sent back to the server when the analysis is completed
The client gets another work unit
 SETI@home
 to analyze radio telescope data for signs of intelligent life outside earth
 SETI@home vs. MapReduce
 SETI@home
• very CPU-intensive, which makes it suitable for running on hundreds of thousands of
computers across the world. Volunteers are donating CPU cycles, not bandwidth
• Runs a perpetual computation on untrusted machines on the Internet with highly variable
connection speeds and no data locality
 MapReduce
• Designed to run jobs that last minutes or hours on HW running in a single data center with
very high aggregate bandwidth interconnects
9
A Brief History of Hadoop
 Hadoop
 Created by Doug Cutting, the creator of Apache Lucene, text search library
 Has its origin in Apache Nutch, an open source web search engine, a part of the
Lucene project
 ‘Hadoop’ was the name that Doug’s kid gave to a stuffed yellow elephant toy
 History
 In 2002, Nutch was started
• A working crawler and search system emerged
• Its architecture wouldn’t scale to the billions of pages on the Web
 In 2003, Google published a paper describing the architecture of Google’s
distributed filesystem, GFS
 In 2004, Nutch project implemented the GFS idea into the Nutch Distributed
Filesystem, NDFS
 In 2004, Google published the paper introducing MapReduce
 In 2005, Nutch had a working MapReduce implementation in Nutch
• By the middle of that year, all the major Nutch algorithms had been ported to run using
MapReduce and NDFS
10
A Brief History of Hadoop
 History
 In Feb. 2006, Doug Cutting started an independent subproject of Lucene, called Hadoop
• In Jan. 2006, Doug Cutting joined Yahoo!
• Yahoo! Provided a dedicated team and the resources to turn Hadoop into a system at web scale
 In Feb. 2008, Yahoo! announced its search index was being generated by a 10,000 core
Hadoop cluster
 In Apr. 2008, Hadoop broke a world record to sort a terabytes of data
 In Nov. 2008, Google reported that its MapReduce implementation sorted one terabytes
in 68 seconds.
 In May 2009, Yahoo! used Hadoop to sort one terabytes in 62 seconds
11
Apache Hadoop and the Hadoop Ecosystem
 The Hadoop projects that are covered in this book are following









Common – a set of components and interfaces for filesystems and I/O.
Avro – a serialization system for RPC and persistent data storage.
MapReduce – a distributed data processing model.
HDFS – a distributed filesystem running on large clusters of machines.
Pig – a data flow language and execution environment for large datasets.
Hive – a distributed data warehouse providing SQL-like query language.
HBase – a distributed, column-oriented database.
ZooKeeper – a distributed, highly available coordination service.
Sqoop – a tool for efficiently moving data between relational DB and HDFS.
12