2008 Sales Proposal - University of California, Irvine
Download
Report
Transcript 2008 Sales Proposal - University of California, Irvine
Hadoop
Origins & Applications
Christopher Smith
Xavier Stevens
John Carnahan
Original Map & Reduce
• LISP
– map f(x) [x0, x1, …, xn]
• yields: [f(x0), f(x1), …, f(xn)]
– reduce f(x, y) [x0, x1, ..., xn]
• yields: f(…f(f(x0, x1), x2) …, xn)
• reduce + [1, 2, 3, 4] = ((1 + 2) + 3) + 4 = 10
• Key properties:
– input is read only
– operation on each list element is isolated
– while order of execution can matter, often doesn’t
2
Distributed Computing Strategies
• It’s all about state
– Single System Image: PVM, LOCUS, OpenSSI
– Message Passing: MPI, Io, Erlang
– Parallel Storage: PVFS2, Lustre, GPFS, LINDA
– Grid Computing: distributed.net, Sun Grid Engine,
Globus, GridGain
– Functional models: distributed SQL, ??
3
Overcoming Hardware
• Most of those strategies devolve to trying to get fast
access to random bits of data
– Problem: Network != RAM
– Problem: Remote disk != Local disk
– Problem: Latency == HARD
– Hardware Solution: SAN, Myrinet, Infiniband, RDMA,
etc.
– End Result
• Compute power follows Moore’s Law
• Throughput almost follows Moore’s Law
• Latency follows Murphy’s Law
4
Google Part 1 - GoogleFS
• Even if hardware could do it, can’t afford it
• Need to scale beyond hardware solutions anyway
• Somewhere between embarrassingly parallel and
tightly coupled algorithms.
• BigFiles for document repository -> GoogleFS
– NOT a filesystem (just ask POSIX)
– Break up files in to large (64MB) chunks and
distribute them
– Send out code to the nodes, process chunks that are
on node, send results to new files or network
– Append-only mutation model
5
Google Part 2 – Herding Cats
• Needed to simplify distributed computing work
– Hundreds of different distributed programs running
against GoogleFS
– Hard to find programmers good at distributed
computing.
– Harder to find ops people who could support it
– Impossible to find anyone who understood all the
interactions
6
Google Part 3 – MapReduce
• Key Insight: Use very specific subset of functional
programming concepts for the framework, but allow
applications to be written imperatively.
• Common Process
– Extract features and combine by feature
• Extract features = map
• Combine by feature = reduce
– Slightly different from LISP (tuple based)
• map has arbitrary key instead of offset
• reduce input & output contain key
7
MapReduce Visualization
Master
Chunk Server
<K1,V1>, <K2, V2>…KK
Mapper
<K1, [V1,V2,V3,V4]>…KK
Reducer
<K1, V1>K
Chunk Server
Chunk Server
<K10,V10>, <K11, V11>…KK
<K20,V20>, <K21, V21>…KK
Mapper
Mapper
<K2, [V1,V2,V3,V4]>…KK
<K3, [V1,V2,V3,V4]>…KK
Reducer
Reducer
<K2, V2>K
<K3, V3>
8
End Result
• Jobs NOT latency bound
– Most state stays on the local nodes
– each node scans data at over 300MB/s
– Jobs inherently scan driven, rather than latency
driven
– Mappers cull data to reduce network I/O
• Redundancy improves reliability *and* performance
• Dirt cheap hardware, insane scalability (1000+ nodes)
• Fast developer ramp up time
• New jobs easy to develop
9
…But Wait, There’s More!
• Sawzall
– Scripting language for writing mappers
– Common off the shelf reducers
– Writing common MapReduce jobs easier than writing
SQL or AWK scripts
• BigTable
– Column store built on GoogleFS
– Highly concurrent atomic updates to a single record
– MapReduce & Sawzall can run against it
10
Lessons Learned
• One step beyond “embarrassingly easy” can provide
significant value.
• If your approach is fighting hardware trends, try a
different approach.
• Mature, sophisticated systems often have biases that
fight both of these principles, leading to opportunities
for new, primitive systems.
• You know your idea is good when everyone else
imitates it.
11
Hadoop History
• Nutch Search Engine
• Started in 2004 with the
aim to replicate
Google’s toolset
• Open-source
implementation written
in Java
• FAN has been using
Hadoop since 2006
Motivation
• Lots of data to process
– Utilize as much
hardware as we can
throw at it
– Make programming as
easy as possible
• Hadoop
–
–
–
–
–
Automatic parallelization
Fault-Tolerance
I/O Scheduling
Status and Monitoring
Programming model is
easy to understand
– Near linear scaling in
terms of cost and
performance
– Software is free
Hadoop vs. Google
• Hadoop
–
–
–
–
Hadoop File System (HDFS)
Hadoop Map/Reduce
Hbase
Pig/Hive
• Google
–
–
–
–
Google File System (GFS)
MapReduce
BigTable
Sawzall
Hadoop Sub-Projects
• Language Support
– Streaming
• Process Workflow
– Pig, Cascading
• Distributed Databases
– HBase, Cassandra
• Datawarehouse
– Hive
• Coordination Service
– ZooKeeper
How Scalable?
• Yahoo! - Largest Cluster
– 4,000 nodes
– 16PB Raw Disk
– 64TB of RAM
– 32K CPUs
• Gray’s Sort Benchmark
– 1TB Sort on 1500 nodes in 62 seconds
– 1PB Sort on 3700 nodes in 16.25 hours
Hadoop Distributed File System (HDFS)
• Distributed File System designed to run on
low-cost hardware
• Highly fault tolerant
• Designed for high throughput rather than low
latency
• Not POSIX compliant
Hadoop Distributed File System (HDFS)
Master Node
Secondary
NameNode
Client
NameNode
Slave Node
Slave Node
Slave Node
DataNode
DataNode
DataNode
Local Disk
Local Disk
Local Disk
18
Hadoop Map/Reduce
• Easy to use software framework for processing
large datasets in-parallel
• Designed to be run on low-cost hardware in a
reliable, fault-tolerant manner
Hadoop Map/Reduce
Master Node
Client
JobTracker
Slave Node
Slave Node
Slave Node
TaskTracker
TaskTracker
TaskTracker
Task
Task
Task
20
Hadoop Map/Reduce
Input
<k1,v1>
Map
<k2,v2>
21
Reduce
<k3,v3>
Output
Hadoop Map/Reduce – Word Count
Input
<1,Hello World>
<2,Hello UCI>
Map
<Hello,1>
<Hello,1>
Pseudo-code:
Reduce
<Hello,2>
Output
Input data:
Hello World!
Hello UCI!
map(key, value):
// key: line number
// value: text of the line
for each word w in value:
EmitIntermediate(w, 1);
Map output:
(Hello, 1)
(World, 1)
(Hello, 1)
(UCI, 1)
reduce(key, Iterator values):
// key: a word
// value: a list of counts
totalcount = 0;
for each v in values:
totalcount += v;
Emit(key, totalcount);
Reduce output:
(Hello, 2)
(UCI, 1)
(World, 1)
22
Hadoop Map/Reduce – Combiner
Input
<k1,v1>
Map
<k2,v2>
Combine
23
<k2,v2>
Reduce
<k3,v3>
Output
Hadoop Map/Reduce – Word Count w/ Combiner
Input
<1,Hello World>
<2,Hello UCI>
Map
<Hello,1>
Combine
Input data:
<Hello,?>
Reduce
Map output:
(Hello, 1)
(World, 1)
(Hello, 1)
(UCI, 1)
Hello World!
Hello UCI!
Combine output:
(Hello, 2)
(World, 1)
(UCI, 1)
Reduce output:
(Hello, 2)
(UCI, 1)
(World, 1)
24
<Hello,2>
Output
Hadoop Map/Reduce – Word Count w/ Combiner
• Imagine you have a 1000 map tasks that
output 1 billion of the same key each
• How many key-value pairs are sent to the
reducer without a combiner? With combiner?
– Without combiner: 1000 * 10^9 = 10^12
– With combiner: 1000
Hadoop Map/Reduce – Distributed Cache
• Used for large read-only files that are needed
by Map/Reduce jobs
– Dictionaries
– Statistical Models
– Shared Memory Objects
FAN
• What is FAN and what do we do?
• The challenge of social network data
• The opportunity of social network data
27
What is FAN?
FOX Audience Network
–
–
–
–
–
–
Audience Targeting (vs Search or Context)
Audience Segmentation (Dimensionality Reduction)
Lookalikes and Prediction (Recommendation)
Display Ad Marketplace
Ad Optimization
Ad Network (MySpace and many other publishers …)
FAN Products
–
–
–
–
–
MyAds (Self Service Display Ads)
MyInsights (for Advertisers)
Audience Insights (for Publishers)
Media Planner
Query Prediction
28
Social Network Data
Why do they provide data? Niche Envy.
How is our social network data different from others?
What data do users provide?
•
•
•
•
Demographic, About Me, General Interests
Semi-structured: movies, interests, music, heroes
Self-expression: embedded songs and videos, Comments, Status (Tweets)
Hand-raisers, no rankings: movies examples
•
•
•
•
•
•
•
•
•
goodfellas, jewel of the nile, romancing the stone, my cousin vinny, who’s harry crumb, beetlejuice
green mile, pulp fiction, lock stock, human traffic, football factory, trainspotting, saw, casino, life of brian
transformers, blazing saddles, major league, smokey and the bandit
action, adventure
detroit rock city, crow, gladiator, braveheart
drama
gore, trash, thriller
romance, drama
fiction
How honest are the users? Should we care?
•
•
•
•
Early Research: the elusive “income predictor”
Initial test (demo+): 64% accuracy
Just Including Movies: 76% accuracy
Powerful Latent Semantic Space
Why FAN?
100+ Million profiles and sets of more independent data sources
Most of our problems are related to dimensionality reductions and scale
Social network data: What data do users provide?
Friend Graph
Number of nodes: 68M (excluding hubs, spam, duplicates etc)
Mean number of friends: 41
Mean number of friends one hop away: 1010
Sparsity of connectivity matrix: 1.2e-06
Sparsity of similarity matrix: 14.7e-06
Segmentation and Prediction on this Feature Space
Latent Semantic Space
Massive Scale LSA (SVD)
Targeting By Archetypes
Compressing Data -> Understanding Data
E.g. Movies:
What makes FAN different?
Our ability to predict who a user is and what a user wants.
User to Movie Matrix
Segmentation and Prediction
difficult with sparse matrices
because of low overlap
• Users: 4,038,105
• Items: 24,291
• Matrix: 95B cells but only 30M values
FAN Dimensions Matrix
High overlap of features in
FAN Latent Semantic Space
= better prediction
31
32
33
Query Lookalikes: Performance Test
Query
bunk beds
Query
puma shoes
Query
Flat screen tv
Known Users
3,793
Known Users
1,756
Known Users
6,850
Lookalike Users
790,609
Lookalike Users
627,382
Lookalike Users
438,021
Relative Precision
for Users above
Threshold
1.00 (38.01%)
Relative Precision
for Users above
Threshold
0.30 (22.81%)
Relative Precision
for Users above
Threshold
0.21 (6.31%)
Performance
Performance
Performance
Impressions
545K
Impressions
438K
Impressions
267K
FAN CTR
0.18%
FAN CTR
0.07%
FAN CTR
0.06%
Test eCPM
$0.304
Test eCPM
$0.114
Test eCPM
$0.185
3rd Party aCvR
108%
3rd Party aCvR
31%
3rd Party aCvR
6%
Recommenders
Simple Item-Based Collaborative Filtering
Are the relations between the features based on user co-occurrence useful? How robust are the social
features?
CF Recommenders by Cosine Similarity
Take 2 Vectors: Users with Band Prefs
Calculate the co-occurrence of bands in user space.
U*V
||U|| ||V||
(Cosine Similarity)
1. Inverted Index: transform to user space
Bands
User A: 1, 2, 3
User B: 1, 2, 4
User C: 1, 4, 5
Users
Band 1: A, B, C
Band 2: A, B
…
2. Pairwise Similarity: Dot Product
Inner product:
- The number of users in common
- Does not scale well, huge memory requirements
1
1
2
…
2
…
CF Recommenders using Hadoop
Hadoop: Map/Reduce
Map: Input Data -> (k1, v1)
Sort
Reduce: (k1, [v1]) -> (k2, v2)
Can simplify cosine similarity inner product?
Find the number of users in common directly
Bands
User A: 1, 2, 3
User B: 1, 2, 4
User C: 1, 4, 5
Map:
User A Data -> (1, 2) (2, 3) (1, 3)
User B Data -> (1, 2) (2, 4) (1, 4)
Reduce: (1, [2, 2, 3, 4]) -> (1_2, 2), (1_3, 1)
…
No inverted index, No big memory requirement
One pass through the user data
FAN Hadoop Team
•
•
•
•
•
•
•
•
•
Arshavir Blackwell
Chris Bowling
Dongwei Cao
John Carnahan
Tongjie Chen
Dan Gould
Pradhuman Jhala
Mohammad
Kolahdouzan
Ramya Krishnamurthy
•
•
•
•
•
•
•
•
•
Vasanth Kumar
Anurag Phadke
Saranath Raghavan
Mohammad Sabah
Nader Salehi
Christopher Smith
Xavier Stephens
Vivek Tawde
Bing Zheng
38