Large Scale Data Processing with DryadLINQ Dennis Fetterly Microsoft Research, Silicon Valley Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Download Report

Transcript Large Scale Data Processing with DryadLINQ Dennis Fetterly Microsoft Research, Silicon Valley Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Large Scale Data Processing with
DryadLINQ
Dennis Fetterly
Microsoft Research, Silicon Valley
Workshop on Data-Intensive Scientific
Computing Using DryadLINQ
Outline
•
•
•
•
Brief introduction to TidyFS
Preparing/loading data onto a cluster
Desirable properties in a Dryad cluster
Detailed description of several IR algorithms
TidyFS goals
• A simple distributed filesystem that provides the
abstractions necessary for data parallel
computations
• High performance, reliable, scalable service
• Workload
– High throughput, sequential IO, write once
– Cluster machines working in parallel
– Terasort
• 240 machines reading at 240 MB/s = 56 GB/s
• 240 machines writing at 160 MB/s = 37 GB/s
TidyFS Names
• Stream: a sequence of partitions
– i.e. tidyfs://dryadlinqusers/fetterly/clueweb09-English
– Can have leases for temp files or cleanup from crashes
• Partition:
–
–
–
–
–
Stream-1
Part 1
Part 2
Part 3
Immutable
64 bit identifier
Can be a member of multiple streams
Stored as NTFS file on cluster machines
Multiple replicas of each partition can be stored
Part 4
Preparation of Data
• Often substantially harder than it appears
• Issues:
– Data format
– Distribution of data
– Network bandwidth
• Generating synthetic datasets is sometimes
useful
Data Prep – Format
• Text records are simplest
– Caveat – information that is not in the line
• e.g. - if a line number encodes information
• Binary records often require custom code to
load to cluster
– Serialization/de-serialization code generated by
DryadLINQ uses C# Reflection
Custom Deserialization Code
public class UrlDocIdScoreQuery
{
public string queryId;
public string url;
public string docId;
public string queryString;
public double score;
public static UrlDocIdScoreQuery Read(DryadBinaryReader reader)
{
UrlDocIdScoreQuery rec = new UrlDocIdScoreQuery();
rec.queryId = ReadAnyString(reader);
rec.queryString = ReadAnyString(reader);
rec.url = ReadAnyString(reader);
rec.docId = ReadAnyString(reader);
rec.score = reader.ReadDouble();
return rec;
}
public static string ReadAnyString(DryadBinaryReader dbr) {…}
}
Data Prep - Loading
• DryadLINQ job
– Often needs a dummy input anchor
• Custom program
– Write records to TidyFS partitions
• “SneakerNet” often a good option
Data Loading - DryadLINQ
• Need input “anchor” to run on cluster
– Generate or use existing stream
• Sample:
IEnumerable<Entry> GenerateEntries(Random x, int
numItems)
{
for (int i = 0; i < numItems; i++) {
// code to generate records
yield return record;
}
}
Data Gen - DryadLINQ
• Need input “anchor” to run on cluster
– Generate or use existing stream
• Sample:
IEnumerable<Entry> GenerateEntries(Random x, int
numItems)
{
for (int i = 0; i < numItems; i++) {
// code to generate records
yield return record;
}
}
DryadLINQ Job
var streamname = "tidyfs://datasets/anchor”;
var os =
@"tidyfs://msri/teamname/data?compression=" +
CompressionScheme.GZipFast;
var r = PartitionedTable.Get<int>(streamname)
.Take(1)
.SelectMany(x => Enumerable.Range(0, partitions))
.HashPartition(x => x, partitions)
.Select(x => new Random(x))
.SelectMany(x => GenerateEntries(x, numItems))
.ToPartitionedTable(os);
Data Loading - Databases
• Bulk copy into files
– Use queries to produce multiple files
• Perform queries within DryadLINQ UDF
IEnumerable<Entry> PerformQuery(string queryArg)
{
var results = “select * from …”;
foreach (var record in results) {
yield return record;
}
}
Building a cluster
• Overall goal – a high-throughput system
– Not latency sensitive
• More slower computers often better than
fewer faster computers
• Multiple cores better that frequency
• Multiple disks – increase throughput
• Sufficient RAM
Networking a Cluster
• Network topology – medium to large clusters
– Attempt to maximize cross rack bandwidth
– Two tier topology
• Rack switches and core switches
• Port aggregation
– Bond multiple connections together
• 1 GbE or 10 GbE
Cluster Software
• Runs on Windows HPC Server 2008
• Academic Release
– For non-commercial use
• Commercial License
DryadLINQ IR Toolkit
• Library that uses DryadLINQ
• Source code for a number of IR algorithms
– Text retrieval - BM25/BM25F
– Link based ranking - PageRank/SALSA-SETR
– Text processing - Shingle based duplicate detection
• Designed to work well with ClueWeb09 collection
– Including preprocessing the data to load the cluster
• Available from
http://research.microsoft.com/dryadlinqir/
ClueWeb09 Collection
• Collected/Distributed by CMU
• 1 billion web pages crawled in Jan/Feb 2009
• 10 different languages
– en, zh, es, ja, de, fr, ko, it, pt, ar
• 5 TB, compressed - 25 TB, uncompressed
• Available to research community
• Dataset available for your projects
– Web graph, 503m English web pages
Example: Term Frequencies
Count term frequencies in a set of documents:
var docs = new PartitionedTable<Doc>(“tidyfs://dennis/docs”);
var words = docs.SelectMany(doc => doc.words);
var groups = words.GroupBy(word => word);
var counts = groups.Select(g => new WordCount(g.Key, g.Count()));
counts.ToPartitionedTable(“tidyfs://dennis/counts.txt”);
IN
SM
metadata
doc =>
doc.words
GB
S
OUT
word =>
word
g =>
new …
metadata
Distributed Execution of Term Freq
LINQ expression
IN
SM
GB
S
OUT
DryadLINQ
Dryad execution
Execution Plan for Term Frequency
SM
Q
SM
GB
GB
(1)
S
SelectMany
Sort
GroupBy
C
Count
D
Distribute
MS
Mergesort
GB
GroupBy
Sum
pipelined
pipelined
Sum
20
Execution Plan for Term Frequency
SM
GB
(1)
S
SM
SM
SM
SM
Q
Q
Q
Q
GB
GB
GB
GB
C
C
C
C
D
D
D
D
MS
MS
MS
MS
GB
GB
GB
GB
Sum
Sum
Sum
Sum
(2)
21
BM25 “Grep”
• For batch evaluation of queries calculating
BM25 is just a select operation
string queryTermDocFreqURLLocal = @"E:\TREC\query-doc-freqs.txt";
Dictionary<string, int> dfs = GetDocFreqs(queryTermDocFreqURLLocal);
PartitionedTable<InitialWordRecord> initialWords =
PartitionedTable.Get<InitialWordRecord>(initialWordsURL);
var BM25s = from doc in initialWords
select ComputeDocBM25(queries, doc, dfs);
BM25s.ToPartitionedTable(“tidyfs://dennis/scoredDocs”);
PageRank
Ranks web pages by propagating scores along hyperlink structure
Each iteration as an SQL query:
1.
2.
3.
4.
5.
Join edges with ranks
Distribute rank on edges
GroupBy edge destination
Aggregate into ranks.
Repeat.
One PageRank Step in DryadLINQ
// one step of pagerank: dispersing and re-accumulating rank
public static IQueryable<Rank> PRStep(IQueryable<Page> pages,
IQueryable<Rank> ranks)
{
// join pages with ranks, and disperse updates
var updates = from page in pages
join rank in ranks on page.name equals rank.name
select page.Disperse(rank);
// re-accumulate.
return from list in updates
from rank in list
group rank.rank by rank.name into g
select new Rank(g.Key, g.Sum());
}
A Complete DryadLINQ Program
public static IQueryable<Rank> PRStep(IQueryable<Page> pages,
IQueryable<Rank> ranks) {
// join pages with ranks, and disperse updates
var updates = from page in pages
join rank in ranks on page.name equals rank.name
select page.Disperse(rank);
public struct Page {
public UInt64 name;
public Int64 degree;
public UInt64[] links;
public Page(UInt64 n, Int64 d, UInt64[] l) {
name = n; degree = d; links = l; }
// re-accumulate.
return from list in updates
from rank in list
group rank.rank by rank.name into g
select new Rank(g.Key, g.Sum());
public Rank[] Disperse(Rank rank) {
Rank[] ranks = new Rank[links.Length];
double score = rank.rank / this.degree;
for (int i = 0; i < ranks.Length; i++) {
ranks[i] = new Rank(this.links[i], score);
}
return ranks;
}
}
var pages = DryadLinq.GetTable<Page>(“tidyfs://pages.txt”);
}
// repeat the iterative computation several times
var ranks = pages.Select(page => new Rank(page.name, 1.0));
for (int iter = 0; iter < iterations; iter++) {
ranks = PRStep(pages, ranks);
}
ranks.ToDryadTable<Rank>(“outputranks.txt”);
public struct Rank {
public UInt64 name;
public double rank;
public Rank(UInt64 n, double r) {
name = n; rank = r; }
}
PageRank Optimizations
• Benchmark PageRank on 954m page graph
• Naïve approach – 10 iter ~3.5 hours 1.2TB
• Apply several optimizations
– Change data distribution
– Pre-group pages by host
– Renaming host groups with dense names
– Cull out leaf nodes
– Pre-aggregate ranks for each host
• Final version – 10 iter 11.5 min 116 GB
Tactics for Improving Performance
• Loop unrolling
• Reduce data movement
– Improve data locality
• Choose what to Group
Gotchas
• Non-deterministic output
– E.g. RNG in user defined function
• Writing to shared state
Schedule for Today
• 9:30 – 10:00 Meet with team, finalize project
• 10:30-12:00 Work on projects, discuss
approach with a speaker
Backup Slides
Cluster Configuration
Head Node
TidyFS Servers
Cluster machines running tasks
and TidyFS storage service
How a Dryad job reads from TidyFS
Schedule Vertex
Part
1
List Partitions
Job Manager
in StreamVertex
Schedule
Part 2
Part 1, Machine 1
Part 2, Machine 2
D:\tidyfs\0001.data
Machine 1
…
Machine 2
Get Read Path
Machine 1, Part 1
D:\tidyfs\0002.data
Get Read Path
Machine 2, Part 2
TidyFS Service
Cluster Machines
How a Dryad job writes to TidyFS
Schedule Vertex 1
Job Manager
Str1_v1
Part1
Str1_v2
Part 2
Schedule Vertex 2
Machine 1
Machine 2
…
create
Str1_v1
create
Str1_v2
Cluster Machines
Part 1
Part 2
TidyFS Service
How a Dryad job writes to TidyFS
Str1
Delete Streams
Create Str1
str1_v1,
str1_v2
ConcatenateStreams
(str1, str1_v1, str1_v2)
Job Manager
Machine 1
Machine 2
…
AddPartitionInfo
GetWritePath
(Part 1, Machine 1,
Machine 1, Part 1
Completed
Size,
Fingerprint, …)
AddPartitionInfo
GetWritePath
(Part 2, Machine 2,
Machine2, Part 2
Completed
Size,
Fingerprint, …)
Cluster Machines
Str1_v1
Part1
Str1_v2
Part 2
D:\tidyfs\0001.data
D:\tidyfs\0002.data
TidyFS Service