Presentation - UTPA Faculty Web
Download
Report
Transcript Presentation - UTPA Faculty Web
Presenter: Ran Ding
1.
2.
3.
4.
5.
Introduction
Where the MR wins
DBMS “sweet spot” tests
Why the Parallel DBMS wins
Conclusion
The MapReduce (MR) paradigm has been hailed as
a revolutionary new platform for large-scale,
massively parallel data access.
Like Hadoop
Parallel DBMS appeared at mid-1980. the
Teradata and Gamma projects pioneered a
new architectural paradigm based on a
cluster of commodity computers.
Distributing the rows of a relational table
across the nodes of the cluster so they can
process in parallel.
One benefit is system automatically manages
the various alternative partitioning strategies
for the tables involved in the query.
Like hash, range, and round-robin…..
It
is not easy!!!!!!
UDF(user defined field) helps.
Like GROUP BY in SQL.
1.
2.
3.
4.
5.
ETL and “read once” data sets
Complex analytics
Semi-structured data
Quick-and-dirty analyses
Limited-budget operations
Extract-transform-load system
MR system can be considered a generalpurpose parallel ETL system.
DBMSs may perform the ETL
Cannot be structured as single SQL aggregate
queries
MR is a good candidate
MR systems are good at processing the data
is prepared for loading into a back-end
system
DBMS requires wide tables with many
attributes
Plus, MR-style systems are easily store and
process
DBMS need the programmer write the schema
then load
MR just copy!
MR is basically open source for free
Parallel DBMS: huge cost
1.
2.
3.
4.
5.
Repetitive record parsing
Compression
Pipelining
Scheduling
Column-oriented storage
Parsing task requires each Map and Reduce
task repeatedly parse and convert string
fields into the appropriate type
Records are parsed by DBMSs when the data
is initially loaded.
It is hard to say……..
Commercial DBMSs may use carefully tuned
compression algorithms
In parallel DBMS, data is streamed from
producer to consumer
the intermediate data is never written to disk
In MR system, it writes the result to local data
structure, and consumers read from it
In a parallel DBMS, every node knows what it
should do
MR system is scheduled on processing nodes
one storage block at a time.
Vertica
Reads only the attributes necessary for
solving the user query
DBMS-X and Hadoop are both row stores
MR advocates should learn from parallel
DBMS the technologies and techniques for
efficient query parallel execution.
MR systems are powerful tools for ETL-style
applications and for complex analytics. If the
application is query-intensive, whether semi
structured or rigidly structured, then a DBMS
is probably the better choice