程啟聖_Revolution_R_Enterprise

Download Report

Transcript 程啟聖_Revolution_R_Enterprise

RRE (Revolution R Enterprise)
vs. R at PC Cluster
Edward Cheng
2.18.2014
2015/4/9
1
PC Cluster
2015/4/9
2
Environment
• Node01~node36,stathpc: RHEL 5 +
RRE 6.1 (R-2.14.2)
• Node51~node60, himemhpc: RHEL 6 +
RRE 7.0 (R-3.0.2)
2015/4/9
3
History
 R 起源
 1993, Professor, Ross Ihaka and
Robert Gentleman, University of
Aukland, 紐西蘭
 Reolution Analytics 公司
(www.revolutionanalytics.com)
 2008 by Intel Capital 等創投投資
 董事會成員有:Robert Gentleman 教授
(R founder), Norman H. Nie 顧問 (前
SPSS CEO)
 Revolution R Enterprise (企業版 R)
2015/4/9
4
R
• R is world’s most widely used
statistics programming language.
• Free and open source software
2015/4/9
5
R usage
2015/4/9
6
R package growth
2015/4/9
7
Why Revolution R
2015/4/9
8
Performance
R-2.14.2
RRE 6.1
R-3.0.1
RRE 7.0
Matrix Multiply (10000*10000)
751 sec
35 sec
568 sec
20 sec
SVD (10000*10000)
5746 sec
374 sec
4549 sec
256 sec
2015/4/9
9
Big Data is coming
2015/4/9
10
Definition
• “Big Data” is data whose scale,
diversity, and complexity require new
architecture, techniques, algorithms,
and analytics to manage it and
extract value and hidden knowledge
from it…
2015/4/9
11
Bytes
2015/4/9
12
Big Data
• 2011 年全球數位資料的使用量約為 1.8
ZB (1 ZB = 2 的 70 次方位元組)。依
據 IDC(International Data
Corporation)所做的研究報告預測,到
2020 年的總量將是現在的 44 倍,約為
35.2 ZB。
2015/4/9
13
Big Data
2006
累計儲存了850 TB的
網頁資料
2009
每週約有二億二千萬張
照片上傳,也就是需要
25 TB的空間儲存
2011
BIG DATA
海嘯來襲
2015/4/9
每分鐘約有48小時
(48GB)的影片上傳
(每天約有70TB)
14
eBay
The world’s largest online marketplace
• We have over 50 petabytes of data
• We have over 400 million items for sale
• We process more than 250 million user queries per day
• We have over 112 million active users
• We sold over US$75 billion in merchandize in 2012
2015/4/9
15
Big Problems
• Capacity
data too big to fit into memory
• Speed
computation may be too slow to be
useful
2015/4/9
16
Distributed computing
2015/4/9
17
RevoScaleR
• RevoScaleR Package
RevoScaleR analysis functions such as
rxCube, rxLinMod, rxCovCor, rxLogit,
and rxGlm will provide significant
speed improvements over any
alternatives. These algorithms are all
optimized for handling big data.
2015/4/9
18
Multi-threaded Processing
2015/4/9
19
.xdf data format
• The XDF file format, a binary file
format with an interface that
optimizes row and column processing
and analysis.
2015/4/9
20