Machine Learning for Big Data, Methods, Trends and
Download
Report
Transcript Machine Learning for Big Data, Methods, Trends and
Büyük Veri Madenciliği veYapay Öğrenme
A. Taylan Cemgil
24.12.2012, ITO Istanbul
http://www.cmpe.boun.edu.tr/pilab
Machine Learning
Use Cases
Supervised Learning
Classification
Unsupervised Learning
Clustering
Dimensionality Reduction
Probabilistic Approach to Machine Learning
Probability Theory
Graphical Models, Probabilistic Expert Systems
Time Series
Matrix and Tensor Factorization
Sensor Fusion
Scaling up Machine Learning
Architectures
References
ML for Big Data, Cemgil, 24.12.2012
2
Collection of computational methods to …
Detect hidden patterns in data
Create useful predictions about unseen data
Decision making under uncertainty
Transform raw data into useful knowledge
ML for Big Data, Cemgil, 24.12.2012
3
Mathematics and
Statistics
• Optimization
• Numerical Linear
Algebra
• Probability Theory
Computer Science
• Databases
• Parallel Processing
• Artificial Intelligence
• Information Retrieval
• Graphics/Visualization
Electrical
Engineering
• Pattern Recognition
• Signal processing
• Detection/Estimation
• Information Theory
• Data Compression
ML for Big Data, Cemgil, 24.12.2012
4
Facets of the same problem
Differences in emphasis/terminology
Historical Evolution of the fields
Data Mining: Database systems, Data Structures
Statistics: Probability Theory, Mathematics
Machine Learning: Artificial Intelligence, Pattern
Recognition
ML for Big Data, Cemgil, 24.12.2012
5
Thinking about old methods with a new mind set
… and invent new ones
Curse/Blessing of Dimensionality
Infrastructure is cheaper
Cloud Computing
Sensor Networks (“new kind of data”)
Speed (“real time”)
ML for Big Data, Cemgil, 24.12.2012
6
Emphasis on System Integration
Reached Critical Mass/Mature technology
ML for Big Data, Cemgil, 24.12.2012
7
“data explosion is bigger than Moore's law”
Computers get faster and cheaper every year but
the amount of data that needs to be processed
grows even faster.
DATA
CPU
ML for Big Data, Cemgil, 24.12.2012
8
AMERICAN/TURKISH (SHORT)
EUROPEAN (LONG)
103 Thousand
(106 ) Million
(109 ) Billion
(1012 ) Trillion
(1015 ) Quadrillion
(1018 ) Quintillion
…
1000 × 1000𝑛
103 Thousand
(106 ) Million
(109 ) Milliard
(1012 ) Billion
(1015 ) Billiard
(1018 ) Trillion
…
1000000𝑛
ML for Big Data, Cemgil, 24.12.2012
9
103
210
megabyte (MB) 106
220
gigabyte (GB)
109
230
terabyte (TB)
1012
240
petabyte (PB)
1015
250
exabyte (EB)
1018
260
zettabyte (ZB)
1021
270
yottabyte (YB)
1024
280
kilobyte (kB)
ML for Big Data, Cemgil, 24.12.2012
10
= 1TB = 1 000 000 000 000 Bytes
=1 Trillion Bytes
= 1PB
= 1 000 000 000 000 000B
=1 Quadrillion Bytes
ML for Big Data, Cemgil, 24.12.2012
11
CERN: Large Hadron Collider produces about 15
petabytes of data per year
× 15 000
Google processes about 24 petabytes of data per
day.
× 24 000
ML for Big Data, Cemgil, 24.12.2012
12
Facebook’s Hadoop Distributed File System (HDFS)
is reported to be about 100 PB
× 100 000
Global Internet Traffic per month in 2011 is
estimated to be about 27500 PB (Source:Cisco)
× 27 500 000
ML for Big Data, Cemgil, 24.12.2012
13
We are drowning in data and starving for knowledge
– J. Naisbitt
(from Machine Learning, a probabilistic perspective, KP Murphy)
ML for Big Data, Cemgil, 24.12.2012
14
Product Recommendation
Market Basket Analysis
Event/Activity/Behavior Analysis
Campaign management and optimization
Supply-chain management and analytics
Market and consumer segmentations
ML for Big Data, Cemgil, 24.12.2012
15
Netflix: 18K movies × 500K users %99 sparse
ML for Big Data, Cemgil, 24.12.2012
16
Network Monitoring and Performance
Optimization
Pricing Optimization
Customer Churn Management
Call Detail Record (CDR) Analysis
(Mobile) User Behavior Analysis
Cybersecurity, Detection and Prevention of DDOS
Attacks
Infrastructure Planning
ML for Big Data, Cemgil, 24.12.2012
17
ML for Big Data, Cemgil, 24.12.2012
18
Fraud Detection/Risk Estimation
High Speed Trading
Anomality/Changepoint Detection
ML for Big Data, Cemgil, 24.12.2012
19
Clickstream Segmentation and Analysis
Ad Targeting/Selection, Forecasting and
Optimization
Click Fraud Detection/Prevention
Social Graph Analysis
Customer Segmentation
Newsgroup/Blog/Social Media opinion tracking
ML for Big Data, Cemgil, 24.12.2012
20
Community Detection (source: matlab exchange)
ML for Big Data, Cemgil, 24.12.2012
21
Ad Personalization: Match ads with users
Key income generator for Google, Yahoo
ML for Big Data, Cemgil, 24.12.2012
22
Urban Traffic Management
Energy Grid Management/Optimization,
Power Generation Management
Environment Monitoring
ML for Big Data, Cemgil, 24.12.2012
23
Diagnosis and Medical Expert systems
Health Insurance fraud detection
Patient care quality and program analysis
Drug discovery
Remote Monitoring
ML for Big Data, Cemgil, 24.12.2012
24
𝑋(𝑔𝑒𝑛𝑒, 𝑠𝑎𝑚𝑝𝑙𝑒, 𝑡𝑖𝑚𝑒)
ML for Big Data, Cemgil, 24.12.2012
25
Pragmatic view
Small Data: Naïve algorithms are feasible
Medium Data: Feasibly processed on one machine
Big Data: Does not fit on one machine
Complex relational data
Analysis of pairwise/higher order interactions between
entities
ML for Big Data, Cemgil, 24.12.2012
26
Classification
ML for Big Data, Cemgil, 24.12.2012
27
Feature 1
Feature 2
Feature 3
Feature 4
Class
5.1
4.3
2.1
0.3
0
5.7
3.5
3.2
0.8
0
3.4
5.2
0.4
0.6
1
X1
X2
X3
X4
c
𝑐 ≈ 𝑓(𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑁 𝑥𝑁 )
ML for Big Data, Cemgil, 24.12.2012
28
Ad Prediction on a Cluster of 1000 Machines
what is the probability that a given ad will be clicked given some context?
A Reliable Effective Terascale Linear Learning System, Agarwal et.al. 2012
Features = 16 M
Number of Examples
17 Billion
3TB Entries
1000 Machines
ML for Big Data, Cemgil, 24.12.2012
29
1.
2.
3.
4.
5.
On each node use online learning independently
to find a parameter vector.
Use AllReduce to average the weights.
On each node, compute the sum of the gradient
for each example.
AllReduce to add the gradients at each node.
Use L-BFGS to update the weight vector, goto 3
ML for Big Data, Cemgil, 24.12.2012
30
Clustering
Dimensionality Reduction
Visualization
ML for Big Data, Cemgil, 24.12.2012
31
ML for Big Data, Cemgil, 24.12.2012
32
Terms-Documents
ML for Big Data, Cemgil, 24.12.2012
33
ML for Big Data, Cemgil, 24.12.2012
34
ML for Big Data, Cemgil, 24.12.2012
35
Probability Theory
Probability theory is nothing but common sense
reduced to calculation – P. Laplace
Graphical Models, Probabilistic Expert Systems
Time Series
Example: Network flow classification
ML for Big Data, Cemgil, 24.12.2012
36
ML for Big Data, Cemgil, 24.12.2012
37
ML for Big Data, Cemgil, 24.12.2012
38
ML for Big Data, Cemgil, 24.12.2012
39
ML for Big Data, Cemgil, 24.12.2012
40
ML for Big Data, Cemgil, 24.12.2012
41
ML for Big Data, Cemgil, 24.12.2012
42
ML for Big Data, Cemgil, 24.12.2012
43
ML for Big Data, Cemgil, 24.12.2012
44
ML for Big Data, Cemgil, 24.12.2012
45
ML for Big Data, Cemgil, 24.12.2012
46
ML for Big Data, Cemgil, 24.12.2012
47
ML for Big Data, Cemgil, 24.12.2012
48
ML for Big Data, Cemgil, 24.12.2012
49
Graphical Model Through Time
ML for Big Data, Cemgil, 24.12.2012
50
Mobile 3G Usage patterns, Monitor Applications
without Deep Packet Inspection (DPI)
8 Hrs Capture, Anonymised, without Payload 1TB
Joint work Kurt, Mungan, Saygun with Ericsson/Avae FP7 Mevico
ML for Big Data, Cemgil, 24.12.2012
51
VIDEO
VIDEO2
ML for Big Data, Cemgil, 24.12.2012
52
ML for Big Data, Cemgil, 24.12.2012
53
Tracking
ML for Big Data, Cemgil, 24.12.2012
54
ML for Big Data, Cemgil, 24.12.2012
55
1
2
1.5
?
4
3
3
6
?
ML for Big Data, Cemgil, 24.12.2012
4
8
6.1
56
1
2
1.5
1
1
2
1.5
2
?
4
3
3
3
6
?
ML for Big Data, Cemgil, 24.12.2012
4
4
8
6.1
57
1
2
1.5
1
1
2
1.5
2
2
4
3
3
3
6
4.5
ML for Big Data, Cemgil, 24.12.2012
4
4
8
6.1
58
ML for Big Data, Cemgil, 24.12.2012
59
ML for Big Data, Cemgil, 24.12.2012
60
ML for Big Data, Cemgil, 24.12.2012
61
ML for Big Data, Cemgil, 24.12.2012
62
Slide from ICML 2011 tutorial Langford et. al.
ML for Big Data, Cemgil, 24.12.2012
63
A. Gray, Analyzing Massive Datasets, Skytree, ML
Company
Data Scientist: The Sexiest Job of the 21st Century
(HBR)
Agarwal et. al. A Reliable Effective Terascale
Linear Learning System
ML for Big Data, Cemgil, 24.12.2012
64
ML for Big Data, Cemgil, 24.12.2012
65
ML for Big Data, Cemgil, 24.12.2012
66
ML for Big Data, Cemgil, 24.12.2012
67
Data is not Knowledge
More Data is not more Knowledge
ML for Big Data Requires a new mindset for
algorithm design
Big Data is not only about entities but also about
their relations and interactions
Many applications, ML provides viable solutions
New CS Education, need more Maths, Physics and
Social Science Majors
Big Data = Big Potential
ML for Big Data, Cemgil, 24.12.2012
68
ML for Big Data, Cemgil, 24.12.2012
69
Ground Truth Labelling
Difficult but a must
Cheaters abound
Validation of labellers + qualification test
Amazon Mechanical Turk
ML for Big Data, Cemgil, 24.12.2012
70