KDD-2010 Review - IEEE Entity Web Hosting
Download
Report
Transcript KDD-2010 Review - IEEE Entity Web Hosting
Data Mining in Practice:
Techniques and Practical Applications
Junling Hu
May 14, 2013
What is data mining?
Mining patterns from data
Is it statistics?
Functional form?
Computation speed concern?
Data size
Variable size
Is it machine learning?
2
Big data issue
New methods: network mining
Examples of data mining
Frequently bought together
3
Movie recommendation
More examples of data mining
Keyword suggestions
4
Genome & disease mining
Heart monitoring
Overview of data mining
Frequent pattern mining
Machine Learning
Supervised
Unsupervised
Stream mining
Recommender system
Graph mining
Unstructured data
Text,
Audio
Image and Video
Big data technology
5
Frequent Pattern Mining
Diaper and Beer
?
Product assortment
Click behavior
Machine breakdown
6
The case of Amazon
User
1
2
3
4
5
Items
{Princess dress, crown, gloves, t-shirt}
{Princess dress, crown, gloves, pink dress, t-shirt }
{Princess dress, crown, gloves, pink dress, jeans}
{ Princess dress, crown, gloves, pink dress}
{crown, gloves }
Count frequency of co-occurrence
Efficient algorithm
7
Machine Learning Process
8
Machine Learning
Supervised
Unsupervised (clustering)
9
Binary classification
Input features
Checking
Data point
10
Yes
Yes
No
Yes
Yes
Yes
Yes
Duration Savings Current
(years)
Loans
($k)
1
10
Yes
2
4
No
5
75
No
10
66
No
5
83
Yes
1
11
No
4
99
Yes
Output class
Loan
Purpose
Risky?
TV
TV
Car
Repair
Car
TV
Car
0
1
0
1
0
0
0
Classification (1)
Decision tree
11
Classification (2): Neural network
Perceptron
Multi-layer neural netowrk
12
Head pose detection
13
Support Vector Machine (SVM)
Search for a separating hyperplane
Maximize margin
14
Perceived advantage of SVM
Transform data into higher dimension
15
Applications of SVM: Spam Filter
Input Features:
Transmission
Email header
From --“[email protected]”
To
--“undisclosed”
cc
Email Body
IP address --167.12.24.555
Sender URL -- one-spam.com
# of paragraphs
# words
Email structure
16
# of attachments
# of links
Logistic regression
Advantage: Simple functional form
Can be parallelized
Large scale
17
Applications of logistic regression
Click prediction
Search ranking (web pages, products)
Online advertising
Recommendation
The model
Output: Click/no click
Input features:
page content,
search keyword,
User information
18
Regression
Linear regression
Non-linear regression
19
Application:
• Stock price prediction
• Credit scoring
• employment forecast
History of Supervised learning
20
Semi-supervised learning
Application:
21
Speech dialog system
Unsupervised learning: Clustering
No labeled data
Methods
22
K-means
Categories of machine learning
23
Applications of Clustering
Malware detection
Document clustering: Topic detection
24
Graphs in our life
Social network
Friend recommendation
25
Molecular compound
Drug discovery
Graph and its matrix representation
Adjacency matrix
1
2
1
4
6
3
2
3
4
5
5
26
6
1
2
3
4
5
6
0
1
0
0
0
1
1
0
1
1
0
0
0
1
0
1
1
0
0
1
1
0
1
0
0
0
1
1
0
1
1
0
0
0
1
0
The web graph
Page 1
Anchor text
Page 2
Hyperlink
Anchor text
Anchor text
Page 3
Anchor text
27
PageRank as a steady state
Transition matrix
P=
1
2
3
4
5
6
1
0
0.5
0.25
0
0
0.5
2
0.33
0
0.25
1
0
0
3
0.33
0.5
0
0
0.33
0
4
0
0
0.25
0
0.33
0
PageRank is a probability vector
P
28
5
0
0
0.25
0
0
0.5
6
0.33
0
0
0
0.33
0
such that
Discover influencers on Twitter
The Twitter graph
Node
Link
A PageRank approach: TwitterRank
2
Following
1
4
5
29
3
Facebook graph search
Entity graph
Natural language search
30
“Restaurants liked by my
friends”
Recommending a game
31
Recommendation in Travel site
32
Prediction Problems
Rating Prediction
Given how an user rated other items, predict the user’s rating for a given item
****
Top-N Recommendation
33
?
Given the list of items liked by an user, recommend new items that the user
might like
Explicit vs. Implicit Feedback Data
Explicit feedback
Ratings and reviews
Implicit feedback (user behavior)
Purchase behavior: Recency, frequency, …
Browsing behavior: # of visits, time of visit, time of staying,
clicks
34
Collaborative Filtering
Hypotheses
User/Item Similarities
Matching characteristics
35
Similar users purchase similar items
Similar items are purchased by similar users
Match exists between user’s and item’s characteristics
User-User similarity
User’s movie rating
36
John
Out of
Africa
4
Star
Wars
4
Air Force
One
5
Liar,
Liar
1
Adam
1
1
2
5
Laura
?
4
5
2
Item-item similarity
John
Adam
Out of
Africa
4
1
Star
Wars
4
1
Air Force
One
5
2
Liar,
Liar
1
5
Laura
?
4
5
2
37
Application of item-item similarity
Amazon
38
SVD (Singular Value Decomposition)
39
Latent factors
40
Application of Latent Factor Model
GetJar
41
Ranking-based recommendation
42
Application in LinkedIn
Ranking-based model
43
Thanks and Contact
Co-author: Patricia Hoffman
Contact:
[email protected]
Twitter: @junling_tech
44