Final Review
Download
Report
Transcript Final Review
Final Review
This is not a comprehensive review
but highlights certain key areas
Top-Level Data Mining Tasks
• At highest level, data mining tasks can be
divided into:
– Prediction Tasks (supervised learning)
• Use some variables to predict unknown or future
values of other variables
– Classification
– Regression
– Description Tasks (unsupervised learning)
• Find human-interpretable patterns that describe the
data
– Clustering
– Association Rule Mining
2
Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is the
class, which is to be predicted.
• Find a model for class attribute as a function of
the values of other attributes.
– Model maps record to a class value
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine accuracy of the model
• Can you think of classification tasks?
3
Classification
•
•
•
•
•
Simple linear
Decision trees (entropy, GINI)
Naïve Bayesian
Nearest Neighbor
Neural Networks
Regression
• Predict a value of a given continuous (numerical)
variable based on the values of other variables
• Greatly studied in statistics
• Examples:
– Predicting sales amounts of new product based on
advertising expenditure.
– Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.
– Time series prediction of stock market indices
5
Clustering
• Given a set of data points find clusters so that
– Data points in same cluster are similar
– Data points in different clusters are dissimilar
You try it on the Simpsons. How can
we cluster these 5 “data points”?
6
Association Rule Discovery
• Given a set of records each of which contain
some number of items from a given collection
– Produce dependency rules which will predict
occurrence of an item based on occurrences of
other items.
Rules Discovered:
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Diapers
beer
7
Attribute Values
• Attribute values are numbers or symbols
assigned to an attribute
• Distinction between attributes and attribute
values
– Same attribute can be mapped to different
attribute values
• Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set
of values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum value
8
Types of Attributes
• There are different types of attributes
– Nominal (Categorical)
•
Examples: ID numbers, eye color, zip codes
– Ordinal
•
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
– Interval
•
Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
•
Examples: temperature in Kelvin, length, time, counts
9
Decision Tree Representation
• Each internal node tests an attribute
• Each branch corresponds to attribute value
• Each leaf node assigns a classification
outlook
sunny
overcast
humidity
high
no
rain
wind
yes
normal
yes
strong
weak
no
yes
10
How do we construct the
decision tree?
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they can be discretized in
advance)
– Examples are partitioned recursively based on selected attributes.
– Test attributes are selected on the basis of a heuristic or statistical measure
(e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
– There are no samples left
• Pre-pruning/post-pruning
11
How To Split Records
• Random Split
– The tree can grow huge
– These trees are hard to understand.
– Larger trees are typically less accurate than smaller trees.
• Principled Criterion
– Selection of an attribute to test at each node - choosing the most useful attribute
for classifying examples.
– How?
– Information gain
• measures how well a given attribute separates the training examples
according to their target classification
• This measure is used to select among the candidate attributes at each step
while growing the tree
12
Advantages/Disadvantages of Decision Trees
• Advantages:
– Easy to understand (Doctors love them!)
– Easy to generate rules
• Disadvantages:
– May suffer from overfitting.
– Classifies by rectangular partitioning (so does not
handle correlated features very well).
– Can be quite large – pruning is necessary.
– Does not handle streaming data easily
13
Overfitting (another view)
• Learning a tree that classifies the training data perfectly may not lead to
the tree with the best generalization to unseen data.
– There may be noise in the training data that the tree is erroneously fitting.
– The algorithm may be making poor decisions towards the leaves of the tree
that are based on very little data and may not reflect reliable trends.
accuracy
on training data
on test data
hypothesis complexity/size of the tree (number of nodes)
14
Notes on Overfitting
• Overfitting results in decision trees (models in
general) that are more complex than
necessary
• Training error no longer provides a good
estimate of how well the tree will perform on
previously unseen records
• Need new ways for estimating errors
15
Evaluation
• Accuracy
• Recall/Precision/F-measure
Bayes Classifiers
That was a visual intuition for a simple case of the Bayes classifier, also called:
• Idiot Bayes
• Naïve Bayes
• Simple Bayes
We are about to see some of the mathematical formalisms, and more
examples, but keep in mind the basic idea.
Find out the probability of the previously unseen instance belonging to each
class, then simply pick the most probable class.
Go through all the examples on the slides and be ready to generate tables
similar to the ones presented in class and the one you created for your HW
assignment.
Smoothing
17
Bayesian Classifiers
• Bayesian classifiers use Bayes theorem, which says
p(cj | d ) = p(d | cj ) p(cj)
p(d)
•
p(cj | d) = probability of instance d being in class cj,
This is what we are trying to compute
• p(d | cj) = probability of generating instance d given class cj,
We can imagine that being in class cj, causes you to have feature
d with some probability
• p(cj) = probability of occurrence of class cj,
This is just how frequent the class cj, is in our database
• p(d) = probability of instance d occurring
This can actually be ignored, since it is the same for all classes
18
Bayesian Classification
–
–
–
–
–
–
Statistical method for classification.
Supervised Learning Method.
Assumes an underlying probabilistic model, the Bayes theorem.
Can solve diagnostic and predictive problems.
Particularly suited when the dimensionality of the input is high
In spite of the over-simplified assumption, it often performs better in many
complex real-world situations
19
Advantages/Disadvantages
of Naïve Bayes
• Advantages:
–
–
–
–
Fast to train (single scan). Fast to classify
Not sensitive to irrelevant features
Handles real and discrete data
Handles streaming data well
• Disadvantages:
– Assumes independence of features
20
Nearest-Neighbor Classifiers
Unknown record
Requires three things
– The set of stored records
– Distance metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
To classify an unknown record:
– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
21
Up to now we have assumed that the nearest neighbor algorithm uses the Euclidean Distance,
however this need not be the case…
DQ, C qi ci
n
2
DQ, C
p
i 1
10
9
8
7
6
5
4
3
2
1
qi ci
n
p
i 1
Max (p=inf)
Manhattan (p=1)
Weighted Euclidean
Mahalanobis
1 2 3 4 5 6 7 8 9 10
22
Strengths and Weaknesses
• Strengths:
–
–
–
–
–
Simple to implement and use
Comprehensible – easy to explain prediction
Robust to noisy data by averaging k-nearest neighbors
Distance function can be tailored using domain knowledge
Can learn complex decision boundaries
• Much more expressive than linear classifiers & decision trees
• More on this later
• Weaknesses:
– Need a lot of space to store all examples
– Takes much more time to classify a new example than with a
parsimonious model (need to compare distance to all other examples)
– Distance function must be designed carefully with domain knowledge
23
Strengths and Weaknesses
• Strengths:
–
–
–
–
–
Simple to implement and use
Comprehensible – easy to explain prediction
Robust to noisy data by averaging k-nearest neighbors
Distance function can be tailored using domain knowledge
Can learn complex decision boundaries
• Much more expressive than linear classifiers & decision trees
• More on this later
• Weaknesses:
– Need a lot of space to store all examples
– Takes much more time to classify a new example than with a
parsimonious model (need to compare distance to all other examples)
– Distance function must be designed carefully with domain knowledge
24
Perceptrons
•
•
•
The perceptron is a type of artificial
neural network which can be seen
as the simplest kind of feedforward
neural network: a linear classifier
Introduced in the late 50s
Perceptron convergence theorem
(Rosenblatt 1962):
– Perceptron will learn to classify
any linearly separable set of
inputs.
Perceptron is a network:
– single-layer
– feed-forward: data only
travels in one direction
XOR function (no linear separation)
25
Perceptron: Artificial Neuron Model
Model network as a graph with cells as nodes and synaptic connections as weighted edges
from node i to node j, wji
The input value received of a neuron is calculated by summing the weighted input values
n
from its input links
wx
i 0
i i
threshold
threshold function
Vector notation:
26
Examples
(step activation function)
In1
In2
Out
In1
In2
Out
In
Out
0
0
0
0
0
0
0
1
0
1
0
0
1
1
1
0
1
0
0
1
0
1
1
1
1
1
1
1
n
w0 – t
w x
i 0
i i
27
Summary of Neural Networks
When are Neural Networks useful?
– Instances represented by attribute-value pairs
• Particularly when attributes are real valued
– The target function is
• Discrete-valued
• Real-valued
• Vector-valued
– Training examples may contain errors
– Fast evaluation times are necessary
When not?
– Fast training times are necessary
– Understandability of the function is required
28
Types of Clusterings
• A clustering is a set of clusters
• Important distinction between hierarchical
and partitional sets of clusters
• Partitional Clustering
– A division data objects into non-overlapping subsets (clusters) such
that each data object is in exactly one subset
• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
K-means Clustering
•
•
•
•
•
Partitional clustering approach
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple
–
K-means tutorial available from
http://maya.cs.depaul.edu/~classes/ect584/WEKA/k-means.html
K-means Clustering
1. Ask user how
many clusters
they’d like. (e.g.
5
k=3)
2. Randomly guess k
cluster Center
locations
3. Each datapoint
finds out which
Center it’s closest
to.
4
3
2
4. Each Center finds 1
the centroid of the
points it owns…
0
5. …and jumps there
6. …Repeat until
terminated!
0
1
2
3
4
5
31
K-means Clustering: Step 1
5
4
k1
3
k2
2
1
k3
0
0
1
2
3
4
5
32
K-means Clustering
5
4
k1
3
k2
2
1
k3
0
0
1
2
3
4
5
33
K-means Clustering
5
4
k1
3
2
k3
k2
1
0
0
1
2
3
4
5
34
K-means Clustering
5
4
k1
3
2
k3
k2
1
0
0
1
2
3
4
5
35
K-means Clustering
expression in condition 2
5
4
k1
3
2
k2
k3
1
0
0
1
2
3
4
expression in condition 1
5
36
Strengths of Hierarchical Clustering
• Do not have to assume any particular number
of clusters
– Any desired number of clusters can be obtained by
‘cutting’ the dendogram at the proper level
• They may correspond to meaningful
taxonomies
– Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, …)
Hierarchical Clustering
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only
one cluster (or k clusters) left
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a
point (or there are k clusters)
• Agglomerative is most common
How to Define Inter-Cluster Similarity
p1
Similarity?
p2
p3
p4 p5
p1
p2
p3
p4
MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
p5
.
.
.
Proximity Matrix
...
DBSCAN
• DBSCAN is a density-based algorithm.
–
Density = number of points within a specified radius (Eps)
–
A point is a core point if it has more than a specified number of
points (MinPts) within Eps
• These are points that are at the interior of a cluster
–
A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point
–
A noise point is any point that is not a core point or a border
point.
What Is Association Mining?
•
Association rule mining:
– Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
•
Applications:
– Market Basket analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, etc.
41
Association Rule Mining
– We are interested in rules that are
• non-trivial (and possibly unexpected)
• actionable
• easily explainable
42
Support and Confidence
Customer buys both
Customer
buys beer
Customer
buys diaper
• Find all the rules X Y with
minimum confidence and support
– Support = probability that a transaction
contains {X,Y}
• i.e., ratio of transactions in which X, Y
occur together to all transactions in
database.
– Confidence = conditional probability that
a transaction having X also contains Y
• i.e., ratio of transactions in which X, Y
occur together to those in which X
occurs.
In general confidence of a rule LHS => RHS can be computed as the support
of the whole itemset divided by the support of LHS:
Confidence (LHS => RHS) = Support(LHS
RHS) / Support(LHS)
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset
• An itemset that contains k items
• Support count ()
– Frequency of occurrence of itemset
– E.g. ({Milk, Bread,Diaper}) = 2
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
• Support
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than or
equal to a minsup threshold
44
The Apriori algorithm
• The best known algorithm
• Two steps:
– Find all itemsets that have minimum support
(frequent itemsets, also called large itemsets).
– Use frequent itemsets to generate rules.
• E.g., a frequent itemset
{Chicken, Clothes, Milk}
[sup = 3/7]
and one rule from the frequent itemset
Clothes Milk, Chicken
CS583, Bing Liu, UIC
45
[sup = 3/7, conf = 3/3]
Associations: Pros and Cons
• Pros
– can quickly mine patterns describing business/customers/etc. without
major effort in problem formulation
– virtual items allow much flexibility
– unparalleled tool for hypothesis generation
• Cons
– unfocused
• not clear exactly how to apply mined “knowledge”
• only hypothesis generation
– can produce many, many rules!
• may only be a few nuggets among them (or none)
Association Rules
• Association rule types:
– Actionable Rules – contain high-quality, actionable
information
– Trivial Rules – information already well-known by
those familiar with the business
– Inexplicable Rules – no explanation and do not suggest
action
• Trivial and Inexplicable Rules occur most often