Transcript slides

Final Exam Review
• The following is a list of items that you should
review in preparation for the exam. Note that
not every item in the following slides may be
on the exam, and there may be items on the
exam not on this slide.
Overview of three techniques
• Decision Tree
• Clustering
• Association Rule
What is classification?
• Determining to what group a
data element belongs
– Or “attributes” of that “entity”
• Examples
– Determining whether a customer
should be given a loan
– Flagging a credit card transaction
as a fraudulent charge
– Categorizing a news story as
finance, entertainment, or sports
What is Cluster Analysis?
Grouping data so that
elements in a group
will be
• Similar (or related) to
one another
• Different (or unrelated)
from elements in other
groups
Distance within
clusters is
minimized
Distance
between
clusters is
maximized
http://www.baseball.bornbybits.com/blog/uploaded_images/
Takashi_Saito-703616.gif
Association Mining
Find out which items
predict the occurrence of
other items
Also known as “affinity
analysis” or “market
basket” analysis
Uses
• What products are bought together?
• Amazon’s recommendation engine
• Telephone calling patterns
Match Scenario with Data Mining
Technique
• Which data mining technique (Decision Trees,
Clustering, or Association Rules) would be most
appropriate to answer each question below?
– What products are bought at the same time as coke?
– What is the probability that a 57-year-old female in a
low income family will die because of cancer?
– How many types of customers visit fresh grocery?
Interpret your model
• You should be able to interpret your model
from two aspects:
– First, whether it is a good model
– Second, how you can use your model to help you
answer question/make decision.
Basic Statistic Information
• Be able to understand the basic about your
data by looking at explore window with
descriptive statistics
– Distribution, Average, Range and etc.
– And what those numbers can tell you.
What can you tell from this histogram? Do most people
spend a lot or not?
Decision Tree
• Whether it is a good model
– Use Subtree Assessment Plot to find out Average
Square Error and/or Misclassification Rate. Lower
average square error and misclassification rate
suggest better model.
– Think why these numbers can provide you the
optimal number of leaf.
• How to use your model
– Follow the tree path that matches the descriptions
in your question.
Why the optimal number of leaves is 13?
What is the likelihood of 52 years old
man with affluence of 5 buying an
organic product?
Cluster and Segment
• Whether it is a good model
– You want to have higher cohesion within your
cluster and higher separation between your
cluster.
– Higher Root Mean Square Standard Deviation
suggests lower cohesion. Higher distance to
nearest cluster suggests higher separation
• How to use your model
– Be able to tell the difference each cluster has
against your overall result.
Which model is better in terms of
cluster cohesion?
For each model, which cluster
has the highest cohesion?
How will the maximum
number of clusters in you
model may affect the cohesion
and separation?
Is the sale of stretch jeans of cluster 2 better than the average sales
of stretch jeans of entire population?
Association Rule
• Whether it is a good model
– Confidence: the chance of Y is bought when X has
been bought
– Support: the chance of X and Y bought together
– Lift: the ration of confidence to the chance of X
and Y are bought together coincidentally.
• How to use your model
– Able to give suggestions based on your analysis
Does coke often be bought with Beer or Pepsi? Why?
Can you give one suggestion that two products
should been put close to each other? Can you give
one suggestion that two products should not been
put together? Why?