Excel – Not a Bad Data Mining Client At All Allan Mitchell SQL Server MVP Konesans Limited ww.SQLIS.com.

Download Report

Transcript Excel – Not a Bad Data Mining Client At All Allan Mitchell SQL Server MVP Konesans Limited ww.SQLIS.com.

Excel – Not a Bad Data
Mining Client At All
Allan Mitchell
SQL Server MVP
Konesans Limited
ww.SQLIS.com
Who am I
•
•
•
•
•
SQL Server MVP
SQL Server Consultant
Joint author on Wrox Professional SSIS book
Worked with SQL Server since version 6.5
www.SQLDTS.com and www.SQLIS.com
Today’s Schedule
• Mostly Demos
• Data Mining Add-In for Excel 2007
– Added XL Functions
– Visualisation Methods
Today’s Schedule
• Added XL Functions - Not a lot of people know these
exist
– DMPREDICT
– DMPREDICTTABLEROW
– DMCONTENTQUERY
– Only exist after add-in installed
Today’s Schedule
• Visualisation Methods
–
–
–
–
–
Accuracy Charts
Classification Matrix
Profit Charts
Folding (X-Validation)
Calculator (if we get time)
Excel Functions
• DMPREDICT
• Can take a variable number of arguments, the minimum being 3.
• The first parameter is the Analysis Services connection to be used.
An empty string refers to the current (active) connection.
• The second parameter is the name of the mining model that will
execute the prediction
• The third parameter, is the requested predicted entity (predictable
column, in general, but could also be any prediction function)
• The function may also take up to 32 pairs of arguments. Each such
pair contains the value and the name of an input (in this order, i.e.
value followed by name).
Excel Functions
• DMPREDICTTABLEROW
• The first parameter is the Analysis Services connection to be
used. An empty string refers the current (active) connection.
• The second parameter is the name of the mining model that
will execute the prediction
• The third parameter, is the requested predicted entity
(predictable column, in general, but could also be any
prediction function)
• The fourth parameter is a range of cells to be passed as inputs
• The fifth parameter (optional) is a comma-separated list of
column names to be used as names for the inputs
Excel Functions
•
•
•
•
DMPREDICTTABLEROW
If range of cells is form XL List Object
Column Headers taken from List
5th Parameter not necessary
– Unless Column Name != Model Column Name
Excel Functions
• DMCONTENTQUERY
• The first parameter is the Analysis Services connection to
be used. An empty string refers to the current (active)
connection.
• The second parameter is the name of the mining model
that will execute the prediction
• The third parameter, is the requested content column
• The fourth parameter is a WHERE clause to be appended
to the content query
Data Mining Excel functions
DEMO
Excel Add-In
•
•
•
•
•
Great way of visualising Data Mining
Takes away some of the mystery
Easy to use
Some wizards
Freedom vs. flexibility
Accuracy Charts
• Compare 1-n models against
– Another model
– Best model
– Thumb in the air model/no model/chance
Accuracy Charts
• Interpreting
– How does a model compare with other models
– What is the cumulative gain
– Lift
• The real thing we want to see is.....
– By how much do we beat the “chance” model
Accuracy Charts
DEMO
Classification Matrix
• What are we interested in
–
–
–
–
–
How well did my model predict outcomes
False Positive
False Negative
True Positive
True Negative
Classification Matrix
Predicted
TRUE
FALSE
TRUE
True Positive
False Negative (type 2 error)
FALSE
False Positive (type 1 error)
True Negative
Actual
Classification Matrix
• A misclassification is not always a bad thing
• Consider
–
–
–
–
–
Predicted possibility of disease
Extra care/treatment given
Real result is “No disease”
Example of false positive
Is it such a bad thing?
Classification Matrix
DEMO
Profit Charts
• Closely follows lift/cumulative gain chart
• Apply costs to efforts
Profit Charts
• Apply costs to
– Initial/Fixed outlay
– Cost per case
– Return per case
• Target predictable column
• Target Outcome
• Count of cases to use
Profit Chart
DEMO
X-Validation/Folding/Rotation
Estimation
• Validates your model
• Tests whether model generally applicable
• Large variations in results between partitions
– Model not generally applicable
– May need tuning
X-Validation/Folding/Rotation Estimation
• Stratified K-Fold Cross Validation
• Creates K folds
– Representative partitions
•
•
•
•
Holds one partition out
Trains model with others
Tests with holdout partition
Repeat (different holdout/test partition)* K
X-Validation/Folding/Rotation Estimation
DEMO
Prediction Calculator
• Set costs and profits associated with
– Getting the prediction right
– Getting the prediction wrong
• See profit curves
• See profit threshold scores
• Pad for entering new data
Prediction Calculator
•
•
•
•
Cloud Version available
Print version available for later data entry
Easy to use
Easy to understand
Prediction Calculator
DEMO
Thank you…
[email protected]