Understanding Data Mining Craig A. Stevens, PMP, CC [email protected] www.westbrookstevens.com Examples of Classical Statistical Methods.
Download ReportTranscript Understanding Data Mining Craig A. Stevens, PMP, CC [email protected] www.westbrookstevens.com Examples of Classical Statistical Methods.
Slide 1
Understanding Data Mining
Craig A. Stevens, PMP, CC
[email protected]
www.westbrookstevens.com
Slide 2
Examples of
Classical Statistical
Methods
Slide 3
Latitude 36.19N and Longitude -86.78W
Nashville, TN, USA
Slide 4
Yi = a + bxi + e
Slide 5
Multiple Regression
http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm
Slide 6
Multiple Regression
http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm
Slide 7
Multiple Regression
http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm
Slide 8
Multiple Regression
http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm
Slide 9
Multiple Regression
http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm
Slide 10
Data Mining
Slide 11
http://datamining.typepad.com/photos/uncat
egorized/livejournal.png
Slide 12
Slide 13
What is Data Mining?
• The process of identifying hidden patterns, trends,
and relationships in large quantities of data.
Why Do Data Mining?
• To discover useful information for making
decisions.
• Too many variables for Classical Statistical methods
to work.
– Large Number of Records 108 - 1012
• Gigabyte – Terabyte
– High Dimensional Data
• Lots of Variables (10 – 104 attributes)
Slide 14
The Huber-Wegman Taxonomy of Data Set Sizes
Descriptor
Tiny
Small
Data Set Size in
Bytes
10^2
10^4
Medium
10^6
Large
Huge
Massive
10^8
10^10
10^12
Super Massive
10^15
Storage Mode
Piece of Paper
A few Pieces of
Paper
A Floppy Disk
Hard Disk
Multiple Hard Disks
Robotic Magnetic
Tape
Storage Silos
Distributed Data
Archives
Slide 15
Name
BAD
Model
Role
Target
Measurement
Level
Binary
Description
CLAGE
Input
Interval
Age of oldest trade line in
months
CLNO
Input
Interval
Number of trade lines
DEBTINC
Input
Interval
Debt-to-income ratio
DELINQ
Input
Interval
Number of trade lines
DEROG
Input
Interval
Number of major
derogatory reports
JOB
Input
Nominal
LOAN
Input
Interval
MORTDUE
Input
Interval
NINQ
Input
Interval
REASON
Input
Binary
VALUE
Input
Interval
Six occupational
categories
Amount of the loan
request
Amount due on existing
mortgage
Number of recent credit
inquiries
DebtCon=debt
consolidation,
HomeImp=home
improvement
Value of current property
YOJ
Input
Interval
Years at present job
1=client defaulted on loan
0=loan repaid
Slide 16
SAS Enterprise Miner Objects
Slide 17
Slide 18
Shows the Cut off Point is 6 Variables
Slide 19
Small Number of Useful Variables
Slide 20
Slide 21
Comparing Methods and Profit vs
Marketing Cost
Slide 22
Slide 23
Slide 24
Decision Trees for Predictive Modeling
Padraic G. Neville SAS Institute Inc. 4 August
1999
Slide 25
Clustering As in Different Brands
Slide 26
2
0
1
- 1
0
1
0
- 1
0
1
- 1
0
1
0
1
H
4
H
4
P CR2 _ 1
M
O
IS
_I 9B
O 0
- 1
J
- 1
_
1
2
O 0
L
J
A
L
- 1
0
1
- 1
0
1
2
3
- 1
0
1
2
- 1
0
1
_
C
A
1
C
- 1
0
0
2
Z
Z
M
O
IS
_I 9B
S
S
- 1
B
B
_
R
R
0
A
A
_
C
C
1
M
O
IS
_I 9B
Q
G
- 1
H
Q
_
G
I
H
D
_
O
S
I
1
M
O
IS
_I 9B
D
2
3
- 1
6
O
S
6
D
O
O
0
J
D
_
J
H
S
_
1
S
H
A
A
2
M
O
IS
_I 9B
J
- 1
L
C
J
F
C
L
_
F
0
_
T
2
F
A
1
T
3
- 1
1
A
F
0
M
O
IS
_I 9B
2
- 1
3
R 0
3
T
_
O 1
T
R
P
P CR3 _ 1
P CR1 _ 1
0
0
0
0
0
P
R
O
T_TR
3
1
P
R
O
T_TR
3
1
P
R
O
T_TR
3
1
P
R
O
T_TR
3
1
P
R
O
T_TR
3
1
2
2
2
2
2
M
O
IS
_I 9B P
R
O
T_TR
3 FA
T_FC
LJ A
S
H
_JO
D
6 S
O
D
I _H
G
QC
A
R
B
_S
Z0 C
A
L_JO
H
4
1
2
- 1
0
1
- 1
0
1
2
3
- 1
0
1
2
4
H
- 1
O 0
J
_
L
A
C
0
Z
S
_
B
R
A
C
Q
G
H
_
I
D
O
S
6
D
O
J
_
H
S
A
- 1
- 1
- 1
- 1
0
1
1
1
1
FA
T_FC
LJ
0
FA
T_FC
LJ
0
FA
T_FC
LJ
0
FA
T_FC
LJ
2
2
2
2
3
3
3
3
1
2
- 1
0
1
- 1
0
1
2
4
H
- 1
O 0
J
_
L
A
C
0
Z
S
_
B
R
A
C
Q
G
H
_
I
D
O
S
3
- 1
- 1
- 1
1
1
1
A
S
H
_JO
D
6
0
A
S
H
_JO
D
6
0
A
S
H
_JO
D
6
0
2
2
2
1
2
- 1
0
1
4
H
- 1
O 0
J
_
L
A
C
0
Z
S
_
B
R
A
C
- 1
- 1
0
0
1
S
O
D
I _H
G
Q
1
S
O
D
I _H
G
Q
2
2
3
3
1
2
4
H
- 1
O 0
J
_
L
A
C
- 1
0
C
A
R
B
_S
Z0
1
Slide 27
Data Mining Art found at http://datamining.typepad.com/data_mining/dataviz/page/2/
Slide 28
Data Mining Art found at http://datamining.typepad.com/data_mining/dataviz/page/2/
Slide 29
Slide 30
National Energy Research Scientific Computing Center
Slide 31
SurfStat
A Matlab toolbox for the statistical analysis of
univariate and multivariate surface and
volumetric data using linear mixed effects
models and random field theory
Keith J. Worsley
Slide 32
Latitude 36.19N and Longitude -86.78W
Nashville, TN, USA
Slide 33
Genealogical Tree
On You Tube
http://www.youtube.com/watch?v=CnniJR5Ah7g
Understanding Data Mining
Craig A. Stevens, PMP, CC
[email protected]
www.westbrookstevens.com
Slide 2
Examples of
Classical Statistical
Methods
Slide 3
Latitude 36.19N and Longitude -86.78W
Nashville, TN, USA
Slide 4
Yi = a + bxi + e
Slide 5
Multiple Regression
http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm
Slide 6
Multiple Regression
http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm
Slide 7
Multiple Regression
http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm
Slide 8
Multiple Regression
http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm
Slide 9
Multiple Regression
http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm
Slide 10
Data Mining
Slide 11
http://datamining.typepad.com/photos/uncat
egorized/livejournal.png
Slide 12
Slide 13
What is Data Mining?
• The process of identifying hidden patterns, trends,
and relationships in large quantities of data.
Why Do Data Mining?
• To discover useful information for making
decisions.
• Too many variables for Classical Statistical methods
to work.
– Large Number of Records 108 - 1012
• Gigabyte – Terabyte
– High Dimensional Data
• Lots of Variables (10 – 104 attributes)
Slide 14
The Huber-Wegman Taxonomy of Data Set Sizes
Descriptor
Tiny
Small
Data Set Size in
Bytes
10^2
10^4
Medium
10^6
Large
Huge
Massive
10^8
10^10
10^12
Super Massive
10^15
Storage Mode
Piece of Paper
A few Pieces of
Paper
A Floppy Disk
Hard Disk
Multiple Hard Disks
Robotic Magnetic
Tape
Storage Silos
Distributed Data
Archives
Slide 15
Name
BAD
Model
Role
Target
Measurement
Level
Binary
Description
CLAGE
Input
Interval
Age of oldest trade line in
months
CLNO
Input
Interval
Number of trade lines
DEBTINC
Input
Interval
Debt-to-income ratio
DELINQ
Input
Interval
Number of trade lines
DEROG
Input
Interval
Number of major
derogatory reports
JOB
Input
Nominal
LOAN
Input
Interval
MORTDUE
Input
Interval
NINQ
Input
Interval
REASON
Input
Binary
VALUE
Input
Interval
Six occupational
categories
Amount of the loan
request
Amount due on existing
mortgage
Number of recent credit
inquiries
DebtCon=debt
consolidation,
HomeImp=home
improvement
Value of current property
YOJ
Input
Interval
Years at present job
1=client defaulted on loan
0=loan repaid
Slide 16
SAS Enterprise Miner Objects
Slide 17
Slide 18
Shows the Cut off Point is 6 Variables
Slide 19
Small Number of Useful Variables
Slide 20
Slide 21
Comparing Methods and Profit vs
Marketing Cost
Slide 22
Slide 23
Slide 24
Decision Trees for Predictive Modeling
Padraic G. Neville SAS Institute Inc. 4 August
1999
Slide 25
Clustering As in Different Brands
Slide 26
2
0
1
- 1
0
1
0
- 1
0
1
- 1
0
1
0
1
H
4
H
4
P CR2 _ 1
M
O
IS
_I 9B
O 0
- 1
J
- 1
_
1
2
O 0
L
J
A
L
- 1
0
1
- 1
0
1
2
3
- 1
0
1
2
- 1
0
1
_
C
A
1
C
- 1
0
0
2
Z
Z
M
O
IS
_I 9B
S
S
- 1
B
B
_
R
R
0
A
A
_
C
C
1
M
O
IS
_I 9B
Q
G
- 1
H
Q
_
G
I
H
D
_
O
S
I
1
M
O
IS
_I 9B
D
2
3
- 1
6
O
S
6
D
O
O
0
J
D
_
J
H
S
_
1
S
H
A
A
2
M
O
IS
_I 9B
J
- 1
L
C
J
F
C
L
_
F
0
_
T
2
F
A
1
T
3
- 1
1
A
F
0
M
O
IS
_I 9B
2
- 1
3
R 0
3
T
_
O 1
T
R
P
P CR3 _ 1
P CR1 _ 1
0
0
0
0
0
P
R
O
T_TR
3
1
P
R
O
T_TR
3
1
P
R
O
T_TR
3
1
P
R
O
T_TR
3
1
P
R
O
T_TR
3
1
2
2
2
2
2
M
O
IS
_I 9B P
R
O
T_TR
3 FA
T_FC
LJ A
S
H
_JO
D
6 S
O
D
I _H
G
QC
A
R
B
_S
Z0 C
A
L_JO
H
4
1
2
- 1
0
1
- 1
0
1
2
3
- 1
0
1
2
4
H
- 1
O 0
J
_
L
A
C
0
Z
S
_
B
R
A
C
Q
G
H
_
I
D
O
S
6
D
O
J
_
H
S
A
- 1
- 1
- 1
- 1
0
1
1
1
1
FA
T_FC
LJ
0
FA
T_FC
LJ
0
FA
T_FC
LJ
0
FA
T_FC
LJ
2
2
2
2
3
3
3
3
1
2
- 1
0
1
- 1
0
1
2
4
H
- 1
O 0
J
_
L
A
C
0
Z
S
_
B
R
A
C
Q
G
H
_
I
D
O
S
3
- 1
- 1
- 1
1
1
1
A
S
H
_JO
D
6
0
A
S
H
_JO
D
6
0
A
S
H
_JO
D
6
0
2
2
2
1
2
- 1
0
1
4
H
- 1
O 0
J
_
L
A
C
0
Z
S
_
B
R
A
C
- 1
- 1
0
0
1
S
O
D
I _H
G
Q
1
S
O
D
I _H
G
Q
2
2
3
3
1
2
4
H
- 1
O 0
J
_
L
A
C
- 1
0
C
A
R
B
_S
Z0
1
Slide 27
Data Mining Art found at http://datamining.typepad.com/data_mining/dataviz/page/2/
Slide 28
Data Mining Art found at http://datamining.typepad.com/data_mining/dataviz/page/2/
Slide 29
Slide 30
National Energy Research Scientific Computing Center
Slide 31
SurfStat
A Matlab toolbox for the statistical analysis of
univariate and multivariate surface and
volumetric data using linear mixed effects
models and random field theory
Keith J. Worsley
Slide 32
Latitude 36.19N and Longitude -86.78W
Nashville, TN, USA
Slide 33
Genealogical Tree
On You Tube
http://www.youtube.com/watch?v=CnniJR5Ah7g