Understanding Data Mining Craig A. Stevens, PMP, CC [email protected] www.westbrookstevens.com Examples of Classical Statistical Methods.

Download Report

Transcript Understanding Data Mining Craig A. Stevens, PMP, CC [email protected] www.westbrookstevens.com Examples of Classical Statistical Methods.

Slide 1

Understanding Data Mining

Craig A. Stevens, PMP, CC
[email protected]
www.westbrookstevens.com


Slide 2

Examples of
Classical Statistical
Methods


Slide 3

Latitude 36.19N and Longitude -86.78W

Nashville, TN, USA


Slide 4

Yi = a + bxi + e


Slide 5

Multiple Regression

http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm


Slide 6

Multiple Regression

http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm


Slide 7

Multiple Regression

http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm


Slide 8

Multiple Regression

http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm


Slide 9

Multiple Regression

http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm


Slide 10

Data Mining


Slide 11

http://datamining.typepad.com/photos/uncat
egorized/livejournal.png


Slide 12


Slide 13

What is Data Mining?
• The process of identifying hidden patterns, trends,
and relationships in large quantities of data.
Why Do Data Mining?
• To discover useful information for making
decisions.
• Too many variables for Classical Statistical methods
to work.
– Large Number of Records 108 - 1012
• Gigabyte – Terabyte

– High Dimensional Data
• Lots of Variables (10 – 104 attributes)


Slide 14

The Huber-Wegman Taxonomy of Data Set Sizes
Descriptor

Tiny
Small

Data Set Size in
Bytes
10^2
10^4

Medium

10^6

Large
Huge
Massive

10^8
10^10
10^12

Super Massive

10^15

Storage Mode

Piece of Paper
A few Pieces of
Paper
A Floppy Disk
Hard Disk
Multiple Hard Disks
Robotic Magnetic
Tape
Storage Silos
Distributed Data
Archives


Slide 15

Name

BAD

Model
Role
Target

Measurement
Level
Binary

Description

CLAGE

Input

Interval

Age of oldest trade line in
months

CLNO

Input

Interval

Number of trade lines

DEBTINC

Input

Interval

Debt-to-income ratio

DELINQ

Input

Interval

Number of trade lines

DEROG

Input

Interval

Number of major
derogatory reports

JOB

Input

Nominal

LOAN

Input

Interval

MORTDUE

Input

Interval

NINQ

Input

Interval

REASON

Input

Binary

VALUE

Input

Interval

Six occupational
categories
Amount of the loan
request
Amount due on existing
mortgage
Number of recent credit
inquiries
DebtCon=debt
consolidation,
HomeImp=home
improvement
Value of current property

YOJ

Input

Interval

Years at present job

1=client defaulted on loan
0=loan repaid


Slide 16

SAS Enterprise Miner Objects


Slide 17


Slide 18

Shows the Cut off Point is 6 Variables


Slide 19

Small Number of Useful Variables


Slide 20


Slide 21

Comparing Methods and Profit vs
Marketing Cost


Slide 22


Slide 23


Slide 24

Decision Trees for Predictive Modeling
Padraic G. Neville SAS Institute Inc. 4 August
1999


Slide 25

Clustering As in Different Brands


Slide 26

2

0

1

- 1

0

1

0

- 1

0

1

- 1

0

1

0

1

H
4

H
4

P CR2 _ 1

M
O
IS
_I 9B

O 0

- 1

J

- 1

_

1

2

O 0

L

J

A

L

- 1

0

1

- 1

0

1

2

3

- 1

0

1

2

- 1

0

1

_

C

A
1

C

- 1

0

0

2

Z

Z

M
O
IS
_I 9B

S

S

- 1

B

B

_

R

R
0

A

A

_

C

C

1

M
O
IS
_I 9B

Q

G

- 1

H

Q

_

G

I

H

D

_

O

S

I
1

M
O
IS
_I 9B

D

2

3

- 1

6

O

S

6

D

O

O
0

J

D

_

J

H

S

_

1

S
H

A

A

2

M
O
IS
_I 9B

J
- 1

L

C

J

F

C
L

_

F
0

_

T

2

F
A

1

T

3

- 1

1

A

F

0

M
O
IS
_I 9B

2

- 1

3

R 0
3

T

_

O 1
T

R

P

P CR3 _ 1

P CR1 _ 1

0

0

0

0

0

P
R
O
T_TR
3

1

P
R
O
T_TR
3

1

P
R
O
T_TR
3

1

P
R
O
T_TR
3

1

P
R
O
T_TR
3

1

2

2

2

2

2

M
O
IS
_I 9B P
R
O
T_TR
3 FA
T_FC
LJ A
S
H
_JO
D
6 S
O
D
I _H
G
QC
A
R
B
_S
Z0 C
A
L_JO
H
4

1

2

- 1

0

1

- 1

0

1

2

3

- 1

0

1

2

4

H
- 1

O 0

J

_

L

A

C

0

Z

S

_

B

R

A

C

Q

G

H

_

I

D

O

S

6

D

O

J

_

H

S

A

- 1

- 1

- 1

- 1

0

1

1

1

1

FA
T_FC
LJ

0

FA
T_FC
LJ

0

FA
T_FC
LJ

0

FA
T_FC
LJ

2

2

2

2

3

3

3

3

1

2

- 1

0

1

- 1

0

1

2

4

H
- 1

O 0

J

_

L

A

C

0

Z

S

_

B

R

A

C

Q

G

H

_

I

D

O

S

3

- 1

- 1

- 1

1

1

1

A
S
H
_JO
D
6

0

A
S
H
_JO
D
6

0

A
S
H
_JO
D
6

0

2

2

2

1

2

- 1

0

1

4

H
- 1

O 0

J

_

L

A

C

0

Z

S

_

B

R

A

C

- 1

- 1

0

0

1

S
O
D
I _H
G
Q

1

S
O
D
I _H
G
Q

2

2

3

3

1

2

4

H
- 1

O 0

J

_

L

A

C

- 1

0

C
A
R
B
_S
Z0

1


Slide 27

Data Mining Art found at http://datamining.typepad.com/data_mining/dataviz/page/2/


Slide 28

Data Mining Art found at http://datamining.typepad.com/data_mining/dataviz/page/2/


Slide 29


Slide 30

National Energy Research Scientific Computing Center


Slide 31

SurfStat
A Matlab toolbox for the statistical analysis of
univariate and multivariate surface and
volumetric data using linear mixed effects
models and random field theory
Keith J. Worsley


Slide 32

Latitude 36.19N and Longitude -86.78W

Nashville, TN, USA


Slide 33

Genealogical Tree
On You Tube
http://www.youtube.com/watch?v=CnniJR5Ah7g