Introduction to KDD for Tony's MI Course
Download
Report
Transcript Introduction to KDD for Tony's MI Course
1
COMP 3503
Data Preparation
and Meta Data
with
Daniel L. Silver
2
The KDD Process
Interpretation
and Evaluation
Data Mining
Knowledge
Selection and
Pre-processing
Data
Consolidation
Patterns &
Models
Warehouse
Consolidated
Data
Data Sources
p(x)=0.02
Prepared Data
3
Selection and Pre-processing
Core Problems & Approaches
Problems:
•
•
•
identification of relevant data
representation of data
search for valid pattern or model
Probability
of sale
Age
Approaches:
Income
• top-down verification by expert
OLAP
• interactive visualization of data/models
•
* bottom-up induction from data *
Data
Mining
4
Selection and Pre-processing
As much effort is expended preparing data as
applying a data mining tool
Iterative approach: prepare
develop
data
model
Data Mining phase will benefit from any
insight that leads to improved set of
attributes
Representation can facilitate or frustrate the
Search for the most accurate model (hypothesis)
Spreadsheet, OLAP and visualization tools are very helpful
5
Selection and Pre-processing
Data and Variable Characteristics
Three basic variable data types:
•
•
•
Nominal (catagorical) qualitative values marital
status = single, married, divorced, widowed
Ordinal (ranked) values have rank order
grade = A,B,C
Interval values have order plus a metric scale for
comparisons and arithmetic operations
temperature = 2, 10, 20 or 10.5, 15.2, 19.3
date = 12Aug99, 13Feb02
6
Selection and Pre-processing
Data and Variable Characteristics
Variables can be either discrete or continuous
•
In addition data can be of various formats:
•
Only interval numeric values are continuous
Text, numeric, logical, binary, date, money, …
Data mining software will vary in its ability
to accept these types and formats
7
Selection and Pre-processing
Data Selection and Sampling
Select response (dependent) variable
•
•
determine prior probability of categories
deal with volume bias issues
Select predictor (independent) attributes
Generate a set of examples
•
•
choose sampling method (random, stratified)
consider sample complexity: How many
examples do I need to develop a reliable model?
Handle outliers (obvious exceptions)
• Remove the row or replace with imputed value
8
Selection and Pre-processing
Data Reduction
The curse of dimensionality
number of attributes / number of values
Reduce number of attributes
•
•
•
remove redundant and correlating attributes
combine attributes (logically, arithmetically,
statistically (Principal Components Analysis)
Reduce attribute value ranges
•
•
group symbolic discrete values
quantize continuous numeric values
9
Selection and Pre-processing
Preliminary Statistical Analysis
Coefficient of correlation , r, measures the
linear dependence of two variables X and Y,
-1 < r < +1; r 2 shows magnitude of r.
pos r
neg r
?
Y
X
X
X
Select attributes which correlate strongly with
the response variable and pertain to problem
10
Selection and Pre-processing
Preliminary Statistical Analysis
Remove or combine attributes that correlate
with each other or try to de-correlate through
transformation
Factor Analysis - ANOVA can be used to compare
relative contribution of each attribute to outcomes
Principal Component Analysis - generates
variates - linear combinations of original attributes
Tools such as Minitab, SAS, SPSS can be used
11
Selection and Pre-processing
Transform data
• de-correlate and normalize values
• map time-series data to static representation
Encode data
• representation must be appropriate for the Data
Mining tool which will be used
• continue to reduce attribute dimensionality where
possible without loss of information
Use spreadsheet functions or transformation
and encoding software within DM tool
12
Selection and Pre-processing
Transformation and Encoding
Discrete variable values
If necessary transform to discrete numeric
values
Example, encode the value 4 as follows:
•
•
•
Nominal: one-of-N code (0 1 0 0 0) - five inputs
Ordinal: thermometer code ( 1 1 1 1 0) - five inputs
Interval: real value (0.4)* - one input
Consider relationship between values
•
(single, married, divorce) vs. (youth, adult, senior)
13
Selection and Pre-processing
Transformation and Encoding
Continuous numeric values
De-correlate via normalization of values:
Min-max:
x’ = [(newmax – newmin) (x – min) / (max – min)]
+ newmin
• Euclidean: x’ = x / sqrt(sum of all x^2)
• Percentage: x’ = x/(sum of all x)
• Variance based: x’ = (x - (mean of all x))/variance
Scale values using a linear transform if data is
uniformly distributed or use non-linear (log, power)
if skewed distribution
•
14
Selection and Pre-processing
Transformation and Encoding
Other encodings for continuous numeric values
Example: 1.6 meters could be encoded as:
•
Single real-valued number (0.16)*
o
•
Bits of a binary number (010000)
o
•
BAD! Modeling system must now learning binary encoding
One-of-N quantized intervals (0 1 0 0 0)
o
•
OK! But what if data is skewed
NOT GREAT! Presents discontinuties
Distributed (fuzzy) overlapping intervals
( 0.3 0.8 0.1 0.0 0.0)
o
BEST! Deals well with skewing but no discontinuities
15
Selection and Pre-processing
Extracting Features from a Single
Variable
From dates:
• Day, week, month, quarter, holiday, weekend day
From time:
• Hour, minute, morning, afternoon, evening
From address:
• Postal Code components mean something
Telephone number:
• NPA-NNX-9999
16
Selection and Pre-processing
Time Series Data
Of great interest to business, science, medicine
Time series data has high dimensional
•
T1, T2, … , Tn
Approaches to summary/characterization
•
•
•
Current value = Ti
Moving average = MAi = ( Ti + Ti-1 + Ti-2) / 3
Trends = Ti - MAi or = MAi - MAi-k
17
Selection and Pre-processing
Textual Data
A difficult data type
•
Can have very high dimensions
•
freeform, open-ended, syntax -vs- semantics
thousands of potential values
Approaches to summary/characterization
•
•
•
define a fixed set of N word classes based on
frequency analysis
map word combinations to one of the N classes
automate via specialty software
Data Warehousing and
Preparation
Access to Recent Information
www.datawarehousing.com
DWI - Data Warehouse Institute www.dw-institute.com
Wikipedia http://en.wikipedia.org/wiki/Data_warehouse
DW Information Centre http://www.dwinfocenter.org
A DW Tutorial: http://www.planet-sourcecode.com/vb/scripts/ShowCode.asp?lngWId=5&txtCodeId=378
Text Books:
Data Warehousing texts by W.H. Inmon, Claudia Imhoff, Ralph Kimball
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.
18
19
THE END
[email protected]