Descriptive Exploratory Data Analysis III Jagdish S. Gangolly

Download Report

Transcript Descriptive Exploratory Data Analysis III Jagdish S. Gangolly

Descriptive Exploratory Data Analysis III
Jagdish S. Gangolly
State University of New York at Albany
Trellis Graphics I
Syntax:
Dependent variable ~ explanatory variable |conditioning variable Data set
Output:
>trellis.device(motif)
>dev.off() or >graphics.off()
Trellis Graphics II
Example:
histogram(~height | voice.part,
data=singer)
– No dependent variable for histogram
– Height is explanatory variable
– Data set is singer
Trellis Graphics III
• Layout: layout and skip and aspect
parameters (p.147).
• Ordering graphs: left to right, bottom to top. If
as.table=T, left to right top to bottom p.149).
Descriptive Data Exploration
•
•
•
•
•
•
•
summary : mean, median, quartiles p.171
stem : stem and leaf display p.171
quantile p.172
stdev p.173
tapply : splits data p.174
by p.175
mean works on vector, and other structures need to
be converted to vectors before computing means.
• (example on p.176-7)
Data Preprocessing for Datamining I
• Why
– Incomplete
• Attribute values not available, equipment
malfunctions, not considered important
– Noisy (errors)
• instrument problems, human/computer errors,
transmission errors
– Inconsistent
• inconsistencies due to data definitions
Data Preprocessing for Datamining II
• Data Cleaning
– Missing values:
• ignore tuple, fill-in values manually, use a global constant
(unknown), missing value=attribute mean, missing value =
attribute group mean, missing value= most probable value
– Noisy data:
• Binning: partitioning into equi-sized bins, smoothing by bin
means or bin boundaries
• Clustering
• Inspection: computer & human
• Regression
– Inconsistencies
Data Preprocessing for Datamining III
• Data Integration: Combining data from different
sources into a coherent whole
– Schema integration: combining data models (entity
identification problems)
– Redundancy (derived values, calculated fields, use of
different key attributes): use of correlations to detect
redundancies
– Resolution of data value conflicts (coding values in
different measures)
Data Preprocessing for Datamining III
• Transformation
–
–
–
–
–
Smoothing
Aggregation
Generalisation
Normalisation
Attribute (or feature) construction
Data Preprocessing for Datamining IV
• Data Reduction & compression
– Data cube aggregation (p.117)
– Dimension reduction: minimise loss of
information.
• Attribute selection
• Decision tree induction
• Principal components analysis
Data Preprocessing for Datamining IV
– Numerosity reduction
• Regression/log-linear regression
• histograms
• Clustering