Transcript Chapter 5
Chapter 5 The Lure of Statistics: Data Mining Using Familiar Tools
Note: Included in this Slide Set is a subset of Chapter 5 material and additional material from the instructor.
Why a Manager (or you) Needs to Know Some Basics about Statistics • To know how to properly present information • To know how to draw conclusions about populations based on sample information • To know how to improve processes • To know how to obtain reliable forecasts 2
Statistics vs Data Mining
• For statisticians, data mining has a negative connotation – one of searching for data to support preconceived ideas • Statistics don’t lie but liars use statistics!
• Statistics developed as a discipline to help scientists make sense of observations and experiments, hence the
scientific method
• Problem has often been too little data for statisticians • DM is faced with too much data • Many of the techniques & algorithms used are shared by both statisticians and data miners 3
Some Definitions
•
Population
(universe) is the collection of things under consideration •
Sample
is a portion of the population selected for analysis •
Statistic
is a summary measure computed to describe a characteristic of the sample 4
Some Definitions*
• • • •
Mean
(average) is the sum of the values divided by the number of values
Median
is the midpoint of the values (50% above; 50% below) after they have been ordered from the smallest to the largest, or the largest to the smallest
Mode
is the value among all the values observed that appears most frequently
Range
is the difference between the smallest and largest observation in the sample * laymen’s 5
Population and Sample
Population Sample
Use statistics to summarize features Use parameters to summarize features Inference on the population from the sample 6
Occam’s Razor – “Kiss”
• William of Occam, Franciscan monk, 1280-1349 – prior to modern statistics, the Renaissance and the printing press.
• Influential philosopher, theologian, professor with a very simple idea: – Latin: Entia non sunt multiplicanda sine necessitate – English: The simpler explanation is the preferable one or “Keep it simple, stupid!” 7
The Null Hypothesis
• The NH assumes that differences among observations are due simply to chance • Bush vs Kerry – poll’s margin of error ~ 3% - 4% • Layperson asks, “Are these %’s different?” • Statistician asks, “What is the probability that these two values are really the same?” 8
Skepticism
• Is good for both statisticians and DMiners • Goal for both is to demonstrate results that work, hence discounting the null hypothesis • The less reliance on chance the better 9
P-Values and Q-Values
• The null hypothesis can be quantified • The
p-value
is the probability that the null hypothesis is true • • When the null hypothesis is true, nothing is really happening; differences are due to chance
Confidence
, the reverse of a p-value, is called the q-value. p-value = 5% then the q-value (confidence) is 95%.
• Example: Bush/Kerry…p-value 60% or 5% 10
Data Visualization
• • Discrete data, such as products, channels, regions, and descriptions is the main focus of data mining
Histogram
– bars show number of times different values occur 11
Data Visualization
• • Histograms describe a single moment in time • Data mining is often concerned with what is happening over time.
Time Series Analysis
– choosing an appropriate time frame to consider the data 12
Standardized Values
• Time Series charts are useful, but have limitations also; cannot tell whether the changes over time are expected or unexpected • We could look at a segment of the data, say a day at a time asking: “Is it possible that the differences seen on each day are strictly due to chance?” (null hypothesis) • Answer: calculate the p-value for a day 13
Central Limit Theorem
• As more and more samples are taken from a population, the distribution of the averages of the samples follows the
normal distribution
. The average of the samples comes arbitrarily close to the average of the entire population.
• Normal distribution is described by the
mean
(average count) and the
standard deviation
(clustering around the mean) 14
Different Shapes of Distributions
15
Variance and Standard Deviation
• •
Variance
is a measure of the dispersion of a sample (or how closely the observations cluster around the mean [average])
Standard Deviation
, the square root of the variance, is the measure of variation in the observed values (or variation in the clustering around the mean) 16
Example: Sample Scores/Grades
• 84 • 65 • 74 • 72 • 85 • 65 • 96 • 30 • 78 • 72 • 85 • 64 • 65 • 96 • 15 • 72 • 73 • 85 1. Sort the data from highest to lowest and assign grades 2. Find the Mean, Median, Mode, and Standard Deviation 3. Create a histogram for the grades .
17
Using MS Excel…
B C D E F G H I Sorted Raw Data
96 96 85 85 85 84 78 74 73 72 72 72 65 65 65 64 30 15 C C C C D D D D F F
Grade
A A B B B B C C
(Bx-I5)^2
630.57
630.57
199.12
199.12
199.12
171.90
50.57
9.68
4.46
1.23
1.23
1.23
34.68
34.68
34.68
47.46
1671.90
3123.57
Range Mean Median Mode Standard Deviation A's B's C's D's F's W's Sum
2 0 18 2 4 6 4 81 70.9
72.5
85 19.8
18
Using MS Excel…
Grade Distribution
7 6 1 0 3 2 5 4 A's B's C's D's F's 19
End of Chapter 5
20