Turning Data Into Information

Download Report

Transcript Turning Data Into Information

Take a challenge with time; never let time idles away
aimlessly.
1
Turning Data Into Information
types of data
 summary statistics
 distributions

2
What are Data?
Any set of data contains information about
some group of individuals. The information is
organized in variables.
Individuals are the objectives described by a set of
data. Could be animals, people, or things.
A variable is any characteristic of an individual.
A variable can take different values for different
individuals.
3
Population/Sample/Raw Data
A population is a collection of all individuals
about which information is desired.
 A sample is a subset of a population.
 Raw data: information collected but not been
processed.
Eg. What’s the fastest you’ve ever driven a car?
100 mph.
** use this eg to show individual, population,
sample and variable.

4
Example: A College’s Student Dataset
The data set includes data about all currently
enrolled students such as their ages, genders,
heights, grades, and choices of major.

Who? What individuals do the data describe?

Population/sample/raw data of study?
What? How many variables do the data
describe? Give an example of variables.

5
Types of Variables
A categorical variable places an individual
into one of several groups or categories.
 A quantitative variable takes numerical
values for which arithmetic operations such
as adding and averaging make sense.
Q. Which variable is categorical ? Quantitative?

6
A variable
Q: Does “average” make sense?
No
Categorical/
Qualitative
Q: Is there any natural ordering among categories?
No
Nominal variable
Yes
Ordinal variable
Yes
Numerical/
Quantitative
Q: Can all possible values be listed down?
Yes
Discrete variable
No
Continuous variable
7
Asking the Right Questions

The relationship between gender and age
** What is your age and gender?
** Reading assignment: P 16-18.
8
The Role of a Variable


Response/outcome/dependent variables:
how they change are the most interested.
Explanatory/predictor/independent variables:
those can explain how responses change.
Study Q: Do the males tend to speedier than
the females?
1.
9
What are the two variables? Their roles?
Two Basic Strategies to Explore Data
10

Begin by examining each variable by itself.
Then move on to study the relationship
among the variables.

Begin with a graph or graphs. Then add
numerical summaries of specific aspects of
the data.
Summarizing Data
Goal: to study or estimate the distributions of
variables
The distribution of a variable tells us what
values/categories it takes and how often it
takes those values/categories.
 Displaying distributions of data with graphs
 Describing distributions of data with numbers
11
A Dataset of CSUEB Students
Gender
12
Weight
(pounds)
155
College
M
Height
(inches)
68.5
F
F
M
F
61.2
63.0
70.0
68.6
99
115
205
170
Bsns
Sci
Sci
Arts
F
M
M
65.1
72.4
--
125
220
188
Bsns
Arts
Sci
Bsns
Numerical Summaries for Categorical
Variables
Frequency/counts
 Relative frequency/percentage
** show gender
 Contingency table
** show gender vs. college

13
Graphs for Categorical Variables
Pie charts
** good for one variable; show gender pie chart
 Bar graphs
** good for one or two variables; show gender
then gender and college bar graphs

14
Graphs for Quantitative Variables
Stem-and-leaf plots; dotplots
** to see the distribution. Good for small datasets;
explicit but choppy. E.g. height
 Histograms
** to see the distribution. Good for all datasets. E.g.
height

15
How to Make a Stemplot
Separate each observation into a stem
consisting of all but the final (rightmost) digit
and a leaf, the final digit. Stems may have
as many digits as needed, but each leaf
contains only a single digit.
Example: age of 20.5  leaf = “5” and the
other digit “20” will be the stem
1.
16
How to Make a Stemplot
2.
3.
Write the stems in a vertical column with the
smallest at the top, and draw a vertical line
at the right of this column.
Write each leaf in the row to the right of its
stem, in increasing order out from the stem.
*** work on blackboard for height ***
17
How to Make a Histogram
1.
2.
3.
18
Break the range of values of a variable into
equal-width intervals.
Count the # of individuals in each interval.
These counts are called frequencies and
the corresponding %’s are called relative
frequencies.
Draw the histogram: the variable on the
horizontal axis and the count (or %) on the
vertical axis.
*** work on blackboard for height ***
What do We See from the Graphs?
Important features we should look for:
 Overall pattern
–
–
–

Shape
Center (the location data tend to cluster to)
Spread (the spread level of data)
Outliers, the values that fall far outside the
overall pattern
(for quantitative variables only)
19
Overall Pattern—Shape


How many peaks, called modes? A distribution with
one major peak is called unimodal.
Symmetric or skewed?
–
–
–
Symmetric if the large values are mirror images of small
values
Skewed to the right if the right tail (large values) is much
longer than the left tail (small values)
Skewed to the left if the left tail (small values) is much
longer than the right tail (large values)
*** Show examples on blackboard. ***
20
Numerical Summaries for Quantitative
Variables
To measure center (location): Mode, Mean and
Median
 To measure spread: Range, Interquartile Range
(IQR) and Standard Deviation (SD)
 Five-number summaries
** show height
 Outliers
** give a large number for the missing height
 Regression
** height vs. weight, discussed in chapter 5

21
More Graphs for Quantitative
Variables
Boxplots
** to show location and spread, and identify outliers
 Scatterplots
** to see the relationship between two quan. var’s:
height vs. weight

22
Graphs for the Relation of Two
Variables
23

1 categorical + 1 quantitative var’s:

2 quantitative var’s:

2 categorical var’s:
(for Bell-shaped distributions only)
24
Empirical Rule (68-95-99.7 rule)

If a variable X follows normal distribution, that
is, all X values (the whole population) show
bell-shaped, then:
Mean(X) + 1*SD(X) covers 68% of possible X values
Mean(X) + 2*SD(X) covers 95% of possible X values
Mean(X) + 3*SD(X) covers 99.7% of possible X values
25
Empirical Rule (68-95-99.7 rule)

If the data (from a sample) of a variable X
show bell-shaped, then:
X + 1*S covers about 68% of possible X values
X + 2*S covers about 95% of possible X values
X + 3*S covers about 99.7% of possible X values
26
How to use Empirical Rule



27
Find the range covering 68%, 95% or 99.7%
of X values
Check if X follows a normal distribution.
Provide an estimate of S without messy
calculation