Turning Data Into Information
Download
Report
Transcript Turning Data Into Information
Take a challenge with time; never let time idles away
aimlessly.
1
Turning Data Into Information
types of data
summary statistics
distributions
2
What are Data?
Any set of data contains information about
some group of individuals. The information is
organized in variables.
Individuals are the objectives described by a set of
data. Could be animals, people, or things.
A variable is any characteristic of an individual.
A variable can take different values for different
individuals.
3
Population/Sample/Raw Data
A population is a collection of all individuals
about which information is desired.
A sample is a subset of a population.
Raw data: information collected but not been
processed.
Eg. What’s the fastest you’ve ever driven a car?
100 mph.
** use this eg to show individual, population,
sample and variable.
4
Example: A College’s Student Dataset
The data set includes data about all currently
enrolled students such as their ages, genders,
heights, grades, and choices of major.
Who? What individuals do the data describe?
Population/sample/raw data of study?
What? How many variables do the data
describe? Give an example of variables.
5
Types of Variables
A categorical variable places an individual
into one of several groups or categories.
A quantitative variable takes numerical
values for which arithmetic operations such
as adding and averaging make sense.
Q. Which variable is categorical ? Quantitative?
6
A variable
Q: Does “average” make sense?
No
Categorical/
Qualitative
Q: Is there any natural ordering among categories?
No
Nominal variable
Yes
Ordinal variable
Yes
Numerical/
Quantitative
Q: Can all possible values be listed down?
Yes
Discrete variable
No
Continuous variable
7
Asking the Right Questions
The relationship between gender and age
** What is your age and gender?
** Reading assignment: P 16-18.
8
The Role of a Variable
Response/outcome/dependent variables:
how they change are the most interested.
Explanatory/predictor/independent variables:
those can explain how responses change.
Study Q: Do the males tend to speedier than
the females?
1.
9
What are the two variables? Their roles?
Two Basic Strategies to Explore Data
10
Begin by examining each variable by itself.
Then move on to study the relationship
among the variables.
Begin with a graph or graphs. Then add
numerical summaries of specific aspects of
the data.
Summarizing Data
Goal: to study or estimate the distributions of
variables
The distribution of a variable tells us what
values/categories it takes and how often it
takes those values/categories.
Displaying distributions of data with graphs
Describing distributions of data with numbers
11
A Dataset of CSUEB Students
Gender
12
Weight
(pounds)
155
College
M
Height
(inches)
68.5
F
F
M
F
61.2
63.0
70.0
68.6
99
115
205
170
Bsns
Sci
Sci
Arts
F
M
M
65.1
72.4
--
125
220
188
Bsns
Arts
Sci
Bsns
Numerical Summaries for Categorical
Variables
Frequency/counts
Relative frequency/percentage
** show gender
Contingency table
** show gender vs. college
13
Graphs for Categorical Variables
Pie charts
** good for one variable; show gender pie chart
Bar graphs
** good for one or two variables; show gender
then gender and college bar graphs
14
Graphs for Quantitative Variables
Stem-and-leaf plots; dotplots
** to see the distribution. Good for small datasets;
explicit but choppy. E.g. height
Histograms
** to see the distribution. Good for all datasets. E.g.
height
15
How to Make a Stemplot
Separate each observation into a stem
consisting of all but the final (rightmost) digit
and a leaf, the final digit. Stems may have
as many digits as needed, but each leaf
contains only a single digit.
Example: age of 20.5 leaf = “5” and the
other digit “20” will be the stem
1.
16
How to Make a Stemplot
2.
3.
Write the stems in a vertical column with the
smallest at the top, and draw a vertical line
at the right of this column.
Write each leaf in the row to the right of its
stem, in increasing order out from the stem.
*** work on blackboard for height ***
17
How to Make a Histogram
1.
2.
3.
18
Break the range of values of a variable into
equal-width intervals.
Count the # of individuals in each interval.
These counts are called frequencies and
the corresponding %’s are called relative
frequencies.
Draw the histogram: the variable on the
horizontal axis and the count (or %) on the
vertical axis.
*** work on blackboard for height ***
What do We See from the Graphs?
Important features we should look for:
Overall pattern
–
–
–
Shape
Center (the location data tend to cluster to)
Spread (the spread level of data)
Outliers, the values that fall far outside the
overall pattern
(for quantitative variables only)
19
Overall Pattern—Shape
How many peaks, called modes? A distribution with
one major peak is called unimodal.
Symmetric or skewed?
–
–
–
Symmetric if the large values are mirror images of small
values
Skewed to the right if the right tail (large values) is much
longer than the left tail (small values)
Skewed to the left if the left tail (small values) is much
longer than the right tail (large values)
*** Show examples on blackboard. ***
20
Numerical Summaries for Quantitative
Variables
To measure center (location): Mode, Mean and
Median
To measure spread: Range, Interquartile Range
(IQR) and Standard Deviation (SD)
Five-number summaries
** show height
Outliers
** give a large number for the missing height
Regression
** height vs. weight, discussed in chapter 5
21
More Graphs for Quantitative
Variables
Boxplots
** to show location and spread, and identify outliers
Scatterplots
** to see the relationship between two quan. var’s:
height vs. weight
22
Graphs for the Relation of Two
Variables
23
1 categorical + 1 quantitative var’s:
2 quantitative var’s:
2 categorical var’s:
(for Bell-shaped distributions only)
24
Empirical Rule (68-95-99.7 rule)
If a variable X follows normal distribution, that
is, all X values (the whole population) show
bell-shaped, then:
Mean(X) + 1*SD(X) covers 68% of possible X values
Mean(X) + 2*SD(X) covers 95% of possible X values
Mean(X) + 3*SD(X) covers 99.7% of possible X values
25
Empirical Rule (68-95-99.7 rule)
If the data (from a sample) of a variable X
show bell-shaped, then:
X + 1*S covers about 68% of possible X values
X + 2*S covers about 95% of possible X values
X + 3*S covers about 99.7% of possible X values
26
How to use Empirical Rule
27
Find the range covering 68%, 95% or 99.7%
of X values
Check if X follows a normal distribution.
Provide an estimate of S without messy
calculation