Introduction to Statistics (Variables and Displaying Categorical Data)

Download Report

Transcript Introduction to Statistics (Variables and Displaying Categorical Data)

Stats Starts Here
Copyright © 2009 Pearson Education, Inc.
Stats Starts Here


According to 100% of people surveyed, this is the
greatest class ever offered in college.
List of people surveyed: ME
Copyright © 2009Pearson Education, Inc.
Slide 1- 2
What Is (Are?) Statistics?



Statistics (the discipline) is a way of reasoning, a
collection of tools and methods, designed to help us
understand the world.
Statistics (plural) are particular calculations made from
data.
Data are values with a context.
What is Statistics About?



Statistics is about variation.
All measurements are imperfect, since there is
variation that we cannot see.
Statistics helps us to understand the real, imperfect
world in which we live.
Copyright © 2009Pearson Education, Inc.
Slide 1- 3
Think, Show, Tell

There are three simple steps to doing Statistics right:
first. Know where you’re headed and why.
is about the mechanics of calculating statistics
and graphical displays, which are important (but are
not the most important part of Statistics).
what you’ve learned. You must explain your
results so that someone else can understand your
conclusions.
Copyright © 2009Pearson Education, Inc.
Slide 1- 4
Chapter 2
Data
Copyright © 2009 Pearson Education, Inc.
What Are Data?



Data can be numbers, record names, or other
labels.
Not all data represented by numbers are
numerical data (e.g., 1=male, 2=female).
Data are useless without their context…
Copyright © 2009Pearson Education, Inc.
Slide 1- 6
The Exam (FICTIONAL DATA)





The class average last semester was a 94 on the
final exam.
It was out of 500 points!!!!
A group of individuals averaged a 78% on an
algebra exam……
The group of individuals were 7 year olds…
Or…the group of individuals were Algebra
teachers!
Copyright © 2009Pearson Education, Inc.
Slide 1- 7
Dream Job


The average salary at a company that has 25
employees is $8,500,000 per year.
Would you like to be hired by this company?
The CEO makes
$212,400,000 per year…
The other 24 employees
Average $4,166.67 per/year

Copyright © 2009Pearson Education, Inc.
Slide 1- 8
The “W’s”


To provide context we need the W’s
 Who
 What (and in what units)
 When
 Where
 Why (if possible)
 and How
of the data.
Note: the answers to “who” and “what” are
essential.
Copyright © 2009Pearson Education, Inc.
Slide 1- 9
Data Tables

The following data table clearly shows the context
of the data presented:

Notice that this data table tells us the What
(column titles) and Who (row titles) for these data.
Copyright © 2009Pearson Education, Inc.
Slide 1- 10
Who

The Who of the data tells us the individual cases
about which (or whom) we have collected data.
 Individuals who answer a survey are called
respondents.
 People on whom we experiment are called
subjects or participants.
 Animals, plants, and inanimate subjects are
called experimental units.
Copyright © 2009Pearson Education, Inc.
Slide 1- 11
What and Why



Variables are characteristics recorded about each
individual.
The variables should have a name that identify
What has been measured.
To understand variables, you must Think about
what you want to know.
Copyright © 2009Pearson Education, Inc.
Slide 1- 12
What and Why (cont.)


A categorical (or qualitative) variable names
categories and answers questions about how
cases fall into those categories.
 Categorical examples: sex, race, ethnicity
A quantitative variable is a measured variable
(with units) that answers questions about the
quantity of what is being measured.
 Quantitative examples: income ($), height
(inches), weight (pounds)
Copyright © 2009Pearson Education, Inc.
Slide 1- 13
What and Why (cont.)


Example: In a student evaluation of instruction at a
large university, one question asks students to
evaluate the statement “The instructor was
generally interested in teaching” on the following
scale: 1 = Disagree Strongly; 2 = Disagree;
3
= Neutral; 4 = Agree; 5 = Agree Strongly.
Question: Is interest in teaching categorical or
quantitative?
Copyright © 2009Pearson Education, Inc.
Slide 1- 14
What and Why (cont.)



Question: Is interest in teaching categorical or
quantitative?
We sense an order to these ratings, but there are
no natural units for the variable interest in
teaching.
Variables like interest in teaching are often called
ordinal variables.
 With an ordinal variable, look at the Why of the
study to decide whether to treat it as
categorical or quantitative.
Copyright © 2009Pearson Education, Inc.
Slide 1- 15
Identifying Identifiers



Identifier variables are categorical variables with
exactly one individual in each category.
 Examples: Social Security Number, ISBN,
FedEx Tracking Number
Don’t be tempted to analyze identifier variables.
Be careful not to consider all variables with one
case per category, like year, as identifier
variables.
 The Why will help you decide how to treat
identifier variables.
Copyright © 2009Pearson Education, Inc.
Slide 1- 16
Where, When, and How


We need the Who, What, and Why to
analyze data. But, the more we know, the
more we understand.
When and Where give us some nice
information about the context.
 Example: Values recorded at a large
public university may mean something
different than similar values recorded at a
small private college.
Copyright © 2009Pearson Education, Inc.
Slide 1- 17
Where, When, and How (cont.)



How the data are collected can make the
difference between insight and nonsense.
 Example: results from voluntary Internet
surveys are often useless
The first step of any data analysis should be to
examine the W’s—this is a key part of the Think
step of any analysis.
And, make sure that you know the Why, Who,
and What before you proceed with your analysis.
Copyright © 2009Pearson Education, Inc.
Slide 1- 18
What Can Go Wrong?



Don’t label a variable as categorical or
quantitative without thinking about the question
you want it to answer.
Just because your variable’s values are numbers,
don’t assume that it’s quantitative.
Always be skeptical—don’t take data for granted.
Copyright © 2009Pearson Education, Inc.
Slide 1- 19
What have we learned? (cont.)

We treat variables as categorical or quantitative.
 Categorical variables identify a category for
each case.
 Quantitative variables record measurements or
amounts of something and must have units.
 Some variables can be treated as categorical
or quantitative depending on what we want to
learn from them.
Copyright © 2009Pearson Education, Inc.
Slide 1- 20
Displaying and Describing
Categorical Data
Copyright © 2009 Pearson Education, Inc.
The Three Rules of Data Analysis

The three rules of data analysis won’t be difficult to
remember:
1. Make a picture—things may be revealed that are
not obvious in the raw data. These will be things to
think about.
2. Make a picture—important features of and
patterns in the data will show up. You may also
see things that you did not expect.
3. Make a picture—the best way to tell others about
your data is with a well-chosen picture.
Copyright © 2009Pearson Education, Inc.
Slide 1- 22
Frequency Tables: Making Piles


We can “pile” the data by counting the number of
data values in each category of interest.
We can organize these counts into a frequency
table, which records the totals and the category
names.
Slide 1- 23
Copyright © 2009Pearson Education, Inc.
Frequency Tables: Making Piles (cont.)

A relative frequency table is similar, but gives the
percentages (instead of counts) for each
category.
Copyright © 2009Pearson Education, Inc.
Slide 1- 24
What’s Wrong With This Picture?
You might think that
a good way to show
the Titanic data is
with this display:

Copyright © 2009Pearson Education, Inc.
Slide 1- 25
The Area Principle



The ship display makes it look like most of the people
on the Titanic were crew members, with a few
passengers along for the ride.
When we look at each ship, we see the area taken up
by the ship, instead of the length of the ship.
The ship display violates the area principle:
 The area occupied by a part of the graph should
correspond to the magnitude of the value it
represents.
Copyright © 2009Pearson Education, Inc.
Slide 1- 26
Bar Charts

A bar chart displays the distribution of a categorical
variable, showing the counts for each category next to
each other for easy comparison.

A bar chart stays true
to the area principle.

Thus, a better display
for the ship data is:
Copyright © 2009Pearson Education, Inc.
Slide 1- 27
Bar Charts (cont.)



A relative frequency bar chart displays the relative
proportion of counts for each category.
A relative frequency bar chart also stays true to the area
principle.
Replacing counts
with percentages
in the ship data:
Copyright © 2009Pearson Education, Inc.
Slide 1- 28
Pie Charts

When you are interested in parts of the whole, a pie chart
might be your display of choice.

Pie charts show the whole
group of cases as a circle.

They slice the circle into
pieces whose size is
proportional to the
fraction of the whole
in each category.
Copyright © 2009Pearson Education, Inc.
Slide 1- 29
Contingency Tables


A contingency table allows us to look at two categorical
variables together.
It shows how individuals are distributed along each variable,
contingent on the value of the other variable.
 Example: we can examine the class of ticket and whether a
person survived the Titanic:
Copyright © 2009Pearson Education, Inc.
Slide 1- 30
Contingency Tables (cont.)


The margins of the table, both on the right and on the bottom,
give totals and the frequency distributions for each of the
variables.
Each frequency distribution is called a marginal distribution of
its respective variable.
 The marginal distribution of Survival is:
Copyright © 2009Pearson Education, Inc.
Slide 1- 31
Contingency Tables (cont.)

Each cell of the table gives the count for a combination of
values of the two values.
 For example, the second cell in the crew column tells us
that 673 crew members died when the Titanic sunk.
Slide 1- 32
Copyright © 2009Pearson Education, Inc.
Conditional Distributions

A conditional distribution shows the distribution of
one variable for just the individuals who satisfy
some condition on another variable.
 The following is the conditional distribution of
ticket Class, conditional on having survived:
Slide 1- 33
Copyright © 2009Pearson Education, Inc.
Conditional Distributions (cont.)

The following is the conditional distribution of
ticket Class, conditional on having perished:
Slide 1- 34
Copyright © 2009Pearson Education, Inc.
Conditional Distributions (cont.)

The conditional distributions tell us that there is a
difference in class for those who survived and those
who perished.

This is better
shown with
pie charts of
the two
distributions:
Slide 1- 35
Copyright © 2009Pearson Education, Inc.
Conditional Distributions (cont.)



We see that the distribution of Class for the
survivors is different from that of the nonsurvivors.
This leads us to believe that Class and Survival
are associated, that they are not independent.
The variables would be considered independent
when the distribution of one variable in a
contingency table is the same for all categories of
the other variable.
Slide 1- 36
Copyright © 2009Pearson Education, Inc.
Segmented Bar Charts


A segmented bar chart
displays the same
information as a pie chart,
but in the form of bars
instead of circles.
Here is the segmented bar
chart for ticket Class by
Survival status:
Copyright © 2009Pearson Education, Inc.
Slide 1- 37
What Can Go Wrong?

Don’t violate the area principle.

While some people might like the pie chart on
the left better, it is harder to compare fractions of
the whole, which a well-done pie chart does.
Slide 1- 38
Copyright © 2009Pearson Education, Inc.
What Can Go Wrong? (cont.)

Keep it honest—make sure your display shows
what it says it shows.

This plot of the percentage of high-school
students who engage in specified dangerous
behaviors has a problem. Can you see it?
Slide 1- 39
Copyright © 2009Pearson Education, Inc.
What Can Go Wrong? (cont.)



Don’t confuse similar-sounding percentages—pay
particular attention to the wording of the context.
Don’t forget to look at the variables separately
too—examine the marginal distributions, since it is
important to know how many cases are in each
category.
Example: 20% of 100 (20) is a big difference from
21% of 10000 (2100)
Copyright © 2009Pearson Education, Inc.
Slide 1- 40
What Can Go Wrong? (cont.)

Be sure to use enough individuals!

Do not make a report like “We found that
66.67% of the rats improved their performance
with training. The other rat died.”
Copyright © 2009Pearson Education, Inc.
Slide 1- 41
What Can Go Wrong? (cont.)


Don’t overstate your case—don’t claim something
you can’t.
Don’t use unfair or silly averages—this could lead
to Simpson’s Paradox, so be careful when you
average one variable across different levels of a
second variable.
Copyright © 2009Pearson Education, Inc.
Slide 1- 42
Simpsons Paradox
Hitter A:
 -- Against right-handed pitchers: 300 at-bats, 90 hits
(.300 average)
-- Against left-handed pitchers: 200 at-bats, 50 hits
(.250 average)
Hitter B:
 -- Against right-handed pitchers: 100 at-bats, 32 hits
(.320 average)
-- Against left-handed pitchers: 300 at-bats, 78 hits
(.260 average)
Copyright © 2009Pearson Education, Inc.
Slide 1- 43
What have we learned?




We can summarize categorical data by counting
the number of cases in each category
(expressing these as counts or percents).
We can display the distribution in a bar chart or
pie chart.
And, we can examine two-way tables called
contingency tables, examining marginal and/or
conditional distributions of the variables.
If conditional distributions of one variable are the
same for every category of the other, the
variables are independent.
Copyright © 2009Pearson Education, Inc.
Slide 1- 44