Transcript Slide 1
Probability & Statistics
Data (Bock Ch2)
What are Data?
• Data can be numbers, record names, or other
labels.
• Not all data represented by numbers are
numerical data (e.g., 1=male, 2=female).
• Data are useless without their context…
The “W’s”
• To provide context we need the W’s
– Who
– What (and in what units)
– When
– Where
– Why (if possible)
– and How
of the data.
• Note: the answers to “who” and “what” are
essential.
Organization!
Organization is important!
Here are some customer records from a
company’s database:
B001OAA
24
Veterans
902
Boston
Y
Garbage
43
Y
440
17
15.98
Chicago
18
Kansas
10.99
B0015Y6
413
Y
N
368
Fenway
B001OAA
312
Without some organization, what can be made of
this data?
Data Table
• The following data table clearly shows the
context of the data presented:
Name
Ship To
Price
Area
Code
# of Previous
Purchases
Gift
?
Catalog ID
Artist
Katherine H.
Ohio
10.99
440
17
N
B0015Y6
Kansas
Samuel P.
Illinois
16.99
312
3
Y
B002BK9
Boston
Chris G.
New York
15.98
413
0
N
B0068ZVQ
Chicago
Monique D.
Canada
11.98
902
10
Y
B001OAA
Garbage
• Notice that this data table tells us the What
(column titles) and Who (row titles) for these
data.
Data Table
The “What”
Name
Ship To
Price
Area
Code
# of Previous
Purchases
Gift
?
Catalog ID
Artist
Katherine H.
Ohio
10.99
440
17
N
B0015Y6
Kansas
Samuel P.
Illinois
16.99
312
3
Y
B002BK9
Boston
Chris G.
New York
15.98
413
0
N
B0068ZVQ
Chicago
Monique D.
Canada
11.98
902
10
Y
B001OAA
Garbage
The “Who”
Data Table
The “What” needs to labeled as either Categorical or
Quantitative. If it is Quantitative, the units should be
included.
Name
Ship To
Price
Area
Code
# of Previous
Purchases
Gift
?
Catalog ID
Artist
Katherine H.
Ohio
10.99
440
17
N
B0015Y6
Kansas
Samuel P.
Illinois
16.99
312
3
Y
B002BK9
Boston
Chris G.
New York
15.98
413
0
N
B0068ZVQ
Chicago
Monique D.
Canada
11.98
902
10
Y
B001OAA
Garbage
Who
• The Who of the data tells us the individual
cases about which (or whom) we have
collected data.
– Individuals who answer a survey are called
respondents.
– People on whom we experiment are called
subjects or participants.
– Animals, plants, and inanimate subjects are called
experimental units.
Who (cont.)
• Sometimes people just refer to data values as
observations and are not clear about the Who.
– But we need to know the Who of the data so we
can learn what the data say.
What
• The “what” are Variables or characteristics
recorded about each individual.
• The variables should have a name that identify
What has been measured.
• To understand variables, you must Think about
what you want to know.
What
• Some variables have units that tell how each
value has been measured and tell the scale of
the measurement.
What
• A categorical variable names categories and
answers questions about how cases fall into
those categories.
– Categorical examples: sex, race, ethnicity
• A quantitative variable is a measured variable
(with units) that answers questions about the
quantity of what is being measured.
– Quantitative examples: income ($), height (inches),
weight (pounds)
What
• Example:
An online store that sells sports memorabilia
keeps track of the addresses of all of their
customers. One of the variables they keep
track of are customers’ zip codes.
• Question: Are zip codes categorical or quantitative?
What
• Question: Are zip codes categorical or quantitative?
• Although zip codes are numbers and can be
put in order, there are no natural units for the
variable zip code.
• Variables like “zip code” are considered
categorical data.
Why
• Why we are collecting data is important in
understanding what we think about and how
we treat the variables.
Where, When, and How
• We need the Who, What, and Why to analyze
data. But, the more we know, the more we
understand.
• When and Where give us some nice
information about the context.
– Example: Values recorded at a large public
university may mean something different
than similar values recorded at a small
private college.
(The “where” makes a difference)
Where, When, and How
• How the data are collected can make the
difference between insight and nonsense.
– Example: results from voluntary Internet surveys
are often useless
• The first step of any data analysis should be to
examine the W’s—this is a key part of the
Think step of any analysis.
• And, make sure that you know the Why, Who,
and What before you proceed with your
analysis.
What can go wrong?
• Don’t label a variable as categorical or
quantitative without thinking about the
question you want it to answer.
• Just because your variable’s values are
numbers, don’t assume that it’s quantitative.
• Always be skeptical—don’t take data for
granted.
Summary
• Data are information in a context.
– The W’s
helpthe
with
context.
Why:
Who:
Who
Why
(or
what)
data
was
thecollected?
data is about?
– We must know the Who (cases), What (variables),
and Why to be able to say anything useful about
the data.
What: The
variables
(and their units).
When
and
Where:
and quantitative
WhenBoth
and categorical
Where the data
were recorded?
measuerments about the “who”.
How
the them?)
data collected? Was
(Whatwas
about
is a legitimate source?
How:
Summary (cont.)
• We treat variables (the “what”) as categorical
or quantitative.
– Categorical variables identify a category for each
case.
– Quantitative variables record measurements or
amounts of something and must have units.
– Some variables can be treated as categorical or
quantitative depending on what we want to learn
from them.
Example
For the following description of data, identify the W’s, name the variables, classify
each variable as categorical or quantitative, and for any quantitative variable,
identify the units in which it was measured (or state that they were not provided).
According to an article in Fortune (Dec 28, 1992), 401(k)
plans permit employees to shift part of their before-tax
salaries into investments such as mutual funds. Employers
typically match 50% of the employees’ contribution up to
about 6% of salary. One company, concerned with what it
believed was a low employee participation rate in its
401(k) plan, sampled 30 other companies with similar
plans and asked for their 401(k) participation rates.
Example
According to an article in Fortune (Dec 28, 1992), 401(k) plans permit employees
to shift part of their before-tax salaries into investments such as mutual funds.
Employers typically match 50% of the employees’ contribution up to about 6% of
salary. One company, concerned with what it believed was a low employee
participation rate in its 401(k) plan, sampled 30 other companies with similar
plans and asked for their 401(k) participation rates.
Who:
30 other companies
Although the interest here is in 401(k) plans, the 30
companies are the subject being studied.
Example
According to an article in Fortune (Dec 28, 1992), 401(k) plans permit employees
to shift part of their before-tax salaries into investments such as mutual funds.
Employers typically match 50% of the employees’ contribution up to about 6% of
salary. One company, concerned with what it believed was a low employee
participation rate in its 401(k) plan, sampled 30 other companies with similar
plans and asked for their 401(k) participation rates.
What:
Be sure to
label each
variable as
quantitative
or categorical
Employer’s contribution – quantitative (%)
Contribution limit – quantitative (%)
Participation rate – ???
Participation rate could be quantitative or categorical
depending on how it is measured. If it is measured by
number of employees or percent of employees, then it would
be quantitative. If it is categorized as high, medium, or low,
then participation rate would be categorical.
Example
According to an article in Fortune (Dec 28, 1992), 401(k) plans permit employees
to shift part of their before-tax salaries into investments such as mutual funds.
Employers typically match 50% of the employees’ contribution up to about 6% of
salary. One company, concerned with what it believed was a low employee
participation rate in its 401(k) plan, sampled 30 other companies with similar
plans and asked for their 401(k) participation rates.
Why:
to improve participation rates
Because the company has low employee participation in its
401(k) plan, we can infer that the study is being done in an
attempt to improve participation rates.
Example
According to an article in Fortune (Dec 28, 1992), 401(k) plans permit employees
to shift part of their before-tax salaries into investments such as mutual funds.
Employers typically match 50% of the employees’ contribution up to about 6% of
salary. One company, concerned with what it believed was a low employee
participation rate in its 401(k) plan, sampled 30 other companies with similar
plans and asked for their 401(k) participation rates.
Where:
not specified
Although the “where” is not specified here, we can assume
that it is in the area that Fortune magazine services. If the
magazine were local, then we could assume that the “where”
is local. If the magazine were only distributed in the U.S. then
we could assume that the “where” is in the U.S. only.
Example
According to an article in Fortune (Dec 28, 1992), 401(k) plans permit employees
to shift part of their before-tax salaries into investments such as mutual funds.
Employers typically match 50% of the employees’ contribution up to about 6% of
salary. One company, concerned with what it believed was a low employee
participation rate in its 401(k) plan, sampled 30 other companies with similar
plans and asked for their 401(k) participation rates.
When: before Dec. 28, 1992
This passage seems to imply that the study of the 30
companies was part of the magazine article. If this is so, then
the data must have been collected before the article was
published.
If the company had done the study as a result of the
magazine article, then we would have said that the “when”
was after Dec. 28, 1992.
Example
According to an article in Fortune (Dec 28, 1992), 401(k) plans permit employees
to shift part of their before-tax salaries into investments such as mutual funds.
Employers typically match 50% of the employees’ contribution up to about 6% of
salary. One company, concerned with what it believed was a low employee
participation rate in its 401(k) plan, sampled 30 other companies with similar
plans and asked for their 401(k) participation rates.
How:
Sampled other companies
How the sample was conducted is not specified. It could have
been by asking the company, sending them a survey to fill
out, or by simply checking company records. We do know
that they sampled the other companies; a more specific
“how” was not specified.