Surviving Statistics - University of Alberta

Download Report

Transcript Surviving Statistics - University of Alberta

The Winter Institute
on Statistical
Literacy for
Librarians
Demystifying statistics for the
practitioner
Anna Bombak, Chuck Humphrey, Lindsay Johnston and Leah Vanderjagt
University of Alberta
Outline







Introductions
Statistics and data: what are we talking about?
Definitions, standards and metadata
Official statistics: national
Official statistics: international
Census geography and small area statistics
Non-official statistics
Introductions: your backgrounds

Please introduce yourself

Your name
 Your institutional affiliation
 Your librarian responsibilities
 Is there anything in particular that you are
hoping in covered this workshop?


You are equally
split between nonacademic and
academic libraries.
The largest group,
with 13, is from
universities other
than the U of A.
The second largest
group, with 10, is
from government
libraries.
Academic Academic

Non-
Introductions: your backgrounds
Other Universities
University of Alberta
Public / Special
Government
0
5
(13)
(02)
(05)
(10)
10
15


Geographically, 21 of
you are from Alberta
and nine are from
other provinces.
We have representation from Ontario,
Manitoba, Saskatchewan and Alberta.
Thirteen are from the
Edmonton region.
(21)
Outside
Alberta

Alberta
Introductions: your backgrounds
(09)
0
20
Statistics: what are we talking about
Statistics are ubiquitous
“Statistics are generated today about nearly every activity on
the planet. Never before have we had so much statistical
information about the world in which we live. Why is this
type of information so abundant? For one thing, statistics
have become a form of currency in today’s information
society. Through computing technology, society has
become very proficient in calculating statistics from the
vast quantities of data that are collected. As a result, our
lives involve daily transactions revolving around some use
of statistical information.”
Data Basics, page 1.1
Numeric information
Statistics
•
numeric facts/figures
• created from data, i.e,
already processed
• presentation-ready
Data
•
•
•
numeric files created
and organized for
analysis/processing
requires processing
not display-ready
Numeric information
Geography
Region
Time
Periods
Unit of Observation Attributes
Smokers
The cells in the tableEducation
are the number of
Six dimensions
or variables
estimated smokers.
Agein this table
Sex
Statistics are about definitions!
Definitions
Sex
Total
Male
Female
Periods
1994-1995
1996-1997
Statistics are about definitions!
Some definitions are based
on standards while others
are based on convention or
practice.
For example, Standard
Geography classifications
Numeric information
Stories are told through statistics


The National Population Survey in the
previous example had over 80,000
respondents in 1996-97 sample and the
Canadian Community Health Survey in 2005
has over 130,000 cases. How do we tell the
stories about each of these respondents?
We create summaries of these life
experiences using statistics.
Summary





Statistics are derived from observational,
experimental or simulated data .
A table is a format for displaying statistics and
presents a summary or one view of the data.
Tables are structured around geography, time
and attributes of the unit of observation.
Statistics are dependent on definitions.
Statistics summarize individual stories into
common or general stories.
Methods producing data
Observational
Methods
Experimental
Methods
Computational
Methods
Focus is on
developing
observational
instruments to collect
data
Focus is on
manipulating causal
agents to measure
change in a response
agent
Focus is on modeling
phenomena through
mathematical
equations
Correlation
Causation
Prediction
Replicate the analysis
(same data or similar)
Replicate the
experiment
Replicate the
simulation
Statistics summarize
observations
Statistics summarize
experiment results
Statistics summarize
simulation results
Methods producing data



A particular discipline or field will tend to be
dominated by one of these three methods,
although outputs may also exist from the
other two methods.
Consequently, the knowledge disseminated
within a field is often fairly homogeneous in
how statistical information is used and
reported.
Knowing this and the life cycle in which
statistics are produced can help in the search
for statistics.
Life cycle of survey statistics
1
2
9
Access to
Information
8
3
4
7
6
5
1
Program objective
2
Survey unit organized
3
Questionnaire & sample
4
Data collection
5
Data production & release
6
Analysis
7
Findings released
8
Popularizing findings
9
Needs & gaps evaluation
Life cycle of survey statistics
1
2
9
Preserving
Information
8
3
4
7
6
5
1
Program objective
2
Survey unit organized
3
Questionnaire & sample
4
Data collection
5
Data production & release
6
Analysis
7
Official findings released
8
Popularizing findings
9
Needs & gaps evaluation
Life cycle applied to health statistics
1
1
Health Information
Roadmap Initiative
8
increased emphasis on
health promotion and
disease prevention;
2
9
3
4
7
6
5
Program objectives
decentralization of
accountability and decisionmaking;
shift from hospital to
community-based services;
integration of agencies,
programs and services; and
increased efficiency and
effectiveness in service
delivery.
Life cycle applied to health statistics
1
2
9
Health Information
Roadmap Initiative
8
3
4
7
6
5
2
Survey unit organized
3
Questionnaire & sample
4
Data collection
5
Data production & release
6
Analysis
7
Official findings released
Reconstructing statistics

One way to see the
relationship between
statistics and the data
upon which they were
derived is to
reconstruct statistics
that someone else
has produced from
data that are publicly
accessible.
Reconstructing statistics
1
2
9
Health Information
Roadmap Initiative
8
3
4
7
6
5
1
Program objective
2
Survey unit organized
3
Questionnaire & sample
4
Data collection
5
Data production & release
6
Analysis
7
Official findings released
8
Popularizing findings
9
Needs & gaps evaluation
Reconstructing statistics


The statistics that we will reconstruct are reported in
“Health Facts from the 1994 National Population
Health Survey,” Canadian Social Trends, Spring
1996, pp. 24-27.
The steps we will follow are:
 identify
the characteristics of the respondents in the
article;
 identify the data source;
 locate these characteristics in the data documentation;
 find the original questions used to collect the data;
 retrieve the data; and
 run an analysis to reproduce the statistics.
The findings to be replicated
Page
26
Summary of variables identified
 Findings
 Likely
 Men
need age of respondents
and women
 Look
 Type
apply to Canadian adults
for the sex of respondents
of drinkers
 Look
for frequency of drinking or a variable
categorizing types of drinkers
 Age
 Look
for actual age or age in categories
 Smokers
 Look
for smoking status
Identify the data source

Survey title is identified:
National Population
Health Survey, 1994-95

Public-use microdata
file is announced

Page 25 of the article
Locate the variables
 Examine
the data documentation for the
National Population Health Survey, 1994-95

PDF version is on-line
Use TOC and link to “Data Dictionary for Health”
 Identify the variables from their content

 NOTE:
check how missing data were handled
Trace the variables back the questionnaire
 Did sampling method require weighting cases?

 NOTE:
in addition to the other variables, is a weight
variable needed to adjust for the sampling method?
Retrieve and analyze the data


For universities subscribed to
the Statistics Canada Data
Liberation Initiative (DLI), the
public use microdata from the
NPHS can be downloaded
without additional cost. See
the Statistics Canada Online
Catalogue for further cost
details.
Make use of local data
services to retrieve data from
the NPHS.
Lessons from the NPHS example


This example demonstrates the distinction
between creating statistics and interpreting
statistics that have been created by others.
This is an important distinction because:
•
•

Choices are made in creating statistics.
Interpreting statistics requires an ability to understand
the choices that were made.
Searching for statistics that others have
created can be facilitated by understanding
these points.
Provide a different perspective
Building on the
previous example
using the NPHS,
compare the
statistics from an
article about
young adults
giving and
receiving help to
their parents’ age
cohort.
Statistics are about definitions
Statistics are about definitions
Statistics are about definitions

Look at the Census definitions
Definitions are in the Census Handbook (2001)
and the Census Dictionary (2006)
 Search by Census Variable under Topic-Based
Tabulations (2006) for value categorizations


Look at some standard classifications used in
statistics

SIC, NAICS, NOC, Standard Classification of
Goods (SCG), Standard Geographic
Classification (SGC), Classification of
Instructional Programs (CIP), ICD10
Statistics in the News

Three recent newspaper articles that include
statistics in them have been selected for this
exercise. For each of the articles, answer the
following questions.

What is the concept represented by the statistic or
statistics in this story?
 Is a definition for this concept provided? If it is, what
is it? Or is the definition implicit?
 Are any classifications identifiable? What are they?
 Are the data from which this statistic was derived
identified in the article?
Metadata for describing tables


As we have discussed, tables are a typical
display format for statistics. Because tables
are often published within an article, they
don’t get indexed. Therefore, to find
published tables requires a connection
between characteristics in the table with other
indexed content.
Two indices of tables that exist are Statistical
Universe and Tablebase. They use
traditional elements to index tables without
defining unique properties of tables.
Metadata for describing tables


What are the properties of a table that we
might use to develop useful descriptors for
describing their content?
What is the motivation for doing this
exercise?



Searching for tables that were indexed using such descriptors
would allow finding statistics much easier.
The movement toward open access journals and publishing
lends an opportunity to introduce metadata elements for
statistical tables.
Once we have statistical tables described more
comprehensively, opportunities will exist to link tables to the data
sources from which the statistics in the table were derived.
Title
Unit of Observation
Variables
Average Tuition
Discipline
Academic Year
Province
Statistical Metric
Dollars
Footnote
Date
Producer
What are the metadata
characteristics of tables & graphs?






Is a title provided?
Is an author, producer or agency
identifiable?
Is there a date of creation or
publication?
What is the entity that has been
observed to make this statistic?
That is, what is the unit of
observation?
Are the characteristics of the
unit observation (i.e., variables)
and their categories clearly
identified and defined?
Is there a key to explain the use
of colours or lines in the graph?





Is the type of statistic clearly
identified? That is, does the
table or graph contain
percentages, counts, averages,
etc.?
Is there a scale for the numbers
presented in the table or graph?
Is there an overall figure or
number (N) presented upon
which the table or graph was
calculated?
Are there footnotes?
Are geography, time and social
content clearly expressed in the
table or graph?
Summary



If statistical tables and graphs were described
and indexed by rich metadata, our ability to
locate statistics would be greatly enhanced.
In the absence of such metadata, we use
elements of this metadata structure to search
our existing databases.
The next generation of metadata in the field of
data will work to integrate the description of
both data and statistics.