No Slide Title

Download Report

Transcript No Slide Title

MBP1010 – Jan. 5, 2010
Today’s Topics
1. Introduction
2. Course Information and Schedule
3. Study Design
4. Looking at Data
Introduction to the Practice of Statistics
Ch. 1, 2.5, 3.2
Meaning from Data
(1) How can we describe and draw meaning
from a collection of data?
(2) How can we infer information about the whole
population when we know data from only some
of the population (a sample)?
What is statistics?
- science of understanding data and making
decisions in the face of variability
and uncertainty
- statistics is NOT a field of mathematics
Statistical Thinking
-humans are good at recognizing patterns and
there is real danger of over-interpreting patterns
that are merely due to the play of chance
(false leads)
- role of statistics - to reject chance as an explanation
so that we can have reasonable assurance that
patterns seen are worthy of interpretation
Statistical Thinking
- explore data prior to analysis
- think about context and design
- reasoning behind standard statistical
methods
Interpretation/Conclusions
Course Overview
1. Looking at data
2. Concepts of statistical inference
and hypothesis testing
3. Specific statistical tests
- 1 and 2 sample test for continuous
and categorical data
- correlation, regression and ANOVA
4. Other Topics
- eg survival analysis, logistic regression
5. Bioinformatics
Course Information and Schedule
Lectures: Tuesdays 1 to 3 pm
620 University, 7-709
Tutorials: Thursdays 2 to 3:30 pm
OCI 7-605
R Tutorials: Thurs Jan 7 and 14
(Part 1 and part 2)
Study Design
Can what we eat influence our risk of cancer?
The case of dietary fat and breast cancer
Posted on website: New York Times article
Searching for clarity: A primer on medical studies
What should we do next?
Observational Studies
An observational study observes individuals
and measures variables of interest but does
not attempt to influence the responses.
Observational Studies
Case/control and cohort studies common in
cancer research (epidemiology)
- outcome is binary: cancer/ no cancer
Observational studies often examine factors
associated with continuous outcome variables
- eg association of body weight or diet
with hormone levels
- calcium intake and blood pressure
Case Control Study
Exposure
eg diet
Exposure
eg diet
X X
X
X X
X X
X
X X
0
0
0
0
0
0
0
0
0
0
Cohort Study
0
0
0
0
0
0
0
0
0 0
0
0
0
0
0
Exposure
eg diet
0
X
0 0
X
0
0
0
0
0
0
X
X
Cancer (yes/no)
Relative Risk
• Compare risk of disease in those with highest versus lowest intake
RR = 1.0
no association
RR = 1.4
1.4 times the risk
40% higher risk
RR = 0.8
20% lower risk
a. Total Fat
Case Control:
Challier (1998)
DeStefani (1998)
Ewertz (1990)
Franceschi (1996)
Graham (1982)
Graham (1991)
Hirohata (1985)
Hirohata (1987) (Caucasian)
Hirohata (1987) (Japanese)
Ingram (1991)
Katsouyanni (1988)
Katsouyanni (1994)
Landa (1994)
Lee (1991)
Levi (1993)
Mannisto (1999)
Martin-Moreno (1994)
Miller (1978)
Núñez (1996)
Potischman (1998)
Pryor (1989)
Richardson (1991)
Rohan (1988)
Shun-Zhang (1990)
Toniolo (1989)
Trichopoulou (1995)
van't Veer (1990,1991)
Wakai (2000)
Witte (1997)
Yuan (1995)
Zaridze (1991)
Case Control Summary
Cohort:
Bingham (2003)
Cho (2003)
Gaard (1995)
Graham (1992)
Holmes (1999)
Howe (1991)
Jones (1987)
Knekt (1990)
Kushi (1992)
Thiébaut (2001)
Toniolo (1994)
van den Brandt (1993)
Velie (2000)
Wolk (1998)
Cohort Summary
All Studies Summary
0
1
2
3
4
5
Odds Ratio or Relative Risk
6
13
14
15
Interpretation
Suppose we find that women who eat a low fat
diet tend to have lower risk of breast cancer.
Can we conclude that the fat in the diet is
responsible for the lower risk of breast cancer?
Interpretation
Suppose we find that women who eat a low fat
diet tend to have lower risk of breast cancer.
Can we conclude that the fat in the diet is
responsible for the lower risk of breast cancer?
No. Other factors may be responsible
for the association with dietary fat
(confounding)
Problem of Confounding
Suppose A is associated with B:
This may be because:
• A causes B
• B causes A
• X is associated with both A and B
X need not be a cause of either A or B
Problem of Confounding
In our dietary fat example:
-women who eat more dietary fat may differ from
those who less fat (eg. weight, exercise,
other dietary factors)
-these factors may influence the risk of
breast cancer
Trying to control for confounding
- measure potential confounders
eg. measure weight and physical activity
-“control” for possible confounders in analysis
- but…what about confounding with variables we
don’t know exist or can’t measure?
Observational Studies
An observational study observes individuals and
measures variables of interest but does not attempt
to influence the responses.
Association between variables a response variable,
even if it is very strong, is not good evidence of a
cause and effect link between variables
Correlation is not causation
Randomized Experiments
- impose treatment and observe response
- subjects/animals randomly assigned to treatments
and control
- randomization should result in groups that are
similar with respect to any possible confounding
variables
- difference in outcome must be due to treatment
(OR the play of chance in random assignment)
Basic principles of experimental design
1. Formulate question/goal in advance
2. Comparison/control
3. Replication
4. Randomization
5. Stratification (or blocking)
6. Factorial experiments
Replication
Randomized Design
Dietary fat and mammary tumors in Sprague-Dawley rats
(n=30 per diet group)
Diet
% energy from fat % developed
cancer
Time to Cancer
(weeks)
Low Fat
5
50.0
14.5  1.38
High Fat
26
76.6
11.6  0.96
Jackson et al. Nutr.Cancer, 1998
Stratification
• Suppose that some measurements will be made
in males and females
AND
• You anticipate a difference in responses between
males and females
– Randomize within males and females separately
- any systematic difference by sex removed
- this is sometimes called “blocking”.
-Take account of the difference between males and
females in analysis: - helps control variability
Randomization and stratification
• If you can (and want to), fix a variable.
– e.g., study only men or women or a single
strain of animal
• If you don’t fix a variable, stratify on it.
– e.g., randomize treatment men and women
• If you can neither fix nor stratify a variable,
randomize to treatment.
Factorial Experiment
Dietary fat and fiber and mammary tumors in Sprague-Dawley rats
(n=30)
Diet
Low fat - high fiber
Low fat - mid fiber
Low fat - low fiber
Mid fat - high fiber
Mid fat - mid fiber
Mid fat - low fiber
High fat – high fiber
High fat – mid fiber
High fat – low fiber
% developed cancer
56.7
50.0
56.7
80.0
70.0
76.7
60.0
76.6
86.7
Time to Cancer
(weeks)
14.2  1.4
14.5  1.4
13.2  1.4
13.9  1.2
12.7  1.2
11.6  0.9
12.4  1.5
11.6  1.0
11.8  0.8
Randomized Clinical Trials in Humans Dietary Fat and Breast Cancer
• Diet and Breast Cancer Prevention Study
• 4793 high risk women followed for 7-17 years
(not yet published)
•Women’s Health Initiative (US)
• 48,835 postmenopausal women followed for
8-12 years
• reported in 2006
Eligible Subjects Identified
(> 50% density)
Prerandomization
Assessment
Intervention
(n=2,343)
Control
(n=2,350)
Annual Visits
• demo/anthro data
• diet records
• non fasting serum
Follow up until Dec 2005
(7-17 years per subject)
breast cancer incidence
Women’s Health Initiative
- Postmenopausal women (50-79 years
of age)
- n=48,835; follow-up 8-12 years
- randomized 40:60 intervention and
control
- group dietary counselling
- follow up for breast cancer
Kaplan-Meier Estimates of the Cumulative Hazard for Invasive Breast Cancer
Prentice, R. L. et al. JAMA 2006;295:629-642.
Copyright restrictions may apply.
Randomized Clinical Trials in Humans
Practical Issues:
- long (particularly for cancer outcomes!)
- expensive
- limited in “treatment” options
Randomized Clinical Trials in Humans
Other issues:
- highly selected subjects
- selection criteria and motivation
- subject/investigator blinding
- subjects drop out
-compliance?
Main Points
- primary interest is causal relationships
between variables
- observational studies show associations only
- randomized studies best for causation but are
not without challenges
- totality of evidence important
What’s in the dataset?
What are the observations (individuals)?
Eg people, animals, cells, countries
How many observations are in the
dataset?
How many observations should there be?
Are the observations independent?
- repeated in an individudal?
What’s in the dataset?
What are the variables?
What is their exact definition?
How were they measured?
What are the units of measurement?
What type of variables?
Main Types of Variables
Categorical:
- include nominal and dichotomous variables
- qualitative difference between values
- eg sex (male/female), smoker/non smoker
Continuous:
- quantitative
- equal distance between each value
- eg blood pressure, age, dietary fat
Ordinal variables can be ordered but they do not have
specific numeric values, eg scales, ratings
Continuous Variables
Stem and Leaf Plots
- displays distribution of small/moderate amounts of data
- includes the actual numerical values
Example data: Blood pressure data in 21 patients
107 110 123 129 112 111 107 112 136 102
123 109 112 102 98 114 119 112 110 117 130
Stem
(all but
last digit)
Leaf (last digit)
9:8
10 : 22779
11 : 0012222479
12 : 339
13 : 06
Stem and Leaf Plot
Blood Pressure Data:
107 110 123 129 112 111 107 112 136 102
123 109 112 102 98 114 119 112 110 117 130
Stem
Add leaves
9:
10 :
11 :
12 :
13 :
9:8
10 : 7 7 2 9 2
11 : 0 2 1 2 2 4 9 2 0 7
12 : 3 9 3
13 : 6 0
Order leaves
9:8
10 : 2 2 7 7 9
11 : 0 0 1 2 2 2 2 479
12 : 3 3 9
13 : 0 6
Frequency Histograms
- like a stem plot but leaves (individual data points) are not
distinguished
- usuually plotted horizontally
How to make a histogram?
1. Divide data into classes of equal width.
2. Count the number in each class.
3. Plot bars with heights proportional to number or
percent of data points in each interval.
Similarity of Histogram and Stem Leaf Plot
Blood Pressure Data: n= 21 measurements
Effect of Using Different Intervals
Blood Pressure Data: n= 21 measurements
Describing Distributions with Numbers
Blood Pressure Data: n= 21 measurements
98 102 102 107 107 109 110 110 111 112 112
112 112 114 117 119 123 123 129 130 136
mean = 2395/21 = 114
median = observation 11 = 112
Mean versus Median - skewed data
2 8 15 3 29 5 8 1 20 17 6 5 31 44 10 12 23 62
Stem Plot
0:
1:
2:
3:
4:
5:
6:
12355688
0257
039
1
4
Mean = 16.7
Median = 11
2
Remove highest observation (62): mean = 14.1
median = 10
BP data; n = 10
100 102 104 105 106 112 114 115 116 125
100
Min
104
109
115
Q1
Median
Q3
125
Max
75% quantile
1.5xIQR
Median
25% quantile
IQR
1.5xIQR
Everything above or below are
considered outliers
Measures of Spread
- range of data set: largest - smallest value
- interquartile range (IQR): 3rd minus 1st quartile
- sample variance and standard deviation
Deviation from the Mean
Extreme Observations or Outliers
- rule of thumb 1.5 x IQR for potential outliers
- observations that stand apart from the overall pattern
(not just extreme values)
- do not automatically delete outliers
- try to explain them
- an error in measurement or in recording data
- an usual occurrence
- describe outliers, what you do with them and what
their effect is
Energy expenditure in 29 women measured by doubly
labelled water (MJ per day).
Stem
18
16
14
12
10
8
6
4
Leaf
9
0258
244579
1122447839
5886689
6
----+----+----+----+
#
1
Boxplot
0
4
6
10
7
1
|
+-----+
*--+--*
+-----+
|
1.5 x 3.5(IQR) = 5.25
75th (11.46) + 5.25 = 16.71
19.9 MJ
What did we do about the outlier?
- checked recording/calculations/data entry
- unusual occurrence?
- biological plausible?
- re-measured laboratory samples
- analysis with and without outlier
- described all above in paper
Data Relationships
Dietary fat intake in the intervention and control groups
(n=150 intervention and 187 control)Schematic Plots
|
45 +
|
|
|
|
|
|
40 +
|
|
|
|
|
|
|
35 +
0
|
|
0
|
|
0
+-----+
|
|
|
30 +
|
|
|
|
*--+--*
|
|
|
|
|
|
|
|
25 +
|
|
|
|
|
+-----+
|
|
|
|
+-----+
|
20 +
|
|
|
|
|
|
|
|
*--+--*
|
|
|
|
|
15 +
|
|
|
|
+-----+
|
|
|
|
10 +
|
|
|
|
|
|
5 +
------------+-----------+----------GROUP
1
2
% Dietary
Fat
Group
Mean %
Dietary Fat (SD)
Intervention 17.5 (5.0)
Control
28.3 (6.2)
Intervention
Control
Dot Plot
How to Display Data Badly
H Wainer (1984) How to display data badly.
American Statistician 38(2):137-147
- posted at website
-use of Microsoft Excel and Powerpoint has
resulted in remarkable advances in the field
(of poor data display)
General principles
The aim of good data graphics:
Display data accurately and clearly.
Some rules for displaying data badly:
– Display as little information as possible.
– Obscure what you do show (with chart junk).
– Use pseudo-3d and color gratuitously.
– Make a pie chart (preferably in color and 3d).
– Use a poorly chosen scale.
Pay attention to scale!
Same data, different scale
Displaying data well
• Be accurate and clear.
• Let the data speak.
– Show as much information as possible, taking
care not to obscure the message.
• Science not sales.
– Avoid unnecessary frills — esp. gratuitous 3d.
• In tables, every digit should be meaningful.
Further reading – Data Display
• ER Tufte http://www.edwardtufte.com/tufte/
(1983) The visual display of quantitative
Information.
(1990) Envisioning information.
(1997) Visual explanations.
•WS Cleveland (1993) Visualizing data. Hobart Press.
• WS Cleveland (1994) The elements of graphing data.
CRC Press.