Transcript Slide 1

Biostatistics I using SPSS
The Master of Science in Clinical Investigation Program
Vanderbilt University Medical Center
Date: September 13, 2005
Instructor: Ayumi Shintani, Ph.D., M.P.H.
Department of Biostatistics, Vanderbilt University
E-mail: [email protected]
1
Graphical Display of Data Part 1
Overview:
3.1 Categorical
3.2 Continuous
3.2.1
Histograms
3.2.2
Stem-&-Leaf Plots
3.2.3
Boxplots
3.2.4
Dotplots
3.2.5
Error bar charts
3.2.6
Error bar charts with lines
3.2.7
Pie-charts
2
Graphical Display of Data Part 2
Overview:
3.2.8
3.2.8.1
3.2.8.2
3.2.8.3
3.2.9
3.2.9.1
3.2.9.2
3.2.10
Simple Scatterplot
Labeling points
Identifying different groups for scatterplot
Representing Multiple Points
Scatterplot Matrix
Addling lines into scatter plots
Overlay plot with Loess Smoothers
Three-dimentional Scatterplot
3
Graphs are pictorial representations of numerical data:
“A picture is worth a thousand t-tests.”
Graphical displays should:
•Easily convey characteristics of the data.
•Present many numbers in a small space.
•Make large datasets coherent.
•Encourage the eye to compare different sections of data.
•Be closely integrated with the statistical and verbal descriptions of the
dataset.
•Be clearly labeled for easy understanding.
4
Mean log dose of sedative and analgesic medications administered
during 24-hour period prior to cognitive assessment *
24-hour Transition
N
Lorazepam dose
+/- SD
Fentanyl dose
+/- SD
Morphine dose
+/- SD
Propofol dose
+/- SD
Normal to Normal
97
0.2±4.0
0.1±2.7
0.2±5.7
0.0
Normal to Delirium
17
0.5±7.3
0.1±1.5
0.1±3.1
0.2±9.2
Normal to Coma
3
6.3±1.3
0.4±9.5
0.3±5.8
0.0
Delirium to Normal
62
0.2±4.3
0.1±3.1
0.2±5.8
0.0
Delirium to Delirium
197
0.5±8.4
0.2±4.5
0.3±10.5
0.1±5.5
Delirium to Coma
51
1.3±9.1
0.4±7.3
0.5±13.8
0.0
Coma to Normal
13
0.6±7.2
0.2±4.5
0.1±2.8
0.2±18.0
Coma to Delirium
89
0.7±10.4
0.3±5.8
0.3±11.7
0.1±4.0
Coma to Coma
167
1.4±14.2
0.4±7.4
0.4±11.5
0.2±12.6
Total
696
5
30
20
Mean Lorazepam
Dose (mg)
in 24 hours
10
0
Current Cognitive Status
Previous Cognitive Status
C
D
N
Coma (C)
C
D
N
C
D
N
Delirium (D) Normal (N)
6
Error Bars show 95.0% Cl of Mean
3.1 Graphical Display of Categorical Data
In medical papers, categorical data are very rarely graphically
displayed. However, for posters, such graphical displays are typically more
eye-catching than a table.
A histogram graphically displays the frequency distribution of
categorical and continuous data. For categorical data, also called
bar diagram, bar chart, or bar graph.
•The x-axis denotes each value of the categorical variable.
•A vertical bar is drawn for each category. The bar can denote:
• Frequency (number of observations having that categorical value).
• Fraction (proportion of total observations having that categorical value).
• Cumulative Frequency (each bar represents a total
number of patients who falls in the category or categories in lower orders. )
• Mean (or other summary measures) of other variable for the category
7
How to obtain Histogram in SPSS using Graph Option (1)
In SPSS, open Rothman.sav then go to
Graphs (no interactive), Bar Charts, Select Simple
8
How to obtain Histogram in SPSS using Graph Option (2):
Frequency
Frequency distribution is defined when each bar shows the number of
9
observations having that categorical value.
SPSS screen shot: Frequency
10
How to obtain Histogram in SPSS using Graph Option (3):
Fraction
Fraction is defined when each bar represents proportion of total
observations having that categorical value.
11
SPSS screen shot: Fraction
12
How to obtain Histogram in SPSS using Graph Option (4):
Cumulative Frequency
Cumulative frequency is defined where each bar represents a total
number of patients who falls in the category or categories in lower
orders.
13
Cumulative Frequency
14
How to obtain Histogram in SPSS using Graph Option (5):
Group Means
Each bar represents mean of another variable (continuous) for the category
15
Group Means
16
How to obtain Histogram in SPSS using Interactive Graph Option (1):
Group Means
Bars show counts
60
40
20
n=34
n=42
n=65
n=39
n=12
0
8th degree or less
High School Grad
College Grad or above
Some High School
Some College
Education
17
Using Interactive graphics:
In SPSS, go to: Graphs, Interactive, Bar, …
Frequency using Interactive Graph Option
18
How to obtain Histogram in SPSS using Interactive Graph Option (2):
Group Means with Error Bars
12.0
8.0
4.0
n=34
n=42
n=65
n=39
n=12
0.0
8th degree or less
High School Grad
College Grad or above
Some High School
Some College
Education
Note: I don’t personally recommend this type of graphs.
19
Using Interactive graphics:
In SPSS, go to: Graphs, Interactive, Bar, …
Group Means with Error Bars (1)
20
Group Means with Error Bars (2)
21
3.2 Graphical Displays of Continuous Data
3.2.1 Histograms
Displays frequency distribution for continuous data.
However, in contrast to categorical data, continuous data
need to be grouped, and the # of groups must be chosen,
which is subjective.
22
How to obtain Histogram Continuous Data Histogram using
Interactive Graph Option (1): Frequency Distribution
30
Count
20
10
0
30
40
50
60
age (yrs )
70
80
23
In SPSS, read Rothman.sav, go to:
Graphs, Interactive
Histogram
Frequency Distribution for Continuous Data (1)
24
Frequency Distribution for Continuous Data (2)
25
What kinds of things should I look for in a histogram?
1. Look for cases with values very different from the rest.
2. Look whether distribution is symmetric (normality).
3. Look for separate clusters of data values. For example, you may
see a two clusters, i.e., peaks. One peak may be from male
patients, and the other may from female. In such situation, you may
want to analyze the data separately for males and females.
26
Editing Histogram (1): Adding normality curve
30
Count
20
10
0
30
40
50
60
70
80
age (yrs )
27
In SPSS, read Rothman.sav, go to:
Graphs, Interactive
Select Histogram
Click on Histogram dialog box
Adding Normal Curve to Histogram
28
Editing bin size on histogram (1)
In SPSS, after you create a histogram using interactive graphs, double click
on the figure and open Chart Editor. Click Interval Tool.
29
Editing bin size on histogram (2)
NOTE: Without specification, SPSS automatically determines the number of
groups (bins).
30
What will happen if you use smaller number of bins?
#bins=5
#bins=50
12
Count
Count
75
50
8
4
25
#bins=20
0
0
30
40
50
60
70
30
80
40
50
60
70
80
age (yrs )
age (yrs )
30
Count
20
10
0
30
40
50
60
70
80
age (yrs )
Which histogram do you find more useful?
31
Now, consider histograms of age stratified by study arms:
C o n tr o l
I nte r v e n ti on
Count
15
10
5
0
30
40
50
60
age (yrs)
70
80
30
40
50
60
70
80
age (yrs)
Important  :
Whenever you are interested in comparing continuous variable between
groups, you must look at data separately for groups.
32
Histogram of Age Stratified by Status
33
3.2.2 Stem-&-Leaf Plots
A useful way of tabulating the original data and, at the same
time, depicting the general shape of the frequency distribution.
The stem consists of all but the rightmost digits of the data.
The leaf represents the leftmost digits.
age (yrs) Stem-and-Leaf Plot
Frequency
Stem &
2.00 Extremes
3.00
2 .
4.00
3 .
10.00
3 .
13.00
4 .
28.00
4 .
30.00
5 .
42.00
5 .
25.00
6 .
14.00
6 .
9.00
7 .
12.00
7 .
1.00 Extremes
Stem width:
Each leaf:
Leaf
(=<21)
588
1233
5577888999
0000113333344
5555556666677777778888999999
000000111111122222222333333444
555555566666677777778888889999999999999999
0000111122222233333344444
55566666777778
000112234
555666777889
(>=87)
10
1 case(s)
A stem-and-leaf plot, like a
histogram, shows how many
cases have various data values.
A stem-and-lead plot preserved
more information than a
histogram because it does not
use the same symbol to
represent all cases. Instead,
the symboldepents on the
actual value for a case.
Question: What are exact values of age 20 years or older and less than 30
34
years old?
In SPSS, go to: Analyze, Descriptive Statistics, Explore
Stem-&-leaf plot of patient’s age.
35
3.2.3 Box Plots / Box-and-Whisker plot
A graphical summary for continuous data using percentiles
Bar charts and histograms are convenient for displaying summary
information about data, but they provide very little information about
anything other than the values of the measure. Box-plots are
popularly used to summarize data, which simultaneously displays the
median, the inter-quartile range, and the smallest and largest values
of data. A useful application of box plots is to graphically compare the
distribution of a continuous measure across different levels of a
categorical variable.
36
“Whiskers’ extend to largest
and smallest observed values within
1.5-box lengths
Study Status
75th percentile
12 Month HbA1c
15.0
Control
Intervention
Outliers are hidden
Extreme values are hidden
12.5
10.0
50th percentile / median
7.5
25th percentile
5.0
Non-User
User
on ins ulin a t enrollment
How do you interpret these box plots?
37
1.5 Boxes
3 Boxes
Extreme values: defined by
observed value
More than 3 box-lengths
from upper (75th) or
lower (25th) value.
Outliers: defined by observed value
More than 1.5-box and less than 3-box lengths from upper
(75th) or lower (25th) value.
38
How to obtain Box-plot using SPSS (1):
39
How to obtain Box-plot using SPSS (2):
Then click Boxes to go to the next page.
40
How to obtain Box-plot using SPSS (3):
41
What can you tell from box-plot?
• From the median, you can get an idea of the typical value (central
tendency)
•From the length of the box, you can see how much the values vary
(data dispersion)
If the median line is not in the center of the box, you can tell
that distribution of your data blues is no symmetric.
If the median is closer to the bottom of the box than to the
top, there is a tail toward large values (positive skewness).
If the median is closer to the top of the box than to the bottom, there is
a tail toward smaller values (negative skewness)..
42
Let’s compare box-plot with other methods.
N o n - U s e r C on tr o l
U s e r C o n tr o l
N o n - U s e r I n te r v e nti on
U s e r I nte r v e nti on
Using histogram
Count
12
8
4
0
Count
12
8
4
0
6.0
8.0
10.0
12.0
12 Month HbA1c
14.0
6.0
8.0
10.0
12.0
14.0
12 Month HbA1c
43
Using bar-chart for mean of 12 month HbA1c
on ins ulin a t enrollment
10.0


12 Month HbA1c
Non-User
Us er


8.0
Error B ars s how Mean +/- 1.0 SD
Bars s how Means
6.0
4.0
2.0
n=60
n=35
n=60
Control
n=38
Intervention
Study Status
Let’s discuss pros and cons of each method of graphics.
44
Checking for Normality of Data in SPSS
How do we know if data are normally distributed? SPSS has a nice features
for testing and visual diagnosis for normality.
In SPSS, open Rothman.sav and go to:
Analyze, Descriptive Statistics, Explore
put ranChisq and ranNorm into dependent list box
Click on Plots,
In Plots dialog box, select Normality plots with tests
45
Checking Normality (1)
46
Checking Normality (2)
47
Checking Normality (3)
48
SPSS Output from Explore : Skewed Data (1)
ranChisq Stem-and-Leaf Plot
Frequency
Stem &
65.00
0
21.00
0
26.00
0
18.00
0
14.00
0
7.00
1
10.00
1
4.00
1
4.00
1
5.00
1
3.00
2
3.00
2
1.00
2
12.00 Extremes
Stem width:
Each leaf:
.
.
.
.
.
.
.
.
.
.
.
.
.
Leaf
00000000000000000011111111111111
2222233333
444444455555
666666777
888999
000&
2333
5&
67
88&
1
3&
&
(>=2.5)
1.00
2 case(s)
& denotes fractional leaves.
49
SPSS Output from Explore : Skewed Data (2)
50
SPSS Output from Explore : Skewed Data (3)
Formal Statistical Test for Normality
Tests of Normality for RanChisq
ranChisq
Kolmogorov-Smirnov
Statistic
df
.214
193
a
Sig.
.000
Shapiro-Wilk
Statistic
df
.729
193
Sig.
.000
Lilliefors
Significance Correction
a.
51
SPSS Output from Explore : Normally Distributed Data (1)
ranNorm Stem-and-Leaf Plot
Frequency
2.00
3.00
5.00
16.00
30.00
34.00
45.00
26.00
19.00
7.00
6.00
Stem width:
Each leaf:
Stem &
-2
-2
-1
-1
-0
-0
0
0
1
1
2
.
.
.
.
.
.
.
.
.
.
.
Leaf
55
223
57789
0000011112222233
555556666677777777888888999999
0000011111111111112222333334444444
000000000000011111112222222222223333333333444
55555556666777777788888899
0000000000123333444
5566777
011222
1.00
1 case(s)
Tests of Normality for ranNorm
Kolmogorov-Smirnov a
Statistic
df
Sig.
ranNorm
.040
193
.200 *
This*.is a lower bound of the true significance.
Shapiro-Wilk
Statistic
df
.993
193
Sig.
.440
Lilliefors
a.
Significance Correction
52
SPSS Output from Explore : Normally Distributed Data (2)
53
Data transformation to achieve normality
Many types of laboratory data, specifically data in the form of concentrations o
one substance, length of duration can be expressed with a skewed distribution
Transformation, such as taking logarithmic some times make these
skewed variables to normally (Gaussian) distributed.
In SPSS, use Transform, Compute dialog box to transform baseline Hba1c value
Into log(e) scale. Then compare distributions of un-transformed and transformed data.
25
25
20
Count
Count
20
15
15
10
10
5
5
54
6.0
8.0
10.0
12.0
12 M onth HbA1c
14.0
1.80
2.00
2.20
2.40
logHa1c 12
2.60
3.2.4 Dotplots
Similar to a stem-&-leaf plot (or a histogram displayed vertically), but
data expressed using dots.

10
Dot/Lines s how c ounts


8
Count

 
6


 


4


2

5.0








  
 


  

 
7.5

  
10.0
 




 
  
12.5



15.0
12 M onth HbA1 c
Similar to box plots, dotplots are useful for comparing distributions of
a continuous measure across different levels of a categorical
55
variable.
Dotplots of 12 month HbA1c stratified by Study arm and insulin use:
56
How to obtain dot plot in SPSS (1)
57
How to obtain dot plot in SPSS (2)
58
3.2.5. Error Bar Chart
Non-User
User
Error Bars s how 95.0% Cl of Mean
11.5
Baseline HbA1c
11.0




10.5
10.0
9.5
Control
Intervention
Study Status
Control
Intervention
Study Status
59
How to obtain Error Bar Chart in SPSS (1)
Read Rothman.sav into SPSS, then go to:
Graphs, Interactive, Error bar..
60
How to obtain Error Bar Chart in SPSS (2)
Select a set of Ha1c as Y-axis variable
Select Status as X-axis variable
Click on Error bars, select Display error bars, OK
61
3.2.6. Error bar chart with line:
62
How to obtain Error Bar Chart with Line in SPSS (1)
63
How to obtain Error Bar Chart with Line in SPSS (2)
64
How to obtain Error Bar Chart with Line in SPSS (3)
65
How to obtain Error Bar Chart with Line in SPSS (4)
66
How to obtain Error Bar Chart with Line in SPSS (5)
67
Editing Error Bar Chart with Lines: Editing Connecting lines (1)
Double click on the error bar chart to open Chart Editor.
In Chart Editor, click on the object you want to edit, Here we want to edit
Lines, so click on lines. Change Dot and Line size.
Click on error bar, in error bar dialog box, click on width to fix the gap between
Connecting lines and error bars. Move the cursor for cluster to 10%.
68
Editing Error Bar Chart with Lines: Editing Connecting lines (2)
69
70
3.2.7. Never use Pie charts.
VAR000 01
1.00
2.00
3.00
4.00
5.00
6.00
7.00
Pies s how Sums of VAR00002
Which category (from 1 to 7) do you think the largest?
71
Redoing the previous page graph pie chart using bar-charts and line chart.
In SPSS, go to:
Graphs, Interactive, Bar,
Bars s how Means
40.00
VAR00002
30.00
20.00
10.00
1
2
3
4
5
6
7
8
9
10
Ca se
72
Creating a bar graph directly from each data point.
73
Redoing the previous page graph pie chart using line chart.

40.00

VAR00002
30.00

20.00
10.00







8
9
10
0.00
1
2
3
4
5
6
7
Ca se
74
Creating a line graph directly from each data point.
In SPSS, go to:
Graphs, Interactive, Bar,
75
3.2.8 Scatterplots
One of the best ways to look for relationships and patterns among multiple
continuous variables.
In previous lecture, you’ve used a variety of graphical displays to
summarize single variable. In this lecture, we will learn how to display
the values or two variables in meaningful scale.
Circles point
represents
ID=216
Baseline
HbA1c=21.1%
12month
HbA1c=13.5%
Each point represents a pair of values. One variable is represented by the x76
axis and the other by the y-axis.
How to obtain the scatter plot in SPSS (1)
•Read Rothman.sav into SPSS
• To produce a scatterplot of 12 months HbA1c by baseline HbA1c, from
the menus choose:
Graphs, Scatter/Dot...{uses non-interactive mode this time}
•
•
Select simple scatter plot
Click Define.
77
How to obtain the scatter plot in SPSS (1)
78
What can you tell from the scatterplot?
Scatterplots are not randomly scattered over the grid. There
seems to be a pattern.
The points are concentrated in a bottom left to top right,
indicating as baseline HbA1c value increases, 12 month value
increases. That is, a straight line might be a reasonable
summary of the data.
You can also determine whether these are cases that have
unusual combinations of values for the two variables. You may
want to validate the observations on ID=216, is it clinically real to
have Baseline HbA1c=21.1% with 12month HbA1c=13.5%.
79
3.2.8.1 Labeling the Points
80
How to label a point in a scatter plot (1)
In order to add a label for the observed value on the next page,
In Simple Scatterplot dialog box,
Select 12 Month HbA1c as the y variable and Baseline HbA1c
as the x variable.
Additionally, set ID under “case labeled by”.
Click OK.
81
How to label a point in a scatter plot (2)
Double click on the scatterplot to open Chart Editor.
In Chart Editor, click on
then click on the point value you want to
show ID number.
82
3.2.8.2. Identifying different groups for scatterplot.
83
How to identify different groups for scatterplot
To identify points by study arm, select STATUS for Set Markers by, as shown below.
84
3.2.8.3. Representing Multiple Points
85
How to represent multiple points in scatter plot.
In the Chart Editor, double-click on any point in the figure.
In the Properties dialog box, click the Point Bins tab.
Under Display At, select Bins.
Under Count Indicator, select Marker Size.
86
3.2.9. Scatterplot Matrices.
So far, we have looked a the relationship between two variables.
What if you want to see how these variables to relate to another
variable. A scatterplot matric is a display that contains
scatterplots for all possible pairs of variables.
Is there any way to help understand relationship between two
variables?
87
How to obtain scatterplot matrices.
88
89
3.2.9.1. Adding Lowess smother to scatterplot
90
How to add Lowess smother to scatterplot (1)
Read Rothman.sav into SPSS
Follow the instruction for scatterplots,
After you create scatterplot matrices
* activate the graph by double-clicking on it.
* Highlight all points in the Chart Editor.
* Click the Add fit line tool, click on fit line, then chose
LOESS with % of points to fit =50
91
How to add Lowess smother to scatterplot (2)
92
A scatteplot matrix of 12 month HbA1c, 12 month systolic blood pressure,
age, baseline BMI has the same number of rows and columns as there
are variables. In this example, you see 5 row and 5 columns. Each cell
of the matrix, except for cells on the diagonal is a plot of a pair of
variables.
What’s the easiest way to read a scatterplot matrix?
Try to scan across an entire row or column. For example, in the
previous page
Figure, you will see that 12 month HbA1c value correlate to 6 month
value but not much with baseline value. Plots symmetric along
diagonal line is in fact the same plots, so you may want to ignore one
of the plots.
93
3.2.9.2. Overlay Plots
Un-interactive option does not work well for this, so use interactive graphs.

Study Status
Control
Intervention
20.0


Baseline HbA1c

LLR Smoother




16.0







12.0

8.0
5.0

















   
 


 
   





      
 

 
 

 
 
 







    
  

 
     

  



    

  
 

7.5
10.0







 


 


 




 


12.5
12 M onth HbA1 c



15.0
94
How to overlay 2 scatter plots (1)
In SPSS, go to, Graph, Interactive,
Scatter…
In Scatterplot dialog box,
Open “Fit” dialog box by clicking the menu
Enter 5 into each bandwidths
Choose Subgroup under “Fit lines for”
95
How to overlay 2 scatter plots (2)
96
3.2.10. Three dimensional Scatter Plots
Un-interactive option does not work well for this, so use interactive graphs.
97
How to create three dimensional scatter plots
In SPSS, go to, Graph, Interactive,
Scatter…
In Scatterplot dialog box,
Select, 3-D coordinate, which will give you an option to add the third coordinate
98
Compare the figures below. You may realize that it is very hard to
understand relationship between variables from the 3 dimensional figure,
You may rather want to show each pair wise relationship to describe the
dynamic relationship.
I recommend “never” use 3 dimensional graphs. Use scatter plot matrices
instead.
99
Example from a real practice: (Before paper revision)
The prevalence of coronary-artery calcification among patients
with rheumatoid arthritis and control subjects, according to age.
80
Percentage
Percentage
60
40
40
8/19
12/30
5/35
9/25
6/19
2/19
1/35
Control subjects
0/29
Early RA
5/19
10
0
Control subjects
5/16
30
4/21
3/16
8/33
6/33
10
2/30
0
Established RA
9/21
8/21
40
20
4/25
20
3/19
4/29
8/16
50
12/25
30
30
19/33
60
16/30
50
50
0
70
70
60
10
80
80
14/19
70
20
90
90
25/29
29/35
Percentage
90
>=60 years
50-59 years
< 50 years
Early RA
Established RA
Control subjects
Early RA
Established RA
Agatston score = 0
Agatston score = 1-109
Agatston score >109
100
Example from a real practice: (After paper revision)
The prevalence of coronary-artery calcification among patients
with rheumatoid arthritis and control subjects, according to age.
90
90
Controls
70
calcification (%)
Prevalence of coronary-artery
80
80
Early RA
70
Established RA
60
60
50
50
40
40
30
30
20
20
10
10
0
0
<50 years
50-59 years
>60 years
Age
There was a significant interaction between age and disease-status (P-value for
interaction <0.05). For age < 50 years and 50-59 years the prevalence of
coronary calcification was increased in patients with established RA compared to
101
controls (both P<0.05) but this was not significant in subjects > 60 years.