SPSS Tutorial 101: Import, Merge and Save Data Sets

Download Report

Transcript SPSS Tutorial 101: Import, Merge and Save Data Sets

SPSS 201: Using SPSS to Perform
Commonly Used Statistical
Testing in Medical Research
(Workshop)
Dr. Daisy Dai
Department of Medical Research
1
Who are biostatisticians?

Ashley Sherman





Phone: 816-701-1347
[email protected]

Daisy Dai


Phone: 816-701-5233
Email: [email protected]


Consultation
Experimental design
and sampling plan
Collaboration in
presentation and
publication of studies
Education
Research
2
Statistical Courses




SPSS 201: Using SPSS to
perform statistical tests I

SPSS 202: Using SPSS to
perform statistical tests II
SPSS 204: Using SPSS to
manage data
SPSS 203: Summarize data
with tables and graphs


STA 101: Properly Setting up
and Designing a Clinical
Research Study Including
Power Analysis for Proper
Patient Numbers (July 16th)
STA 102: Commonly Used
Statistical Tests in Medical
Research - Part I
STA 103: Commonly Used
Statistical Tests in Medical
Research - Part II
3
Contents




Review statistical tools (1 hour)
Introduce SPSS (30 minutes)
Practice (1 hour)
Questions and discussions ( 30 minutes)
4
Statistical tools
5
Medical Research

Clinical Trials



Intervention or therapeutic
Preventative
Retrospective Studies
6
Data









Medical data
Physics data
Chemistry data
Education
Economics
Social studies
Sensory
Nutrition
Many more…

Continuous variable



Interval variable
Ordinal variable
Categorical variable



Binary variable
Discrete variable
Ordinal variable
7
Information Collections
1.
Historical Data


2.
Census


3.
Pro: Convenient; Save a lot of work
Con: Outdated; Different Objectives and Designs; Unknown
Detailed Information
Pro: reliable, accurate and comprehensive (e.g. Population
census)
Con: Time consuming; requiring more resources; difficult to
investigate all subjects in the population
Sampling


Pro: Efficient; Less risky; exploratory; informative
Caveats: Selection bias; misinterpretation; design flaw
8
Statistics

Descriptive Statistics



Methods to organize and summarize information
Mean, median, max, min, frequency and proportions,
etc. that summarize sample demographics
Inferential Statistics

Methods to draw conclusions about a population
based on information obtained from a sample of the
population
9
Population
Sampling Plan
Conclusion
Inferential
Statistics
Sample
Descriptive
Statistics
10
Summary Statistics

Measures of Center

Mean
n
x 


i 1
xi

Measures of variation

n
Median: the middle value in
its ordered list


Mode: the most frequently
occurring value
Range: the difference
between the largest and
smallest value in the data
set, i.e., Range=Max-Min.
Standard deviation:
measure variation by
indicating how far, on
average, the observations
are from the mean.
n
n
We will talk more about data summary
and distribution graphs in SPSS 204
Workshop.
s
 (x
i 1
i
n
 x )2
n 1

x
i 1
2
i

( xi ) 2
i 1
n 1
n
11
Exercise: Determine for the mean, median and mode,
which measure of center is most appropriate in the
following case studies?

A student takes four exams in a biology class.
His grades are 88, 75, 95, and 100.

The National Association of REALTORS publishes
data on resale prices of U.S. homes.

In the 2003 Boston Marathon, there were two
categories of official finishers: male and female,
of which there were 10,737 and 6,309,
respectively.
12
Statistical Testing Procedures
1.
2.
3.
4.
5.
Clarify study
objectives.
Establish hypotheses.
Determine the
outcome variables,
treatment groups, risk
factors and covariates.
Perform appropriate
statistical testing.
Interpret results.
13
Statistical Testing Procedures
1.
Null Hypothesis
-
2.
Ho: Mean_Treatment=Mean_Control
Alternative Hypothesis
-
Ha: Mean_Treatment ≠ Mean_Control (Two-sided Test)
Ha: Mean_Treatment > Mean_Control (One-sided Test)
Ha: Mean_Treatment < Mean_Control (One-sided Test)
3.
Calculate statistics
4.
Make Inference
-
If P-value > 0.05, then Ho holds
If P-value < 0.05, then Ha holds
14
Continuous Variables
Two or multiple treatment groups
15
Two samples t-test
Compare the means of a normally distributed
interval
dependent
variable
for
two
independent groups.
null hypothesis:
H 0 : 1   2
alt. hypotehsis:
H A : 1   2
test statistic :
t
y1  y2
1 1
s   
 n1 n2 
reject H 0 if | t | t / 2, N 1
2
p
decision rule :
16
Case Study: FEV1 Changes
A new compound, ABC123, is being developed for
long-term
treatment
of
patients
with
chronic
asthma. Asthma patients
were enrolled in a doubleblind study and randomized
to receive daily oral or a
placebo for 6 weeks.
asthmatic patients
Placebo
Test
FEV1 after 6-week treatment
17
FEV1 Data
Test Group
Placebo Group
Patient ID
Baseline
week 6
Patient ID
Baseline
week 6
101
1.35
n/a
102
3.01
3.9
103
3.22
3.55
104
2.24
3.01
106
2.78
3.15
105
2.25
2.47
108
2.45
2.3
107
1.65
1.99
109
1.84
2.37
11
1.95
n/a
110
2.81
3.2
112
3.05
3.26
113
1.9
2.65
114
2.5
2.55
116
3
3.96
115
1.6
2.2
118
2.25
2.97
117
.77
2.56
120
2.86
2.28
119
2.06
2.9
121
1.56
2.67
122
1.71
n/a
124
2.66
3.76
123
3.54
2.92
18
What is the
difference
between std
and std error?
P-value
P-value
19
Mean and Error Bar
Conclusion:
As
compared
to
placebo, the new
drug did not show
any effect on FEV1.
20
Paired t-test
Compare the means of
a normally distributed
interval dependent
variable for two related
groups.
Test Group
Patient ID
Baseline
week 6
101
1.35
n/a
103
3.22
3.55
106
2.78
3.15
108
2.45
2.3
109
1.84
2.37
110
2.81
3.2
113
1.9
2.65
116
3
3.96
118
2.25
2.97
120
2.86
2.28
121
1.56
2.67
124
2.66
3.76
21
Conclusion:
For subjects on
the new drug,
FEV1 at week 6 is
significantly
higher than
baseline.
P-value
22
One-way ANOVA
Test for differences of the means for
continuous variables in multiple independent
treatment groups.
null hypothesis:
H 0 : 1   2  ...   k
alt. hypotehsis:
H A : not H 0
test statistic :
decision rule :
MSG
MSE
reject H 0 if F  FNk 1k ( )
F
23
Case Study: HAM-A Scores in GAD
A new serotonin-update inhibiting
agent, SN-X95, is being studied in
subjects with general anxiety disorder
(GAD). Fifty-two subjects diagnosed with
GAD
were enrolled and randomly
assigned to one of three treatment
groups: three treatment groups: 25mg
SN-X95, 100mg SN-X95 or placebo. After
10 weeks of once-daily oral dosing in a
double-blind fashion, a test based on the
Hamilton Rating Scale for Anxiety (HAMA) was administered. This test consists
of 14 anxiety-related items (e.g. ‘anxious
mood’, ‘tension’, ‘insomnia’, ‘fear’, etc.),
each rated by the subject as ‘no present’,
‘mild’, ‘moderate’, ‘severe’, or ‘very
severe’. HAM-A test scores were founded
by summing the coded values of all 14
items using the numeric coding scheme
of 0 for “not present”, 1 for …. Are there
any differences in means HAM-A test
score among the three groups?
Patients with GAD
Placebo
25mg
SN-X95
100 mg
SN-X95
HAM-A Score after 10-week treatment
24
Data
Lo-Dose
Hi-Dose
Placebo
21
16
22
18
21
26
19
31
29
99
25
19
28
23
99
22
25
33
30
18
37
27
20
25
28
18
28
19
16
26
23
24
99
22
22
31
20
21
27
19
16
30
26
33
25
35
21
22
99
17
36
25
P-value
26
Mean and Error Bar
Conclusion:
There is significant
difference in mean
HAM-A among three
treatment at 95%
confidence level.
27
Categorical Variables
Two or multiple treatment groups
28
Fisher’s Exact Test
A conservative non-parametric test about
a relationship between two categorical
variables.
Responders
Non-responders
Total
Group 1
N11
N12
N11+N12
Group 2
N21
N22
N21+N22
Combined
N11+N21
N12+ N22
N
29
Case Study: CHF Incidence in
CABG after ARA
A new adenosine-releasing agent (ARA), thought
to reduce side effects in patients undergoing
coronary artery bypass surgery (CABG), was
studied in a pilot trial.
CHF
No CHF
Total
ARA
2 (6%)
33
35
Placebo
5 (25%)
20
25
Combined
7
53
60
Fisher’s exact test: p=0.0455
30
Chi-square test
Test
a
relationship
between
two
categorical variables. The chi-square test
assumes that the expected value for each
cell is five or higher.
null hypothesis :
H 0 : p1  p2
alt. hypotehsis :
H A : p1  p2
test statistic :
decision rule :
2
NUM
2
DEN
reject H 0 if  2   12  
31
Case Study: ADR Frequency with
Antibiotic Treatment
A study was conducted to monitor the
incidence of GI adverse drug reactions of
a new antibiotic used in lower respiratory
tract infections.
Responders
Non-responders
Total
Test
(new antibiotic)
22 (33%)
44
66
Control
(erythromycin)
28 (54%)
24
53
Combined
50 (42%)
68
118
Chi-square test: p=0.0252; Fisher’s exact test: p=0.0385
32
Other tests







One-way repeated
measures ANOVA
Repeated measures
logistic regression
Factorial ANOVA
Friedman test
Factorial logistic
regression
Simple Linear Regression
Multiple Regression







We will cover all tests including
non-parametric tests in SPSS 202
Workshop.
Factor analysis
Multiple logistic
regression
Discriminant analysis
One-way MANOVA
Multivariate multiple
regression
Canonical correlation
Analysis of covariance
33
Questions?
34
Introduction to SPSS
35
What is SPSS?



Statistical software.
16 server licenses.
SPSS 18.
36
SPSS Data Entry

SPSS data can be entered manually.




The format is ready for analysis.
SAS, Excel, txt, etc. data can be easily
imported to SPSS.
SPSS data files are saved as “SPSS data
document (.sav)”.
SPSS output files are saved as “SPSS
viewer document (.spv)”.
37
SPSS Data Entry

SPSS has a few unique features in data entry.
 Categorical variables need to be coded. For instance,
code male as 1 and female as 0 or vice versa.
 When you have two treatments, test and control,
please use 1 for test and 0 for control.
 Categorical variables that are not coded in other
sourced data files will not be imported or analyzed
properly in SPSS.
 Continuous variables don’t need coding.
 Missing values needs to be defined in “variable view”
38
page.
Example: CDC Survey Data


An allergy survey was conducted in 2005
and 2006 to children more than 1 year
old.
Two data sets, allergy questionnaire and
demographic information, are saved in
sas export format.
39
Tasks




Import these two SAS data files to SPSS
and save them as SPSS data file.
Sort each data set by study ID.
Merge allergy variables and demographic
variables.
Save new data set as SPSS data file.
40
Log in SPSS


CMH offers server version
SPSS 18. Any employee
can log in SPSS from your
employee account.
Go to Start
->Program
->Accessories
-> Remote Desktop
Connection
41
Log in SPSS


In the prompted
connection window,
enter cmhterm.
Click Connect.
42
Log in SPSS



In the Log On
Window, enter your
cmh user name and
password.
Choose log on to CMH
Click OK.
43
Task 1: Import Data

We need to import two data sets to SPSS.



Allergy qustionaire: aqq_d.xpt (xpt is sas export file)
Demographic information: demo_d.xpt
Please note that SPSS is on server and data
must be saved in shared drive such as u drive or
w drive. You will not be able to find the file in
SPSS if you save them on your local disk.
44
Task 1: Import Data



Double click spss 18
icon on the screen.
In the task wizard,
click Open an existing
source.
Click OK.
45
Task 1: Import Data

Just in case wizard
does not prompt, you
can go to file
-> Open
-> Data
46
Task 1: Import Data





Select the folder.
Choose agg_d file.
Select xpt format.
Click Open.
Note: SPSS is compatible
with other commonly
used statistical and data
management software
packages. Excel, SAS,
Access files are all
convertible to SPSS.
47
Task 1: Import Data


Now the data is open.
You can see the data
in “data View” tab.
48
Task 1: Import Data

The data structure,
variable name, label,
etc. are in “Variable
View” tab.
49
Task 2: Sort Data

Variable to be sort: SEQN, that is,
Respondent sequence number.
50
Task 2: Sort Data





Go to Data and select
Sort Cases.
On Sort Cases page,
select the variable,
Respondent sequence
number.
Click on right arrow.
Choose Ascending or
Descending.
Click OK.
51
Practice

Now let’s repeat this process by doing the
following:


Open the demographic data, demo_d.xpt.
Sort the data by variable, Respondent
Sequence Number.
52
Task 3: Merge Two Data Sets




Two data sets need to be linked by key
variables.
In our case, the key variable is SEQNRespondent Sequence Number.
Make sure the key variable has the same
name and variable type in two data sets.
Both data sets needs to be sorted by the
key variable.
53
Task 3: Merge Two Data Sets

Under any data set,
go to Data
-> Merge File
-> Add Variables
54
Task 3: Merge Two Data Sets


Choose the other data
to add on.
Note, this page will
look different in SPSS
18. By all means,
choose the other data
set.
55
Task 4: Save the New Data





Go to File
-> Save As…
Select the folder.
Create new file,
MergedData.
Choose SPSS data
format.
Click Save.
56
Task 4: Save the New Data

Go to Data
-> Merge File
-> Add Variables
57
Questions?
We will cover more data
management in SPSS
203 workshop.
58
Let’s play with SPSS
59
Project 1: FEV1 Changes
60
Case Study: FEV1 Changes
A new compound, ABC123, is being developed for
long-term
treatment
of
patients
with
chronic
asthma. Asthma patients
were enrolled in a doubleblind study and randomized
to receive daily oral or a
placebo for 6 weeks.
asthmatic patients
Placebo
Test
FEV1 after 6-week treatment
61
FEV1 Data
Test Group
Placebo Group
Patient ID
Baseline
week 6
Patient ID
Baseline
week 6
101
1.35
n/a
102
3.01
3.9
103
3.22
3.55
104
2.24
3.01
106
2.78
3.15
105
2.25
2.47
108
2.45
2.3
107
1.65
1.99
109
1.84
2.37
11
1.95
n/a
110
2.81
3.2
112
3.05
3.26
113
1.9
2.65
114
2.5
2.55
116
3
3.96
115
1.6
2.2
118
2.25
2.97
117
.77
2.56
120
2.86
2.28
119
2.06
2.9
121
1.56
2.67
122
1.71
n/a
124
2.66
3.76
123
3.54
2.92
62
Tasks
5.
Log in to intranet and open SPSS.
Define variables and missing values in “variable view”
tab.
Enter data in “data view” tab.
Perform two-sample t-tests to compare FEV1 at 6
weeks between test and control.
Generate mean and error bar graph for two groups.
6.
Interpret the SPSS output and make conclusion.
1.
2.
3.
4.
63
Tasks to be continued
7.
8.
9.
10.
11.
Perform paired t-test to compare the FEV
between baseline and 6 weeks for test group.
Interpret SPSS results and draw conclusions.
Save SPSS data and SPSS output respectively.
Open SPSS data and SPSS output by double
clicking the icons.
Close both files.
64
Project 2: HAM-A Scores in
GAD
65
Case Study: HAM-A Scores in GAD
A new serotonin-update inhibiting
agent, SN-X95, is being studied in
subjects with general anxiety disorder
(GAD). Fifty-two subjects diagnosed with
GAD
were enrolled and randomly
assigned to one of three treatment
groups: three treatment groups: 25mg
SN-X95, 100mg SN-X95 or placebo. After
10 weeks of once-daily oral dosing in a
double-blind fashion, a test based on the
Hamilton Rating Scale for Anxiety (HAMA) was administered. This test consists
of 14 anxiety-related items (e.g. ‘anxious
mood’, ‘tension’, ‘insomnia’, ‘fear’, etc.),
each rated by the subject as ‘no present’,
‘mild’, ‘moderate’, ‘severe’, or ‘very
severe’. HAM-A test scores were founded
by summing the coded values of all 14
items using the numeric coding scheme
of 0 for “not present”, 1 for …. Are there
any differenceds in means HAM-A test
score among the three groups?
Patients with GAD
Placebo
25mg
SN-X95
100 mg
SN-X95
HAM-A Score after 10-week treatment
66
Data
Lo-Dose
Hi-Dose
Placebo
21
16
22
18
21
26
19
31
29
99
25
19
28
23
99
22
25
33
30
18
37
27
20
25
28
18
28
19
16
26
23
24
99
22
22
31
20
21
27
19
16
30
26
33
25
35
21
22
99
17
36
67
Tasks
1.
2.
3.
4.
5.
6.
7.
8.
Open data in excel. Make sure the data structure,
variables and missing values are set up properly.
Import Excel to SPSS.
Perform one-way ANOVA to compare high dose, low
dose and control groups.
Generate mean and error bar graph for three groups.
If the global F-test is significant, then perform posthoc pair-wise comparisons.
Interpret the SPSS output and make conclusion.
Save data and output.
Close files.
68
Project 3: CHF Incidence in
CABG after ARA
69
Case study: CHF Incidence in CABG
after ARA

A new adenosine-releasing agent (ARA), thought to
reduce side effects in patients undergoing coronary
artery bypass surgery (CABG), was studied in a pilot trial
That enrolled 35 patients who receive active medication
and 20 patients who received a placebo. Follow-up
observation revealed that 2 patients who received active
medication and 5 patients who received the placebo had
shown symptoms of congestive heart failure (CHF)
within 90 days post surgery. Is this evidence of a
reduced rate of CHF for patients treated with the ARA
compound?
70
Tasks
1.
2.
3.
4.
5.
6.
7.
Open SPSS data.
Summarize frequency, percentage in two-way
contingency table.
Perform Fisher’s exact test.
Perform Chi-square test.
Compare Fisher’s exact test with Chi-square test.
Interpret the SPSS output and make conclusion.
Close files.
71
Project 4: ADR Frequency with
Antibiotic Treatment
72
Case Study: ADR Frequency with
Antibiotic Treatment

A study was conducted to monitor the
incidence of GI adverse drug reactions of
a new antibiotic used in lower respiratory
tract infections. Two parallel groups were
included in the study. One group consisted
of 66 LRTI patients randomized to receive
the new treatment and a reference group
of 52 patients randomized to receive
erythromycin.
73
Tasks
1.
2.
3.
4.
5.
6.
7.
Open SPSS data.
Summarize frequency, percentage in two-way
contingency table.
Perform Fisher’s exact test.
Perform Chi-square test.
Compare Fisher’s exact test with Chi-square test.
Interpret the SPSS output and make conclusion.
Close files.
74
Questions?
Let us know statistics
topics you are
interested.
75
In summary…
76
Thank You
For more information, visit my website
http://www.childrensmercy.org/content/view
.aspx?id=9740
Or go to Scope ->Research -> Medical
Research -> Statistics
77
References



Medical Statistics by Campbell et al.
Introductory Statistics by Neil Weiss
Common Statistical Methods for Clinical
Research by Walker
78