Transcript Slide 1

Quantitative Research Methods for
Social Sciences
Spring 2012
Module 2: Lecture 8
Cox Regression and Path Analysis
Priyantha Wijayatunga
Department of Statistics Umeå University
Modeling time length
Sometimes we want to model the time length until a certain event
happens (to a subject)





Age of a person who gets married
It may depend on his/her income, education, etc.
Time until an unemployed person finds an employment again
It may depend on gender, education, etc. of the person, and also
labour market conditions in the country (time)
Time until a patient gets cured after having a certain medicine
It may depend on persons health in general, effectiveness of the
medicine, etc.
Life span of the person
It may depend on person’s health, profession, etc.
And many more!!
Nature of time to event data
Modeling events happening over the time is called event history analysis
 Data are with high rate of right–censoring: for all we cannot observe the
event. By the time we do analysis, many people may not be cured by
the medicine, etc. Censored time durations for some people
What we all know is that each of their time length to the event is greater
than time length that has been elapsed by then!!
 If we do not take these censored time lengths we are biasing seriously:
you cannot only take into your analysis the people who found
employments if you want to model unemployment time duration
 SPSS can handle right–censored data
 Response (time length) may dependent on time and also some
explanatory variables that may change over time: in unemployment
duration problem, labour market conditions may change over time and
person concern may participate in some training programmes during
the unemployment time
 Rather than modeling length of the time we usually model intensity/rate
of the event (hazard rate)
Hazard Rate (Rate of an event)
Consider 5 subjects were admitted to a nursing home. Event of interest is
giving a special medical care, therefore time until the care. Let us
assume we observe that they are 0.5, 0.2, 1.3 and 0.1 years for first 4
subjects and 5th subject did not require it during her stay of 0.4 years at
the nursing home (censored!).






Total number of events occurred = 4
Total number of years gone = 0.5+0.2+1.3+0.1+0.4=2.5 years
Assuming rate of happening of events remains the same throughout.
sample rate of occurence = hazard rate =4/2.5=1.6 events per year
To be more realistic we should not assume rate does not change over
time.
Furthermore, the rate varies over some characteristics of the subjects
(values of some X)
In Cox regression we can do them and some more!!
Hazard Rate at time t, H(t)
Hazard rate at time t, say, H(t) depends on time t and two other
explanatory variables X1 and X2.
The model:
logH (t )  0 (t )  1 X1  2 X 2
like in Poisson regression

Therefore H(t) is always positive

Focus is on effects of each variable X1 and X2 on hazard rate, i.e.,
β1and β2 (parameters) but not on β0 (we allow it to vary arbitrary)

Since hazard rate depends on time t, and two other explanatory
variables X1 and X2 we may write it as (to be indicative)
H x1 , x2 (t )
Hazard Rate…
Model:


log H x1 , x2 (t )  0 (t )  1 X1  2 X 2

Parameters β1and β2 are:




log H x1 1, x2 (t )  log H x1 , x2 (t )  1
H x1 1, x2 (t )
H x1 , x2 (t )
 e 1
Hazard rate ratio
H x1 1, x2 (t )  e 1 H x1 , x2 (t )



Hazard rate is in(de)creasing proportionally –proportional hazard model
Sign of each β says if respective X has increasing, no or decreasing
effect ( >0 or =0 or <0 respectively)
Usual hypothesis test are applicable for all β’s.
Example: We use data file: rearrest.sav
(Thanks to UCLA Academic Technology Services for data)
File contains data on 194 inmates released from a certain security
prison. ’months’ is the failure time and if it is censored ’censor’ is 1.
‘personal’ identifies the 61 former inmates (31.4%) who had a history
of person-related crimes (e.g., assault, kidnapping). ‘property’
identifies 158 (81.4%) who had a history of property crimes. ‘cage’ is
centered age (on sample mean 30.7 ) of the person at release.
Data:
id months censor personal property
cage
1
,0657
0
1
1
-1,67519776
2
,1314
0
0
1
-10,48286374
3
,2300
0
1
1
-4,42673780
4
,2957
0
0
1
-11,32885963
5
,2957
0
1
1
-7,16458859
6
,3285
0
1
0
-2,86890070
.
.
.
.
.
.
Analyze > Survival > Kaplan–Meier and then move ’months’ to Time
box and ’censor’ to Status box. Press Define events and add 0 as
Sinlge value and press Continue. Move ’personal’ to Factor box.
Survival functions and hazard funtions stratified by factor ’personal’
Running Cox regression
Analyze > Survival > Cox–regression and then move ’months’ to
Time box and ’censor’ to Status box. Press Define events and add
0 as Sinlge value and press Continue. Move ’personal’, ’property’
and ’cage’ to Factor box. Press Plots tab and tick Survival and
Hazard at Plot type there before pressing Continue. Finally press
OK.
Omnibus Tests of Model Coefficientsa
Overall (score)
-2 Log Likelihood
950,583
Chi-square
df
Change From Previous Step
Sig.
30,263
Chi-square
3
,000
df
38,907
Change From Previous Block
Sig.
3
Chi-square
,000
a. Beginning Block Number 1. Method = Enter
Variables in the Equation
B
SE
Wald
df
Sig.
Exp(B)
personal
,569
,205
7,680
1
,006
1,766
property
,935
,351
7,107
1
,008
2,548
-,067
,017
15,778
1
,000
,936
cage
Covariate Means
Mean
personal
,314
property
,814
cage
,000
38,907
df
Sig.
3
,000
Plots
Path Analysis

To anlyze the relationships among events and phenomena
(random, though not fully) in the world

Associational relationships and causal relationships

Uses regression models to model associations

Uses diagrams to hypothesis causal relationships

Combines many regression models into one through a graphical
representation
Relations between random events
Events in the world may not be completely random. Some events are
related (association) and some events can cause some other events







For high school students math exam score and science exam score
may be associated (positive correlation)
Higher income may cause higher living standard (causal relation –
something more than association) for individuals.
Informally, if an event A happened before an event B and happening
of A changes the probability of happening B when all other related
factors for them held stable then A is a cause for B
Causation implies association but association always does not imply
causation
Correlation or other dependence measures we talked earlier
measure only the association.
To say something about causation we need extra information like
time order of events, subject domain knowledge, etc.
We use path diagrams to say something about causation
From association to causation
Association can be replaced by causation that may be the true model:
math score and science score are associated but this association may
be a result of the intelligence level of the student, high level cause high
scores in math and science and low level scores low in both.
Intelli
gence
level
math
score
science
score
math
score
science
score
We can prove non–causation but very difficult to prove causation: Suppose
someone claims A causes B by observing their association. If the
association disappears when we control for another factor C, then no
direct causation from A to B (e.g causal chain). If association does not
disappear then relation is not necessarily causal
 If an effect has two (independent) causes it is possible to identify them
as causes (controlling for effect causes become dependent): if passing
an exam is caused by two scores, given someone has passed it, two
scores should be big and vice versa.
What is path analysis?







It is theoretical explanation of cause–effect relations among a set of variables
using number of regression equations with the aid of diagrams
Single regression equation is not sufficient to explain many cause–effect
relationships
Direct causal relationships are shown with arrows eminating from causes and
pointing to thier effects. Thus, indirect causal relationships are shown too
Each arrow has a coefficient associated with it: it is the standardized regression
coefficient of cause variable in the multiple regression of effect variable on all
possible cause variables
For example, if there is a causal chain, in the multiple regression of Y on X and Z,
the effect of X on Y should not be significant
Every effect is assiociated with a residual–variable path explaining variation of it
unexplained by its causes. It has the coefficient 1  R2 where R 2 is the Rsquared from the corresponding multiple regression equation
Rz
Ry
Also path diagrams can contain bi–directed
paths, meaning associations/feedback relations
.3
.4
X
Some variable can have both directed and
Z
Y
indirected causal relations
W
.2
Path decomposition
X

Z
Y
A causal chain
Correlation between X and Y is explained by Z: controling for Z will
make X–Y correlation disappear. That is, the partial correlation
between X and Y given Z is zero.
 xy   zx  zy
 xy. z 
0
2
2
(1   zx )(1   zy )
 xy   zx  zy

The above formula can be generalized to more complex paths
between X and Y, if they are there.