Transcript Slide 1
Quantitative Research Methods for Social Sciences Spring 2012 Module 2: Lecture 8 Cox Regression and Path Analysis Priyantha Wijayatunga Department of Statistics Umeå University Modeling time length Sometimes we want to model the time length until a certain event happens (to a subject) Age of a person who gets married It may depend on his/her income, education, etc. Time until an unemployed person finds an employment again It may depend on gender, education, etc. of the person, and also labour market conditions in the country (time) Time until a patient gets cured after having a certain medicine It may depend on persons health in general, effectiveness of the medicine, etc. Life span of the person It may depend on person’s health, profession, etc. And many more!! Nature of time to event data Modeling events happening over the time is called event history analysis Data are with high rate of right–censoring: for all we cannot observe the event. By the time we do analysis, many people may not be cured by the medicine, etc. Censored time durations for some people What we all know is that each of their time length to the event is greater than time length that has been elapsed by then!! If we do not take these censored time lengths we are biasing seriously: you cannot only take into your analysis the people who found employments if you want to model unemployment time duration SPSS can handle right–censored data Response (time length) may dependent on time and also some explanatory variables that may change over time: in unemployment duration problem, labour market conditions may change over time and person concern may participate in some training programmes during the unemployment time Rather than modeling length of the time we usually model intensity/rate of the event (hazard rate) Hazard Rate (Rate of an event) Consider 5 subjects were admitted to a nursing home. Event of interest is giving a special medical care, therefore time until the care. Let us assume we observe that they are 0.5, 0.2, 1.3 and 0.1 years for first 4 subjects and 5th subject did not require it during her stay of 0.4 years at the nursing home (censored!). Total number of events occurred = 4 Total number of years gone = 0.5+0.2+1.3+0.1+0.4=2.5 years Assuming rate of happening of events remains the same throughout. sample rate of occurence = hazard rate =4/2.5=1.6 events per year To be more realistic we should not assume rate does not change over time. Furthermore, the rate varies over some characteristics of the subjects (values of some X) In Cox regression we can do them and some more!! Hazard Rate at time t, H(t) Hazard rate at time t, say, H(t) depends on time t and two other explanatory variables X1 and X2. The model: logH (t ) 0 (t ) 1 X1 2 X 2 like in Poisson regression Therefore H(t) is always positive Focus is on effects of each variable X1 and X2 on hazard rate, i.e., β1and β2 (parameters) but not on β0 (we allow it to vary arbitrary) Since hazard rate depends on time t, and two other explanatory variables X1 and X2 we may write it as (to be indicative) H x1 , x2 (t ) Hazard Rate… Model: log H x1 , x2 (t ) 0 (t ) 1 X1 2 X 2 Parameters β1and β2 are: log H x1 1, x2 (t ) log H x1 , x2 (t ) 1 H x1 1, x2 (t ) H x1 , x2 (t ) e 1 Hazard rate ratio H x1 1, x2 (t ) e 1 H x1 , x2 (t ) Hazard rate is in(de)creasing proportionally –proportional hazard model Sign of each β says if respective X has increasing, no or decreasing effect ( >0 or =0 or <0 respectively) Usual hypothesis test are applicable for all β’s. Example: We use data file: rearrest.sav (Thanks to UCLA Academic Technology Services for data) File contains data on 194 inmates released from a certain security prison. ’months’ is the failure time and if it is censored ’censor’ is 1. ‘personal’ identifies the 61 former inmates (31.4%) who had a history of person-related crimes (e.g., assault, kidnapping). ‘property’ identifies 158 (81.4%) who had a history of property crimes. ‘cage’ is centered age (on sample mean 30.7 ) of the person at release. Data: id months censor personal property cage 1 ,0657 0 1 1 -1,67519776 2 ,1314 0 0 1 -10,48286374 3 ,2300 0 1 1 -4,42673780 4 ,2957 0 0 1 -11,32885963 5 ,2957 0 1 1 -7,16458859 6 ,3285 0 1 0 -2,86890070 . . . . . . Analyze > Survival > Kaplan–Meier and then move ’months’ to Time box and ’censor’ to Status box. Press Define events and add 0 as Sinlge value and press Continue. Move ’personal’ to Factor box. Survival functions and hazard funtions stratified by factor ’personal’ Running Cox regression Analyze > Survival > Cox–regression and then move ’months’ to Time box and ’censor’ to Status box. Press Define events and add 0 as Sinlge value and press Continue. Move ’personal’, ’property’ and ’cage’ to Factor box. Press Plots tab and tick Survival and Hazard at Plot type there before pressing Continue. Finally press OK. Omnibus Tests of Model Coefficientsa Overall (score) -2 Log Likelihood 950,583 Chi-square df Change From Previous Step Sig. 30,263 Chi-square 3 ,000 df 38,907 Change From Previous Block Sig. 3 Chi-square ,000 a. Beginning Block Number 1. Method = Enter Variables in the Equation B SE Wald df Sig. Exp(B) personal ,569 ,205 7,680 1 ,006 1,766 property ,935 ,351 7,107 1 ,008 2,548 -,067 ,017 15,778 1 ,000 ,936 cage Covariate Means Mean personal ,314 property ,814 cage ,000 38,907 df Sig. 3 ,000 Plots Path Analysis To anlyze the relationships among events and phenomena (random, though not fully) in the world Associational relationships and causal relationships Uses regression models to model associations Uses diagrams to hypothesis causal relationships Combines many regression models into one through a graphical representation Relations between random events Events in the world may not be completely random. Some events are related (association) and some events can cause some other events For high school students math exam score and science exam score may be associated (positive correlation) Higher income may cause higher living standard (causal relation – something more than association) for individuals. Informally, if an event A happened before an event B and happening of A changes the probability of happening B when all other related factors for them held stable then A is a cause for B Causation implies association but association always does not imply causation Correlation or other dependence measures we talked earlier measure only the association. To say something about causation we need extra information like time order of events, subject domain knowledge, etc. We use path diagrams to say something about causation From association to causation Association can be replaced by causation that may be the true model: math score and science score are associated but this association may be a result of the intelligence level of the student, high level cause high scores in math and science and low level scores low in both. Intelli gence level math score science score math score science score We can prove non–causation but very difficult to prove causation: Suppose someone claims A causes B by observing their association. If the association disappears when we control for another factor C, then no direct causation from A to B (e.g causal chain). If association does not disappear then relation is not necessarily causal If an effect has two (independent) causes it is possible to identify them as causes (controlling for effect causes become dependent): if passing an exam is caused by two scores, given someone has passed it, two scores should be big and vice versa. What is path analysis? It is theoretical explanation of cause–effect relations among a set of variables using number of regression equations with the aid of diagrams Single regression equation is not sufficient to explain many cause–effect relationships Direct causal relationships are shown with arrows eminating from causes and pointing to thier effects. Thus, indirect causal relationships are shown too Each arrow has a coefficient associated with it: it is the standardized regression coefficient of cause variable in the multiple regression of effect variable on all possible cause variables For example, if there is a causal chain, in the multiple regression of Y on X and Z, the effect of X on Y should not be significant Every effect is assiociated with a residual–variable path explaining variation of it unexplained by its causes. It has the coefficient 1 R2 where R 2 is the Rsquared from the corresponding multiple regression equation Rz Ry Also path diagrams can contain bi–directed paths, meaning associations/feedback relations .3 .4 X Some variable can have both directed and Z Y indirected causal relations W .2 Path decomposition X Z Y A causal chain Correlation between X and Y is explained by Z: controling for Z will make X–Y correlation disappear. That is, the partial correlation between X and Y given Z is zero. xy zx zy xy. z 0 2 2 (1 zx )(1 zy ) xy zx zy The above formula can be generalized to more complex paths between X and Y, if they are there.