Multilevel modelling Volume 4 Slide Set 1 Revised June 2011 Session 1: Concepts & Structures • • • • • • • • • • • • Who are we? Your knowledge? Course structure and aims Going further and Software Four.
Download ReportTranscript Multilevel modelling Volume 4 Slide Set 1 Revised June 2011 Session 1: Concepts & Structures • • • • • • • • • • • • Who are we? Your knowledge? Course structure and aims Going further and Software Four.
Multilevel modelling Volume 4 Slide Set 1 Revised June 2011 1 Session 1: Concepts & Structures • • • • • • • • • • • • Who are we? Your knowledge? Course structure and aims Going further and Software Four key notions that define ML Hierarchical structures Task 1 (over to you) Hierarchies resulting from research designs Non-hierarchical structures Alternative analysis strategies Lab 1: MLwiN preliminaries (before modelling) Homework: Task 2 and Statistics 101 2 Us and You? • US • Kelvyn Jones (Bristol) - geographer (health) - cohorts & panel studies - ML since 1987 • YOU • Fitted a multilevel model? • Used MLwiN? • Familiarity with…... - ANOVA - ANCOVA ? - LRT ? - ZPR? - REML - FIML - MCMC? - logit model ? - Bayesian modelling ? 3 Course structure and aims • A thorough introduction to the BASIC multilevel modelling: two-level model • Sessions 1 to 8 : concepts, specification and interpretation of 2-level model, continuous response • Sessions 9 & 10: IGLS/RIGLS estimation, basic introduction to MCMC estimation; 3-level model; discrete response, repeated measures, spatial model….. • Practical: hands-on experience us: modelling house prices you: modelling pupil progress between schools • Participation: talk, tasks, discussion 4 Multilevel Models: AKA • • • • • random-effects models, hierarchical models, variance-components models, random-coefficient models, mixed models • First known application: 1861 one-way, random-effects model: several telescopic observations per night for several different nights; separated the variance into between and within-night variation • Modern day version: 1986, publication of algorithms (linked to software) for dealing with unbalanced data and complex variance functions 5 Texts • Comprehensive but demanding! : Goldstein • Thorough but a little dated: Snijders & Bosker • Approachable : Hox • Thorough (HLM) Raudenbush & Bryk • Authoritative: de Leeuw & Meijer • Applications: education, O’Connell & McCoach • Applications: health, Leyland & Goldstein http://www.cmm.bristol.ac.uk/learningtraining/multilevel-m-support/books.shtml 6 Realistically complex modelling Statistical models as a formal framework of analysis with a complexity of structure that matches the system being studied Four KEY Notions 1 : Modelling data with a complex structure A large range of structures that ML can handle routinely; eg houses nested in neighbourhoods 2: Modelling heterogeneity standard regression models ‘averages’, ie the general relationship ML additionally models variances; eg individual house prices vary from n’hood to neighbourhood 3: Modelling dependent data potentially complex dependencies in the outcome over time, over space, over context; eg houses within a n’hood tend to have similar prices 4: Modelling contextuality: micro & macro relations eg individual house prices depends on individual property characteristics and on neighbourhood characteristics 7 Modelling data with complex structure • ‘Once you know that hierarchies exist you see them everywhere’ (Kreft, 1990) • Rest of Session 1 • Basic Idea: For many (most?) research questions Either the real world has a complex structure and/or we impose one during research design • Two fundamental types: hierarchical and nonhierarchical structures: • Ignoring structure: leads to impoverished analyses and inferential error • ML analyses data with complex structure 8 Strict hierarchies Two-level structures and three-level structures Unit Diagrams 2 9 Examples of strict hierarchy • • • Education pupils (1) in schools (2) pupils (1) in classes( 2) in schools (3) • • Geography houses(1) in neighbourhoods(2) in regions(3) in countries(4) • • Business individuals(1) within teams(2) within organizations(3) • • • Psychology individuals(1) within family(2) individuals(1) within twin sibling pair(2) • • Economics employees(1) within firms(2) • NB all are structures in the POPULATION (ie exist in reality) 10 Characteristics of strict hierarchy • Exact nesting – a lower level unit is nested within one and only one unit at the next higher level • Lower-level units repeated samples of higher level units – cannot measure hospital performance directly, but make repeated measures of it through individual patient outcome • Successive sampling from each level of a hierarchical population (even with full enumeration); concept of hypothetical super-population • Imbalanced structure – Imbalance in sample size within higher level units is to be expected in observational research; in one hospital you may have sampled 22 patients, in another it is only 10. Used to be troublesome for estimation – corollary: NEVER throw-away information to achieve balance; even if only 1 patient in the hospital 11 What does the data-frame look like? a 2-level study for examining school effects on pupil progress Classifications or levels Response Explanatory variables Pupil i School j Pupil Exam scoreij Pupil previous Examination scoreij Pupil genderij School typej 1 1 75 56 M State 2 1 71 45 M State 3 1 91 72 F State 1 2 68 49 F Private 2 2 37 36 M Private 3 2 67 56 M Private 1 3 82 76 F State 2 3 85 50 F State 1 4 54 39 M Private 2 4 91 71 M Private 3 4 43 41 M Private 4 4 66 55 F Private Some multilevel questions •What is the between-school variation in progress? •Do pupils make more progress in private than public schools? •Do girls make greater progress in private schools 12 TASK 1 • “Many research designs can be accommodated within the framework of a strict hierarchy” • Specify a multilevel structure and dataframe for EITHER a response variable of your choice OR for a model to explain house prices OR voting behaviour Template for answer • what is the response (must always be measured at level 1)? • What are the levels: 1, 2 , etc? • What are the predictor variables, and at what level are they measured (1,2 etc)? • What are the multilevel questions that you want answered? 13 What generates a multilevel structure? • Real world generates them; population structures or Naturally occurring hierarchies Pupils in Classes in Schools – Patients in Clinics in Health authorities – Voters in Polling districts in Constituencies – People in Neighborhoods in Cities – Houses in Neighborhood in Towns in Regions – But also can get - strict hierarchies imposed during research design and / or during data collection………. Examples of different designs…... 14 Multistage sampling designs • for efficient collection of data • most large-scale surveys are not SRS • Two-level structure imposed by design • respondents nested within PSU’s 15 Multistage sampling designs • Multistage designs (usually) generate dependent data individuals living within the same PSU can be expected to be more alike than a random sample • The ‘design effect’ Inferential procedures (SE’s, confidence limits, tests) are likely to be incorrect incorrect estimates of precision Type 1 errors: finding a relationship where none exists • Multilevel models model this dependency and automatically corrects for the ‘design effect’ 16 Repeated cross-sectional design • Structure of Example:(Nuttall et al’s analysis of changing school performance in London) – Level-3 is the school, level-2 is the year and level-1 is the individual. Level-2 represents repeated measurements on the school. – Imbalance: – particular schools not included in particular years (level 2); – different number of pupils in each cohort in each school (level 1) – even if school is only measured once do NOT discard it, include it! • Research Question – modeling changing school performance as different cohorts of pupils pass through the schools 17 Repeated measures design: True Panels • Structure of Example: – Level-3 is the commune, level-2 is the voter and level-1 is the occasion on which they voted ; voters are repeatedly measured; SAME individuals • Research Question -what is the volatility of the vote; how changeable are individuals, and how changeable are communes?; the metaphor of the serene swan! • Imbalance - does not require a fixed set of repeated observations for all persons - both the number of observations per person and the spacing between observations may vary 18 Multivariate multilevel structures I • Structure: – Multiple response variables, representing repeated measurements of distinctive but not unrelated outcome variables – SET of response variables (level-1) nested within individuals (level-2) nested within neighborhoods (level-3) – NB- imbalance: not all responses measured on all individuals • Research Question -responses are health related behaviours (drinking, smoking exercise & diet) – are there unhealthy communities as well as unhealthy individuals? – how correlated are the behaviours at the individual level and community level (taking account of other characteristics)? 19 Multivariate multilevel structures II • Handles extreme case of matrix sampling design; imbalance by design • All respondents are asked a set of core questions but additional questions are asked of random subsets of the total sample • Example: all pupils asked core set of core mathematics questions; but subsets asked detailed questions on trigonometry, matrix algebra, calculus; all modeled simultaneously • Another variant is the ‘mixed’ multivariate model………. 20 Mixed multilevel structures • Structure: – Level-3: Places; Level-2: Individuals; Level-1: Two responses: • 1) smoke or not (a qualitative state) • 2) how many cigarettes they consume in a day (a quasi-continuous measure)………Imbalance; data with a spike at zero • Question – is social class related to whether you smoke or not; but unrelated to the amount you smoke? – Are places that have a high proportion of smokers also tend to have higher smoking intensity? 21 Non-Hierarchical structures Cross-classified structures I • Structure: – Individuals at level-1 in Workplaces at level-2 AND neighborhoods at level-2 – Workplaces and neighborhoods are not nested but CROSSED. – Individuals are seen as occupying more than one set of contexts • Questions: – Relative contribution of neighborhoods and workplaces to life satisfaction. 22 Cross-classified structures II • Examples patients (1) nested by GP (2) and by hospitals (2) - pupils (1) nested within primary schools(2), secondary schools (2), and neighbourhoods (2) • Cross classifications and strict hierarchies repeated sampling potential imbalance but latter do not have exact nesting 23 Non-Hierarchical structures Multiple membership structures • Structure: lower level unit can belong simultaneously to more than one higherlevel unit – Pupils nested within primary school-teachers – Some pupils are taught by more than one teacher – Structure includes a ‘weight’ based upon the proportion of time the pupil spent with the teacher (sum to 1) • Generalization - handles ‘missing identifiers; when only probability of membership to a high-level unit is known 24 CLASSIFICATION DIAGRAMS a) 3-level hierarchical structure b) cross-classified structure 25 CLASSIFICATION DIAGRAMS(cont) c) multiple membership structure d) spatial structure 26 Spatial Models as combination of strict hierarchy and multiple membership (eg contagious disease diffusion) A D B Person in A C Affected by A(SH) and B,C,D (MM) E F Person in H G H I J K L M Affected by H(SH) and E,I,K,G (MM) Multiple membership defined by common boundary; weights as function of inverse distance between centroids of areas 27 Combining structures: crossed-classifications and multiple membership relationships School Pupils Area S1 S2 S3 S4 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 A1 A2 A3 Pupil 1 moves in the course of the study from residential area 1 to 2 and from school 1 to 2 Area School Pupil 8 has moved schools but still lives in the same area Pupil 7 has moved areas but still attends the same school Student Now in addition to schools being crossed with residential areas pupils are multiple members of both areas and schools. 28 CLASSIFICATIONS FOR ALSPAC Primary school Area School Cohort Pupil Teacher occasions • • • • • All children born in Avon in 1990 followed longitudinally Multiple attainment measures on a pupil Pupils span 3 school-year cohorts (say 1996,1997,1998) Pupils move between teachers,schools,neighbourhoods Pupils progress potentially affected by their own changing characteristics, the pupils around them, their current and past teachers, schools and neighbourhoods 29 CLASSIFICATION DIAGRAM FOR ALSPAC Primary school Area School Cohort Pupil Teacher occasions IS SUCH COMPLEXITY NEEDED? • Complex models are not reducible to simpler models • Confounding of variation across levels (eg primary and secondary school variation) 30 Structures, classifications and dataframes School S1 S2 S3 S4 2-level strict hierarchy structure Pupils P1 school pupil P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 Classifications or levels Response Explanatory variables Pupil I School j Pupil Exam scoreij Pupil previous Examination scoreij Pupil genderij School typej 1 1 75 56 M State 2 1 71 45 M State 3 1 91 72 F State 1 2 68 49 F Private 2 2 37 36 M Private 3 2 67 56 M Private 1 3 82 76 F State 2 3 85 50 F State 1 4 54 39 M Private 2 4 91 71 M Private 3 4 43 41 M Private 4 4 66 55 F Private 31 Distinguishing Variables and Levels School type School Pupils state S1 P1 P2 private S3 P3 P6 S2 P7 P8 P4 P5 Classifications or levels Response Explanatory Variables Pupil I School Type k Pupil Exam Scoreijk Previous Exam scoreijk Pupil genderijk School j 1 1 State 75 56 M 2 1 State 71 45 M 3 1 State 91 72 F 1 2 Private 68 49 F 2 2 Private 37 36 M Etc S4 NO! P9 P10 P11 P12 School type is not a random classification it is a fixed classification, and therefore a variable not as a level. Random classification if units can be regarded as a random sample from a wider population of units. Eg pupils and schools Fixed classfication is a small fixed number of categories. Eg State and Private are not two types sampled from a large number of types, on the basis of these two we cannot generalise to a wider population of types of schools, But countries? Later….. 32 Structures, classifications and dataframes 2 level repeated measures design Classification diagram Unit diagram Person P1 O1 O2 O3 O4 Measurement Occasion Classifications or levels Response Explanatory variables Occasion i Person j Heightij Ageij Genderj 1 1 75 5 F 2 1 85 6 F 3 1 95 7 F 1 2 82 7 M 2 2 91 8 M 1 3 88 5 F 2 3 93 6 F 3 3 96 7 F P2 P3 ..... O1 O2 O1 O2 O3 HOcc1 HOcc2 HOcc3 AgeOcc1 AgeOcc2 AgeOcc3 Gender Person 1 75 85 95 5 6 7 F 2 82 91 * 7 8 * M 3 88 93 96 5 6 7 F b) in short form : a) in long form 33 Structures, classifications and dataframes Classification and unit diagrams for multivariate response model P1 Pupil E M P2 P3 P4..... E E M M Subject b) in long form a) in short form Englis h Score Maths Score 1 95 75 2 55 3 4 Pupil Gende r Response Explanatory variables Classifications or levels M Exam Subject i Pupil j Exam Scoreij EngIndicij MathIndicij GenderEngj GenderMathj * F Eng 1 1 95 1 0 M 0 65 40 F Math 2 1 75 0 1 0 M * 75 M Eng 1 2 55 1 0 F 0 Eng 1 3 65 1 0 F 0 Math 2 3 40 0 1 0 M Math 2 4 75 0 1 0 M 34 Analysis Strategies for Multilevel Data What are the alternatives; and why use multilevel modelling? 35 Analysis Strategies I Group-level analysis. Move up the scale: analyse only at the macro level; Aggregate to level 2 and fit standard regression model. • Problem: Cannot infer individual-level relationships from group-level relationships (ecological or aggregation fallacy) Example: research on school effects Response: Current score on a test, turned into an average for each of j schools; Predictor: past score turned into an average for each of _ j schools X j Model: regress means on means _ _ Same mean Y j , X j but three very different within school relations _ Yj Means on means analysis is meaningless! Mean does not reflect within group relationship Aitkin, M., Longford, N. (1986), "Statistical modelling issues in school effectiveness studies", Journal of the Royal Statistical Society, Vol. 149 No.1, pp.1-43. 36 Analysis Strategies continued I Group-level analysis Continued Aggregate to level 2 and fit standard regression model. • Problem: Cannot infer individual-level relationships from grouplevel relationships (ecological or aggregation fallacy) Level Black illiteracy Foreign-born illiteracy Individual 0.20 0.11 State 0.77 -0.52 Robinson (1950) demonstrated the problem by calculated the correlation between illiteracy and ethnicity in the USA for 2 aggregate and individual 2 scales of analysis for 1930 USA - Individual: for 97 million people; States: 48 units - very different results! The ECOLOGICAL FALLACY 37 Analysis Strategies continued II Individual-level analysis. Move down the scale; work only at the micro level; Fit standard OLS regression model • Problem: Assume independence of residuals, but may expect dependency between individuals in the same group; leads to underestimation of SEs; Type I errors Bennet’s (1976) “teaching styles” study uses a single-level model: test scores for English, Reading and Maths aged 11 were significantly influenced by teaching style; PM calls for a return to ‘traditional’ or formal methods Re-analysis: Aitkin, M. et al (1981) Statistical modelling of data on teaching styles (with Discussion). J. Roy. Statist. Soc. A 144, 419-461 Used multilevel models to handle dependence of pupils within classes; no significant effect Also atomistic fallacy…………. 38 What does an individual analysis miss? ( Subramaniam, SV, Jones, K,et al (2009) 'Revisiting Robinson: The perils of individualistic and ecological fallacy', International Journal of Epidemiology ) • Re-analysis as a two level model (97m in 48 States) Who is illiterate? Individual model States People Does this vary from State to State? Cross-level interactions? 39 Analysis Strategies (cont.) III Contextual analysis. Analysis individual-level data but include group-level predictors Problem: Assumes all group-level variance can be explained by group-level predictors; incorrect SE’s for group-level predictors • • • • Do pupils in single-sex school experience higher exam attainment? Structure: 4059 pupils in 65 schools Response: Normal score across all London pupils aged 16 Predictor: Girls and Boys School compared to Mixed school Parameter Cons (Mixed school) Boy school Girl school Between school variance(u2) Between student variance (e2) Single level -0.098 (0.021) 0.122 (0.049) 0.245 (0.034) 0.985 (0.022) Multilevel -0.101 (0.070) 0.064 (0.149) 0.258 (0.117) 0.155 (0.030) 0.848 (0.019) SEs 40 Problem is worst for higher-level variables If we fit a single level model we get incorrect uncertainty intervals leading to incorrect inferences Single level model ignores the clustering effect of schools mixed school boy school girl school Multilevel model automatically corrects the SE, and CI’s 41 Analysis Strategies (cont.) IV Analysis of covariance (fixed effects model). Include dummy variables for each and every group Problems • What if number of groups very large, eg households? • No single parameter assesses between group differences • Cannot make inferences beyond groups in sample • Cannot include group-level predictors as all degrees of freedom at the group-level have been consumed • Target of inference: individual School versus schools 42 Analysis Strategies (cont.) V Fit single-level model but adjust standard errors for clustering (GEE approach) Problems: Treats groups as a nuisance rather than of substantive interest; no estimate of between-group variance; not extendible to more levels and complex heterogeneity VI Multilevel (random effects) model. Partition residual variance into between- and within-group (level 2 and level 1) components. Allows for un-observables at each level, corrects standard errors, Micro AND macro models analysed simultaneously, avoids ecological fallacy and atomistic fallacy: richer set of research questions BUT (as usual) need well-specified model and assumptions met. 43 Why should we use multilevel models? Sometimes: single level models can be seriously misleading! 44 What have we learned? • Multilevel models can handle social science research problems with “realistic complexity” • Complexity takes two forms and two types • As ‘Structure’ ie dependencies - naturally occurring dependencies Eg: pupils in schools ; measurements over time - ‘imposed-by-design’ dependencies Eg: multistage sample • As ‘Missingness’ ie imbalance - naturally occurring imbalances Eg: not answering in a panel study - ‘imposed-by-design’ imbalances Eg: rotational questions • Most (all?) social science research problems and designs are a combination of strict hierarchies, cross-classifications and multiple memberships 45 So what? • Substantive reasons: richer set of research questions – To what extent are pupils affected by school context in addition to their individual characteristics? – What proportion of the variability in achievement at aged 16 can be accounted for by primary school, secondary school and neighbourhood characteristics? • Technical reasons: – Individuals drawn from a particular ‘groupings’ can be expected to be more alike than a random sample – Incorrect estimates of precision, standard errors, confidence limits and tests; increased risk of finding relationships and differences where none exists 46 Some reading • Johnston, RJ, Jones, K, Propper, C & Burgess, SM (2007) Region, local context, and voting at the 1997 General Election in England, American Journal of Political Science, 51 (3), 641-655 • Jones, K, Subramanian, SV & Duncan, C. (2003) Multilevel methods for public health research', in Kawachi, I and Berkman L F (Eds.), Neighbourhoods and Health, Oxford University Press, 65-111. • Jones, K & Duncan, C. (2001) Using multilevel models to model heterogeneity: potential and pitfalls, Geographical Analysis, 32, 279305 • Bullen N, Jones K, Duncan C. (1997) Modelling complexity: analysing between individual and between-place variation—a multilevel tutorial. Environment and Planning. 29: 585–609 • Jones, K., and Duncan, C. (1998) ‘Modelling context and heterogeneity: Applying multilevel models’, in E. Scarbrough and E. Tannenbaum (Eds.), Research Strategies in the Social Sciences. Oxford University Press. 47 Rest of Session and homework • Homework Task 2 : think of a social-science research problem in your field; does it have a hierarchical or non-hierarchical structure; are cross-classified model and multiple membership models appropriate; draw up a classification diagram for a realistically complex model • Task 3 read Chapter 1 in accompanying book “Developing multilevel models in MlwiN” (pp.3-11) • Laboratory 1 • Start working through Chapter 2 “Getting started with MlwiN” stop before Graphical Display (work through pp.12-30) • NB review Stats 101: following slides 48 Stat101: A reminder of some key ideas in statistics What you should have got out of Stats 101 and hopefully have! KEY concept: random coefficient model analyses the MEAN AND VARIANCE of a response variable as a FUNCTION of other predictor variables MEAN: the central tendency; the average or typical value; eg what is the average rainfall on this island? VARIANCE: the spread, the typical or ‘average’ variability around the mean; the average deviation around the mean eg how unequal is the rainfall, how consistent across the island? Questions of volatility, difference, variability, equality and segregation are answered by examining the variance Description, modelling and inference in the simplest possible case………..ESSENTIAL PRE-REQUISITES Description: calculating the Mean and Variance Index i Rainfall Inches (yi) Deviation from mean Deviations Squared 1 5 5-7 = -2 -2 * -2 = 4 2 7 7-7 = 0 0* 0=0 3 9 9-7 = 2 2* 2=4 Sum 21 0 8 Mean = [sum(Values)]/n = 21/3 = 7 inches n is the no of rainfall stations, = 3 yi is the amount if rain in inches at each of the i = 1,2,3 stations y Mean Deviation = [sum(Deviations)]/n = 0; Always, therefore useless! Variance = [sum(Deviations)2]/n = 8/3 = 2.666 inches-SQUARED = 2 Therefore square-root to get back to original measurements SD 2.666 1.633 INCHES Usually use n-1 to avoid downwardly biased estimate [1 df has been consumed in the calculation of the mean; only n-1 values can freely vary] Consequently, unbiased estimates Variance = 8/2 = 4 inches squared; SD= 2 inches 51 Description: calculating the covariance Station Y:Rainfall Inches Distance Inland Kms X1 DistanceInland Metres X1 1 9.20 0 0 2 7.37 1 1000 3 6.16 2 2000 4 3.04 3 3000 5 2.37 4 4000 n Covy , x [ ( y y )(x x)] / n 1 i 1 The extent to which two random variables vary together (co-vary) NB scale-dependent covariance Y and X1 : inches*kms:= -4.5 Y and X2 : inches* Metres: -4497.5 Therefore standardise to get a dimensionless number between -1 and 1; divide by the product of the Standard deviations Corry, x Covy, x /( y x ) NB scale independent correlation coefficient Y and X1 : inches kms:= -0.985 Y and X2 : inches Metres: -0.985 A covariance is a un-standardised correlation coefficient! http://www.visualstatistics.net/Visual%20Statistics%20Multimedia/covariance.htm 52 Variances of sum of two variables (more esoteric but use later) EG summing the responses for each question on a test to obtain the total test score Can be expanded to give IE Sum of the variances of the variances plus two times their covariance Person Bia David Tina Alex Var Cov Test 1 3 1 3 9 12 Test 2 Total: T1 + T2 4 4 8 8 5.33333 7 5 11 17 28 5.33333 Variance of the Total = Var(T1) + Var(T2) + 2Cov(T1,T2) = 12 + 5.33 + 2(5.33) = 28 Knowing the variance and covariance of the Original variables, can calculate the Variance of the sum Theory II: Variance of a weighted sum of variables EG if we wanted to weight Test 1 by double counting, and Test 2 by treble counting Begin with double counting of Test 1: weight =2 Person Test 1 Mean: 3 * weight= 3* 2 = 6 Test 1 * 2 Bia 1 2 David 2 4 Tina 4 8 Alex 5 10 Mean 3 6 Var 2 8 Var(αY) = α2Var(Y) where α is the weight Variance : weight2 * 2=22 *2 = 8 Now do both weightings Person Test 1 Test 2 T1 * 2 T2*3 Bia 1 3 2 9 11 David 2 1 4 3 7 Tina 4 3 8 9 17 Alex 5 9 10 27 37 Mean 3 4 6 12 18 Var 2 9 8 81 133 Covar 3.5 (T1*2) + (T2*3) Var(αY + βX) = α2Var(Y) + β2Var(X) + 2αβCov(Y,X); α and β are the weights Variance[(T1*2)+ (T2*3)] = 22(2) + 32(9) + 2*2*3*3.5 = 133 Knowing the weights, variance and covariance of the Original variables, can calculate the Variance of the weighted sum 54 Modelling the Mean and Variance Verbal equation: the Observed Response = Mean + Deviation Index i Observed Rainfall (yi) Estimated Mean Deviation from mean 1 5 7 -2 2 7 7 0 3 9 7 2 Symbolically yi y e i 0 e, e ~ i i N (0, e ) 2 Note: distributional assumption that the deviations come from a Normal distribution; can & be must be assessed Note: mean and variance are the sample estimates of the underlying parameters 55 Inference I: Bounds or Coverage Intervals Make use of Normality assumptions to infer coverage intervals for island rainfall EG what are the likely extremes on the island? 20000 P o in ts o n th e Is la n d A theoretical Normal distribution with a Mean of 7 and a SD of 2; complete description 10000 0 0 10 20 A Normal Distribution with Mean of 7 and SD of 2 Using the properties of a Normal distribution 95% of places will lie +/- 1.96 SD’s from the mean 2.5% wettest = 7 + (1.96*2) = 10.92 inches 2.5% dryest = 7 - (1.96*2) = 3.08 inches 95% coverage interval = 3 to 11 inches (assumes real world follows a Normal distribution) 56 Inference II: Confidence, precision, and the Standard Error •All estimates have a degree of uncertainty or un-reliability around them •The Standard Error assesses this unreliability or imprecision We can estimate the SE given observed values (and assumptions) SE of the mean = SD/ √n = 2/√3 = 1.1547 Note: reliability affected by how much evidence we have (the size of the sample, n) and how variable rainfall is (SD) Again making use of Normal theory, we can be 95% confident that population mean lies between Upper 95 Confidence Interval: 7 + (1.96*1.1547) = 9.26 Lower 95 Confidence interval: 7 – (1.96*1.1547) = 4.74 We can be 95% confident (if our assumptions are met) that the Island’s mean rainfall is between 4¾ and 9¼ inches The SE is the special name for the SD of the sampling distribution of an estimate, ie a summary of the spread of an estimate in repeated sampling 57 Inference III: Confidence Intervals around Coverage Bounds Do not confuse the two! Coverage Bound - given estimated Mean and Variance; what is the 95 coverage of possible observations - 95 % of the observed values should lie between these bounds -But like all estimates, they should have SE’s & confidence intervals! 95% Coverage SE of Fit 95.0% CI around CB 3.08: lower 0.51 (0.89, 5.27) 10.92: upper 1.80 (3.17, 18.67) Cannot tell a thing here due to small n Never done in practice, but making you aware of different concepts! 58 Mean and Variance as represented in MLwiN As a (single level) population model As an estimated model (RIGLS estimate; ie n-1) Estimated Mean Estimated Variance Estimated Standard Error of the Mean Estimated Standard Error of the Variance 59 Interpretation of the intercept and slope in simple regression 0 1 yi 0 1 x1i (ei ) [ei ] ~ N (0, ) 2 e Intercept: Mean of y when xi is zero Slope: Change in y when xi increases by 1 unit; how y changes with x + means if x is increased, so does y; also if x is decreased, then so does y - means if x is decreased, then y increases; if x increases then y decreases Station Rainfall Distance Inland Kms Distance Inland Metres 1 9.20 0 0 2 7.37 1 1000 3 6.16 2 2000 4 3.04 3 3000 5 2.37 4 4000 Slope Mean inches of Rainfall at the coast 60 The meaning of the fixed part of the model as averages Fixed part 0 1 x1i The response as a function of x1, how does the mean rainfall change as you move inland? IE, make predictions for y by substituting values of x1 in to the estimated equation ^ Station Rainfall Distance Predicted y 10 2 DistKm s Inland Kms Mean Rainfall 1 9.20 0 10 2 7.37 1 8 3 6.16 2 6 4 3.04 3 4 5 2.37 4 2 61 The meaning of the random part of the model ^ ei yi y i y i ( 0 1 x1i ) Random part [ei ] ~ N (0, e2 ) Residuals as deviations from the mean summarised as a variance Station Rainfall Distance Inland Kms Predicted Mean Rainfall Deviation or Residual 1 9.20 0 10 9.20 -10 = -0.8 2 7.37 1 8 7.37 – 8 = -0.63 3 6.16 2 6 6.16 – 6 = +0.16 4 3.04 3 4 3.04 – 4 = -0.96 5 2.37 4 2 2.37 – 2 = +0.37 [ei ] ~ N (0, e2 ); ^ y y (10 2 DistKm s) e2 0.343 The Variance is NOT a function of x1, same variability around the mean at the coast and far inland; ie homoscedaticity 95% Coverage bounds = Mean ± 1.96 * √Variance = Mean ± 1.96 * √0.343 Based on Normality assumptions 62 What’s going to be new? Nothing, except • (1) So far variance & covariance of observed or manifest variables; now latent or unobserved variables: eg measure people but posit & estimate place effects • (2) Variances at different levels; between people within n’hood, and also between n’hood • (3) Variances are a function of variables; ie variance changes with a predictor; eg greater n’hood differences for large as compared to small properties; boys more variable in achievement than girls 63 Session 2 • Varying relationships plots • Tasks 4 and 5 • From graphs to equations I the general structure of statistical models the simple regression model the random -intercepts two-level model as portrayed in MLwiN • Lab 2: MLwiN preliminaries (before modelling) 64 Varying relationships “There are NO general laws in social science that are constant over time and independent of the context in which they are embedded” Rein (quoted in King, 1976) 65 Varying relations plot • Ignore the complexities of Session 1 Two level model houses at level 1 nested within districts at level 2 • Single continuous response: price of a house • Single continuous predictor: size = number of rooms and this variable has been centred around average size of 5 Rooms x1i 1 2 3 4 5 6 7 8 -4 -3 -2 -1 0 1 2 3 66 67 68 Task 4 • Draw a single diagram relating a response (fat consumption) to a continuous predictor (age) centred around its national mean with the following characteristics no national relationship; substantial differences between 6 places in terms of the elderly, but less marked for middle-ages and least for young 69 TASK 5 • Draw three separate graphs showing the following relations between response of house prices and a 3-category predictor (detached, semi-detached, terrace) • a) differences between 3 categories of housing but no differences between places • b) differences between 3 categories of housing and same differences between places (random intercepts model) • c) differences between 3 categories of housing and different differences between places (random intercepts & slopes) • HINT: use different colours or line styles to show different places 70 General Structure for Statistical models • • • • Response = general trend + fluctuations Response = average + variability around the average Response = systematic component + stochastic element Response = fixed + random • Specific case: the single level simple regression model Response Systematic Part Random Part House Price Price of average- + sized house Cost of extra room house residual variation Intercept Slope = + Residual 71 Simple regression model 1 0 y x1 is the outcome, price of a house is the predictor, number of rooms, which we shall deviate around its mean, 5 Rooms 1 2 3 4 5 6 7 8 x1 -4 -3 -2 -1 0 1 2 3 72 Simple regression model yi 0 1x1i (ei ) yi x1 is the price of house i is the individual predictor variable 0 is the intercept: meaning? 1 is the fixed slope term: meaning? ei is the residual/random term, one for every house Summarizing the random term: ASSUME IID Mean of the random term is zero Constant variability (Homoscedasticy) No patterning of the residuals (i.e, they are independent) [ei ] ~ N (0, e2 ) 2 e is variations between houses conditional on size 73 Visualising IID assumptions I Both assumptions apply to the residuals (not the responses) 1)Independently distributed: ie no spatial or temporal autocorrelation; patternless residuals, ‘patterns’ has been captured in the fixed part of the model; a positive residual does not increase the chance of another positive residual nearby positive residual (fixed part under-prediction) negative residual (fixed part over-prediction) Left map: strongly patterned; eg western neighbourhoods are more expensive or wetter than predicted by model Right map: no strong patterning; this is what the model presumes 74 Visualising IID assumptions II 2)Identically distributed: ie each residual comes from a distribution with the same variance; homoscedaticity Eg Big houses are no more variable in their price than small ones Eg The same variability of rainfall at the coast as inland y Mean relationship ei has a constant variance for each value of xi x Distribution of y-values at each value of x is Normal with the same variance 75 Two-level random-intercept multilevel model Prices for house in a district is a function of the size of the house and a differential effect associated with its district Basic concept: of multilevel model: Develop models at each level and link them in a combined model. Micro-model: specifying the within-district part of model; ie house level Macro-model: specifying the between-district part of the model Note: the allowed-to-vary parameters of within-district model become the responses in the between-district model 76 Micro Model (Within-district, between-house) Price of a house in a district = Price of an averaged sized house in each district + Cost of extra room everywhere + Differential for each house in a district 77 Macro Model 1 (Between-district) Price of an averaged sized house in each district = Price of an averaged sized house (across all districts) + Differential (discount /premium) on the price of an averaged sized house for each district 78 Combined model Price of a house in a district = Price of an averaged sized house + Cost of extra room + Differential on the price of an averaged sized house for each district + Residual differential for each house in a district 79 Random intercepts model Premium Y ^ Citywide line u0 j 0 Discount 1 x 1 u0 j Differential shift for each district j : index the intercept Micro-model yij 0 j 1x1ij eij Macro-model: index parameter as a response 0 j 0 u0 j Price of average = district j citywide + price differential for district j Substitute macro into micro……. 80 Random intercepts COMBINED model Substituting the macro model into the micro model yields yij ( 0 u0 j ) 1x1ij eij Grouping the random parameters in brackets yij 0 1 x1ij (u0 j eij ) • Fixed part 0 1 • Random part (Level 2) [u0 j ] ~ N (0, u20 ) • Random part (Level 1) [eij ] ~ N (0, e2 ) • District and house differentials are independent Cov[u0 j , eij ] 0 81 The meaning of the random terms • Level 2 : between districts [u0 j ] ~ N (0, u20 ) 2 u0 • Between district variance conditional on size • Level 1 : within districts between houses [eij ] ~ N (0, e2 ) 2 e • Within district, between-house variation variance conditional on size 82 Variants on the same model • Combined model: what variable is 1 associated with? and ? 0 yij 0 1 x1ij (u0 j eij ) • Fully explicit Combined model yij 0 x0ij 1x1ij (u0 j x0ij e0ij x0ij ) x0ij • Is the constant ; a set of 1’s • In MLwiN Differentials at each level 83 Chapter 3: Fitting a RI two-level model in MLwiN (pp.41) • Model 1: The null random intercepts model yij 0 x0ij (u0 j x0ij e0ij x0ij ) u0 j Y ? ^ 0 u0 j 84 Chapter 3 Fitting a RI two-level model in MLwiN (pp.41) • Model 1: The null random intercepts model • Standard sequence: - preliminaries: input, sorting, create constant - specification (response, structures, predictors, fixed and random) - estimation of model & initial interpretation - estimate residuals - graphics to aid interpretation - re-specify model 85 Laboratory 2 • Complete chapter 2: Getting started in MLwiN: preliminaries before modelling • If completed; visit the Bristol Centre for Multilevel Modelling 86 Resources Centre for Multilevel Modelling http://www.cmm.bris.ac.uk Provides access to general information about multilevel modelling and MlwiN. Lemma training repository http://www.ncrm.ac.uk/nodes/lemma/about.php Email discussion group: www.jiscmail.ac.uk/multilevel/ 87 http://www.cmm.bristol.ac.uk/ 88 http://www.cmm.bristol.ac.uk/learning-training/course.shtml 89 http://www.cmm.bristol.ac.uk/links/index.shtml 90 http://www.cmm.bristol.ac.uk/learning-training/multilevel-m-software/index.shtml 91 The MLwiN manuals are another training resource http://www.cmm.bristol.ac.uk/MLwiN/download/manuals.shtml 92 Session 3 • Review of the random intercepts model • Types of question tackled by Multilevel models • Residuals in multilevel models - concepts and estimation • Fitting a two-level model to house-price data - + a single predictor • Context, composition and valued-added analyses • Laboratory 3: an analysis of the JSP data - pupils within school - random intercepts null model - then add in single predictor 93 2-level random-intercept multilevel model: review • Combined model in full yij 0 x0ij 1x1ij (u0 j x0ij e0ij x0ij ) • Fixed part 0 1 • Random part (Level 2) • Random part (Level 1) 2 u0 2 e0 94 Type of questions tackled by multilevel modelling I • 2-level model: current attainment given prior attainment of pupils(1) in schools(2) • NB assuming a random sample of pupils from a random samples of schools • Do Boys make greater progress than Girls (F) • Are boys more or less variable in their progress than girls?(R) • What is the between-school variation in progress? (R) • Is School X different from other schools in the sample in its effect? (F) • continued……. 95 Type of questions tackled by multilevel modelling II • Are schools more variable in their progress for pupils with low prior attainment? (R) • Does the gender gap vary across schools? (R) • Do pupils make more progress in denominational schools?(F) • Are pupils in denominational schools less variable in their progress? (R) • Do girls make greater progress in denominational schools? (F) (cross-level interaction) 96 Price Residuals in a two-level random intercept model: simplified to three groups (Red, Blue, Other districts) Size 97 Residuals in a single-level model Y ^ 1 x 1 Price 0 ei Size 98 Random intercepts model Y ^ 0j Y 1 x 1 ^ 0 Y 1 x 1 Price ^ 0j 1 x 1 Size 99 Residuals in a two-level model – Level 2 u0 j Price u0 j Size 100 Residuals in a two-level model – Level 1 Price eij Size 101 Two ways of estimating a set of group effects Method 1: treat each group (eg each school) as a separate entity; ie fixed classification, put a dummy in the model for each and every school Pric e Equivalent to fitting a separate line to each group In terms of estimates of coefficients get the same results as ‘select if’ Size Equivalent to Anova/ Ancova; ie a fixed effects analysis No general line; just many separate lines( & used up all df at higher level) Two ways of estimating a set of group effects Method 2: treat each group (eg each school) as belonging to a distribution; ie a random classification, group differences assumed to come from a distribution in the random part of the model [u0 j ] ~ N (0, ) Pric e 2 u0 Estimate a general line (fixed part) and allow this to vary from group to group IE random- effects analysis Size Will the fixed results differ from the random? Potentially YES! If the group line is imprecisely estimated, it will be shrunk towards the general line; if precisely estimated (eg many pupils in a school): little difference; random effects is protective, conservative approach Estimating Multilevel Residuals In single level models: Residuals can be estimated simply by subtracting the individual predictions from the observed values. In multilevel models: more complex! Two stage procedure: 1: estimate the variance 2: estimate the specific random effects rij yij yˆ ij Raw residual for level-1 units (Departures from the general line) Raw residual for level-2 units IE separate analysis (see next slide) Then, the estimated, precision weighted residual for level-2 units (the random effect)is rj is the mean of uˆ 0 j u20 rij u20 .r j 2 e0 / n j 104 Raw Residuals Y ^ 0 1 x 1 Price General line across houses & districts ri Size rj rj Raw residual for level-1 units: Made up of house& district differential mean residual for district j eg Red District = ?; Blue = ? Method 1: Separate (fixed) estimate of district difference; takes no Account of belonging to a distribution, therefore have to shrink it to get Estimate of district random effects; ie Method 2 105 AKA Posterior residuals uˆ 0 j 2 u0 u20 .r j 2 e0 / n j Where, 2 u0 is the between-district variation 2 e0 is the within-district, between house variation nj is the number of houses in that district rj is the raw residual for that district 106 AKA shrunken residuals uˆ 0 j u20 u20 .r j 2 e0 / n j The multiplier in the above formula is always less than or equal to 1, so that the estimated residual magnitude than the raw residual, . rj uˆ0 j is less or equal in The raw residual is multiplied by a shrinkage factor and the estimated (or posterior) residual therefore is also sometimes called a shrunken residual. Greatest shrinkage towards zero when between district variance is relatively small and when the raw residual would be un-reliably estimated: Session 10 provides comparison 16 minute video on multilevel residuals http://www.cmm.bristol.ac.uk/learning-training/videos/RP/Residuals/Residuals.html 107 Residuals and variances uˆ0 j ^ 2 u0 • The estimated residual, • AKA the random effect at level 2; • AKA the allowed-to-vary differential effect; • These have been shrunk towards general line • The ‘difference a school makes’, • Positive and negative values • The variance at level 2 • Summarizes the differences, • Not the variance of the shrunken residuals uˆ0 j • But variance of the raw mean residuals, rj • Not the estimated between group variance of the sample • But the estimated between-group variance in the • population…….. More later and comparison Chapter 3: Fitting a RI two-level model in MLwiN (pp.39) • Model 1: The null random intercepts model yij 0 x0ij (u0 j x0ij e0ij x0ij ) u0 j Y ? ^ 0 u0 j 109 Chapter 3 Fitting a RI two-level model in MLwiN (pp.39) • Model 1: The null random intercepts model • Standard sequence: - preliminaries: input, sorting, create constant - specification (response, structures, predictors, fixed and random) - estimation of model & initial interpretation - estimate residuals Page 48 graphics to aid interpretation re-specify model 110 Moving from a null model to a model with a predictor What can happen as we include a level 1 predictor? What might happen to the between school and between pupil within school variances when we condition on prior ability in a value added study? Y ^ Y ^ Prior ability x 0 1 1 0 Prior ability Both the between school and between pupil within school variation could be reduced when we model attainment as a function of prior ability Apparent contextual effects are in fact due to different composition in terms of intake 111 Moving from a null model to a model with a predictor What can happen as we include a level 1 predictor? What might happen to between neighbourhood variation when we condition on size of property? The between neighbourhood variation could increase when size included; Contextual effects are being masked; expensive neighbourhood have small houses; cheap neighbourhood have large houses 112 Laboratory 3: an analysis of the JSP data Fitting you own models! The data are from the Junior School project, a longitudinal study of 2000 pupils (level 1) who entered junior school aged 7 and left in 1984 aged 11. The JSP pupils attended 48 primary schools (level 2) selected randomly from the 636 such schools of the Inner London Education Authority. The response is a score for mathematics at age 9 (Maths5), that is year 5 of the National Curriculum and the single predictor is Math3, that is year 3 of the National Curriculum. At this stage we want you to fit two models 1 random intercepts null model with Maths5 as the response and only a constant term in the model; the model effectively analyses attainment and the main question is consequently “is there a between school difference in attainment”? Continued…………. 113 Laboratory: an analysis of the JSP data 2 random intercepts model with Maths5 as the response and two predictors, the constant and Math3 We would like you to centre Math3 around the value of 25. This analysis effectively models progress, that is attainment given prior achievement. The aim of this analysis is assess the valueadded that schools are making to progress in Mathematics. It is important to realise that exactly the same test was used on both occasions for the Maths score. Hint Read in the data and name the columns c1 ‘school’ c2 ‘pupil’ c3 ‘math3’ c4 ‘math5’ c5 ‘Male’ Male is a binary predictor where 1 represents male; ignore for now NB Strongly advised to save each model as you go along with a ‘meaningful’ file name eg jspmod1 114 Session 4 • Review Lab 3 • Random slopes AKA random-coefficient models • From graphs to equations • Characteristics of a random slopes model • Fitting random slopes models in MLwiN - allowing price-size relationships to vary over districts - hybrid model with districts in the fixed and random part • Laboratory 4: an analysis of the JSP data 115 Random slopes & Random Intercepts: AKA random-coefficient models 116 Two-level model In words: Variation in prices for houses in neighborhoods is a function of the size of the house; an effect associated with neighbourhoods; and a differential effect for size in relation to neighbourhoods Basic concept (as before) Develop models at each level and link them in a combined model. Micro-model: specifying the within-district part of model Macro-model: specifying the between-district part of the model Note: the allowed-to-vaty parameters of within-district model become the responses in the between-district model 117 Micro Model (Within-district, between-house) Price of a house in a district = Price of an averaged sized house in each district + Cost of extra room in each district + Residual differential for each house in a district 118 Macro Model 1 (Between-district) Price of an averaged sized house in each district = Price of an averaged sized house (across all districts) + Differential for averaged sized house for each district 119 Macro Model 2 (Between-district) Cost of an extra room in each district = Cost of an extra room (across all districts) + Differential in the cost of an extra room for each district 120 Combined model Price of a house in a district = Price of an averaged sized house (generally) + Cost of extra room (generally) + Differential for the price of an averaged sized house for each district + Differential in price of an extra room for each district + Differential for each house in a district 121 Random slopes model Micro-model yij 0 j x0ij 1 j x1ij e0ij x0ij Note: Index the intercept and the slope associated with a constant, and number of rooms, respectively Macro-model (Random Intercepts) 0 j 0 u0 j Macro-model (Random Slopes) 1 j 1 u1 j Slope for district j = citywide slope + differential slope for district j Substitute macro models into micro model………… 122 Random slopes model Substituting the macro model into the micro model yields yij (0 u0 j ) x0ij (1 u1 j ) x1ij e0ij x0ij Multiplying the parameters with the associated variable and grouping them into fixed and random parameters yields the combined model: yij 0 x0ij 1x1ij (u0 j x0ij u1 j x1ij e0ij x0ij ) 123 Characteristics of random slopes model yij 0 x0ij 1x1ij (u0 j x0ij u1 j x1ij e0ij x0ij ) • Fixed part 0 1 u1 j u0 j • Random part (Level 2) u0 j u20 ) ~ N (0, 2 u1 j u 0u1 u1 • Random part (Level 1) [e0ij ] ~ N (0, e20 ) Estimated correlation between two sets of random effects = covariance divided between the product of the square root of the variances corr(u0 j , u1 j ) u 0u1 /( u0 u1 ) 124 attain pre-test y x i 0 1 i 2 ei ~ N (0, e ) yij 0 j 1 xij eij 0 j 0 u0 j 2 u0 j ~ N (0, u 0 ) e ~ N (0, 2 ) e ij attain pre-test attain pre-test attain pre-test attain pre-test y x e 0j 1 j ij ij ij u 0 0j 0j 1 j 1 u1 j u0 j 2 ~ N (0, ) : u 0 2 u u u1 u1 j u 01 e ~ N (0, 2 ) e ij 125 Interpreting varying relationship plot through mean and variance-covariance Intercepts: terms associated with Constant x0 Graph Mean 0 Slopes terms associated with Predictor x1 Variance Mean u20 1 Variance u21 Intercept/Slope terms associated with x0 x1 Covariance u 0u1 A + 0 + 0 undefined B + + + 0 undefined C + + + + + D + + + + - E + + + + 0 F + + 0 + + 126 Equation in MLwiN 127 Task 6 • Specify a micro model relating house prices (y) to number of rooms (x1) terraced or not (x2) and garage or not (x3). Specify macro models in which: 1 there are differential overall costs for living in different districts of a city; 2 the cost of an extra room varies from place to place 3 the cost of a non-terraced house varies from place to place. • Write the combined model, and specify as variances/ covariances at each level. What is the implicit assumption about how the cost of a garage varies over the city? 128 Fitting Random-Slopes Model in MLwiN (p.59) Model 3 129 Laboratory 4: an analysis of the JSP data • Return to the JSP data and fit a new model in which the relation between Maths5 and Math3 is allowed to vary between schools? • Is there a difference between schools for the average pupil on entry? • For what sort of pupil do schools make the most difference? • NB do not forget to save the resultant model as something useful like jspmodel3! • Work through Chapter 4 on your own applied to the Jsp data 130 Session 5 • Review Lab 4 • Dependent observations • Modeling variance functions: a general approach to characterize multilevel models • Heterogeneity at level-2 • Calculating and displaying intra-unit correlation and variance functions in MLwiN • Laboratory 5: an analysis of the JSP data continued – variance functions 131 Modeling correlated /dependent data “Everything is related to everything else, but near things are more related than distant things.” (Tobler 1970) Yet, conventional methods and software assume independence (IID) 132 Same thing, different names • Auto-correlation = “self” correlation - temporal autocorrelation -spatial autocorrelation • Clustered data • Intra-class correlations • Intra-unit correlations • Variance partitioning coefficientS 133 Characteristics of a dependent data • It is dependent according to some grouping; eg households tend to be alike; measurements on occasions within people tend to very alike • Consequently; not as much level-1 information as appears; not as many degrees of freedom as we think • If ignored leads to overstatement of significance (mis-estimate precision; standard errors too small; Type I error – finding relationships where none exists) • ρ = 0.01; m= 4, n= 100 nominal α = 0.05; true α = 0.17 134 Sources that generate clustered data • Survey design – Multistage sampling (primary sampling unit induced clustering; known as design effects; interviewer induced clustering); • Population induced clustering (social processes!) – Schools, Clinics, Hospital settings, Workplaces, ‘Neighborhoods’, Administrative areas, Social groupings) 135 So, what do we about it? • Ignore it (increasingly getting less common in social and health research) • ‘Adjust’ or ‘Correct’ it (probably getting most common) eg Sudaan http://www.rti.org/sudaan fits marginal or population-averaged models using generalized estimating equations (GEE) • Expect and model it explicitly (increasingly becoming common) – Multilevel Modeling Approach 136 Problem is worst for higher-level variables If we fit a single level model we get incorrect uncertainty intervals leading to incorrect inferences Single level model ignores the clustering effect of schools mixed school boy school girl school Multilevel model automatically corrects the SE, and CI’s 137 ML approach to modeling clustered data In the two-level random intercepts case yij 0 x0ij 1x1ij (u0 j x0ij e0ij x0ij ) it is simply, u20 2 u 0 e20 Which is: between-district variation divided by the between-district and within-district (betweenhouse) or total variation What happens when = 1 .............. 138 yij 5 (u0 j x0ij e0ij x0ij ) j u0 j e0ij 1 -3 0 2 1 -3 0 . . . . . . . . . . 2 10 1 -3 0 8 1 2 +3 0 8 2 2 +3 0 . . . . . . . . . . 8 15 2 +3 0 4 1 3 -1 0 4 2 3 -1 0 . . . . . . . . . . 4 6 3 -1 0 6 1 4 +1 0 6 2 4 +1 0 . . . . . . . . . . 6 25 4 +1 0 yij i 2 1 2 139 Meaning of • When = 0 - No dependency, all level-1 information, no differences between groups • When = 1 Complete dependency Maximum difference between groupings equivalent to Complete similarity within grouping 9 minute video on dependency in multilevel models http://www.cmm.bristol.ac.uk/learningtraining/videos/RP/dependency/dependency.html 140 Notes on modeling clustering In the single-level case yi 0 x0i 1x1i (e0i x0i ) • Assume, that y given x: pattern-less residuals, conditionally no dependence • If e0i is independent (not patterned), given x, then SE, t test etc correctly estimated in the fixed part. If not, then problem In the multilevel case yij 0 x0ij 1x1ij (u0 j x0ij e0ij x0ij ) • Assume, e0ij is independent, given x and u0j; • In other words, u0j is district “similarity” effect • Consequently SE’s, t test etc are now corrected for this similarity within district (the DISTRICT premium/discount) 141 Correlation structure of single level model yi 0 x0i 1x1i (e0i x0i ) Pupils within classes: FULL correlation structure S 1 1 1 2 2 2 2 3 3 3 P 1 2 3 1 2 3 4 1 2 3 1 1 1 0 0 0 0 0 0 0 0 0 1 2 0 1 0 0 0 0 0 0 0 0 1 3 0 0 1 0 0 0 0 0 0 0 2 1 0 0 0 1 0 0 0 0 0 0 2 2 0 0 0 0 1 0 0 0 0 0 2 3 0 0 0 0 0 1 0 0 0 0 2 4 0 0 0 0 0 0 1 0 0 0 3 1 0 0 0 0 0 0 0 1 0 0 3 2 0 0 0 0 0 0 0 0 1 0 3 3 0 0 0 0 0 0 0 0 0 1 142 yij 0 x0ij 1x1ij (u0 j x0ij e0ij x0ij ) Correlation structure of two level model S 1 1 1 2 2 2 2 3 3 3 P 1 2 3 1 2 3 4 1 2 3 1 1 1 0 0 0 0 0 0 0 1 2 1 0 0 0 0 0 0 0 1 3 1 0 0 0 0 0 0 0 2 1 0 0 0 1 0 0 0 2 2 0 0 0 1 0 0 0 2 3 0 0 0 1 0 0 0 2 4 0 0 0 1 0 0 0 3 1 0 0 0 0 0 0 0 1 3 2 0 0 0 0 0 0 0 1 3 3 0 0 0 0 0 0 0 1 143 Correlation structure of cross-classified model (pupils within schools & neighbourhoods) S 1 1 1 2 2 2 2 3 3 3 P 1 2 3 1 2 3 4 1 2 3 1 1 1 0 0 0 2 0 0 0 1 2 1 0 0 0 0 0 0 0 1 3 1 0 0 0 0 0 2 0 2 1 0 0 0 1 0 0 0 2 2 0 0 0 1 0 0 0 2 3 0 0 0 1 0 0 0 2 4 2 0 0 1 0 0 0 3 1 0 0 0 0 0 0 0 1 3 2 0 0 2 0 0 0 0 1 3 3 0 0 0 0 0 0 0 1 144 School styles study • Bennet’s (1976) “teaching styles” study uses a singlelevel model • test scores for English, Reading and Mathematics aged 11 were significantly influenced by teaching style • PM calls for a return to ‘traditional’ or formal methods • Re-analysis: Aitkin, M., Anderson, D.A. and Hinde, J.P. (1981) Statistical modelling of data on teaching styles (with Discussion). J. Roy. Statist. Soc. A 144, 419-461. • Using proto- multilevel models to handle dependence of pupils within classes • Spencer, N.H. (2002) "Combining modelling strategies to analyse teaching styles data". Quality and Quantity, 36, 2, pp 113-127. 145 ML approach to modeling clustered data In the two-level random intercepts and slopes case yij 0 x0ij 1x1ij (u0 j x0ij u1 j x1ij e0ij x0ij ) it is not so simple! u20 2 u 0u1 x1ij u21 x12ij 2 u 0 2 u 0u1 x1ij u21 x12ij e20 So what is this beast all about? 9 minute video on correlation and covariance matrices http://www.cmm.bristol.ac.uk/learning-training/videos/RP/corr-covar/corr-covar.html 146 Modeling heterogeneity through variance functions • In multilevel models, we have sample of houses, AND a sample of districts. • We need to infer to the general population of houses and the general population of districts • What does the sample suggest about the population from which it is drawn? • We need to do this not simply for ‘population’ averages, but also ‘population heterogeneities’. Or VARIANCE FUNCTIONS • Modeling variation as a function of observed predictor variables: now level 2; later level 1 147 RI model: Constant variance function yij 0 x0ij 1x1ij (u0 j x0ij e0ij x0ij ) • Total variance at level-2 based on the random variable u0 j Var(u0 j x0ij ) x 2 u 0 0ij • Total variance at level-1 based on the random variable e0ij Var(e0ij x0ij ) x 2 e0 0ij • Normality assumptions – 95% of district lines will lie within +/- 2 standard deviations of the overall line – Population of districts likely to vary within the tramlines 148 Between district-heterogeneity as CONSTANT function of size Var(u0 j x0ij ) u20 x0ij • Variance function by Size • Population bounds for districts y x1 x1 149 Random Intercepts and Random Slopes Model: Quadratic variance function at level-2 yij 0 x0ij 1x1ij (u0 j x0ij u1 j x1ij e0ij x0ij ) • Total variance at level-2, sum of TWO random terms u0 j u1 j Var(u0 j x0ij u1 j x1ij ) u20 x02ij 2 u 0u1x0ij x1ij u21x12ij • Total variance at level-1 based on the random variable e0ij Var(e0ij x0ij ) e20 x0ij • Clearly (if random slopes are required) results in a quadratic function at level-2 150 Between district-heterogeneity as QUADRATIC function of size Var(u0 j x0ij u1 j x1ij ) u20 x02ij 2 u 0u1x0ij x1ij u21x12ij • Population bounds for districts • Variance function by Size y x1 x1 151 Between district-heterogeneity as QUADRATIC function of size Var(u0 j x0ij u1 j x1ij ) x 2( u 0u1 ) x0ij x1ij x 2 2 u 0 0ij • Variance function by Size 2 2 u1 1ij • Population bounds for districts y x1 x1 152 Constant, quadratic and linear variance functions Constant Var(u0 j x0ij ) x A 2 u 0 0ij Quadratic Var(u0 j x0ij u1 j x1ij ) x 2 u 0u1 x0ij x1ij x 2 2 u 0 0ij 2 2 u1 1ij A B C Linear Var(u0 j x0ij u1 j x1ij ) u20 x02ij 2 u 0u1 x0ij x1ij A B • Linear is a simplification of the Quadratic variance function in which the C element is zero • Perfectly plausible as a variance function, and more parsimonious • And can do 2-step procedure; estimate the variance, then estimate the level 2 residuals – pictures …………… 153 Between district-heterogeneity as LINEAR function of size Var(u0 j x0ij u1 j x1ij ) x 2 u 0u1 x0ij x1ij 2 2 u 0 0ij • Variance function by Size • Population bounds for districts y x1 x1 154 Between district-heterogeneity as LINEAR function of size Var(u0 j x0ij u1 j x1ij ) u20 x02ij 2( u 0u1 ) x0ij x1ij • Variance function by Size • Population bounds for districts y x1 x1 155 Default setting for variance function in MLwiN • What model is suggested by this level 2 result? u0 j u20 Var ~ N (0, ) 2 u1 j u 0u1 u1 27.46(3.2) 0.00(0.00) 0.00(0.00) • But this may be hiding the full result u0 j u20 Var ~ N (0, ) 2 u1 j u 0u1 u1 27.46(3.2) 15.32(1.23) -0.00(0.61) Empirically negative result • Default: if higher-level variance becomes negative both variance and covariance are set to zero 156 Changing the default setting • In the command window • RESE at level ? With option ? Option 0 Default 1 2 Action Variance is set to zero as, is covariance Negative variance reset to 0, but not associated covariance (allows linear function) No re-setting takes place • Example and recommended Rese 1 1 Rese 2 1 Rese 3 1 157 Changing the default setting in GUI 158 Laboratory 5: an analysis of the JSP data • Return to the jsp data and draw graphs of level-2 and level-1 variance functions;pictures of the between school and between pupil populations , and VPC for • Model 1: • Model 2: null random intercepts model random intercepts model with Maths3 as predictor • Model 3: random intercepts and slopes model: quadratic function and level 2 • New Model: linear function for between school heterogeneity 159 Modeling variance functions in MLwiN Chapter 5 (p.84) Calculate rho (= VPC) for range of models Display the graphs of population heterogeneity at each level for a range of models 160 Session 6 • Review Lab 5 • Heterogeneity at level-1 -Linear and quadratic • Significance testing - specific parameters, overall model • Laboratory 6: an analysis of the JSP data continued 161 Complex heterogeneity at level-1 • So far, we assumed a constant variance function at level-1 – the assumption of HOMOCEDASTICITY . But other possibilities exist: – Large properties may not only be expensive; they may also be more variable (level 1 variance function fanning out) – Pupils with high prior ability, less variable in mathematics progress (level 1 variance function fanning in) – Girls may be better than boys on average but also more consistent (level 1 variance function; differential variability) Need to model heterogeneity explicitly 162 CATCH-ALL plot for diagnosis • CATCH-ALL plot: residuals (on the vertical axis) by predicted response (on the horizontal axis). What we are looking for is a structure-less plot. Magnifies problems 3 5 2 resids resid 1 0 0 -1 -2 -5 -3 30 40 50 yhat 60 70 30 40 50 60 70 yhat • Read Chapter 4 diagnostic procedures 163 A range of catch-all plots 164 Complex heterogeneity at level-1 • Should be routinely done, but almost nobody ever does it. Not possible in most regression software • Why is it important, especially for multilevel models? – Confounding of variances between levels, what appears to be complex level-2 could be a complex level-1 – Implications for fixed estimates standard errors Substantively – EG impact heterogeneity. There is an ‘average’ treatment effect, but are there some groups with greater impact variation? 165 So, what do we about it? • Ignore it: very common • ‘Adjust’ or ‘Correct’ it through transformations of y or x’s or WLS, rare • Expect and model it explicitly; that is Multilevel Modeling Approach; unfortunately very rare • Exactly same principles as adopted for Level-2 166 House-heterogeneity as CONSTANT function of size (homoscedasticity) • Micro model yij 0 j x0ij 1 j x1ij (e0ij x0ij ) Var(e0ij x0ij ) e20 x0ij • Variance function by Size • Population bounds for houses y x1 x1 167 • Between house-heterogeneity as QUADRATIC function of size Micro model yij 0 j x0ij 1 j x1ij (e0ij x0ij e1ij x1ij ) Var(e0ij x0ij e1ij x1ij ) e20 x02ij 2 e0e1x0ij x1ij e21x12ij • Variance function by Size • Population bounds for houses y x1 x1 168 Between house-heterogeneity as QUADRATIC function of size • Micro model yij 0 j x0ij 1 j x1ij (e0ij x0ij e1ij x1ij ) Var(e0ij x0ij e1ij x1ij ) e20 x02ij 2( e0e1 ) x0ij x1ij e21x12ij • Variance function by Size • Population bounds for houses y x1 x1 169 Between house-heterogeneity as LINEAR function of size • Micro model yij 0 j x0ij 1 j x1ij (e0ij x0ij e1ij x1ij ) Var(e0ij x0ij e1ij x1ij ) x 2 e0e1x0ij x1ij 2 2 e0 0ij • Variance function by Size • Population bounds for houses y x1 x1 170 Between house-heterogeneity as LINEAR function of size • Micro model yij 0 j x0ij 1 j x1ij (e0ij x0ij e1ij x1ij ) Var(e0ij x0ij e1ij x1ij ) e20 x02ij 2( e0e1 ) x0ij x1ij • Population bounds for houses • Variance function by Size y x1 x1 171 Level 1 variation Level 2 variation Modeling heterogeneity at multiple levels x1 x1 Level-2: Positive quadratic function Level-1: Negative linear function u20 x02ij 2 u 0u1x0ij x1ij u21x12ij e20 x02ij 2( e0e1 ) x0ij x1ij 172 95% population bounds y y x1 Fixed price/size relationship along with the quadratic bounds for between district variations x1 Fixed price/size relationship along with the linear bounds for within district variations 173 WHY Use variance functions? Substantively • Some research questions are about variances as well as means: segregation, inequality are all about differences between units; lots of such work remains descriptive • Impact heterogeneity: eg 2 interventions that reduce the mean of the problem, but one is preferred becomes it narrows the variance; ie more consistent in effect. • Impact heterogeneity eg 2 interventions that have same mean effect but one is more consistent for men, the other for women; the variance is structured by gender • First lesson in data analysis; middle and scatter around the middle; yet in modelling: homoscedastic assumption, ie all variation consigned to one term and labelled error! 174 WHY Use variance functions? Technically • Gives correct estimate of standard errors in the presence of heteroscedasticity • Multilevel models are about variance functions at higher levels; eg between district variance in terms of price BUT can get confounding across levels; need explicit modelling of variance at all levels • Example: between-district variation in price in terms of size, requires model simultaneously betweenhouse variation in terms of size • Example: detached houses are more expensive than non-detached (fixed), have bigger differences between neighbourhoods (level 2) and bigger differences between houses (level 1) 175 Exemplifying variance heterogeneity • How has life satisfaction changed in Germany 1991-2006? • Structure: 15 occasions for 16k individuals in 16 lander • Response: Life satisfaction on a 10 point score 176 Model specification Fixed part: 3rd order polynomial function of time and East/West interaction Between Lander as function of time and East/West Between Individuals as function of time and East/West Between occasions as function of time and East/West 177 Exemplifying variance heterogeneity 2: varying treatment effects • How has diastolic BP changed with 3 different drug treatments • Structure: 4 post-treatment occasions for 287 patients nested in 29 treatment centres; ie longitudinal, panel model • Baseline average BP is 102 • Below: raw data and fixed effects • All three drugs lower BP; no significant differences between them Variance functions • Variance increases over time between centre • Drug A: greater volatility around falling trend • Drug B: most consistent and reduction over time • Drug C: less consistent, but reduction over time Modeled patient Trajectories • • • Drug A: divergence in BP around falling mean Drug B: convergence in BP around falling mean Drug C: variable impact Modelling segregation Motivation: are we become a segregated society? Virtuous and Vicious circles Following 1988 Education Reform Act - choice, league tables, competition expectation of INCREASED segregation Attractive to High status parents Apparent high performance Apparent poor performanc e Unattractive to High status parents Apparent worsening Apparent improved performance Choice Choice performance increased polarization in terms of ability increased polarization in terms of socio-economic background; poverty; ethnicity etc Research Questions Eligibility for free school meals as a measure of poverty • Has school FSM segregation increased? • Has LA segregation increased? • Has segregation been differential between types of schools and types of LA’s Approach: model the variance not calculate an Index; low variance = low segregation FSM: the data • Source: Pupil Level Annual School Census • Outcome: Proportion of intake Eligible for FSM • Intake: Year 7 of the national curriculum in 2001-2006, Action LA’s Schools Cohorts Pupils Complete data from PLASC; 2001-2006 148 5.615 26,178 3,587,459 Omit ‘special’ schools 148 4.088 20.952 3,536,152 Omit cohorts with less than 20 pupils 148 3,636 20,429 3,535,056 Omit schools without a new intake at aged 11 (ie middle schools) LA loss, eg IOW, Poole 144 3,076 17,695 3271,010 Omit cohorts with implausible year-on-year differences (see next slide) 144 3,076 17,637 3,261,372 Are LA’s that are selective (Grammar/ Secondary) more segregated than totally Comprehensive systems? • • Pupils going to school in Selecting LA’s are less likely to be in poverty Slight decline in poverty in both types of area • • Schools in Selecting areas are more segregated Slight evidence of an increase Is there more segregation in areas that are Selective and where less schools are under LA control in terms of admission policies? Variance function for Selective/Non-selective, structured by the proportion of pupils in an LA who go to Community or Voluntary Controlled schools (contra Voluntary Aided,Foundation, CTC’s, Academies • Pupils going to school in NonSelecting LA’s with low LA control are more likely to be in poverty • • Schools in Selecting areas are more segregated Segregation decreases with greater LA control for both types of LA Area characteristics 3 • • Which of England’s LA’s have the most segregated school system? Model with 144 averages and 144 variances, one for each LA! 185 Which of England’s LA’s have the most segregated school system? Modelling 4.2million pupils; 144 Means and Variances!! LA analysis FSM 2001-6 4 Reading Non Select 3 Va ria nc e Hammersmith and Fulham Buckinghamshire 2 1 Southend-on-Sea Slough Traf f ord Oldham Calderdale Sutton Telf ord and Wrekin Solihull Barnet Wirral Know sley Milton Keynes Croydon Stockton-on-Tees Birmingham Liverpool Kent BexleyPlymouth Salf ord Dudley Gloucestershire Wolverhampton Lincolnshire Bromley Walsall Bolton Havering Kingston-upon-Thames Torbay Enf ield City Bradf ord of Bristol, City North of Peterborough East Leeds Kensington Lincolnshire and Chelsea Essex Sef ton St Helens Bracknell Forest Redbridge Lancashire MedwCumbria ay Bury Thurrock Kirklees Warw ickshire Shef f ield Wokingham Northamptonshire Hillingdon Blackburn w ith Darw en Camden Cheshire Lambeth Manchester Derby NorthHertf Yorkshire Warrington Bournemouth ordshire Gateshead Stoke-on-Trent Coventry Redcar and Cleveland New castle-upon-Tyne Hackney Hampshire Sw indon Portsmouth Haringey Derbyshire South Tyneside Wigan Waltham Forest Westminster Barnsley City of Kingston-Upon-Hull Stockport Hartlepool Ealing Staf East fYork Sussex Darlington Rochdale Wiltshire North Surrey Somerset Brighton & Hove Brent Oxf ordshire Suf fordshire olk Wakef ield City of Nottingham Somerset Doncaster Luton Bath and North NE Somerset Lincolnshire Durham Hounslow Middlesbrough East Riding of Yorkshire West Sussex Rotherham Sunderland Leicester Halton Wandsw City orth North Tameside Tyneside Worcestershire Greenw ich Merton Cambridgeshire Norf Nottinghamshire olk BlackpoolLew isham Leicestershire Heref ordshire Barking Southw ark Sandwand ell Dagenham Dorset DevonRichmond-upon-Thames Shropshire Southampton New ham Islington Windsor and Maidenhead South Gloucestershire Rutland West Berkshire Cornw all Harrow Tow er Hamlets 0 0.0 0.1 0.2 0.3 0.4 0.5 Median Proportion FSM 2001-6 0.6 LA’s with highest segregation (not including estimates lees than 2* SE) LA Variance D equiv Index Median prop FSM 2001-6 Select Prop LA control Buckinghamshire 2.12 0.46 0.03 Select 0.77 Southend-on-Sea 1.92 0.45 0.09 Select 0.21 Slough 1.76 0.43 0.11 Select 0.37 Trafford 1.75 0.43 0.08 Select 0.40 Oldham 1.72 0.43 0.18 Non 0.75 Calderdale 1.59 0.42 0.12 Select 0.32 Sutton 1.50 0.41 0.05 Select 0.39 Telford &Wrekin 1.46 0.40 0.15 Select 0.53 Solihull 1.42 0.40 0.08 Non 0.85 Barnet 1.42 0.40 0.16 Select 0.41 Knowsley 1.38 0.40 0.34 Non 0.67 Wirral 1.38 0.40 0.18 Select 0.74 Milton Keynes 1.36 0.39 0.12 Non 0.43 Croydon 1.30 0.39 0.16 Non 0.31 Stockton-on-Tees 1.29 0.39 0.16 Non 0.69 Take home message • Heterogeneity a norm, not an aberration or error • Traditional approach: assume constant variance that can be summarized in one parameter is overly simplistic and potentially misleading • Multilevel approach: anticipates and allows for modeling of the structure of variation not accounted for by the fixed part in simple and complex ways 188 Significance testing • Individual fixed part and random coefficients at each level • Overall model • Simultaneous and complex comparisons 189 Standard procedure • H0 Null hypothesis of no difference • H1 Alternate hypothesis of difference • Calculate statistic (depends on the nature of the problem) • Compare with the tabulated statistic based on H0 • Calculated statistic should be greater than the tabulated statistic at certain level (often 0.05 level) and with given degrees of freedom to be ‘significant’ • Answers question: What are the chances of getting observed statistic if H0 is true? 190 Individual coefficients • Akin to t tests in regression models • Either specific fixed effect or specific variance-covariance component – – H0: H1: 1 1 is 0 is not 0 – – H0: H1: u20 u20 is 0 is not 0 • Procedure: Divide coefficient by their standard error – Judge against a z distribution – If ratio exceeds 1.96 then significant at 0.05 level • Approximate procedure; asymptotic test, small sample properties not well-known. • OK for fixed part coefficients but not for random (typically small numbers; variance distribution is likely to have + skew) 191 Overall model • Akin to F tests in regression models, i.e., is a more complex model a significantly model better fit to the data • Model 1: Null two-level random intercepts model – Model 2: As model 1 + two additional fixed parameters – Model 3: As model 2 + quadratic variance function at level 2 • Procedure: – Calculate deviance (function of the actual and fitted values, which equals -2*logarithm of the likelihood) – Calculate the difference in the deviance • Deviance of Model 3 – Deviance of Model 2 • Deviance of Model 2 – Deviance of Model 1 • Under H0 of no improvement in goodness of fit, difference in deviance is distributed as chi-square with df = to number of extra parameters. – Does result of Model 2 – Model 1 exceed chi square with df = ? – Does result of Model 3 – Model 2 exceed chi square with df=? 192 Simultaneous/complex comparisons • Example: Testing H0: 2 – 3 = 0 AND 3 = 5 • H0: [C][] [k] • [C] is the contrast matrix (p by q) specifying the nature of hypothesis: p is number of simultaneous tests; q is number of parameters in that part of the model. 1 if parameter involved -1 if involved as a difference 0 not involved otherwise • [] is a vector of parameters (fixed or random); q • [k] is a vector of values that the parameters are contrasted against (usually the null); p 193 Simultaneous/complex comparisons • Example in contrast matrix terms – q = 4 (intercept and 3 slope terms) – p = 2 (2 sets of tests) [C] [] [k] 0 0 0 1 -1 0 0 0 1 1 * 2 = 0 5 3 • Overall test against chi square with p degrees of freedom • Output – Result of the contrast – Chi-square statistic for each test separately – Chi-square statistic for overall test; all contrasts simultaneously 194 Laboratory 6 • Return to the jsp data and draw graphs of level-2 and level-1 variance functions;confidence bounds of the population and VPC for • • • • Model 4a linear function level 1 and linear function at level 2 Model 4b quadratic function level 1, linear level 2 Model 4c quadratic function level 2, linear level 1 Model 4d fully random, quadratic at level 1 & level 2 • Test to see if the more complex model is an improvement over the simpler using the deviance statistic; and test whether all level 2 variance terms are needed using the contrast matrix approach 195 SESSION 7 • Modeling categorical predictors • Coding categorical predictors - contrast specification • Categorical predictors in the fixed part (as main effects and Interactions) and MLwiN implementation • Categorical predictors in the random part at level 2 and level 1; contrast and separate coding • Fitting models with categorical predictors in MLwiN • Categorical response later 196 Coding: turning qualitative states into numbers House Heating Type Size-5 Price 1 Yes Terraced 1 75 2 No Terraced -2 61 3 No Semi 4 135 4 Yes Detached 3 140 • Typical approach (Contrast Coding) – Put in a constant for everybody – Give most common category 0 (modal centering) – Give numbers to other categories • Example – Heating: Yes = 0; No = 1 – Type: Terraced = 0; Semi = 1; Detached = 2 197 Creating indicator variables House Constant Non-heating Type Size-5 Price 1 1 0 0 1 75 2 1 1 0 -2 61 3 1 1 1 4 135 4 1 0 2 3 140 • Are we ready to model this data? • Heating is correct; 1 is different from zero – But, type is not: 2 is not 2 *1 • Solution • Replace k categories by k-1 indicators/dummies – Re-label variables 198 Creating indicator variables House Constant Non-heating Semi Detached Size-5 Price 1 1 0 0 0 1 75 2 1 1 0 0 -2 61 3 1 1 1 0 4 135 4 1 0 0 1 3 140 • Other codings CONTRASTED with constant (the base category) • Constant represents a 5 roomed terraced property with central heating (base category), • Dummy variable trap: Use 1 less category than number of states (exact multi-collinearity) 2 heating categories becomes 1 dummy contrasted with constant 3 types become 2 dummies contrasted with constant 199 Contrast coding: most common Model : full model with main effects and interactions yij 0 (Cons) 1 ( Nheat) 2 (Sem) 3 (Det) 4 ( NH * Sem) 5 ( NH * Det) • Allows simplification to additive effects - 6 means (based on a 2 by 3 combination of heating and type) can be estimated with just four terms Model : main effects yij 0 (Cons) 1 ( Nonheat) 2 (Semi) 3 (Detach) • Additive model What is average price for terrace with heating? What is average price for detached without central heating? • Asking the question in term of differences if not different, not needed in model 200 Price Possibility of complex variation between places Semi Detached 201 Quadratic function at level 2 • Same principles and procedures (include as macro equation; variance/covariance matrix) yij 0 x0ij 1x1ij (u0 j x0ij u1 j x1ij e0ij x0ij ) • Variance for base category: Var(u0 j x0ij ) x 2 u 0 0ij • Variance for contrast category: Var(u0 j x0ij u1 j x1ij ) u20 x02ij 2 u 0u1x0ij x1ij u21x12ij • Note: Variance of base category + 2 times the covariance + variance of the contrasted category 202 Task 7 The response is house-prices and there are two levels , houses and districts. Houses are classified into terrace semi’s and detached properties. Using terrace as the base and contrast coding, write down a micro equation and macro equations which allow each type of property to have differential geography over the city. Form the combined multilevel model and identify the fixed terms and the variance-covariance matrices at each level. Draw for each type of property a graph showing between district variations. For lines on the graph identify the terms forming the average prices and the district differential price. Finally write down the variance associated with each type of housing. 203 Separate coding House District Type Size-5 Price 1 1 Terraced 1 75 2 1 Terraced -2 61 3 1 Semi 4 135 4 1 Detached 3 140 Separate coding Replace all distinct categories with indicators Re-label indicators 3 categories; therefore three dummies Type Terrace Semi Detached Terraced 1 0 0 Terraced 1 0 0 Semi 0 1 0 Detached 0 0 1 204 Use contrast code in fixed and separate code in random part at level 2 yij 0 (Cons) 1 (Size) 2 (Semi) 3 (Detach) (u1 jTerrace u2 j Semi u3 j Detach e0ijCons ) NB: no Cons at level 2 Big Advantage: the between district variance is given directly Var(u2 j Semi) 2 Var(u3 j Detach) u3 Var(u1 jTerrace) 2 u1 2 u2 Big advantage: the correlation between different maps of effects for each type of house given directly; not as correlation between a base and a differential Big disadvantage: model not reducible: no base; not easily extendible; more than 1 set of categories requires combinations of categories (see later) 205 Chapter 7 Modelling categorical predictors Create dummy/indicator variables 206 Categorical predictors varying at level-1 • • • • House Heating Type Size-5 Price 1 Yes Terraced 1 75 2 No Terraced -2 61 3 No Semi 4 135 4 Yes Detached 3 140 Separate coding Replace all distinct categories with indicators Re-label indicators DO NOT use a constant at all: 6 dummies, no Base T-H T-NH S-H S-NH D-H D-NH 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 207 Separate coding Model for fixed effects yij 1TH 2TNH 3SH 4 SNH 5 DH 6 DNH • Price for non-heated detached? • Results given directly not by differentials • Matter of convenience, BUT • Separate NOT reducible to a simpler form 208 In multilevel models • Contrast coding in the fixed part and higher-level (level-2) random part • Separate coding in level-1 random part • Micro model yij 0 j (Cons) 1 j ( NH ) e1ij ( NH ) e2ij (H ) e1ij e21 Var ~ N (0, ) 2 e2ij 0 e2 • Interpret • Can only estimate variances, no ‘covariance’; can only belong to one category not both simultaneously 209 Summary – Separate/contrast coding House Heating Cons (x0) Non-heat (x1) Heat (x2) 1 Yes 1 0 1 2 No 1 1 0 3 No 1 1 0 4 Yes 1 0 1 5 Yes 1 0 1 Contrast • Fixed part: contrast coded Separate 0 x0ij 1x1ij 210 Contrast coding at level-2 • ‘Quadratic’ at level-2 Cons( x0 ) Non Heat( x1 ) u0 j 2 u0 ~ N ,0 Cons( x0 ) u 1 j 2 Non Heat( x ) 1 u 0 u1 u1 • Between-district variance for heated properties 2 u0 • Between-district variance for non-heated properties 2 u 0u1 2 u0 2 u1 211 EITHER Separate coding at level-1 Non heat( x1 ) Heat( x2 ) e1ij 2 e1 ~ N ,0 Non Heat( x1 ) e2ij 2 Heat( x ) 0 2 e2 • Between-house variance for heated properties 2 e2 • Between-house variance for non-heated properties 2 e1 212 OR Contrast coding at level-1 Cons( x0 ) Non Heat( x1 ) e0ij 2 e0 ~ N ,0 Cons( x0 ) e 1ij Non Heat( x ) 0 1 e 0 e1 • Between-house variance for heated properties e20 • Between-house variance for non-heated properties e20 2 e0e1 213 Summary of level -1 heterogeneity with categorical level-1 predictors • At level 1 only have information to estimate two random terms; not all 3 of full quadratic; therefore • Either model directly using separate coding and variance for each category; ie two+ variances; • Or model indirectly using contrast coding and base variance and covariance; variance of contrasted category derived as var(base) + 2* Covar(base, contrast) • NOT variance of base and variance of contrasted category as then forcing base to have smaller variance 214 Model 10 Categorical predictors at level-1 (contrast coding) QUESTION 6 Why are there zeros at level-1? 215 Laboratory 7 • In MLwiN open sample worksheet, choosing the tutorial data, this is a new and larger dataset for secondary school children. • The response is Normexam exam score aged 16, and an individual categorical predictor is the categorical variable Vrband of the pupil 5 years earlier. The exam score has been formed by summing a score for each examination that is sat at age 16. This is typically skewed. It has been replaced by a normal score. You can do this by generating a random normal vector of the same length as the number of pupils; sorting the score and sorting the generated values, replacing one by the other. Fit the following models 216 Laboratory 7 (continued) • A: Setup a random-intercepts model with Normexam as response, and Cons and VRband in the fixed part. Level 2 is school and level 1 is student. Choose VB3, the lowest achievement band as the base referent category. Use customised predictions to plot the main effects of Vrband and the predictions window to draw a varying relations plot. Are the school differences remaining after taking account of prior ability at entry? • B: Develop model A to answer the following questions. Are there different school effects for pupils of different initial prior ability? Are children of different prior ability more or less variable in their progress? That is allow the effects of VRB2 and VRB1 to be random at level 2 and level 1; that is use contrast coding as Vrb3 is already there. Estimate the model. Why are some of the level-1 variances zero? Plot the variance functions, and draw up the varying relations plot for each VRband separately. For which type of pupil is the greatest between school differences? Are different types of pupil differentially variable in their progress? 217 Laboratory 7 (continued) • C: This model addresses the same questions as Model B, but this time we will use separate coding. The usefulness of this approach is that we can assess the extent to which a school performs well for one category, also performs well for other categories. Click on any of the fixed VRband terms and delete all two of these terms to leave a null random intercepts model. Click on the Cons term and tick off the level 1 and 2 random terms and the fixed term. There should be no terms in the fixed and random part of the model . Add the VRband term choosing None for the base reference category. Click on each of 3 separately coded dummies and tick on the level 1 and level 2 random terms. Estimate this model. Which of the level-1 random terms are zero, why is this? Plot the variance functions at level 1 and level 2; do they differ from the results for model B. Produce pairwise plots of these school differentials (level 2 residuals) and examine (via the estimates table) the correlations between the level 2 school differentials; how similar are schools in their performance for pupils of different VRband abilities?. 218 Session 8 • Concept of higher-level predictors • Contextual effect and cross-level interaction • Specifying higher-level predictors • Fitting models in MLwiN 219 Higher-level variables • So far all predictors have been level 1 (Math3, boy/girl); (size, type of property) • Now higher level predictors (contextual, ecological) - global occurs only at the higher level; -aggregate based on summarising a level 1 attribute • Example: pupils in classes progress affected by previous score (L1); class average score (A:L2); class homogeneity (SD, A:L2); teaching style (G:L2) • NOW: trying to account for between school differences 220 Typology of contextual/ecologic measures Typology Derived, Aggregate, Collective Description Aggregate of attributes measured at the individual (or lower) level. Examples Mean/Median income Proportion reporting mistrust Expressed as a measure of central tendency (e.g. mean, median), or measures of variation of individual-level variables (e.g. standard deviation), but caution when sample-based. Gini coefficient for inequality Individual analogue can often be measured. Contagion, Peer Integral, Global Aggregate of the individual-level outcome, rather than predictor, that in turn predicts the individual name-sake of the same variable. Proportion smokers Individual analogue is the outcome. Infectious disease prevalence Measure attributes of groups, organizations, or areas. Systematic Social Observation Occurs uniquely at the higher level. Policy variables Individual analogue cannot be measured, thus, irreducible. Environmental, Physical attributes of an area Suicide rate 221 Main and cross-level relationships: a graphical typology: voting example Yij Propensity for left vote The individual and the ecological - 1 Low SES: X1ij = 1 High SES: X1ij = 0 % Working class X2j 222 The individual and the ecological - 2 Propensity for left vote Low SES High SES % Working class 223 The individual and the ecological - 3 consensual Propensity for left vote Low SES High SES % Working class 224 A graphical typology of cross-level interactions (A to F) (Jones & Duncan 1993) Individual Consensual Ecological Reactive Reactive for W; Consensual for M Non-linear crosslevel interactions 225 Main and cross-level relationships: Housing example: symbols & variables Yij House price Detached properties: X1ij = 1 Environmental quality X2j Of district Other housing types: X1ij = 0 226 Data matrix: for main ecologic effect District j House i Price Yij Detached X1ij EnvQual X2j 1 1 75 1 10 1 2 61 0 10 1 3 135 0 10 1 4 140 1 10 1 5 125 0 10 . . . . . . . . . . . . . . . 50 1 75 1 80 50 2 61 0 80 50 3 135 0 80 50 4 140 1 80 Higherlevel variable 227 Micro Model (Within-district, between-house) Price of a house in a district = Price of non-detached properties in each district + Price DIFFERENTIAL for detached properties + Residual differential for each house in a district 228 Macro Model (Between-district) Price of properties in each district = Average price of properties (across all districts) + Effect of environmental quality for each district + Residual differential on the price of properties for each district 229 Combined model Price of a house in a district = Average price of non-detached properties + Average DIFFERENTIAL for detached properties + Change in price for a unit change in DISTRICT environmental quality + Residual differential on the price of properties for each district + Residual differential for each house in a district 230 Specifying ecologic predictors – Main effect allows fitting of types A to D Micro model yij 0 j x0ij 1x1ij e0ij x0ij Macro model 0 j 0 2 x2 j u0 j Effect for ENVQ Overall model yij 0 x0ij 1x1ij 2 x2 j x0ij (u0 j x0ij e0ij x0ij ) [u ] ~ N,0( 0j 2 u0 ) [e ] ~ N,0( 0ij 2 e0 ) Note: Enters the macro equation, but as a fixed parameter 231 Data matrix: for cross-level ecological effect (Graphs A to F) District House Price Detached EnvQual X2j Detach* EnvQual X2jX1ij 1 1 75 1 10 10 1 2 61 0 10 0 1 3 135 0 10 0 1 4 140 1 10 10 1 5 125 0 10 0 . . . . . . . . . . . . . . . . . . 50 1 75 1 80 80 50 2 61 0 80 0 50 3 135 0 80 0 50 4 140 1 80 80 232 Micro Model (Within-district, between-house) Price of a house in a district = Price of non-detached properties in each district + Price DIFFERENTIAL detached properties in each district + Residual differential for each house in a district 233 Macro Models (Between-district) - 1 Price of non-detached properties in each district = Average price of non-detached properties (across all districts) + Effect of environmental quality for each district + Residual differential on the price of non-detached properties for each district 234 Macro Models (Between-district) - 2 Price DIFFERENTIAL for detached properties in each district = Average price differential for detched properties (across all districts) + Effect of environmental quality for each district + Residual differential on the price differential of detached properties for each district 235 Combined model Price of a house in a district = Average price of non-detached properties + Average DIFFERENTIAL for detached properties + Change in price for a unit change in DISTRICT environmental quality for non-detached properties + Differential change in price for a unit change in DISTRICT environmental quality for detached properties + Residual differential on the price of detached properties for each district + Residual differential on the price differential of detached properties for each district + Residual differential for each house in a district 236 Specifying ecologic predictors – Cross-level interaction effect allows fitting of types A to F yij 0 j x0ij 1 j x1ij e0ij x0ij Micro model Main effect Macro model 0 j 0 2 x2 j u0 j X-level effect 1 j 1 3 x2 j u1 j Overall model yij 0 x0ij 1 x1ij 2 x2 j x0ij 3 x2 j x1ij (u0 j x0ij u1 j x1ij e0ij x0ij ) ( x0 ) ( x1 ) u0 j 2 ~ N ,0 ( x0 ) u 0 u1 j 2 (x ) u 0 u1 u1 1 [e ] ~ N,0( 0ij 2 e0 ) Note: Combined makes it clear that it is a cross-level interaction 237 Cross-level interaction equation in MLwiN 238 Non-linear macro models allows fitting of types A to H Polynomial for ENVQ Macro model(s) 0 j 0 2 x2 j 3 x22 j u0 j 1 j 1 4 x2 j x u1 j 2 5 2j Overall model yij 0 x0ij 1 x1ij 2 x2 j x0ij 3 x22 j x0ij 4 x2 j x1ij 5 x22 j x1ij (u0 j x0ij u1 j x1ij e0ij x0ij ) 239 Cross-level interactions with continuous variables • So far typology, equation and examples for continuous w and categorical x • Now for continuous w and x • No real changes except that needs a surface plot or a contour plot to show the results City-wide cost of an 5 room house when ENVq = 0 Effect for ENV quality yij 0 x0ij 1x1ij 2 x2 j x0ij 3 x2 j x1ij Cost of an extra room Effect for EVQ * Size 240 Cross-level interactions: continuous variables Row 1 2 3 Price district house room-5 85.0 1 1 -2 50.0 1 2 -4 102.5 1 3 -1 ENVq ENVQ* Rooms 5 -10 5 -20 5 -5 10 11 190.0 83.0 1 2 10 1 4 -2 5 7 20 -14 21 103.5 3 1 -1 7 -7 yij 100x0ij 10* x1ij 2 * x2 j x0ij 1.5 * x2 j x1ij 241 Multilevel data manipulations in MLwiN • Creates aggregate variables for each block Average Calculates the mean of the input column(s) values Count Calculates the number of level 1 units of the input column Stand Dev Form the standard deviation (divisor n-1) of the input column(s) values Beware Measurement error of SAMPLE-BASED aggregate variables 242 Multilevel data manipulations: school average attainment and heterogeneity on entrance 243 EXAMPLE • • • - STRUCTURE: 2275 voters in 218 constituencies, 1992 RESPONSE: vote Labour not Conservative PREDICTORS: Level individual: age, sex, education, tenure, income 1 : 8-fold classification of class - constituency:% Local authority renters 2 % Employers and managers;100 - % Unemployed • MODEL: cross-level interactions between INDIVIDUAL&CONSTITUENCY characteristics Fixed part main effects: 8 fold division of class Random part at level 2: 2 fold division of class Working class: Non-working class: unskilled and skilled manual, foreman public and private-sector salariat, routine nonmanual, petty-bourgeoisie, ‘unstated’ 244 Note different terms in the fixed and random Fixed: all 8 fractions of class (contrast coded) Random: two groupings of class (separate coding) yij 0 x0ij 1Class2ij 2Class3ij ...... 7Class8ij (u9 j Class1 3ij u10 j Class4 8ij e9ij x0ij e10ij x1ij ) (Class1 3) (Class4 8) u 9j 2 u9 ~ N ,0 (Class1 3) u 10 j 2 (Class4 8) u 9 u10 u10 245 Cross-level interactions 246 Conclusions 1 • Specifying ecologic measures within a multilevel statistical model makes it clear that Ecological variables - account for between area differences NOT level 1 outcomes - account for place differences AFTER allowing for within place composition • Single-level model under-estimate the relative importance of ecological variables; moreover multilevel models result in correct SE’s for higherlevel fixed terms……………….. 247 Conclusions 2 Random variability between-areas (the response in the macro models) is composed of three parts: – TRUE parameter variance: implying that area differences can be accounted by observable/measurable ecological variables. – UNOBSERVED heterogeneity due to in-numerable small-scale factors that are hard to measure; unexplainable – SAMPLE variance: a consequence of working with a sample of areas; unexplainable • Single level model presume that higher-level has all been accounted for 248 Two-level hierarchical model Micro model : continuous predictor and response yij 0 j x0ij 1 j x1ij (e0ij x0ij e1ij x1ij ) e1ij e21 Var ~ N (0, ) 2 e2ij e1e 2 e 2 Macro models 0 j 0 2 x2 j u0 j 1 j 1 3 x2 j u1 j Combined multilevel model ( x0 ) ( x1 ) u0 j 2 ~ N ,0 ( x0 ) u 0 u 2 (x ) 1 j u 0 u1 u1 1 yij 0 x0ij 1 x1ij 2 x2 j x0ij 3 x2 j x1ij (u0 j x0ij u1 j x1ij e0ij x0ij e1ij x1ij ) Level 2 variance Level 1 variance x 2 u 0u1x0ij x1ij x x 2 e0e1x0ij x1ij x 2 u0 2 e0 2 0ij 2 0ij 2 u1 2 e1 2 1ij 2 1ij 249 Task 8: 3 level model of London Schools There are 3 levels to this model; schools(3) are repeatedly measured over 3 years as cohorts (2) pass through by the performance of pupils (1) in public examinations. The response is the score of a pupil aged 16 in public examinations; the variables titles are reasonably selfexplanatory with the exception of VR1, and VR2. Pupils are tested for ability (Verbal reasoning test) at age 11 prior to entry to the school. VR1 is the highest 25%; VR2 is the middle 50%. • Interpret the model; begin by working out who is the constant? • How would you improve the model?. Source: Nuttall, D. L., Goldstein, H., Prosser, R., and Rasbash, J. (1989). Differential school effectiveness. International Journal of Educational Research, 13: 769-776. 250 Repeated cross-sectional design • Structure of Example:(Nuttall et al’s analysis of changing school performance in London) – Level-3 is the school, level-2 is the year and level-1 is the individual. Level-2 represents repeated measurements on the school. – Imbalance: – particular schools not included in particular years (level 2); – different number of pupils in each cohort in each school (level 1) – even if school is only measured once do NOT discard it, include it! • Research Question – modeling changing school performance as different cohorts of pupils pass through the schools 251 3-level Random Intercept Model y ijk 0 jk 1x ijk eijk 0 jk 0 u0 jk v 0k Note ijk subscripts Note 2 higher-level random effects i level 1 (e.g. pupil), j level 2 (e.g. cohort) , school) k level 3 (e.g. Variance between pupils within cohorts within schools = Var (eijk) = e2 Variance between cohorts within schools = Var (u0jk) = u20 Variance between schools = Var (v0k) = v20 Variance between cohorts = u20 + v20 Random effects at different levels assumed to be uncorrelated. 252 Variance Partition Coefficients for 3-level random intercepts model Proportion variance due to differences between cohorts u20 v20 = intra-cohort correlation = 2 u 0 v20 e2 Proportion variance due to differences between schools = intra-school correlation = u20 v20 v20 e2 Proportion between-cohorts variance due to differences between v20 cohorts within schools = u20 v20 253 Correlation structure of 3 level model S Intra-class correlation (within same school & same class) 1 Intra-school correlation (within same school, different class) 2 1 1 1 2 2 2 2 3 3 3 1 1 2 1 1 2 2 1 1 2 P 1 2 3 1 2 3 4 1 2 3 C 1 1 1 1 1 2 0 0 0 0 0 0 0 1 1 2 1 1 2 0 0 0 0 0 0 0 1 2 3 2 2 1 0 0 0 0 0 0 0 2 1 1 0 0 0 1 1 2 2 0 0 0 2 1 2 0 0 0 1 1 2 2 0 0 0 2 2 3 0 0 0 2 2 1 1 0 0 0 2 2 4 0 0 0 2 2 1 1 0 0 0 3 1 1 0 0 0 0 0 0 0 1 1 2 3 1 2 0 0 0 0 0 0 0 1 1 2 3 2 3 0 0 0 0 0 0 0 2 2 1 254 Task 8 continued Response: ILEA score on pupils leaving aged 16 Lev 1: 31,623 pupils Lev 2: 3 years (1984, 5, 6) Lev 3:140 schools FIXED EFFECTS (ratio of estimate to standard error) Level 1: Pupil Intercept Girl VR1 (highest ability at 11 years) VR2 (middle ability at 11years) African 4.0(8.0) Bangladeshi 4.7(6.7) Greek 4.6(6.6) Pakistani 6.0(10.0) Turkish 3.7(9.3) 17.8 2.5(12.5) 19.0(63.3) 8.2(41.0) Arab Caribbean Indian SE Asian Other 4.4(4.0) -0.4(1.9) 7.3(14.6) 8.3(13.8) 3.8(9.5) 255 Task 8 continued Fixed effects Level 2: Year year 1.4(7.0) Level 3: School Boys Girls Church of England Roman Catholic Prop. Free school meals Prop. Squared of fsm 0.8(2.7) single sex school 1.4(4.7) single sex school 1.2(3.0) denominational school 2.4(8.0) denominational school -0.41(10.3) 0.003(7.5) 256 Task 8 continued RANDOM EFFECTS (correlations in brackets) Level 1: Between pupils 96.7 Level 2: Between years 1.2 Level 3: Between schools: estimated var-cov matrix Intercept VR1 VR2 Sex Carib Year Intercept 2.9 VR1 -1.9 (0.0) 17.4 VR2 -0.1 (0.0) 6.1 (0.9) 2.8 Sex -1.5 (-0.6) 2.4 (0.4) 0.6 (0.2) 2.1 Carib -0.4 (-0.2) -1.8(-0.4) -0.5 (-0.3) Year (0.1) 0.5 257 Task 8 continued • Interpretation Who is the Intercept? Including VR1, VR2 for age 11 in the model, means that we are modelling ? Fixed pupil effects? Fixed Year effects? Fixed school effects? Level-1 random effects? Level-2 random effects? Level-3 random effects? • Critique 258 Chapter 8 Higher-level variables Model 10 Complex level-1, level-2 (with size, type) 259 Lab 8 Using higher level variables • In MLwiN open sample worksheet, choosing the tutorial data • The response is Normexam exam score aged 16, and the individual predictor is the categorical variable Vrband of the pupil 5 years earlier. The higher level variable is avslrt, the average reading ability in the school for pupils on entry aged 11. Then fit the following models 260 Lab 8 continued • A: Setup a base random-intercepts model with Normexam as response, and Cons and VRband in the fixed part. Level 2 is school and level 1 is student. Choose VB3, the lowest achievement band as the base referent category. Use customised predictions to plot the main effects of Vrband and the predictions window to draw a varying relations plot. Are the school differences remaining after taking account of prior ability at entry? (same as 7A) • B: Does the prior average ability of entry in a school have an effect on subsequent achievement that is in addition to the pupil’s own prior ability? That is include centred school Avslrt (around its grand mean) in addition to individual VRband as predictor variables. Draw a plot of the main effects with Avslrt on the horizontal axis, groups of lines determined by VRband and the vertical axis is the predicted Normexam based on the fixed part of the model. Are there significant peer group effects? In comparison to Model A, has school average ability accounted for some of the unexplained between school variance. Has the level 1 variance reduced with the inclusion of Avslrt, why or why not? 261 Lab 8 continued • C: Setup a new base model to answer the following questions. Are there different school effects for pupils of different initial prior ability? Are children of different prior ability more or less variable in their progress? First remove the fixed effect of Avslrt by deleting the term (back to Model A). Then allow the effects of VRB2 and VRB1 to be random at level 2 and level 1; that is use contrast coding as Vrb3 is already there. Estimate the model. Why are some of the level-1 variances zero? Plot the variance functions, and draw up the varying relations plot for each VRband separately. For which type of pupil is the greatest between school differences? This model does not use higher level predictor variables but sets up a base for the next model, and is needed if we are going to fit cross-level interactions same as Model 7b) • D: Does the prior average ability of entry in a school have an effect on subsequent achievement that is in addition to the pupil’s own prior ability and is this effect differential for pupils of different prior ability? Staring with Model C, include centred school Avslrt (around its grand mean) and then add 1st order cross-level interactions between Vrband 262 Lab 8 continued • (reference category is VR3) and centred Avslrt, in addition to individual VRband as predictor variables. Draw a plot of the main effects with Avslrt on the horizontal axis, groups of lines determined by VRband and the vertical axis is the predicted Normexam based on the fixed part of the model. . Are the significant peer group effects? Which type of pupil is most affected by the peers? In comparison to Model C, has school average ability accounted for some of the unexplained between school variance? • You could do other analyses with this data, that require higher-level variables and cross-level interactions eg • E: does school heterogeneity in terms of LRT have an effect on progress in addition to AvsLrt, and is this differential for pupils of different ability. 263 Session 9 Estimates and Estimation A comparison of Fixed and Random specification estimates - Anova versus ML random intercepts - Ancova versus ML random slopes and intercepts - empirical and algebraic comparisons - borrowing strength; shrinkage IGLS/RIGLS estimation Markov Chain Monte Carlo estimation - Frequentist approach - Bayesian approach - What does MCMC do? - IGLS versus MCMC - MCMC in practice in MLwiN - DIC MCMC is now the preferred form of estimation for realistically complex models 264 Properties of ML estimates: One picture two specifications 1a 1b 2 Select if; and fit a separate line to each district ANOVA; ie include a dummy for each district; fixed part expansion Fit multilevel model: random part expansion Comparison: Target of inference OLS Specific districts Assumption: Df consumed in fit Separate places 1 for each district, grows! ML Between-district differences Exchangeability Grand mean + Variance(s) 265 Two specifications in MLwiN Districts as Random effects Districts treated as Categorical and added to Model as Fixed effects 266 Data for London mortgages (Jones and Bullen, 1994) • 33 London districts 5% sample of Q1 1990 (Jan-March) • • • • ML estimates Grand mean: Between district: Within districts between houses £84.27k 284.1 (117.9) 3112 • Comparison of OLS/ANOVA: separate estimation • ML: assumes come from a normal distribution with a variance of 284.1; results in precision-weighted estimation; = “borrowing strength” & shrinkage 267 District City Barking Barnet Bexley Brent Bromley Camden Croydon Ealing Enfield Greenwich Hackney Hammersmith Haringey Harrow Havering Hillingdon Hounslow Islington Kensington Kingston Lambeth Lewisham Merton Newham Redbridge Richmond Southwark Sutton Twr Hamlets WForest Wandsworth Westminster nj 3 6 31 6 32 15 17 18 23 34 26 9 16 12 22 4 27 26 22 12 8 18 38 10 28 11 26 9 4 13 28 39 16 wj 0.21 0.35 0.73 0.35 0.74 0.57 0.60 0.62 0.67 0.75 0.70 0.45 0.59 0.52 0.66 0.26 0.71 0.70 0.66 0.52 0.42 0.62 0.77 0.47 0.71 0.50 0.70 0.45 0.26 0.54 0.71 0.78 0.59 OLS 79 39 93 84 73 97 76 72 78 80 58 54 103 76 81 59 100 99 90 124 91 81 66 89 63 85 130 66 207 42 75 91 109 ML 83 68 91 84 76 92 79 76 80 81 66 70 95 80 82 77 96 94 88 105 87 82 70 87 69 85 117 76 117 61 78 89 99 Diff -4 -29 2 0 -3 5 -3 -4 -2 -1 -8 -16 8 -4 -1 -18 4 5 2 19 4 -1 -4 2 -6 0 13 -10 90 -19 -3 2 10 Rank 17.5 2 26 32.6 22 13.5 22 17.5 26 30 10.5 6 10.5 17.5 30 5 17.5 13.5 26 3.5 17.5 30 17.5 26 12 32.5 7 8.5 1 3.5 22 26 8.5 268 One picture two specifications 1a 1b 2 OLS Select if; and fit a separate line to each district ANCOVA; ie include a dummy for each district and dummy interaction with size; Fit multilevel model: random part expansion = ANCOVA = separate lines : overall model?? 33 place = 66 parameters (separate slope and intercept) Unstable estimation - few houses in district - little variation in predictor (all 5 roomed houses) 269 Two specifications in MLwiN Districts as Random effects Districts treated as Categorical and District-Size interactions: all as Fixed effects 270 District OLS Mean City 81.9 Barking 38.7 Barnet 84.7 Bexley 79.1 Brent 73.4 Bromley 79.1 Camden 79.5 Croydon 72.7 Ealing 76.9 Enfield 80.1 Greenwich 58.0 Hackney 58.0 Hamrsmith 113. Haringey 83.5 Harrow 81.4 Havering 63.7 Hillingdon 86.8 Hounslow 92.9 Islington 89.5 Kensington 138.0 Kingston 90.0 Lambeth 81.2 Lewisham 65.5 Merton 84.4 Newham 64.7 Redbridge 70.1 Richmond 110.0 Southwark 29.4 Sutton -203.6 T.Hamlets 39.8 W Forest 72.5 Wandswrth 92.7 Westmnser 75.6 ML Diff Rnk 80.7 65.2 82.1 80.5 74.6 87.8 76.4 74.8 74.7 83.8 65.2 73.5 100. 79.3 82.5 69.8 94.7 94.4 83.2 107. 83.3 78.9 70.8 82.8 70.8 79.0 104. 63.1 99.6 64.7 76.2 86.8 62.6 1.2 -26. 2.65 -1.3 -1.1 -8.6 3.12 -2.1 2.12 -3.6 -7.1 -15. 12.4 4.18 -1.1 -6.1 -7.9 -1.4 6.27 30.5 6.7 2.29 -5.3 1.6 -6.0 -8.8 5.41 -33 -302 -24. -3.6 5.98 13. 31 4 24 30 32 10 23 26 27 22 12 6 8 20 33 15 11 29 14 3 13 25 19 28 16 9 18 2 1 5 21 17 7 OLS Slope 14.3 -8.4 13.6 15.6 7.69 29.5 5.07 7.77 3.66 23.9 1.63 6.88 37.2 10.2 18.2 -21.0 41.6 34.3 12.7 40.1 14.2 9.91 8.59 16.8 8.00 18.1 39.7 -27.0 150. -2.6 12.9 13.9 -27. ML Diff Rank 14.4 0.24 14.3 14.4 8.15 25.2 7.27 8.5 6.83 21.4 0.96 9.19 33.5 11.3 17.3 2.73 35.2 31.3 13.9 37.5 16.2 11.2 7.9 16.5 7.26 14.0 38.9 -5.3 35.9 2.49 11.6 15.4 -17. -0.1 -8.7 -0.7 1.26 -0.4 4.23 -2.1 -0.7 -3.1 2.48 0.67 -2.3 3.7 -1.1 0.88 -24. 6.45 2.94 -1.1 2.56 -2.0 -1.3 0.69 0.32 0.73 4.07 0.84 -22. 114. -5.1 1.27 -1.5 -9.8 33 5 26 21 31 8 16 28 11 14 30 15 10 23 24 2 6 12 22 13 17 19 29 32 27 9 25 3 1 7 20 18 4 271 Borrowing strength: Sutton Sutton 4 houses 3 with 7 rooms £155,255,128 (average= £180k) 1 with 8 rooms 330K cost of extra room = £150k back project to average size (4.5 rooms) = -203.6 Unstable estimation (few houses and little variation in size) ML line: combination of OLS line shrunk to grand mean line 272 Relating ML to OLS estimates algebraically: Null model I OLS/ ANOVA Y Y * i1 01 . Y Y ML i1 01 * . i1 i1 . D D i , 33 * i , 33 0 , 33 * ij * 1 01 02 ... 2 x Y x ( ~ N ( 0 , ) Y ij 0j ij 0 0j 0 0 * 0 , 33 D 33 ij ij 0j 0 0j ij ) 2 0j o 273 Relating ML to OLS estimates algebraically: Null model II * (1) 0 w j / w j 0j ML grand mean is weighted average of all OLS intercepts (weighted to emphasise reliability) 0 j w j * 0j (1 w j ) 0 (2) ML intercept is weighted combination of OLS intercept and overall grand Mean 2 2 2 /( (3) w j o j uo ) weights are function of between-place variation and sampling variance within a place 2 j 2 o /nj (4) sampling variance of OLS intercept depends on level 1 variance and number of houses 274 Relating ML to OLS estimates algebraically: Null model III Consequently If large differences between districts and large number of house in a district; sampling variance is small, reliable estimates; weight approaches 1 & equation 2 becomes 0 j w j * 0j (1 w j ) 1 * 0 * 0j (1 1) 0 * 0j ML intercept = OLS estimate If small differences between districts and small number of houses in a district; sampling variance is large, imprecise estimate; weight approaches 0 then equation 2 becomes 0 j w j * 0j (1 w j ) 0 * 0 * 0j (1 0 ) 0 0 ML intercept = Grand mean estimate 275 Relating ML to OLS estimates algebraically for one district 2 0 = £84k; 2 = 284.1 ML estimates = 3112 The city (j=1) n w City /( o 2 =3 City 2 City 2 City 2 o 2 uo / n City ) = 3112/3 = 284.1/ (3112/3 + 28.41) = 0.21 0 j w = (0.21 *79) +((1-0.21) * 84) = 83 City j * 0j (1 w j ) 0 79 shrunken towards 84 becomes 83 “Pooled” estimation = “borrowing strength “ = precision-weighted estimation NB: fully random model borrowing strength between intercepts and slopes according to degree of covariance 276 London data: shrinkage of intercepts and slopes: Chapter 9 277 London data: the two models: Chapter 9 278 Rubin’s analysis of law schools • Aim: Predict performance (Y) in 1st year examination on basis of test before entry(X) • Data: thousands of applications to 83 law schools • Problem: restricted range of X in any one law school; uninterpretable negative slopes; “bouncing β’s” over time • Comparative test: ML and OLS calibration for one year (1974) Predict next year’s examination results (1975) Compare actual and predicted for 1975, ML 40* better than OLS • WHY: unreliable school-specific estimates shrunk towards grand slope based on wide range of X values RUBIN, D. (1980). Using empirical Bayes techniques in the law school validity studies(with discussion). J. Amer. Statist. Assoc. 75 801-827. 279 Analysis Strategies (from Session 1) IV Analysis of covariance (fixed effects model). Include dummy variables for each and every group Problems • What if number of groups very large, eg households? • No single parameter assesses between group differences • Cannot make inferences beyond groups in sample • Cannot include group-level predictors as all degrees of freedom at the group-level have been consumed • Target of inference: individual School versus schools • Potentially unstable estimates 280 Fixed and Random classifications Random classification Fixed classification Generalization of a level (e.g., schools) Discrete categories of a variable (eg Gender) Random effects come from a distribution Not sample from a population, but exhaust the categories All observed schools contribute to betweenschool variance Specific categories only contribute to their respective means Pooling of information across groups: exchangeability Separate estimation; potentially imprecise estimation 281 AKA shrunken residuals(session 3) uˆ 0 j u20 u20 .r j 2 e0 / n j The multiplier in the above formula is always less than or equal to 1, so that the estimated residual in magnitude than the raw residual, rj uˆ0 j is usually less ie FIXED effect . The raw residual is multiplied by a shrinkage factor and the estimated (or posterior) residual therefore is also sometimes called a shrunken residual. Greatest shrinkage towards zero when between district variance is relatively small and when the raw residual would be un-reliably estimated: 16 minute video on multilevel residuals http://www.cmm.bristol.ac.uk/learning-training/videos/RP/Residuals/Residuals.html 282 IGLS Estimation = FIML • Maximum likelihood picks the values of the model parameters that make the data "more likely" than any other values of the parameters would make them • Iterative scheme each cycle : estimate fixed part, estimate random part on updated estimate of fixed part • IGLS (Goldstein, 1986) : see Chapter 9 for worked example 1) OLS estimation of fixed part ignoring complex structure of random part 2) squared residuals from this fit are regressed on a set of indicators defining the structure of the random part (defined by constant, linear quadratic structure etc) to get random parameter estimates 3) these estimates are used in a generalised least-squares to obtain revised estimates of the fixed part 283 RIGLS Estimation = REML • Restricted (or Residual) Maximum Likelihood modifies the maximum likelihood estimation procedures to take account of the sampling variation in the fixed effects when producing the variance estimates • calculating a simple variance - FIML estimator: divide the sum of the squared deviations around the mean by n - REML estimator: divide by n-1 that is the. The -1 representing the degree of freedom consumed in estimating the mean • In large samples there will be very little impact of this correction on the estimated value of the variance. 284 Advantages of REML • Morrell (1998) found the following in a simulation study.[1] • In comparing models with different random effects, REML Type I error rates tended to be truer than FIML rates. • REML was especially superior for small group sizes (20 or below) and for more random effects (three/four vs. fewer). • REML error rates decreased the more FIML error rates increased. • While REML is more realistic and is recommended, especially when the number of grouping is small (it may additionally help convergence), we need to be careful when using likelihood tests [1] .Morrell, C. H. (1998) Likelihood ratio testing of variance components in the mixed-effects models using restricted maximum likelihood. Biometrics, 54(4), 1560-1568. 285 REML and testing • • two models with nested fixed effects can be compared using likelihood-based methods if they are fit using FIML (not REML). In contrast, models with the same fixed effect structure but different random effects should be compared by fitting with REML. • 1 This suggests a twofold strategy. Use FIML-based IGLS LRT model comparisons to decide which fixed effects to include in the model. 2 Switch to REML to decide which random effects to include and to obtain final estimates of random effects. 286 Markov Chain Monte Carlo estimation - Frequentist approach Bayesian approach What does MCMC do? IGLS versus MCMC MCMC in practice in MLwiN DIC Chapter 10 deals with Normal theory models 287 What is a Frequentist approach? Consider the actual sample that you have as just one out of all the samples that you could have collected. Infer from the one sample to a hypothetical population and we use statistics to guard against inferential error. Assume that the estimate has a sampling distribution (with a particular form) Use the data to estimate the reliability or precision of the estimate, that is the standard error. Use the estimate (called the point estimate, because there is only one) and the standard error to infer about the population. Example………… 288 What is a Frequentist approach? IGLS point estimate of level-2 variance 94.436with a SE of 22.1. Make an assumption that the sampling distribution of this variance is Normal then we can use theory to give us confidence intervals calc b1 = 94.44 + (1.96*22.1)= 137.76 calc b1 = 94.44 - (1.96*22.1)= 51.124 That is we can be 95% confident that the value in the population lies between 51 and 138 289 Problems of Frequentist approach Heroic assumption that the sampling distribution of the variance is Normal (not just the random effects) But Normal theory for the sampling distribution is based on asymptotics; whereas in reality we only have 50 districts Moreover, as a variance should not go negative, we can expect that it will be positively skewed, not symmetric Moreover, with just one sample we have no way of checking this Normality assumption Unlike the individual random effects where we can examine the distribution of the estimated residual values, we only have one estimate of the variance. 290 The Bayesian Approach • Not repeated samples from hypothetical super-populations. • But what is the degree of support for a parameter, combining the researchers beliefs about the parameter and the evidence from the data? • The answer is a plot of a distribution which shows the degree of support for a parameter (see below) • MCMC obtains and plot this distribution for each and every term that is estimated. Indeed estimation is about getting this distribution. • Frequentist: the parameter is conceived as fixed but unknown & subject to sampling fluctuations. • Bayes: a parameter is simply another random variable that can be characterized by its distribution 291 The Bayesian Approach Full probability modelling - everything treated as a distribution - parameters & observations as random variables Fundamental theorem: 3 types of distribution Posterior is proportional to the Prior times the Likelihood • Prior - belief about a parameter before data; subjectivity! Includes Diffuse of Flat • Likelihood - estimate of parameter based on data and assumptions • Posterior - updated evidence after combining prior and likelihood; distribution, not just a point estimate and SE’s based on asymptotics In Practice - celibate sex counsellor…….. due to complex hyperspace of the joint posterior when several (maybe 100’s) parameters need evaluating, But MCMC 292 What can we do with MCMC? • Simulation technique for evaluating the joint posterior by iterative simulation from the marginal (ie one at a time) distributions; converges to the distribution; not to a point estimate • Allows a building block approach to estimation; decompose complex problems into lots of small ones; local computations on the graph “decomposition of complex problems into smaller components, which communicate locally to make global inferences” (Spiegelhalter et al) • Downside is convergence and time - lots of diagnostics for non-convergence 293 Fundamentals of MCMC • Anything about a probability distribution (even one of great complexity) can be characterized by sampling sufficient draws from it; eg the mode = most common draw; the 95% tails will be found by ranking all the draws you get and finding the value that is placed in the position 2.5% from the bottom and the top • Draws do not have to be independent. If not, need more draws in a longer run; dependency affects efficiency not its validity; nonindependent draws learn at a slower rate. • Need initial estimates to start simulation; if good ones. then quickly sampling from equilibrium solution of the posterior distribution and characterizing this distribution. • If bad, then could be exploring a small bit of hyperspace and it could many thousands of samples before you ‘drift’ or ‘jump’ to the proper equilibrium and begin characterizing the complete distribution 294 What is MCMC analysis and how does it work? • Numerical engine for evaluating joint posterior • Simulates/ samples from marginal distribution in an iterative scheme • Different types of sampler: for different types of problem; eg Gibbs Gibbs simulates a new value for each parameter in turn from its marginal distribution assuming that the current values for the other parameters are the true values Example: simplest possible model, 2 Parameters: mean and variance yi 0 (ei ) Var[e] ~ N (0, e2 ) We have the following two unknowns: 0 , 2 e We want to know the distribution of the joint posterior for the mean and variance given the observed values of the vector of the dependent variables, that is : p( 0 , e2 | y) 295 Gibbs MCMC sampling of 2 parameter model involves cycling between a two-step procedure as there are only two parameters: Step 1 : update the values of the mean based on an initial estimate of the variance p( 0 | y, (0))to get 0 (1) 2 e Step 2 : update the values of the variance based on an the revised estimate of mean p( | y, 0 (1))to get (1) 2 e 2 e Process is simply repeated many times using the previously generated set of estimates to generate the next. 296 Gibbs MCMC sampling of 2 parameter model (continued) Using non-informative default priors, the marginal distributions at each step are given by familiar values (following Browne, 2009 Chapter 1) p(0 | y, )to get 0 ~ N[( yi / n), / n) 2 e Step 1: 2 e The MCMC estimate of β0 is simply a draw from a Normal distribution with 2 a mean of ( yi / n) and a variance of ( e / n) where n is the number of observations. Step 2: p(1 / | y, 0 ) to get 1/ ~ ( n / 2, e i / 2) 2 e 2 e 2 The MCMC estimate of the precision parameter (1/variance) is simply a draw from a Gamma (Non-Normal)distribution with the first parameter of 0.001 plus half the number of observations, and the other parameter of 0.001 plus half the sum of the squared residuals. Coded as MLwiN macro and run for 50 iterations…… 297 exploring the joint bivariate, posterior distribution; 50 and 5000 draws a very good simulation: continuously and rapidly exploring the bivariate space; no distinctive patches of colour suggesting that the samplers are being stuck in one region of the graph 298 Marginal plot of 5000 MCMC draws Marginal distributions are well behaved in that they do not show any evidence of multimodality. 84.0 Betas 82.5 Marginal distributions give degree of support for each parameter 81.0 79.5 78.0 700 750 800 850 Vars 900 950 299 What does MCMC do in multilevel models? What do we want to know? In a two-level, variance components model we have the following unknowns: , u , u2 , e2 a joint posterior distribution with many dimensions (notice vector of uj’s and β’s) 2 2 p( , u, u , e | y) MCMC is a numerical engine for evaluating joint posterior by simulating from each of the marginal distributions in an iterative scheme eg Gibbs First we assume some starting values for our unknown parameters (0) , u(0) , u2(0) , e2(0) T hensimulatefrom thefollowingconditional distributions in rotation p( | y, u( 0) , u2( 0) , e2( 0) ) to get (1) , then p(u | y, (1) , u2( 0) , e2( 0 ) ) to get u(1) , then 2 p( u2 | y, (1) , u(1) , e2( 0 ) ) to get u(1) , thenfinally p( e2 | y, (1) , u(1) , u2(1) ) 300 MCMC in practice in MLwiN 1: Use maximum likelihood (IGLS): to get starting values estimates 2 Switch to MCMC use default priors (uninformative) and burn-in for 500 simulations; throw away, aiming to get to the equilibrium distribution and away from the starting values 3 Monitor for 5000 simulations and check for convergence to the distribution: good is “white noise”; poor is “slow drift”; check prospective diagnostics; check retrospective diagnostics (especially Effective sample size eg need at least 500 of these for key parameters of interest) 4 Increase monitoring size if suggested insufficient sample size or lack of convergence (ie substantive trending) 5 Report the mean or mode and 95% credible intervals of the parameters and DIC 6 Repeat the above with stronger priors to assess sensitivity 301 IGLS versus MCMC Fast to compute Slower to compute Deterministic convergence-easy to judge Uses mql/pql approximations to fit discrete response models which can produce biased estimates in some cases In samples with small numbers of level 2 units confidence intervals for level 2 variance parameters assume Normality, which may be inappropriate. Stochastic convergence-harder to judge Does not use approximations when estimating discrete response models, estimates are less biased In samples with small numbers of level 2 units Normality is not assumed when making inferences for level 2 variance parameters Cannot incorporate prior information Can incorporate prior information Difficult to extend to new models Easy to extend to new models eg MM, X-class; and functions of parameters; eg ranks 302 Two-level hierarchical model : estimated in MCMC All estimates are blue; ie not converged clicking on the + button on the tool bar, brings up default weak priors 303 Monitor trajectories for each parameter (last 500 shown) 304 MCMC diagnostics A: The trace of all of the estimates B: The Posterior distribution C: The Auto-correlation function D: The Partial Autocorrelation Function E: The Monte Carlo standard error F: Prospective diagnostics G: The summary statistics of the Posterior distribution And the Effective sample size 305 “White noise” Trajectories showing good mixing 2 u Posterior Distribution + skew Prospective convergence diagnostics 95% credible intervals Posterior mean ESS = 2751 306 Trajectories showing less- good mixing β0 RL diagnostic suggests longer run needed Auto-correlated trace; “stickiness”; exploring the parameter space slowly ESS = 375 307 Very problematic mixing despite 100K! ESS only 51; highly auto-correlated trace Still trending at 50k; multimodal distribution (has not converged to a distribution); due to very small n; only 11 schools 308 Model comparison table: extra information Note: summary information ESS and potentially asymmetric credible intervals More efficient MCMC estimation • More efficient = less auto-correlated monitoring chain • IE. Greater ESS for same monitoring size of chain • In MLwiN a number of procedures described in Browne, W.J. (2009) MCMC Estimation in MLwiN, v2.13. Centre for Multilevel Modelling, University of Bristol. Accessed through MCMC options Strongly recommend as a minimum that You use HC at the highest level of the model Deviance Information Criterion • • • • • • Diagnostic for model comparison Goodness of fit criterion that is penalized for model complexity Generalization of the Akaike Information Criterion (AIC; where df is known) Used for comparing non-nested models (eg same number but different variables) Valuable in Mlwin for testing improved goodness of fit of non-linear model (eg Logit) because Likelihood (and hence Deviance is incorrect) Estimated by MCMC sampling; on output get Bayesian Deviance Information Criterion (DIC) Dbar D(thetabar) pD DIC 9763.54 9760.51 3.02 9766.56 Dbar: D(thetaBar): pD: DIC: the average deviance from the complete set of iterations the deviance at the expected value of the unknown parameters the Estimated degrees of freedom consumed in the fit, ie DbarD(thetaBar) Fit + Complexity; Dbar + pD NB lower values = better parsimonious model • Somewhat contoversial! Spiegelhalter, D.J., Best, N.G., Carlin, B.P. and van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, Series B 64: 583-640. 311 Some guidance on DIC • any decrease in DIC suggests a better model • But stochastic nature of MCMC; so, with small difference in DIC you should confirm if this is a real difference by checking the results with different seeds and/or starting values. More experience with AIC, and common rules of thumb……… 312 Example: House price dataset Sequence of Models • Model 0 • Model 1b: • Model 1: • Model 2b: • Model 2: • Model 3b: • Model 3: Null one-level model; two parameters a fixed mean and a variance Null model but districts as 50 fixed effects (49 dummies & constant); separate estimation of each intercept Null two-level random intercepts model: differential intercepts coming from a distribution with size-5 in model, but districts as 50 fixed effects; separate estimation of each intercept Same basic model but with districts as random effects; differential intercepts coming from a distribution Differential district slopes and intercepts as 100 fixed effects; separate estimation of each intercept and slope for each district. Random slopes model; quadratic variance function at level 2; differential intercepts and slopes coming from a joint multivariate distribution 313 DIC for Sequence of Models Model Nominal DF Estimated DF DIC 0: single level 2 2.00 10728.31 1b: 50 district effects 51 50.95 10504.88 1: 2 level Random intercepts 51 44.43 10498.74 2b 1b+ Size 52 51.93 9874.45 2 : 1 + Size 52 45.27 9868.08 3b: 2b + 49 slopes 101 101.16 9843.51 3: + Random Slopes 101 65.24 9807.47 Model 1b: constant + 49 differential intercepts + level 1 variance = 51 effective parameters. Model 1: districts as random effects; only 44 parameters as these effects are estimated as coming from an overall distribution. The nominal 51 parameters are shrunk due to sharing a prior distribution and therefore do not contribute whole parameters to the parameter count Model 3: random intercepts & slopes: Lowest DIC; most parsimonious parameterization: most strongly favoured by this approach 314 Laboratory 9 • Using either the jsp data or the tutorial data, fit a sequence of model of increasing complexity, using the MCMC approach. Compare the models using the DIC diagnostic. Store the models and check that there has been convergence to a distribution by examining the trajectories and ensuring that the ESS is greater than 500 for each and every parameter. 315 10: Some final bits for going further • Sensitizing you to issues and technical language • Discrete outcome models and Population average versus Cluster-specific • Missing data • Repeated measures data with ‘extra’ serial correlation • Endogeneity and fixed versus random coefficient models (Hausman test) • Age period and cohort models • Spatial models with ‘extra’ spatial autocorrelation • Design and Power • Modelling strategy • How to write up your results • What are the benefits of multilevel modelling? • Lab 10 316 Discrete outcome models: the type of response Response Example Binary Yes/No Categorical Model Logit or probit or log-log model with binomial L1 random term Proportion Proportion un- Logit etc. with binomial L1 employed random term Multiple Travel by Logit model with multicategories: train, car, foot nominal random term unordered Multiple Very happy, Logit model with multicategories: quite happy, nominal random term; ordered neutral, proportional odds unhappy, very assumption unhappy Count No of crimes Log model with L1 in area Poisson random term; can include an Offset Count LOS Log model with L1 NBD random term; can include an Offset Reading: Subramanian S V, Duncan C, Jones K (2001) Multilevel perspectives on modeling census data Environment and Planning A 33(3) 399 – 417 317 Discrete outcome models: MLwiN Implementation • • • • • Handles all models listed on previous slide Must use MCMC, otherwise poor estimates Compare models with DIC Ensure data are properly coded: eg Binary 1,0; ordinal: 0,1,2,3 Click on N to change distribution Multivariate model with more than one response with different types of response: Realcom (free) http://www.bristol.ac.uk/cmm/software/realcom/ Uses MCMC 318 Discrete outcome models: MLwiN Implementation EG Binomial Logit Non-linear transformation of the response, prevents predictions going outside the inadmissible range of 0 and 1 random terms Level 2 and higher: Normal distribution, as usual Level 1: binomial random terms 319 Discrete outcome models: VPC and MOR • In discrete outcome models: level 1 variance is not generally estimatable EG Binomial Set the level 1 variance = variance of a standard logistic distribution: u2 Then VPC 2 u 3.29 Variance MOR VPC 0.06 0.18 0.34 0.53 1.25 1.50 1.75 2.00 0.02 0.05 0.09 0.14 320 Population average versus Cluster-specific • Only applies to discrete outcomes • Different interpretation of the predicted probability from estimated model • Marginal: population average or GEE estimation: - is the effect of change in the whole population if everyone’s predictor variable changed by one unit • Conditional: cluster specific: - slope is the effect of change for a particular ‘individual’ of unit change in a predictor • Marginal estimates should not be used to make inferences about individuals (ecological fallacy), • Conditional should not be used to inferences about populations (atomistic fallacy.) • Population average approach underestimates the individual risk & vice-versa Subramanian S V and O'Malley AJ. (2010) The futility in comparing mixed and marginal approaches to modeling neighborhood effects. Epidemiology 21, 4: 475-478 321 Population average versus Cluster-specific • KJ: Multilevel is fundamental; allows the calculation of both • Fit multilevel in MLwiN and then use Customised predictions • A set of scenarios with different higher level variance but same fixed estimate of logit =-1.5 Between variance Logit Probability VPC Subject specific Median 1.5 1.5 1.5 1.5 1.5 1.5 0 1 3 5 8 20 0.00 0.23 0.48 0.60 0.71 0.86 0.18 0.18 0.18 0.18 0.18 0.18 Pop Average Mean 0.18 0.22 0.27 0.30 0.33 0.38 Case 1: no higher-level variance: subject-specific median: the probability of the disease for a typical individual is 0.18; prevalence of disease in the population is also 0.18. Case 6: very large higher-level variance, large differences between individuals in their propensity for the ‘Yes’ outcome; individual risk remains at 0.18, while population prevalence is 0.38. WHY? 322 Population average versus Cluster-specific Simulated 1000 individuals with same mean on the Logit scale of -1.5 and with 5 variances (1,3 ,5, 8, 20) Transform to probabilities Positively skewed as cannot go below 0 Median on the probability scale = 0.18 (median of logits = logit of median) cluster specific risk for typical individual is unchanging Mean on the probability scale changes with variance and pulled upwards by skewed data 323 Population average versus Cluster-specific So what is the point of GEE estimation:? Upside: robust to assumptions about the higher level distribution; gives correct SE for fixed part; gives Population average value Downside: does not give the higher-level variance not extendible to random slopes and 3 level models etc; does not give & cannot derive the clusterspecific estimates; KJ recommends: use multilevel model; and if concerned about distributional assumptions of the higher-level variance, fit discrete random eg NPMLREG in R, Latent Gold; MPlus 324 Chapter 12 Binomial Logistic Model: Teenage employment in Glasgow Differentials by Gender and Qualifications and Adult unemployment in the area 325 Population average and cluster specific estimates Null random intercepts LOGIT Model CLUSTER SPECIFIC = Medians Probability of being employed for median teenager in median district POPULATION-AVERAGE= Means Probability of being employed for mean teenager in mean district Little difference here as level 2 variance is relatively small 326 Chapter 13 HIV in India: SMR’s Higher educated females have the lowest risk, across the age-groups 327 HIV in India: some results Risks for different States relative to living in urban and rural areas nationally. 328 Missingness • • • • • • Random-coefficient model allows imbalance and missingness in the outcome Based on Missingness at Random assumption (Rubin, D B, 1976, Inference and missing data Biometrika 63, 581-592). That is conditional on the predictors in the fixed part of the model, there should be no patterning to the missingness. If a lot of missing data or and informative dropout (eg in a study of alcohol consumption, the heavy drinkers not showing up for appointments) can use Realcom Impute which can impute continuous and discrete variables at different levels The model of interest is first set up in MLwiN and then the variables exported to REALCOM-IMPUTE and then the imputed datasets returned to MLwiN where they will be fitted and combined automatically for the specified model of interest. http://www.bristol.ac.uk/cmm/software/realcom/imputation.html: free software Excellent practical advice on missing data can be found at http://www.lshtm.ac.uk/msu/missingdata/index.html 329 Repeated measures data Standard two level model deals with this: yij 0 x0ij 1Timeij (u0 j x0ij e0ij x0ij ) Where i is occasion and j is individual Term in the model What is shown Crowder and Hand Fixed part with linear trend A generally increasing secular trend of happiness, a rising population average ‘immutable constant of the universe’ Random effects for individuals Person 2 is consistently above the trend; Person 1 is ‘the lasting characteristic of the consistently below; the heterogeneity between individuals individuals’ (the gap between them) is increasing with time Needs random slopes for latter. Random departure for occasion Individuals have their good and bad days which seem ‘the fleeting aberration of the ‘patternless’; there is no evidence of either individual moment’ becoming more ‘volatile’ with time, and one day unrelated to the next 330 Repeated measures data with ‘extra’ serial correlation Standard two level model is not enough: residual time dependence; today carries over into tomorrow Chapter 14: extended discussion of such models: development of aerobic performance in Madeira children 331 Fixed versus random Two versions of the subject-specific repeated measures model [u ] ~ N,0( Multilevel random effects 0j [e ] ~ N,0( 0ij 2 u0 2 e0 ); ) Fixed effects: a dummy for each and every person [e ] ~ N,0( 0ij 2 e0 ) 300 children requires 300 separate parameters, not a multilevel model! Dominant approach in time-series cross sectional data (econometrics & comparative political economy is to apply Hausman (1978) specification test and if significant, then use Fixed effects (>2700 citations Web of Science) 332 Fixed versus random The drawbacks of Fixed effects model • Target of inference: cannot generalize to children, only specific child • No parameter for between child-variance, and does not extend to complex heterogeneity (random slopes ; between occasion, between child & between school etc). • Cannot include any time-invariant variables (eg gender) as all df used up • Rarely used in educational research as interested in gender, school effects etc. The perceived advantages of Fixed effects model • As removed all child–level variability, the estimated slope gives pure longitudinal ‘causal’ effect • Children become their own controls • The BIG prize: ‘these statistical models perform neatly the same function as random assignment in a designed experiment’ 333 Fixed versus random : the Mundlak specification Age-Period and Cohort: Three types of ‘change’ 1 : Age: how outcome develops as individuals develop; maturation; internal to individuals; eg more conservative as grow older 2: Period: specific time that outcome is measured; external to individuals, represent variation over time periods regardless of age eg 7/11 3: Cohort: group who share the same events with the same time interval; the baby boomers The centrality of cohorts for understanding social change was made by Ryder (1965): successive replacement The identification problem Three type of change are conceptually clearly different but difficulty in practice of distinguishing between them Age = Period – Cohort EG a child aged 12 will have been born in 1980 if they have been measured in 1992. The ability to disentangling the effects of these different time depends on the research design that is used. Accelerated Longitudinal Design Multiple sequential age cohorts followed longitudinally, usually with overlap Eg follow 7 years until 11; 11 until 15, 15 until 19 Accelerated? 4 years of study get 12 year span of development by splicing together get one overall development trajectory Advantages: very efficient in terms of time, less problem of attrition; less cost keeping team in field Age, period & cohort are potentially un-confounded in this design so that we can assess how developed is a 11 year old in different cohorts Madeira Growth Study 3levels: Repeated measures within students within cohorts Age Cohorts Students Msmnt occ St1 O1 … O3 … 1 ……….. 5 St100. St1 … O1…O3 St100 O1…O3 O1…O3 • Response : 12 minutes runs test; aerobic performance, cardiovascular endurance • Accelerated longitudinal design: five age cohorts 8,10 12,14 and 16 year olds studied for 3 consecutive years; • Cohorts a tell us about stability of effects over time • Measurement occasions are repeated measures on students and can tell us about students’ development trajectories. • Ignore period: with physical growth short-term time effect can safely be assumed to be zero Results of 3 level model Allowing Age by Sex by Cohort interactions in the fixed part Convergence for girls Cohort change for boys Changing performance with cohort: Boys and Girls Predicted performance for a 14 year old boy and girl born in different cohorts Yang (2008) Happiness in USA: APC model • Instant Citation classic • Innovation: used Synthetic cohorts from the General Social Survey; grouped people in 5 year bands 1895-1900, 1900-5, ….. 1975-1980 • Period 1973 to 2004 in years • Age: continuous age as at survey • Multilevel cross classification of people in Period and in Cohorts • Overcomes identification problem Yang (2008) Random-effects APC model Cohorts Periods People Multilevel proportional odds model (Ordinal: cumulative odds) with random effects Developing APC model to take account of Geography: Form of model States Periods Cohorts People Individual effects Age, Gender, Education, Income, etc random effects: 18 Cohorts + 21 Periods + 49 States + 21*49 Periods*States Spatial models with ‘extra’ spatial autocorrelation ‘Small-for-Age’ children in Zambia Decompose area variation into two parts: surrounding local area and ‘pure’ local area-; shrinkage brings smoothness Un-structured spatial effect Structured spatial effect 344 Spatial Models as a combination of strict hierarchy and multiple membership (including GWR) Adjacency For WinBUGS Tool can be used to create the membership in ArcMap: http://www.umesc.usgs.gov/management/dss/adjacency_tool.html A D G B Person in A C Affected by A(SH) and B,C,D (MM) E F Person in H H Affected by H(SH) and E,I,K,G (MM) I J K L M Multiple membership defined by common boundary; weights as function of inverse distance between centroids of areas 345 Ohio cancer: repeated measures (space and time!) • Response: counts of respiratory cancer deaths in Ohio counties • Aim: Are there hotspot counties with distinctive trends? (small numbers so ‘borrow strength’ from neighbours) • Structure: annual repeated measures (1979-1988) for counties Classification 3: n’hoods as MM (3-8 n’hoods) Classification 2: counties (88) Classification 1: occasion (88*10) • Predictor: Expected deaths; Time • Model 1 2 Log of the response related to fixed predictor, with an offset, Poisson distribution for counts (C1); Two sets of random effects area random effects allowed to vary over time; trend for each county from the Ohio distribution (c2) multiple membership set of random effects for the neighbours of each region (C3) 346 MCMC estimation: repeated measures model, 50,000 draws General trend Default priors N’hood variance Variance function for between county time trend 347 Respiratory cancer trends in Ohio: raw and modelled Red: County 41 in 1988; SMR = 77/49 = 1.57 Blue: County 80 in 1988: SMR= 6/19 = 0.31 348 Design Issues I: What sort of design? • Not SRS eg 10,000 people in 10,000 communities; within community variation confounded with betweencommunity variation • BUT Multistage design indeed, highly-clustered samples may be needed when require sample estimate of higher-level variables • Lots of studies (especially geographical studies) have poor higher-level design eg just use convenient administrative areas but need theoretically relevant definitions; stratify on types of place Eg Laura Stoker and Jake Bowers,"Designing Multi-level Studies: Sampling Voters and Electoral Contexts". 2002. Electoral Studies. 21(2):235-267. 349 Design Issues II: What sample size? • General advice: as many as possible at each level! • Not thousands of peoples in 10 schools • From the literature - Bryk and Raudenbush (1992) HLM; with 60 students in each of 160 - schools estimate 4 random variables at school level Paterson and Goldstein (1991) a minimum of 25 individuals in 25 schools; in preference 100 schools New statistical methods for analysing social structures: an introduction to multilevel models. British Educational Research Journal, 17: 387-394. • But target of inference eg households 350 Free software for sample-size determination • PinT is a program for Power analysis IN Two-level designs for determination of standard errors and optimal sample sizes in multilevel designs with 2 levels) Snijders T A B and Bosker R J (1993) Standard errors and sample sizes for two-level research Journal of Educational Statistics, 18 (1993), 237-259. Snijders T A B (2001) Sampling , being Chapter 11 in Leyland, A and Goldstein, H (2001) Multilevel modelling of health statistics, p159-174 contains an example of PINT http://stat.gamma.rug.nl/snijders/multilevel.htm#progPINT • Optimal design software (Raudenbush) http://sitemaker.umich.edu/group-based/optimal_design_software • Bill Browne’s software using MLwiN: MLPowSim • Download from http://seis.bris.ac.uk/~frwjb/esrc.html 351 General modeling strategy for multilevel analysis • Start by specifying the fixed part, and a variance component at each level of your analysis • Examine variance component at the cluster-level for normality, outliers • Examine level-1 residuals for normality, heteroscedasticity • Consider modeling heteroscedasticity at level-1 before considering complex higher-level heterogeneity • Consider modeling higher-level heterogeneity as a function of observed individual predictor variables • Include cluster-level predictors in the fixed part (as main or cross-level effects depending upon research question) • Chapter 13 is a ‘project’; discusses alternative strategies; implements a bottom-up exploratory and theoretically-driven approach 352 How to write up your results • Kromrey, T R. et al (2009) Multilevel modelling: a review of methodological issues and applications REVIEW OF EDUCATIONAL RESEARCH, 79: 69-102, developed a checklist of good practice • http://www.infoagepub.com/products/content/files/serlinfil es_multilevel/Chapter%2011/checklist.pdf • Could the reader answer the following general questions about the study? 99 questions • Write it so they can! 353 Final example (BJPS 1992) • Stucture: 3 levels strict hierarchy individuals within constituencies within regions • Response: Voting for labour in 1987 • Predictors 1 age, class, tenure, employment status 2 %unemployed, employment change, % in mining in 1981 • Expectation: coal mining areas vote for the left • Allow: mining parameters for mining effect(2) to vary over region(3) in a 3-level logistic model 354 Varying relations for Labour voting and % mining 355 Conclusions • Substantive advantages 1 Modelling contextuality and heterogeneity 2 Micro AND macro models analysed simultaneously -avoids ecological fallacy and atomistic fallacy 3 Social contexts maintained in the analysis; permits intensive, qualitative research on ‘interesting’ cases • 1 2 3 Technical advantages auto-correlation : expected and modelled heterogeneity expected and modelled correct assessment of size of effect for higher-level variables 4 precision-weighted estimation 356 Conclusion “The complexity of the world is not ignored in the pursuit of a single universal equation, but the specific of people and places are retained in a model which still has a capacity for generalisation” LEMMA: http://www.ncrm.ac.uk/nodes/lemma/about.php 357 Lab 10: Your Choice • Chapter 11 Three-level Model Changing School performance in London schools, project bringing together everything we have done on Normal-theory response: data is ileaschools.ws; start p211 • Chapter 12 Logistic modelling: teenage employment of Glasgow; data is employ.ws, start p5 • Chapter 13. HIV in India: Poisson, and a NBD model; data is Hivcounts, start p35 • Chapter 14. Longitudinal analysis of repeated measures data:development of children in Madeira, data is Endurancewide.wsz; start p120 (skip theory!) • Chapter 16. The analysis of spatial and space-time models: low birthweight babies in S Carolina, cancer deaths in Ohio 358 Task 5 District level price-type relationship 359 Some Answers Task 4 Fat Age-46 360 Mean and district differentials for base group Mean District Differentials 0+u0 0 0-u0 361 JSP Model 1: two-level null random intercepts Some questions:1 What does 30.501 represent? And 4.861; and is it significantly different from zero? And 39.42? Does it appear that student achievement varies between schools? What is overall mean? 30.5 What is total variation around this mean 4.861 + 39.42 = 44.28 What proportion of the variance is at the school level? 4.861/ (4.861 + 39.42) = 11% What is likely bounds of variation on schools? Assuming normality 95% of schools lie 30.5 1.96 * sqrt(level 2 variance); that is between 34.8 and 26.2 362 Residual Rank 1 -3.7826 2 -0.49765 3 1.5372 4 -1.7136 5 0.87883 6 0.037677 7 1.2816 8 -0.10038 9 0.77229 10 -0.65464 11 -0.88464 12 -0.83096 13 -1.3033 14 2.4153 15 -1.8671 16 0.92097 17 -0.57512 18 1.2206 19 0.49685 20 -2.5208 21 -1.2670 22 -0.99642 23 2.6617 24 1.2658 Residual 2 20 40 9 32 24 37 22 31 18 15 17 10 45 8 33 19 35 29 5 11 13 46 36 Rank 25 0.67528 26 -0.067084 27 -3.8734 28 -2.0139 29 0.38381 30 4.0935 31 -0.85249 32 -1.0178 33 2.2921 34 0.082354 35 3.6548 36 0.97511 37 1.3421 38 -0.40763 39 -3.3689 40 1.8217 41 -1.9047 42 0.34291 43 -3.4804 44 1.7519 45 2.2007 46 -0.93598 47 0.36467 48 1.4468 Some questions: 2 30 23 1 6 28 48 16 12 44 25 47 34 38 21 4 42 7 26 3 41 43 14 27 39 What is the highest achieving school ; what does a pupil on average achieve there? What is the lowest achieving school ; what does a pupil on average achieve there? The extremes ‘add’ and ‘takeaway’ 4.1 and –3.9; corresponding to likely bounds 363 Some questions:3 What is the high and low achieving schools? Click in the graph and use the Identify points tab What determines confidence band: click on one with wide band and look at hierarchy 364 Variations in attainment between schools 365 JSP Model 2 Some questions:4 What do the estimates represent 30.265 (remember Math3 is centred at 25) 0.604 the general rate of progress across all schools 3.975 is there still variation between schools after taking account of input 28.349 Partitioning the variance A: Original Variance from null model: 4.861 + 39.42 = 44.28 B: Total residual variance from model 2: 3.975 + 28.349 = 32.224 C Proportion of original variance accounted for by ‘math3’ (B-A)/ A = (32.224 – 44.28)/ 44.28 = 27% accounted for D Proportion of remaining variance still to be accounted for at school level 3.975/ (3.975+ 28.349) = 12 percent What is likely bounds of variation on schools? Assuming normality 95% of schools lie 30.3 1.96 * sqrt(level 2 variance); that is between 34.2 and 26.4 That is typical child starts with 25; progresses typically to 30.3, but in top 2.5% of schools progress to 34.2; bottom 2.5% only progress to 26.4 Notice for 1 extra terms that deviance has decreased from to 6263 to 5952 with single parameter 366 School RI Null 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 -3.78 -0.50 1.54 -1.71 0.88 0.04 1.28 -0.10 0.77 -0.65 -0.88 -0.83 -1.30 2.42 -1.87 0.92 -0.58 1.22 0.50 -2.52 -1.27 -1.00 2.66 1.27 RI NullRk 2 20 40 9 32 24 37 22 31 18 15 17 10 45 8 33 19 35 29 5 11 13 46 36 RI M3 -2.62 -0.21 0.88 -2.03 -0.22 -0.13 1.56 0.26 -1.88 0.75 -0.77 -0.14 -0.76 0.69 -0.56 0.45 -0.28 1.01 -0.44 -3.06 -1.57 -0.58 2.93 1.80 RM M3 Rnk 5 23 36 7 22 25 40 27 8 34 12 24 13 33 17 31 20 38 19 2 9 16 44 42 School 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 RI Null 0.68 -0.07 -3.87 -2.01 0.38 4.09 -0.85 -1.02 2.29 0.08 3.65 0.98 1.34 -0.41 -3.37 1.82 -1.90 0.34 -3.48 1.75 2.20 -0.94 0.36 1.45 RI NullRk 30 23 1 6 28 48 16 12 44 25 47 34 38 21 4 42 7 26 3 41 43 14 27 39 RI M3 0.99 0.34 -2.88 -3.37 1.38 3.05 0.42 -1.37 3.02 -0.64 2.95 0.83 3.20 -1.33 -2.58 2.10 -0.45 -0.23 -2.83 0.41 1.68 -0.62 0.16 0.67 RM M3 Rnk 37 28 3 1 39 47 30 10 46 14 45 35 48 11 6 43 18 21 4 29 41 15 26 32 Some questions:5 After taking account of pre-test, what is the school with the most progress, and the least? Values of 3.2 -3.3 compared to ? What has happened to particular schools and why? School 6? School 9? School 10? 367 The caterpillar plot Some questions: 6 After taking account of pre-test, what can you say about the majority of schools? And hence league tables? 368 Varying relations plot 369 JSP Model 3: fully random model at level 2 Some questions:1 What do the estimates represent 30.23 0.61 4.638 -0.348 0.035 27.206 Do you anticipate fanning in or out? 370 Caterpillar plotS Some questions:2 What does top graph show? And bottom Intercepts Slopes 371 Sch 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 RI -2.07 0.15 0.55 -2.44 -0.24 0.34 1.29 0.07 -3.13 0.48 -1.04 0.31 -0.75 1.26 -0.65 -0.08 -0.20 1.49 -0.59 -3.02 -1.79 -1.25 3.14 2.42 RI Rnk 8 26 33 5 22 29 37 25 3 32 14 28 15 36 17 24 23 39 18 4 9 12 45 42 RS RS Rnk 0.10 35 -0.04 18 -0.01 22 0.21 45 0.02 26 -0.07 17 -0.07 16 0.03 27 0.23 46 -0.01 21 0.11 37 -0.08 15 0.06 32 -0.10 13 0.07 33 0.03 28 0.01 25 -0.14 10 0.05 30 0.18 43 0.14 39 0.17 42 -0.24 3 -0.23 5 Sch 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 RI 1.24 0.21 -2.42 -3.71 1.31 3.39 0.36 -1.23 2.95 -1.49 3.40 1.16 3.16 -1.34 -2.43 2.51 -0.30 -0.50 -3.59 0.45 2.40 -0.74 -0.57 1.53 RI Rnk 35 27 7 1 38 47 30 13 44 10 48 34 46 11 6 43 21 20 2 31 41 16 19 40 RS -0.12 0.01 0.15 0.24 -0.09 -0.21 -0.02 0.05 -0.21 0.17 -0.24 -0.11 -0.27 0.08 0.18 -0.20 0.00 0.05 0.34 -0.03 -0.24 0.10 0.12 -0.16 RS Rnk 11 24 40 47 14 7 20 31 6 41 2 12 1 34 44 8 23 29 48 19 4 36 38 9 Some questions:3 What does a pupil with a score of 25 achieve in school 8?: 30.23 + 0.07 In school 28 30.234 –3.7 In school 35 30.234 +3.4 What does a pupil with a score of 35 achieve in school 8?: 30.23 + 0.07 +10* (0.61 +0.03) In school 28 30.234 –3.7 + 10*(0.61+ 0.239) In school 35 30.234 +3.4 +10*(0.61 –0.242) 372 Varying relations plot 373 Lab 5: Analysis of JSP data: Modelling heterogeneity at Level 2 • Return to the jsp data and draw graphs of level-2 and level-1 variance functions;pictures of the between school and between pupil populations , and VPC for • Model 1: • Model 2: null random intercepts model random intercepts model with Maths3 as predictor • Model 3: random intercepts and slopes model: quadratic function and level 2 • New Model: linear function for between school heterogeneity 374 JSP Data : Model #1, Variance Components JSP Data : Model #1, Variance Components JSP Data : Model #2, Random Intercepts JSP Data : Model #2, Random Intercepts JSP Data : Model #3, L2 Random QUADRATIC JSP Data : Model #3, L2 Random QUADRATIC Var(e0ij x0ij ) e20 x0ij 27.206 2 2 2 2 Var(u0 j x0ij u1 j x1ij ) u 0 x0ij 2( u 0u1 ) x0ij x1ij u1x1ij 4.64Cons 2(0.35)Maths31ij 0.03Maths312ij JSP Data : Model #3, L2 Random QUADRATIC Fit 0 x0ij 1x1ij 30.23Consij 0.612Math3ij 95%CI FIT (1.96* u20 x02ij ) 95%CI FIT (1.96* 27.206) FIT (1.96* 5.201) FIT 10.223 JSP Data : Model #3, L2 Random QUADRATIC Fit 0 x0ij 1x1ij 30.23Consij 0.612Math3ij 95%CI FIT (1.96 * ( u20 x02ij 2( u 0u1 ) x0ij x1ij u21 x12ij ) ) FIT (1.96* (4.64Cons 2(0.35) Maths31ij 0.03Maths312ij ) JSP Data : Model #3, L2 Random QUADRATIC u20 2 u 0u1 x1ij u20 x12ij 2 u 0 2 u 0u1 x1ij u20 x12ij e20 4.64 2(0.35) Maths3ij 0.035Maths312ij 4.64 2(0.35) Maths31ij 0.035Maths312ij 27.206 JSP Data : Model #3, L2 LINEAR & L1 CONSTANT Doesn’t converge Laboratory 6 • Return to the jsp data and draw graphs of level-2 and level-1 variance functions;confidence bounds of the population and VPC for • • • • Model 4a linear function level 1 and linear function at level 2 Model 4b quadratic function level 1, linear level 2 Model 4c quadratic function level 2, linear level 1 Model 4d fully random, quadratic at level 1 & level 2 • Test to see if the more complex model is an improvement over the simpler using the deviance statistic; and test whether all level 2 variance terms are needed using the contrast matrix approach 385 JSP Data : Model #4a, Fully Random LINEAR L1&2 JSP Data : Model #4a, Fully Random LINEAR L1&2 JSP Data : Model #4b, Fully Random LIN2 & QUAD1 JSP Data : Model #4b, Fully Random LIN2 & QUAD1 JSP Data : Model #4c, Fully Random QUAD2 & LIN1 JSP Data : Model #4c, Fully Random QUAD2 & LIN1 JSP Data : Model #4d, Fully Random QUAD1 & 2 JSP Data : Model #4d, Fully Random QUAD1 & 2 Task 7: Mean and district differentials Terrace Semi 0+ 1 0 0+1+u0+u1 0+u0 0+1-u0+u1 0-u 0 394 LAB 7 A: base random intercepts model: main effects of prior ability at pupil level Clearly there are significant differences between the subsequent progress of the three Vrbands. There are significant differences between schools after taking account of banded pupil ability on entry. 395 LAB 7 B: New base model, complex variance function at level 2 & level 1 The greatest variation in progress is for the middle-ability pupil, and the greatest between-school variation is for those of highest ability on entry; however there are wide CI’s on these estimates. LAB 7C: Complex variance level 1&2: separate coding The variance function results are the same as Model B, but are given directly. This formulation is helpful in comparing differential school performance eg the correlation between the school effects of a VR1 and a VRB3 pupil is only 0.37; a school could be effective for one type of pupil but not another. The best performing school (53) for the high ability pupils is only average for low ability. LAB 8 A: base random intercepts model: main effects of prior ability at pupil level Clearly there are significant differences between the subsequent progress of the three Vrbands. There are significant differences between schools after taking account of banded pupil ability on entry. 398 LAB 8 B: main effects of prior ability at pupil and school level There is a significant effect of school pupil ability over and above individual ability. The introduction of the level 2 variables has accounted for some of the unexplained variation, but only between schools. LAB C: New base model, complex variance function at level 2 & level 1 The greatest variation in progress is for the middle-ability pupil, and the greatest between-school variation is for those of highest ability on entry; however there are wide CI’s on these estimates. LAB 8 D: Cross-level interactions There appears to be an overall peer group effect that is not differential for pupils of different prior ability. 401 Lab 8 E: cross-level interactions school heterogeneity & prior ability Rather striking results: high ability make more progress when there are diverse children around them; lower ability make worse progress when schools are diverse! Are they interpretable? 402