Multilevel modelling Volume 4 Slide Set 1 Revised June 2011 Session 1: Concepts & Structures • • • • • • • • • • • • Who are we? Your knowledge? Course structure and aims Going further and Software Four.

Download Report

Transcript Multilevel modelling Volume 4 Slide Set 1 Revised June 2011 Session 1: Concepts & Structures • • • • • • • • • • • • Who are we? Your knowledge? Course structure and aims Going further and Software Four.

Multilevel modelling
Volume 4
Slide Set 1
Revised June 2011
1
Session 1: Concepts &
Structures
•
•
•
•
•
•
•
•
•
•
•
•
Who are we?
Your knowledge?
Course structure and aims
Going further and Software
Four key notions that define ML
Hierarchical structures
Task 1 (over to you)
Hierarchies resulting from research designs
Non-hierarchical structures
Alternative analysis strategies
Lab 1: MLwiN preliminaries (before modelling)
Homework: Task 2 and Statistics 101
2
Us and You?
• US
• Kelvyn Jones (Bristol)
- geographer (health)
- cohorts & panel studies
- ML since 1987
• YOU
• Fitted a multilevel model?
• Used MLwiN?
• Familiarity with…...
- ANOVA
- ANCOVA ?
- LRT ?
- ZPR?
- REML
- FIML
- MCMC?
- logit model ?
- Bayesian modelling ?
3
Course structure and aims
• A thorough introduction to the BASIC multilevel
modelling: two-level model
• Sessions 1 to 8 : concepts, specification and
interpretation of 2-level model, continuous response
• Sessions 9 & 10: IGLS/RIGLS estimation, basic
introduction to MCMC estimation; 3-level model;
discrete response, repeated measures, spatial
model…..
• Practical: hands-on experience
us: modelling house prices
you: modelling pupil progress between schools
• Participation: talk, tasks, discussion
4
Multilevel Models:
AKA
•
•
•
•
•
random-effects models,
hierarchical models,
variance-components models,
random-coefficient models,
mixed models
• First known application: 1861 one-way, random-effects model:
several telescopic observations per night for several different nights;
separated the variance into between and within-night variation
•
Modern day version: 1986, publication of algorithms (linked to
software) for dealing with unbalanced data and complex variance
functions
5
Texts
• Comprehensive but demanding! : Goldstein
• Thorough but a little dated: Snijders & Bosker
• Approachable : Hox
• Thorough (HLM) Raudenbush & Bryk
• Authoritative: de Leeuw & Meijer
• Applications: education, O’Connell & McCoach
• Applications: health, Leyland & Goldstein
http://www.cmm.bristol.ac.uk/learningtraining/multilevel-m-support/books.shtml
6
Realistically complex modelling
Statistical models as a formal framework of analysis with a complexity of structure
that matches the system being studied
Four KEY Notions
1 : Modelling data with a complex structure
A large range of structures that ML can handle routinely; eg
houses nested in neighbourhoods
2: Modelling heterogeneity
standard regression models ‘averages’, ie the general relationship
ML additionally models variances; eg individual house prices vary
from n’hood to neighbourhood
3: Modelling dependent data
potentially complex dependencies in the outcome over time, over
space, over context; eg houses within a n’hood tend to have
similar prices
4: Modelling contextuality: micro & macro relations
eg individual house prices depends on individual property
characteristics and on neighbourhood characteristics
7
Modelling data with complex
structure
• ‘Once you know that hierarchies exist you see them
everywhere’ (Kreft, 1990)
• Rest of Session 1
• Basic Idea: For many (most?) research questions Either
the real world has a complex structure and/or we
impose one during research design
• Two fundamental types: hierarchical and nonhierarchical structures:
• Ignoring structure: leads to impoverished analyses and
inferential error
• ML analyses data with complex structure
8
Strict hierarchies
Two-level structures and three-level structures
Unit Diagrams
2
9
Examples of strict hierarchy
•
•
•
Education
pupils (1) in schools (2)
pupils (1) in classes( 2) in schools (3)
•
•
Geography
houses(1) in neighbourhoods(2) in regions(3) in countries(4)
•
•
Business
individuals(1) within teams(2) within organizations(3)
•
•
•
Psychology
individuals(1) within family(2)
individuals(1) within twin sibling pair(2)
•
•
Economics
employees(1) within firms(2)
•
NB all are structures in the POPULATION (ie exist in reality)
10
Characteristics of strict hierarchy
• Exact nesting
– a lower level unit is nested within one and only one unit at the next
higher level
• Lower-level units repeated samples of higher level units
– cannot measure hospital performance directly, but make repeated
measures of it through individual patient outcome
• Successive sampling from each level of a hierarchical population (even
with full enumeration); concept of hypothetical super-population
• Imbalanced structure
– Imbalance in sample size within higher level units is to be expected in
observational research; in one hospital you may have sampled 22
patients, in another it is only 10. Used to be troublesome for estimation
– corollary: NEVER throw-away information to achieve balance; even if
only 1 patient in the hospital
11
What does the data-frame look like?
a 2-level study for examining school effects on
pupil progress
Classifications
or levels
Response
Explanatory variables
Pupil
i
School
j
Pupil
Exam
scoreij
Pupil previous
Examination
scoreij
Pupil
genderij
School
typej
1
1
75
56
M
State
2
1
71
45
M
State
3
1
91
72
F
State
1
2
68
49
F
Private
2
2
37
36
M
Private
3
2
67
56
M
Private
1
3
82
76
F
State
2
3
85
50
F
State
1
4
54
39
M
Private
2
4
91
71
M
Private
3
4
43
41
M
Private
4
4
66
55
F
Private
Some multilevel questions
•What is the between-school
variation in progress?
•Do pupils make more progress in
private than public schools?
•Do girls make greater progress in
private schools
12
TASK 1
•
“Many
research designs can be accommodated within the
framework of a strict hierarchy”
• Specify a multilevel structure and dataframe for EITHER a
response variable of your choice OR for a model to explain
house prices OR voting behaviour
Template for answer
• what is the response (must always be measured at level 1)?
• What are the levels: 1, 2 , etc?
• What are the predictor variables, and at what level are they
measured (1,2 etc)?
• What are the multilevel questions that you want answered?
13
What generates a multilevel structure?
• Real world generates them; population structures or Naturally
occurring hierarchies
Pupils in Classes in Schools
– Patients in Clinics in Health authorities
– Voters in Polling districts in Constituencies
– People in Neighborhoods in Cities
– Houses in Neighborhood in Towns in Regions
– But also can get
- strict hierarchies imposed during research design and / or
during data collection………. Examples of different
designs…...
14
Multistage sampling designs
• for efficient collection of data
• most large-scale surveys are not SRS
• Two-level structure imposed by design
• respondents nested within PSU’s
15
Multistage sampling designs
• Multistage designs (usually) generate dependent data
individuals living within the same PSU can be
expected to be more alike than a random sample
• The ‘design effect’
Inferential procedures (SE’s, confidence limits, tests)
are likely to be incorrect
incorrect estimates of precision
Type 1 errors: finding a relationship where none
exists
• Multilevel models model this dependency and
automatically corrects for the ‘design effect’
16
Repeated cross-sectional design
• Structure of Example:(Nuttall et al’s analysis of changing school
performance in London)
– Level-3 is the school, level-2 is the year and level-1 is the individual.
Level-2 represents repeated measurements on the school.
– Imbalance:
– particular schools not included in particular years (level 2);
– different number of pupils in each cohort in each school (level 1)
– even if school is only measured once do NOT discard it, include it!
• Research Question
– modeling changing school performance as different cohorts of pupils
pass through the schools
17
Repeated measures design: True Panels
• Structure of Example:
– Level-3 is the commune, level-2 is the voter and level-1 is the occasion
on which they voted ; voters are repeatedly measured; SAME
individuals
• Research Question
-what is the volatility of the vote; how changeable are individuals, and how
changeable are communes?; the metaphor of the serene swan!
• Imbalance
- does not require a fixed set of repeated observations for all persons
- both the number of observations per person and the spacing between
observations may vary
18
Multivariate multilevel structures I
• Structure:
– Multiple response variables, representing repeated measurements of
distinctive but not unrelated outcome variables
– SET of response variables (level-1) nested within individuals (level-2)
nested within neighborhoods (level-3)
– NB- imbalance: not all responses measured on all individuals
• Research Question
-responses are health related behaviours (drinking, smoking exercise & diet)
– are there unhealthy communities as well as unhealthy individuals?
– how correlated are the behaviours at the individual level and community
level (taking account of other characteristics)?
19
Multivariate multilevel structures II
• Handles extreme case of matrix sampling design; imbalance by design
• All respondents are asked a set of core questions but additional questions
are asked of random subsets of the total sample
• Example:
all pupils asked core set of core mathematics questions; but subsets asked
detailed questions on trigonometry, matrix algebra, calculus; all modeled
simultaneously
• Another variant is the ‘mixed’ multivariate model……….
20
Mixed multilevel structures
• Structure:
– Level-3: Places; Level-2: Individuals; Level-1: Two responses:
• 1) smoke or not (a qualitative state)
• 2) how many cigarettes they consume in a day (a quasi-continuous
measure)………Imbalance; data with a spike at zero
• Question
– is social class related to whether you smoke or not; but unrelated to the
amount you smoke?
– Are places that have a high proportion of smokers also tend to have
higher smoking intensity?
21
Non-Hierarchical structures
Cross-classified structures I
• Structure:
– Individuals at level-1 in Workplaces at level-2 AND neighborhoods at
level-2
– Workplaces and neighborhoods are not nested but CROSSED.
– Individuals are seen as occupying more than one set of contexts
• Questions:
– Relative contribution of neighborhoods and workplaces to life
satisfaction.
22
Cross-classified structures II
• Examples
patients (1) nested by GP (2) and by hospitals (2)
-
pupils (1) nested within primary schools(2), secondary
schools (2), and neighbourhoods (2)
• Cross classifications and strict hierarchies
repeated sampling
potential imbalance
but latter do not have exact nesting
23
Non-Hierarchical structures
Multiple membership structures
• Structure:
lower level unit can belong simultaneously to more than one higherlevel unit
– Pupils nested within primary school-teachers
– Some pupils are taught by more than one teacher
– Structure includes a ‘weight’ based upon the proportion of time the pupil
spent with the teacher (sum to 1)
• Generalization
- handles ‘missing identifiers; when only probability of membership to a
high-level unit is known
24
CLASSIFICATION DIAGRAMS
a) 3-level hierarchical structure
b) cross-classified structure
25
CLASSIFICATION DIAGRAMS(cont)
c) multiple membership structure
d) spatial structure
26
Spatial Models as combination of strict hierarchy
and multiple membership
(eg contagious disease diffusion)
A
D
B
Person in A
C
Affected by A(SH) and
B,C,D (MM)
E F
Person in H
G
H
I J
K L M
Affected by H(SH) and
E,I,K,G (MM)
Multiple membership defined by common
boundary; weights as function of inverse
distance between centroids of areas
27
Combining structures: crossed-classifications
and multiple membership relationships
School
Pupils
Area
S1
S2
S3
S4
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12
A1
A2
A3
Pupil 1 moves in the course of the study from residential
area 1 to 2 and from school 1 to 2
Area
School
Pupil 8 has moved schools but still lives in the same area
Pupil 7 has moved areas but still attends the same school
Student
Now in addition to schools being crossed with residential areas
pupils are multiple members of both areas and schools.
28
CLASSIFICATIONS FOR ALSPAC
Primary school
Area
School Cohort
Pupil
Teacher
occasions
•
•
•
•
•
All children born in Avon in 1990 followed longitudinally
Multiple attainment measures on a pupil
Pupils span 3 school-year cohorts (say 1996,1997,1998)
Pupils move between teachers,schools,neighbourhoods
Pupils progress potentially affected by their own
changing characteristics, the pupils around them, their
current and past teachers, schools and neighbourhoods
29
CLASSIFICATION DIAGRAM FOR
ALSPAC
Primary school
Area
School Cohort
Pupil
Teacher
occasions
IS SUCH COMPLEXITY NEEDED?
• Complex models are not reducible to
simpler models
• Confounding of variation across
levels (eg primary and secondary
school variation)
30
Structures, classifications and dataframes
School
S1
S2
S3
S4
2-level strict hierarchy structure
Pupils
P1
school
pupil
P2
P3
P4
P5
P6 P7 P8
P9 P10 P11 P12
Classifications or
levels
Response
Explanatory variables
Pupil
I
School
j
Pupil Exam
scoreij
Pupil previous
Examination
scoreij
Pupil
genderij
School typej
1
1
75
56
M
State
2
1
71
45
M
State
3
1
91
72
F
State
1
2
68
49
F
Private
2
2
37
36
M
Private
3
2
67
56
M
Private
1
3
82
76
F
State
2
3
85
50
F
State
1
4
54
39
M
Private
2
4
91
71
M
Private
3
4
43
41
M
Private
4
4
66
55
F
Private
31
Distinguishing Variables and Levels
School type
School
Pupils
state
S1
P1
P2
private
S3
P3
P6
S2
P7 P8
P4 P5
Classifications or levels
Response
Explanatory Variables
Pupil
I
School
Type
k
Pupil Exam
Scoreijk
Previous
Exam
scoreijk
Pupil
genderijk
School
j
1
1
State
75
56
M
2
1
State
71
45
M
3
1
State
91
72
F
1
2
Private
68
49
F
2
2
Private
37
36
M
Etc
S4
NO!
P9 P10 P11 P12
School type is not a random classification it is
a fixed classification, and therefore a variable
not as a level.
Random classification if units can be regarded
as a random sample from a wider population of
units. Eg pupils and schools
Fixed classfication is a small fixed number of
categories. Eg State and Private are not two
types sampled from a large number of types,
on the basis of these two we cannot generalise
to a wider population of types of schools,
But countries? Later…..
32
Structures, classifications and dataframes
2 level repeated measures design
Classification diagram
Unit diagram
Person
P1
O1 O2 O3 O4
Measurement Occasion
Classifications or
levels
Response
Explanatory
variables
Occasion
i
Person
j
Heightij
Ageij
Genderj
1
1
75
5
F
2
1
85
6
F
3
1
95
7
F
1
2
82
7
M
2
2
91
8
M
1
3
88
5
F
2
3
93
6
F
3
3
96
7
F
P2
P3 .....
O1 O2
O1 O2 O3
HOcc1
HOcc2
HOcc3
AgeOcc1
AgeOcc2
AgeOcc3
Gender
Person
1
75
85
95
5
6
7
F
2
82
91
*
7
8
*
M
3
88
93
96
5
6
7
F
b) in short form :
a) in long form
33
Structures, classifications and dataframes
Classification and unit diagrams for multivariate response model
P1
Pupil
E M
P2
P3
P4.....
E
E M
M
Subject
b) in long form
a) in short form
Englis
h
Score
Maths
Score
1
95
75
2
55
3
4
Pupil
Gende
r
Response
Explanatory variables
Classifications or
levels
M
Exam
Subject
i
Pupil
j
Exam
Scoreij
EngIndicij
MathIndicij
GenderEngj
GenderMathj
*
F
Eng
1
1
95
1
0
M
0
65
40
F
Math 2
1
75
0
1
0
M
*
75
M
Eng
1
2
55
1
0
F
0
Eng
1
3
65
1
0
F
0
Math 2
3
40
0
1
0
M
Math 2
4
75
0
1
0
M
34
Analysis Strategies
for Multilevel Data
What are the alternatives; and why use
multilevel modelling?
35
Analysis Strategies
I Group-level analysis. Move up the scale: analyse only at the macro
level; Aggregate to level 2 and fit standard regression model.
•
Problem: Cannot infer individual-level relationships from
group-level relationships (ecological or aggregation fallacy)
Example: research on school effects
Response: Current score on a test, turned into an
average for each of j schools;
Predictor: past score turned into an average for each of
_
j schools
X
j
Model: regress means on means
_
_
Same mean Y j , X j
but three very different
within school relations
_
Yj
Means on means analysis is meaningless!
Mean does not reflect within group relationship
Aitkin, M., Longford, N. (1986), "Statistical modelling issues
in school effectiveness studies", Journal of the Royal
Statistical Society, Vol. 149 No.1, pp.1-43.
36
Analysis Strategies continued
I Group-level analysis Continued Aggregate to level 2 and fit standard
regression model.
•
Problem: Cannot infer individual-level relationships from grouplevel relationships (ecological or aggregation fallacy)
Level
Black illiteracy
Foreign-born illiteracy
Individual
0.20
0.11
State
0.77
-0.52
Robinson (1950) demonstrated the problem by calculated the correlation
between illiteracy and ethnicity in the USA for 2 aggregate and
individual
2 scales of analysis for 1930 USA
- Individual: for 97 million people; States: 48 units
- very different results! The ECOLOGICAL FALLACY
37
Analysis Strategies continued
II
Individual-level analysis. Move down the scale; work only at
the micro level; Fit standard OLS regression model
• Problem: Assume independence of residuals, but may
expect dependency between individuals in the same group;
leads to underestimation of SEs; Type I errors
Bennet’s (1976) “teaching styles” study uses a single-level model: test
scores for English, Reading and Maths aged 11 were significantly
influenced by teaching style; PM calls for a return to ‘traditional’ or
formal methods
Re-analysis: Aitkin, M. et al (1981) Statistical modelling of data on
teaching styles (with Discussion). J. Roy. Statist. Soc. A 144, 419-461
Used multilevel models to handle dependence of pupils within classes; no
significant effect
Also atomistic fallacy………….
38
What does an individual analysis miss? (
Subramaniam, SV,
Jones, K,et al (2009) 'Revisiting Robinson: The perils of individualistic and ecological fallacy', International Journal of Epidemiology )
• Re-analysis as a two level
model (97m in 48 States)
Who is illiterate? Individual model
States
People
Does this vary from State to State?
Cross-level interactions?
39
Analysis Strategies (cont.)
III Contextual analysis. Analysis individual-level data but
include group-level predictors
Problem: Assumes all group-level variance can be
explained by group-level predictors; incorrect SE’s for
group-level predictors
•
•
•
•
Do pupils in single-sex school experience higher exam attainment?
Structure: 4059 pupils in 65 schools
Response: Normal score across all London pupils aged 16
Predictor: Girls and Boys School compared to Mixed school
Parameter
Cons (Mixed school)
Boy school
Girl school
Between school variance(u2)
Between student variance (e2)
Single level
-0.098 (0.021)
0.122 (0.049)
0.245 (0.034)
0.985 (0.022)
Multilevel
-0.101 (0.070)
0.064 (0.149)
0.258 (0.117)
0.155 (0.030)
0.848 (0.019)
SEs
40
Problem is worst for higher-level variables
If we fit a single level
model we get incorrect
uncertainty intervals
leading to incorrect
inferences
Single level model
ignores the clustering
effect of schools
mixed school
boy school
girl school
Multilevel model automatically corrects the SE, and CI’s
41
Analysis Strategies (cont.)
IV Analysis of covariance (fixed effects model). Include dummy
variables for each and every group
Problems
• What if number of groups very large, eg households?
• No single parameter assesses between group differences
• Cannot make inferences beyond groups in sample
• Cannot include group-level predictors as all degrees of
freedom at the group-level have been consumed
• Target of inference: individual School versus schools
42
Analysis Strategies (cont.)
V
Fit single-level model but adjust standard errors for clustering
(GEE approach)
Problems: Treats groups as a nuisance rather than of substantive
interest; no estimate of between-group variance; not
extendible to more levels and complex heterogeneity
VI Multilevel (random effects) model. Partition residual
variance into between- and within-group (level 2 and level
1) components. Allows for un-observables at each level,
corrects standard errors, Micro AND macro models
analysed simultaneously, avoids ecological fallacy and
atomistic fallacy: richer set of research questions BUT (as
usual) need well-specified model and assumptions met.
43
Why should we use multilevel models?
Sometimes:
single level
models can be
seriously
misleading!
44
What have we learned?
• Multilevel models can handle social science research problems with
“realistic complexity”
• Complexity takes two forms and two types
• As ‘Structure’ ie dependencies
- naturally occurring dependencies
Eg:
pupils in schools ; measurements over time
- ‘imposed-by-design’ dependencies
Eg:
multistage sample
• As ‘Missingness’ ie imbalance
- naturally occurring imbalances
Eg:
not answering in a panel study
- ‘imposed-by-design’ imbalances
Eg:
rotational questions
• Most (all?) social science research problems and designs are a combination
of strict hierarchies, cross-classifications and multiple memberships
45
So what?
• Substantive reasons: richer set of research questions
– To what extent are pupils affected by school context in
addition to their individual characteristics?
– What proportion of the variability in achievement at aged 16
can be accounted for by primary school, secondary school
and neighbourhood characteristics?
• Technical reasons:
– Individuals drawn from a particular ‘groupings’ can be
expected to be more alike than a random sample
– Incorrect estimates of precision, standard errors, confidence
limits and tests; increased risk of finding relationships and
differences where none exists
46
Some reading
• Johnston, RJ, Jones, K, Propper, C & Burgess, SM (2007) Region,
local context, and voting at the 1997 General Election in England,
American Journal of Political Science, 51 (3), 641-655
• Jones, K, Subramanian, SV & Duncan, C. (2003) Multilevel methods for
public health research', in Kawachi, I and Berkman L F (Eds.),
Neighbourhoods and Health, Oxford University Press, 65-111.
• Jones, K & Duncan, C. (2001) Using multilevel models to model
heterogeneity: potential and pitfalls, Geographical Analysis, 32, 279305
• Bullen N, Jones K, Duncan C. (1997) Modelling complexity: analysing
between individual and between-place variation—a multilevel tutorial.
Environment and Planning. 29: 585–609
• Jones, K., and Duncan, C. (1998) ‘Modelling context and heterogeneity:
Applying multilevel models’, in E. Scarbrough and E. Tannenbaum
(Eds.), Research Strategies in the Social Sciences. Oxford
University Press.
47
Rest of Session and homework
• Homework
Task 2 : think of a social-science research problem in your
field; does it have a hierarchical or non-hierarchical structure;
are cross-classified model and multiple membership models
appropriate; draw up a classification diagram for a realistically
complex model
• Task 3 read Chapter 1 in accompanying book “Developing
multilevel models in MlwiN” (pp.3-11)
• Laboratory 1
• Start working through Chapter 2 “Getting started with MlwiN”
stop before Graphical Display (work through pp.12-30)
• NB review Stats 101: following slides
48
Stat101:
A reminder of some key ideas in statistics
What you should have got out of
Stats 101 and hopefully have!
KEY concept: random coefficient model analyses the MEAN
AND VARIANCE of a response variable as a FUNCTION of
other predictor variables
MEAN: the central tendency; the average or typical value; eg
what is the average rainfall on this island?
VARIANCE: the spread, the typical or ‘average’ variability
around the mean; the average deviation around the mean eg
how unequal is the rainfall, how consistent across the island?
Questions of volatility, difference, variability, equality and
segregation are answered by examining the variance
Description, modelling and inference in the simplest possible
case………..ESSENTIAL PRE-REQUISITES
Description: calculating the Mean and Variance
Index
i
Rainfall
Inches (yi)
Deviation
from mean
Deviations
Squared
1
5
5-7 = -2
-2 * -2 = 4
2
7
7-7 = 0
0* 0=0
3
9
9-7 = 2
2* 2=4
Sum
21
0
8
Mean = [sum(Values)]/n = 21/3 = 7 inches
n is the no of rainfall stations, = 3
yi is the amount if rain in inches at
each of the i = 1,2,3 stations

y
Mean Deviation = [sum(Deviations)]/n = 0; Always, therefore useless!
Variance = [sum(Deviations)2]/n = 8/3 = 2.666 inches-SQUARED =

2
Therefore square-root to get back to original measurements
SD    2.666  1.633 INCHES
Usually use n-1 to avoid downwardly biased estimate [1 df has been
consumed in the calculation of the mean; only n-1 values can freely vary]
Consequently, unbiased estimates
Variance = 8/2 = 4 inches squared; SD= 2 inches
51
Description: calculating the covariance
Station
Y:Rainfall
Inches
Distance Inland
Kms X1
DistanceInland
Metres X1
1
9.20
0
0
2
7.37
1
1000
3
6.16
2
2000
4
3.04
3
3000
5
2.37
4
4000
n


Covy , x  [ ( y  y )(x  x)] / n  1
i 1
The extent to which
two random variables vary together (co-vary)
NB scale-dependent covariance
Y and X1 : inches*kms:= -4.5 Y and X2 : inches* Metres: -4497.5
Therefore standardise to get a dimensionless number between -1 and 1; divide
by the product of the Standard deviations
Corry, x  Covy, x /( y x )
NB scale independent correlation coefficient
Y and X1 : inches kms:= -0.985 Y and X2 : inches Metres: -0.985
A covariance is a un-standardised correlation coefficient!
http://www.visualstatistics.net/Visual%20Statistics%20Multimedia/covariance.htm
52
Variances of sum of two variables (more esoteric but use later)
EG summing the responses for each question on a test to obtain the total test score
Can be expanded to give
IE Sum of the variances of the variances plus two times their covariance
Person
Bia
David
Tina
Alex
Var
Cov
Test 1
3
1
3
9
12
Test 2
Total: T1 + T2
4
4
8
8
5.33333
7
5
11
17
28
5.33333
Variance of the Total = Var(T1) + Var(T2) + 2Cov(T1,T2) = 12 + 5.33 + 2(5.33) = 28
Knowing the variance and covariance of the Original variables, can
calculate the Variance of the sum
Theory II: Variance of a weighted sum of variables
EG if we wanted to weight Test 1 by double counting, and Test 2 by treble counting
Begin with double counting of Test 1: weight =2
Person
Test 1
Mean: 3 * weight= 3* 2 = 6
Test 1 * 2
Bia
1
2
David
2
4
Tina
4
8
Alex
5
10
Mean
3
6
Var
2
8
Var(αY) = α2Var(Y) where α is the weight
Variance : weight2 * 2=22 *2 = 8
Now do both weightings
Person
Test
1
Test 2
T1 * 2
T2*3
Bia
1
3
2
9
11
David
2
1
4
3
7
Tina
4
3
8
9
17
Alex
5
9
10
27
37
Mean
3
4
6
12
18
Var
2
9
8
81
133
Covar
3.5
(T1*2) +
(T2*3)
Var(αY + βX) = α2Var(Y) + β2Var(X) + 2αβCov(Y,X);
α and β are the weights
Variance[(T1*2)+ (T2*3)] =
22(2) + 32(9) + 2*2*3*3.5 = 133
Knowing the weights, variance and covariance of the
Original variables, can calculate the Variance of the
weighted sum
54
Modelling the Mean and Variance
Verbal equation: the Observed Response = Mean + Deviation
Index
i
Observed
Rainfall (yi)
Estimated
Mean
Deviation
from mean
1
5
7
-2
2
7
7
0
3
9
7
2

Symbolically
yi  y 
e
i
 
0
e, e ~
i
i
N (0, e )
2
Note: distributional assumption that the deviations come from a Normal
distribution; can & be must be assessed
Note: mean and variance are the sample estimates of the underlying
parameters
55
Inference I: Bounds or Coverage Intervals
Make use of Normality assumptions to infer coverage intervals for island rainfall
EG what are the likely extremes on the island?
20000
P o in ts o n
th e Is la n d
A theoretical Normal distribution with a
Mean of 7 and a SD of 2; complete
description
10000
0
0
10
20
A Normal Distribution with Mean of 7 and SD of 2
Using the properties of a Normal distribution
95% of places will lie +/- 1.96 SD’s from the mean
2.5% wettest = 7 + (1.96*2) = 10.92 inches
2.5% dryest = 7 - (1.96*2) = 3.08 inches
95% coverage interval = 3 to 11 inches
(assumes real world follows a Normal distribution)
56
Inference II: Confidence, precision, and the Standard Error
•All estimates have a degree of uncertainty or un-reliability around them
•The Standard Error assesses this unreliability or imprecision
We can estimate the SE given observed values (and assumptions)
SE of the mean = SD/ √n = 2/√3 = 1.1547
Note: reliability affected by how much evidence we have (the size of the
sample, n) and how variable rainfall is (SD)
Again making use of Normal theory, we can be 95% confident that
population mean lies between
Upper 95 Confidence Interval: 7 + (1.96*1.1547) = 9.26
Lower 95 Confidence interval: 7 – (1.96*1.1547) = 4.74
We can be 95% confident (if our assumptions are met) that the Island’s
mean rainfall is between 4¾ and 9¼ inches
The SE is the special name for the SD of the sampling distribution of an
estimate, ie a summary of the spread of an estimate in repeated sampling
57
Inference III: Confidence Intervals around Coverage Bounds
Do not confuse the two!
Coverage Bound
- given estimated Mean and Variance; what is the 95 coverage of possible
observations
- 95 % of the observed values should lie between these bounds
-But like all estimates, they should have SE’s & confidence intervals!
95% Coverage
SE of Fit
95.0% CI around CB
3.08: lower
0.51
(0.89, 5.27)
10.92: upper
1.80
(3.17, 18.67)
Cannot tell a thing here due to small n
Never done in practice, but making you aware of different
concepts!
58
Mean and Variance as represented in MLwiN
As a (single level) population model
As an estimated model (RIGLS estimate; ie n-1)
Estimated
Mean
Estimated
Variance
Estimated Standard Error of the Mean
Estimated Standard Error of the Variance
59
Interpretation of the intercept and slope in simple regression
0
1
yi   0  1 x1i  (ei )
[ei ] ~ N (0, )
2
e
Intercept: Mean of y when xi is zero
Slope: Change in y when xi increases by 1 unit; how y changes with x
+ means if x is increased, so does y; also if x is decreased, then so does y
- means if x is decreased, then y increases; if x increases then y decreases
Station
Rainfall
Distance Inland
Kms
Distance Inland
Metres
1
9.20
0
0
2
7.37
1
1000
3
6.16
2
2000
4
3.04
3
3000
5
2.37
4
4000
Slope
Mean inches
of Rainfall at
the coast
60
The meaning of the fixed part of the model as averages
Fixed part
 0  1 x1i
The response as a function of x1, how does the mean rainfall change as you
move inland?
IE, make predictions for y by substituting values of x1 in to the estimated
equation
^
Station
Rainfall
Distance
Predicted
y  10  2 DistKm s
Inland
Kms
Mean
Rainfall
1
9.20
0
10
2
7.37
1
8
3
6.16
2
6
4
3.04
3
4
5
2.37
4
2
61
The meaning of the random part of the model
^
ei  yi  y i  y i  (  0  1 x1i )
Random part
[ei ] ~ N (0, e2 )
Residuals as deviations from the mean summarised as a variance
Station
Rainfall
Distance Inland
Kms
Predicted
Mean Rainfall
Deviation or
Residual
1
9.20
0
10
9.20 -10 = -0.8
2
7.37
1
8
7.37 – 8 = -0.63
3
6.16
2
6
6.16 – 6 = +0.16
4
3.04
3
4
3.04 – 4 = -0.96
5
2.37
4
2
2.37 – 2 = +0.37
[ei ] ~ N (0,  e2 );
^
y  y  (10  2 DistKm s)
 e2  0.343
The Variance is NOT a function of x1, same
variability around the mean at the coast and
far inland; ie homoscedaticity
95% Coverage bounds = Mean ± 1.96 * √Variance
= Mean ± 1.96 * √0.343
Based on Normality assumptions
62
What’s going to be new?
Nothing, except
• (1) So far variance & covariance of observed or manifest
variables; now latent or unobserved variables: eg
measure people but posit & estimate place effects
• (2) Variances at different levels; between people within
n’hood, and also between n’hood
• (3) Variances are a function of variables; ie variance
changes with a predictor; eg greater n’hood differences
for large as compared to small properties; boys more
variable in achievement than girls
63
Session 2
• Varying relationships plots
• Tasks 4 and 5
• From graphs to equations I
the general structure of statistical models
the simple regression model
the random -intercepts two-level model
as portrayed in MLwiN
• Lab 2: MLwiN preliminaries (before modelling)
64
Varying relationships
“There are NO general laws in social science that are
constant over time and independent of the context
in which they are embedded”
Rein (quoted in King, 1976)
65
Varying relations plot
• Ignore the complexities of Session 1
Two level model
houses at level 1 nested within districts at level 2
• Single continuous response: price of a house
• Single continuous predictor: size = number of rooms and this
variable has been centred around average size of 5
Rooms
x1i
1
2
3
4
5
6
7
8
-4
-3
-2
-1
0
1
2
3
66
67
68
Task 4
• Draw a single diagram relating a
response (fat consumption) to a
continuous predictor (age) centred around
its national mean with the following
characteristics
no national relationship;
substantial differences between 6 places
in terms of the elderly, but less marked for
middle-ages and least for young
69
TASK 5
• Draw three separate graphs showing the following relations
between response of house prices and a 3-category predictor
(detached, semi-detached, terrace)
• a) differences between 3 categories of housing but no
differences between places
• b) differences between 3 categories of housing and same
differences between places (random intercepts model)
• c) differences between 3 categories of housing and different
differences between places (random intercepts & slopes)
• HINT: use different colours or line styles to show different
places
70
General Structure for Statistical models
•
•
•
•
Response = general trend + fluctuations
Response = average + variability around the average
Response = systematic component + stochastic element
Response = fixed + random
• Specific case: the single level simple regression model
Response
Systematic Part
Random Part
House
Price
Price of
average- +
sized
house
Cost
of
extra
room
house
residual
variation
Intercept
Slope
=
+
Residual
71
Simple regression model
1
0
y
x1
is the outcome, price of a house
is the predictor, number of rooms,
which we shall deviate around its mean, 5
Rooms
1
2
3
4
5
6
7
8
x1
-4
-3
-2
-1
0
1
2
3
72
Simple regression model
yi  0  1x1i  (ei )
yi
x1
is the price of house i
is the individual predictor variable
0
is the intercept: meaning?
1
is the fixed slope term: meaning?
ei
is the residual/random term, one for every house
Summarizing the random term: ASSUME IID
Mean of the random term is zero
Constant variability (Homoscedasticy)
No patterning of the residuals (i.e, they are independent)
[ei ] ~ N (0, e2 )

2
e
is variations between houses conditional on size
73
Visualising IID assumptions I
Both assumptions apply to the residuals (not the responses)
1)Independently distributed: ie no spatial or temporal autocorrelation; patternless
residuals, ‘patterns’ has been captured in the fixed part of the model; a positive
residual does not increase the chance of another positive residual nearby
positive residual (fixed part
under-prediction)
negative residual (fixed part
over-prediction)
Left map: strongly patterned; eg western neighbourhoods are more
expensive or wetter than predicted by model
Right map: no strong patterning; this is what the model presumes
74
Visualising IID assumptions II
2)Identically distributed: ie each residual comes from a distribution with
the same variance; homoscedaticity
Eg Big houses are no more variable in their price than small ones
Eg The same variability of rainfall at the coast as inland
y
Mean relationship
ei has a constant variance for
each value of xi
x
Distribution of y-values at each value
of x is Normal with the same variance
75
Two-level random-intercept multilevel model
Prices for house in a district is a function of the size of the
house and a differential effect associated with its district
Basic concept: of multilevel model:
Develop models at each level and link them in a combined
model.
Micro-model: specifying the within-district part of model; ie house
level
Macro-model: specifying the between-district part of the model
Note:
the allowed-to-vary parameters of within-district model
become the responses in the between-district model
76
Micro Model (Within-district, between-house)
Price of a house in a district
=
Price of an averaged sized house in each district
+
Cost of extra room everywhere
+
Differential for each house in a district
77
Macro Model 1 (Between-district)
Price of an averaged sized house in each district
=
Price of an averaged sized house (across all districts)
+
Differential (discount /premium) on the price of an
averaged sized house
for each district
78
Combined model
Price of a house in a district
=
Price of an averaged sized house
+
Cost of extra room
+
Differential on the price of an
averaged sized house for each district
+
Residual differential for each house
in a district
79
Random intercepts model
Premium
Y  
^
Citywide line
 u0 j
0
Discount
1
x
1
 u0 j
Differential shift for each district j : index the intercept
Micro-model
yij  0 j  1x1ij  eij
Macro-model: index parameter as a response
0 j  0  u0 j
Price of average =
district j
citywide +
price
differential for
district j
Substitute macro into micro…….
80
Random intercepts COMBINED model
Substituting the macro model into the micro model yields
yij  ( 0  u0 j )  1x1ij  eij
Grouping the random parameters in brackets
yij  0  1 x1ij  (u0 j  eij )
• Fixed part
 0  1
• Random part (Level 2)
[u0 j ] ~ N (0, u20 )
• Random part (Level 1)
[eij ] ~ N (0, e2 )
• District and house
differentials are
independent
Cov[u0 j , eij ]  0
81
The meaning of the random terms
• Level 2 : between districts
[u0 j ] ~ N (0, u20 )

2
u0
• Between district variance conditional
on size
• Level 1 : within districts between houses
[eij ] ~ N (0, e2 )

2
e
• Within district, between-house
variation variance conditional on size
82
Variants on the same model
• Combined model: what variable is
1
associated with?
and
?
0
yij  0  1 x1ij  (u0 j  eij )
• Fully explicit Combined model
yij  0 x0ij  1x1ij  (u0 j x0ij  e0ij x0ij )
x0ij • Is the constant ; a set of 1’s
• In MLwiN
Differentials at
each level
83
Chapter 3: Fitting a RI two-level
model in MLwiN (pp.41)
• Model 1: The null random intercepts model
yij  0 x0ij  (u0 j x0ij  e0ij x0ij )
 u0 j
Y   ?
^
0
 u0 j
84
Chapter 3 Fitting a RI two-level
model in MLwiN (pp.41)
• Model 1: The null random intercepts model
• Standard sequence:
- preliminaries: input, sorting, create constant
- specification (response, structures,
predictors, fixed and random)
- estimation of model & initial interpretation
- estimate residuals
- graphics to aid interpretation
- re-specify model
85
Laboratory 2
• Complete chapter 2: Getting started in
MLwiN: preliminaries before modelling
• If completed; visit the Bristol Centre for
Multilevel Modelling
86
Resources
Centre for Multilevel Modelling
http://www.cmm.bris.ac.uk
Provides access to general
information about multilevel
modelling and MlwiN.
Lemma training repository
http://www.ncrm.ac.uk/nodes/lemma/about.php
Email discussion group:
www.jiscmail.ac.uk/multilevel/
87
http://www.cmm.bristol.ac.uk/
88
http://www.cmm.bristol.ac.uk/learning-training/course.shtml
89
http://www.cmm.bristol.ac.uk/links/index.shtml
90
http://www.cmm.bristol.ac.uk/learning-training/multilevel-m-software/index.shtml
91
The MLwiN manuals are another
training resource
http://www.cmm.bristol.ac.uk/MLwiN/download/manuals.shtml
92
Session 3
• Review of the random intercepts model
• Types of question tackled by Multilevel models
• Residuals in multilevel models
- concepts and estimation
• Fitting a two-level model to house-price data
- + a single predictor
• Context, composition and valued-added analyses
• Laboratory 3: an analysis of the JSP data
- pupils within school
- random intercepts null model
- then add in single predictor
93
2-level random-intercept multilevel model:
review
• Combined model in full
yij  0 x0ij  1x1ij  (u0 j x0ij  e0ij x0ij )
• Fixed part
 0  1
• Random part (Level 2)
• Random part (Level 1)


2
u0
2
e0
94
Type of questions tackled by multilevel
modelling I
• 2-level model: current attainment given prior attainment
of pupils(1) in schools(2)
• NB assuming a random sample of pupils from a random
samples of schools
• Do Boys make greater progress than Girls (F)
• Are boys more or less variable in their progress than
girls?(R)
• What is the between-school variation in progress? (R)
• Is School X different from other schools in the sample in
its effect? (F)
• continued…….
95
Type of questions tackled by multilevel
modelling II
• Are schools more variable in their progress for pupils with
low prior attainment? (R)
• Does the gender gap vary across schools? (R)
• Do pupils make more progress in denominational
schools?(F)
• Are pupils in denominational schools less variable in their
progress? (R)
• Do girls make greater progress in denominational
schools? (F) (cross-level interaction)
96
Price
Residuals in a two-level random intercept
model: simplified to three groups (Red, Blue,
Other districts)
Size
97
Residuals in a single-level model
Y  
^
1
x
1
Price
0
ei
Size
98
Random intercepts model
Y  
^
0j
Y  
1
x
1
^
0
Y  
1
x
1
Price
^
0j
1
x
1
Size
99
Residuals in a two-level model – Level 2
 u0 j
Price
 u0 j
Size
100
Residuals in a two-level model – Level 1
Price
eij
Size
101
Two ways of estimating a set of group effects
Method 1: treat each group (eg each school) as a separate entity; ie
fixed classification, put a dummy in the model for each and every
school
Pric
e
Equivalent to fitting a
separate line to each group
In terms of estimates of
coefficients get the same
results as ‘select if’
Size
Equivalent to Anova/
Ancova; ie a fixed effects
analysis
No general line; just many
separate lines( & used up
all df at higher level)
Two ways of estimating a set of group effects
Method 2: treat each group (eg each school) as belonging to a
distribution; ie a random classification, group differences assumed
to come from a distribution in the random part of the model
[u0 j ] ~ N (0, )
Pric
e
2
u0
Estimate a general line (fixed
part) and allow this to vary
from group to group
IE random- effects analysis
Size
Will the fixed results differ from the random? Potentially YES!
If the group line is imprecisely estimated, it will be shrunk towards
the general line; if precisely estimated (eg many pupils in a school):
little difference; random effects is protective, conservative approach
Estimating Multilevel Residuals
In single level models:
Residuals can be estimated simply by subtracting the individual
predictions from the observed values.
In multilevel models: more complex!
Two stage procedure:
1: estimate the variance
2: estimate the specific random effects
rij  yij  yˆ ij
Raw residual for level-1 units
(Departures from the general
line)
Raw residual for level-2 units
IE separate analysis (see next
slide)
Then, the estimated, precision
weighted residual for level-2
units (the random effect)is
rj
is the mean of
uˆ 0 j 
 u20
rij
 u20
.r j
2
  e0 / n j
104
Raw Residuals
Y  
^
0
1
x
1
Price
General line across houses & districts
ri
Size
rj
rj
Raw residual for level-1 units:
Made up of house& district
differential
mean residual for district j eg Red
District = ?; Blue = ?
Method 1: Separate (fixed) estimate of district difference; takes no
Account of belonging to a distribution, therefore have to shrink it to get
Estimate of district random effects; ie Method 2
105
AKA Posterior residuals
uˆ 0 j 

2
u0
 u20
.r j
2
  e0 / n j
Where,

2
u0
is the between-district variation

2
e0
is the within-district, between house variation
nj
is the number of houses in that district
rj
is the raw residual for that district
106
AKA shrunken residuals
uˆ 0 j 
 u20
 u20
.r j
2
  e0 / n j
The multiplier in the above formula is always less than or
equal to 1, so that the estimated residual
magnitude than the raw residual,
.
rj
uˆ0 j is less or equal in
The raw residual is multiplied by a shrinkage factor and the
estimated (or posterior) residual therefore is also sometimes called a
shrunken residual.
Greatest shrinkage towards zero when between district
variance is relatively small and when the raw residual would be
un-reliably estimated: Session 10 provides comparison
16 minute video on multilevel residuals
http://www.cmm.bristol.ac.uk/learning-training/videos/RP/Residuals/Residuals.html
107
Residuals and variances
uˆ0 j
^ 2
 u0
• The estimated residual,
• AKA the random effect at level 2;
• AKA the allowed-to-vary differential effect;
• These have been shrunk towards general line
• The ‘difference a school makes’,
• Positive and negative values
• The variance at level 2
• Summarizes the differences,
• Not the variance of the shrunken residuals uˆ0 j
• But variance of the raw mean residuals, rj
• Not the estimated between group variance of the
sample
• But the estimated between-group variance in the
• population…….. More later and comparison
Chapter 3: Fitting a RI two-level
model in MLwiN (pp.39)
• Model 1: The null random intercepts model
yij  0 x0ij  (u0 j x0ij  e0ij x0ij )
 u0 j
Y   ?
^
0
 u0 j
109
Chapter 3 Fitting a RI two-level
model in MLwiN (pp.39)
• Model 1: The null random intercepts model
• Standard sequence:
- preliminaries: input, sorting, create constant
-
specification (response, structures,
predictors, fixed and random)
-
estimation of model & initial interpretation
-
estimate residuals Page 48
graphics to aid interpretation
re-specify model
110
Moving from a null model to a model with a predictor
What can happen as we include a level 1 predictor?
What might happen to the between school and between pupil within school
variances when we condition on prior ability in a value added study?
Y 
^
Y
^
Prior ability
  x
0
1
1
0
Prior ability
Both the between school and between pupil within school variation could
be reduced when we model attainment as a function of prior ability
Apparent contextual effects are in fact due to different composition in terms
of intake
111
Moving from a null model to a model with a predictor
What can happen as we include a level 1 predictor?
What might happen to between neighbourhood variation when we condition
on size of property?
The between neighbourhood variation could increase when size included;
Contextual effects are being masked; expensive neighbourhood have small
houses; cheap neighbourhood have large houses
112
Laboratory 3: an analysis of the JSP data
Fitting you own models! The data are from the Junior School
project, a longitudinal study of 2000 pupils (level 1) who entered
junior school aged 7 and left in 1984 aged 11. The JSP pupils
attended 48 primary schools (level 2) selected randomly from the
636 such schools of the Inner London Education Authority.
The response is a score for mathematics at age 9 (Maths5), that is
year 5 of the National Curriculum and the single predictor is
Math3, that is year 3 of the National Curriculum. At this stage we
want you to fit two models
1
random intercepts null model with Maths5 as the response
and only a constant term in the model; the model effectively
analyses attainment and the main question is consequently
“is there a between school difference in attainment”?
Continued………….
113
Laboratory: an analysis of the JSP data
2
random intercepts model with Maths5 as the response and
two predictors, the constant and Math3 We would like you
to centre Math3 around the value of 25. This analysis
effectively models progress, that is attainment given prior
achievement. The aim of this analysis is assess the valueadded that schools are making to progress in Mathematics. It
is important to realise that exactly the same test was used on
both occasions for the Maths score.
Hint Read in the data and name the columns
c1 ‘school’ c2 ‘pupil’ c3 ‘math3’ c4 ‘math5’ c5 ‘Male’
Male is a binary predictor where 1 represents male; ignore
for now
NB Strongly advised to save each model as you go along with a
‘meaningful’ file name eg jspmod1
114
Session 4
• Review Lab 3
• Random slopes AKA random-coefficient models
• From graphs to equations
• Characteristics of a random slopes model
• Fitting random slopes models in MLwiN
- allowing price-size relationships to vary over districts
- hybrid model with districts in the fixed and random part
• Laboratory 4: an analysis of the JSP data
115
Random slopes & Random Intercepts: AKA
random-coefficient models
116
Two-level model
In words:
Variation in prices for houses in neighborhoods is a function
of the size of the house; an effect associated with
neighbourhoods; and a differential effect for size in relation to
neighbourhoods
Basic concept (as before)
Develop models at each level and link them in a combined
model.
Micro-model: specifying the within-district part of model
Macro-model: specifying the between-district part of the model
Note: the allowed-to-vaty parameters of within-district model
become the responses in the between-district model
117
Micro Model (Within-district, between-house)
Price of a house in a district
=
Price of an averaged sized house in each district
+
Cost of extra room in each district
+
Residual differential for each house in a district
118
Macro Model 1 (Between-district)
Price of an averaged sized house in each district
=
Price of an averaged sized house (across all districts)
+
Differential for averaged sized house
for each district
119
Macro Model 2 (Between-district)
Cost of an extra room in each district
=
Cost of an extra room (across all districts)
+
Differential in the cost of an extra room
for each district
120
Combined model
Price of a house in a district
=
Price of an averaged sized
house (generally)
+
Cost of extra room (generally)
+
Differential for the price of an
averaged sized house for each district
+
Differential in price of an
extra room for each district
+
Differential for
each house in a district
121
Random slopes model
Micro-model
yij  0 j x0ij  1 j x1ij  e0ij x0ij
Note: Index the intercept and the slope associated with a constant,
and number of rooms, respectively
Macro-model (Random Intercepts)
0 j  0  u0 j
Macro-model (Random Slopes)
1 j  1  u1 j
Slope for district j = citywide slope + differential slope for district j
Substitute macro models into micro model…………
122
Random slopes model
Substituting the macro model into the micro model yields
yij  (0  u0 j ) x0ij  (1  u1 j ) x1ij  e0ij x0ij
Multiplying the parameters with the associated variable and
grouping them into fixed and random parameters yields the
combined model:
yij  0 x0ij  1x1ij  (u0 j x0ij  u1 j x1ij  e0ij x0ij )
123
Characteristics of random slopes model
yij  0 x0ij  1x1ij  (u0 j x0ij  u1 j x1ij  e0ij x0ij )
• Fixed part
 0  1
u1 j
u0 j
• Random part (Level 2)
u0 j 
  u20

)
  ~ N (0, 
2 
u1 j 
 u 0u1  u1 
• Random part (Level 1)
[e0ij ] ~ N (0, e20 )
Estimated correlation between two
sets of random effects = covariance
divided between the product of the
square root of the variances
corr(u0 j , u1 j )   u 0u1 /( u0   u1 )
124
attain
pre-test

y     x
 i
0
1 i

2
ei ~ N (0,  e )


 yij   0 j  1 xij  eij

 0 j   0  u0 j


2
u0 j ~ N (0,  u 0 )

e ~ N (0,  2 )

e
 ij
attain
pre-test
attain
pre-test
attain
pre-test
attain
pre-test







y     x  e
0j
1 j ij
ij
 ij
    u
0
0j
 0j

1 j  1  u1 j

u0 j 
 2

  ~ N (0,  ) :    u 0

2
u
u

 u1 
u1 j 
 u 01

e ~ N (0,  2 )
e
 ij




125
Interpreting varying relationship plot through
mean and variance-covariance
Intercepts: terms
associated with
Constant
x0
Graph
Mean
0
Slopes terms
associated with
Predictor
x1
Variance
Mean
 u20
1
Variance
 u21
Intercept/Slope
terms associated
with
x0 x1
Covariance
 u 0u1
A
+
0
+
0
undefined
B
+
+
+
0
undefined
C
+
+
+
+
+
D
+
+
+
+
-
E
+
+
+
+
0
F
+
+
0
+
+
126
Equation in MLwiN
127
Task 6
• Specify a micro model relating house prices (y) to number of
rooms (x1) terraced or not (x2) and garage or not (x3). Specify
macro models in which:
1 there are differential overall costs for living in different
districts of a city;
2 the cost of an extra room varies from place to place
3 the cost of a non-terraced house varies from place to place.
• Write the combined model, and specify as variances/ covariances at each level. What is the implicit assumption about
how the cost of a garage varies over the city?
128
Fitting Random-Slopes Model
in MLwiN
(p.59)
Model 3
129
Laboratory 4: an analysis of the JSP
data
• Return to the JSP data and fit a new model in which the
relation between Maths5 and Math3 is allowed to vary
between schools?
• Is there a difference between schools for the average
pupil on entry?
• For what sort of pupil do schools make the most
difference?
• NB do not forget to save the resultant model as
something useful like jspmodel3!
• Work through Chapter 4 on your own applied to the Jsp
data
130
Session 5
• Review Lab 4
• Dependent observations
• Modeling variance functions: a general approach to
characterize multilevel models
• Heterogeneity at level-2
• Calculating and displaying intra-unit correlation and variance
functions in MLwiN
• Laboratory 5: an analysis of the JSP data continued – variance
functions
131
Modeling correlated /dependent data
“Everything is related to everything else, but near
things are more related than distant things.”
(Tobler 1970)
Yet, conventional methods and software assume
independence (IID)
132
Same thing, different names
• Auto-correlation = “self” correlation
- temporal autocorrelation
-spatial autocorrelation
• Clustered data
• Intra-class correlations
• Intra-unit correlations
• Variance partitioning coefficientS
133
Characteristics of a dependent data
• It is dependent according to some grouping; eg
households tend to be alike; measurements on
occasions within people tend to very alike
• Consequently; not as much level-1 information as
appears; not as many degrees of freedom as we
think
• If ignored leads to overstatement of significance
(mis-estimate precision; standard errors too small;
Type I error – finding relationships where none
exists)
• ρ = 0.01; m= 4, n= 100
nominal α = 0.05;
true α = 0.17
134
Sources that generate clustered data
• Survey design
– Multistage sampling (primary sampling unit
induced clustering; known as design effects;
interviewer induced clustering);
• Population induced clustering (social processes!)
– Schools, Clinics, Hospital settings, Workplaces,
‘Neighborhoods’, Administrative areas, Social
groupings)
135
So, what do we about it?
• Ignore it (increasingly getting less common in social
and health research)
• ‘Adjust’ or ‘Correct’ it (probably getting most
common)
eg Sudaan
http://www.rti.org/sudaan
fits marginal or population-averaged models using
generalized estimating equations (GEE)
• Expect and model it explicitly (increasingly
becoming common) – Multilevel Modeling Approach
136
Problem is worst for higher-level variables
If we fit a single level
model we get incorrect
uncertainty intervals
leading to incorrect
inferences
Single level model
ignores the clustering
effect of schools
mixed school
boy school
girl school
Multilevel model automatically corrects the SE, and CI’s
137
ML approach to modeling clustered data
In the two-level random intercepts case
yij  0 x0ij  1x1ij  (u0 j x0ij  e0ij x0ij )
it is simply,
 u20
 2
 u 0   e20
Which is: between-district variation divided by the
between-district and within-district (betweenhouse) or total variation
What happens when  = 1 ..............
138
yij  5  (u0 j x0ij  e0ij x0ij )
j
u0 j
e0ij
1
-3
0
2
1
-3
0
.
.
.
.
.
.
.
.
.
.
2
10
1
-3
0
8
1
2
+3
0
8
2
2
+3
0
.
.
.
.
.
.
.
.
.
.
8
15
2
+3
0
4
1
3
-1
0
4
2
3
-1
0
.
.
.
.
.
.
.
.
.
.
4
6
3
-1
0
6
1
4
+1
0
6
2
4
+1
0
.
.
.
.
.
.
.
.
.
.
6
25
4
+1
0
yij
i
2
1
2
139
Meaning of

• When  = 0
- No dependency, all level-1 information, no differences
between groups
• When  = 1
Complete dependency
Maximum difference between groupings
equivalent to Complete similarity within
grouping
9 minute video on dependency in multilevel models
http://www.cmm.bristol.ac.uk/learningtraining/videos/RP/dependency/dependency.html
140
Notes on modeling clustering
In the single-level case
yi  0 x0i  1x1i  (e0i x0i )
• Assume, that y given x: pattern-less residuals, conditionally
no dependence
• If e0i is independent (not patterned), given x, then SE, t test
etc correctly estimated in the fixed part. If not, then problem
In the multilevel case
yij  0 x0ij  1x1ij  (u0 j x0ij  e0ij x0ij )
• Assume, e0ij is independent, given x and u0j;
• In other words, u0j is district “similarity” effect
• Consequently SE’s, t test etc are now corrected for this similarity
within district (the DISTRICT premium/discount)
141
Correlation structure of single level model
yi  0 x0i  1x1i  (e0i x0i )
Pupils within classes: FULL correlation structure
S
1
1
1
2
2
2
2
3
3
3
P
1
2
3
1
2
3
4
1
2
3
1
1
1
0
0
0
0
0
0
0
0
0
1
2
0
1
0
0
0
0
0
0
0
0
1
3
0
0
1
0
0
0
0
0
0
0
2
1
0
0
0
1
0
0
0
0
0
0
2
2
0
0
0
0
1
0
0
0
0
0
2
3
0
0
0
0
0
1
0
0
0
0
2
4
0
0
0
0
0
0
1
0
0
0
3
1
0
0
0
0
0
0
0
1
0
0
3
2
0
0
0
0
0
0
0
0
1
0
3
3
0
0
0
0
0
0
0
0
0
1
142
yij  0 x0ij  1x1ij  (u0 j x0ij  e0ij x0ij )
Correlation structure of two level model
S
1
1
1
2
2
2
2
3
3
3
P
1
2
3
1
2
3
4
1
2
3
1
1
1


0
0
0
0
0
0
0
1
2

1

0
0
0
0
0
0
0
1
3


1
0
0
0
0
0
0
0
2
1
0
0
0
1



0
0
0
2
2
0
0
0

1


0
0
0
2
3
0
0
0


1

0
0
0
2
4
0
0
0



1
0
0
0
3
1
0
0
0
0
0
0
0
1


3
2
0
0
0
0
0
0
0

1

3
3
0
0
0
0
0
0
0


1
143
Correlation structure of cross-classified model (pupils
within schools & neighbourhoods)
S
1
1
1
2
2
2
2
3
3
3
P
1
2
3
1
2
3
4
1
2
3
1
1
1


0
0
0
2
0
0
0
1
2

1

0
0
0
0
0
0
0
1
3


1
0
0
0
0
0
2
0
2
1
0
0
0
1



0
0
0
2
2
0
0
0

1


0
0
0
2
3
0
0
0


1

0
0
0
2
4
2
0
0



1
0
0
0
3
1
0
0
0
0
0
0
0
1


3
2
0
0
2
0
0
0
0

1

3
3
0
0
0
0
0
0
0


1
144
School styles study
• Bennet’s (1976) “teaching styles” study uses a singlelevel model
• test scores for English, Reading and Mathematics aged
11 were significantly influenced by teaching style
• PM calls for a return to ‘traditional’ or formal methods
• Re-analysis: Aitkin, M., Anderson, D.A. and Hinde, J.P. (1981)
Statistical modelling of data on teaching styles (with Discussion). J.
Roy. Statist. Soc. A 144, 419-461.
• Using proto- multilevel models to handle dependence of
pupils within classes
• Spencer, N.H. (2002) "Combining modelling strategies to analyse
teaching styles data". Quality and Quantity, 36, 2, pp 113-127.
145
ML approach to modeling clustered data
In the two-level random intercepts and slopes case
yij  0 x0ij  1x1ij  (u0 j x0ij  u1 j x1ij  e0ij x0ij )
it is not so simple!
 u20  2 u 0u1 x1ij   u21 x12ij
 2
 u 0  2 u 0u1 x1ij   u21 x12ij   e20
So what is this beast all about?
9 minute video on correlation and covariance
matrices
http://www.cmm.bristol.ac.uk/learning-training/videos/RP/corr-covar/corr-covar.html
146
Modeling heterogeneity through
variance functions
• In multilevel models, we have sample of houses, AND a sample
of districts.
• We need to infer to the general population of houses and the
general population of districts
• What does the sample suggest about the population from
which it is drawn?
• We need to do this not simply for ‘population’ averages, but
also ‘population heterogeneities’. Or VARIANCE FUNCTIONS
• Modeling variation as a function of observed
predictor variables: now level 2; later level 1
147
RI model: Constant variance function
yij  0 x0ij  1x1ij  (u0 j x0ij  e0ij x0ij )
• Total variance at level-2 based on the random variable
u0 j
Var(u0 j x0ij )   x
2
u 0 0ij
• Total variance at level-1 based on the random variable
e0ij
Var(e0ij x0ij )   x
2
e0 0ij
• Normality assumptions
– 95% of district lines will lie within +/- 2 standard deviations of the
overall line
– Population of districts likely to vary within the tramlines
148
Between district-heterogeneity as
CONSTANT function of size
Var(u0 j x0ij )   u20 x0ij
• Variance function by Size
• Population bounds for districts
y
x1
x1
149
Random Intercepts and Random Slopes Model:
Quadratic variance function at level-2
yij  0 x0ij  1x1ij  (u0 j x0ij  u1 j x1ij  e0ij x0ij )
• Total variance at level-2, sum of TWO random terms
u0 j
u1 j
Var(u0 j x0ij  u1 j x1ij )   u20 x02ij  2 u 0u1x0ij x1ij   u21x12ij
• Total variance at level-1 based on the random variable
e0ij
Var(e0ij x0ij )   e20 x0ij
• Clearly (if random slopes are required) results in a quadratic function
at level-2
150
Between district-heterogeneity as
QUADRATIC function of size
Var(u0 j x0ij  u1 j x1ij )   u20 x02ij  2 u 0u1x0ij x1ij   u21x12ij
• Population bounds for districts
• Variance function by Size
y
x1
x1
151
Between district-heterogeneity as
QUADRATIC function of size
Var(u0 j x0ij  u1 j x1ij )   x  2( u 0u1 ) x0ij x1ij   x
2 2
u 0 0ij
• Variance function by Size
2 2
u1 1ij
• Population bounds for districts
y
x1
x1
152
Constant, quadratic and linear variance
functions
Constant
Var(u0 j x0ij )   x  A
2
u 0 0ij
Quadratic
Var(u0 j x0ij  u1 j x1ij )   x  2 u 0u1 x0ij x1ij   x
2 2
u 0 0ij
2 2
u1 1ij
 A B C
Linear
Var(u0 j x0ij  u1 j x1ij )   u20 x02ij  2 u 0u1 x0ij x1ij  A  B
• Linear is a simplification of the Quadratic variance function in which the
C element is zero
• Perfectly plausible as a variance function, and more parsimonious
• And can do 2-step procedure; estimate the variance, then estimate the level
2 residuals – pictures ……………
153
Between district-heterogeneity as
LINEAR function of size
Var(u0 j x0ij  u1 j x1ij )   x  2 u 0u1 x0ij x1ij
2 2
u 0 0ij
• Variance function by Size
• Population bounds for districts
y
x1
x1
154
Between district-heterogeneity as
LINEAR function of size
Var(u0 j x0ij  u1 j x1ij )   u20 x02ij  2( u 0u1 ) x0ij x1ij
• Variance function by Size
• Population bounds for districts
y
x1
x1
155
Default setting for variance function in MLwiN
• What model is suggested by this level 2 result?
u0 j 
  u20

Var   ~ N (0, 
)
2 
u1 j 
 u 0u1  u1 
27.46(3.2)
0.00(0.00)
0.00(0.00)
• But this may be hiding the full result
u0 j 
  u20

Var   ~ N (0, 
)
2 
u1 j 
 u 0u1  u1 
27.46(3.2)
15.32(1.23)
-0.00(0.61)
Empirically negative result
• Default: if higher-level variance becomes negative
both variance and covariance are set to zero
156
Changing the default setting
• In the command window
• RESE at level ? With option ?
Option
0 Default
1
2
Action
Variance is set to zero as, is covariance
Negative variance reset to 0, but not
associated covariance (allows linear function)
No re-setting takes place
• Example and recommended
Rese 1 1
Rese 2 1
Rese 3 1
157
Changing the default setting in GUI
158
Laboratory 5: an analysis of the
JSP data
• Return to the jsp data and draw graphs of level-2 and level-1
variance functions;pictures of the between school and between
pupil populations , and VPC for
• Model 1:
• Model 2:
null random intercepts model
random intercepts model with Maths3 as
predictor
• Model 3:
random intercepts and slopes model:
quadratic function and level 2
• New Model: linear function for between school
heterogeneity
159
Modeling variance functions
in MLwiN
Chapter 5 (p.84)
Calculate rho (= VPC) for range of models
Display the graphs of population
heterogeneity at each level for a range of
models
160
Session 6
• Review Lab 5
• Heterogeneity at level-1
-Linear and quadratic
• Significance testing
- specific parameters, overall model
• Laboratory 6: an analysis of the JSP data continued
161
Complex heterogeneity at level-1
• So far, we assumed a constant variance function at level-1 –
the assumption of HOMOCEDASTICITY . But other
possibilities exist:
– Large properties may not only be expensive; they may also
be more variable (level 1 variance function fanning out)
– Pupils with high prior ability, less variable in mathematics
progress (level 1 variance function fanning in)
– Girls may be better than boys on average but also more
consistent (level 1 variance function; differential
variability)
Need to model heterogeneity explicitly
162
CATCH-ALL plot for diagnosis
• CATCH-ALL plot: residuals (on the vertical axis) by
predicted response (on the horizontal axis). What we
are looking for is a structure-less plot. Magnifies
problems
3
5
2
resids
resid
1
0
0
-1
-2
-5
-3
30
40
50
yhat
60
70
30
40
50
60
70
yhat
• Read Chapter 4 diagnostic procedures
163
A range of catch-all plots
164
Complex heterogeneity at level-1
• Should be routinely done, but almost nobody ever does
it. Not possible in most regression software
• Why is it important, especially for multilevel models?
– Confounding of variances between levels, what
appears to be complex level-2 could be a complex
level-1
– Implications for fixed estimates standard errors
Substantively
– EG impact heterogeneity. There is an ‘average’
treatment effect, but are there some groups with
greater impact variation?
165
So, what do we about it?
• Ignore it: very common
• ‘Adjust’ or ‘Correct’ it through transformations of y or x’s
or WLS, rare
• Expect and model it explicitly; that is Multilevel Modeling
Approach; unfortunately very rare
• Exactly same principles as adopted for Level-2
166
House-heterogeneity as CONSTANT function
of size (homoscedasticity)
• Micro model
yij  0 j x0ij  1 j x1ij  (e0ij x0ij )
Var(e0ij x0ij )   e20 x0ij
• Variance function by Size
• Population bounds for houses
y
x1
x1
167
•
Between house-heterogeneity as
QUADRATIC function of size
Micro model
yij  0 j x0ij  1 j x1ij  (e0ij x0ij  e1ij x1ij )
Var(e0ij x0ij  e1ij x1ij )   e20 x02ij  2 e0e1x0ij x1ij   e21x12ij
• Variance function by Size
• Population bounds for houses
y
x1
x1
168
Between house-heterogeneity as
QUADRATIC function of size
• Micro model
yij  0 j x0ij  1 j x1ij  (e0ij x0ij  e1ij x1ij )
Var(e0ij x0ij  e1ij x1ij )   e20 x02ij  2( e0e1 ) x0ij x1ij   e21x12ij
• Variance function by Size
• Population bounds for houses
y
x1
x1
169
Between house-heterogeneity as
LINEAR function of size
• Micro model
yij  0 j x0ij  1 j x1ij  (e0ij x0ij  e1ij x1ij )
Var(e0ij x0ij  e1ij x1ij )   x  2 e0e1x0ij x1ij
2 2
e0 0ij
• Variance function by Size
• Population bounds for houses
y
x1
x1
170
Between house-heterogeneity as
LINEAR function of size
• Micro model
yij  0 j x0ij  1 j x1ij  (e0ij x0ij  e1ij x1ij )
Var(e0ij x0ij  e1ij x1ij )   e20 x02ij  2( e0e1 ) x0ij x1ij
• Population bounds for houses
• Variance function by Size
y
x1
x1
171
Level 1 variation
Level 2 variation
Modeling heterogeneity at multiple levels
x1
x1
Level-2: Positive quadratic function Level-1: Negative linear function
 u20 x02ij  2 u 0u1x0ij x1ij   u21x12ij
 e20 x02ij  2( e0e1 ) x0ij x1ij
172
95% population bounds
y
y
x1
Fixed price/size relationship
along with the quadratic
bounds for between district
variations
x1
Fixed price/size relationship
along with the linear bounds
for within district variations
173
WHY Use variance functions?
Substantively
• Some research questions are about variances as well as
means: segregation, inequality are all about differences
between units; lots of such work remains descriptive
• Impact heterogeneity: eg 2 interventions that reduce the
mean of the problem, but one is preferred becomes it
narrows the variance; ie more consistent in effect.
• Impact heterogeneity eg 2 interventions that have same
mean effect but one is more consistent for men, the other
for women; the variance is structured by gender
• First lesson in data analysis; middle and scatter around
the middle; yet in modelling: homoscedastic assumption,
ie all variation consigned to one term and labelled error!
174
WHY Use variance functions?
Technically
•
Gives correct estimate of standard errors in the
presence of heteroscedasticity
•
Multilevel models are about variance functions at
higher levels; eg between district variance in terms of
price BUT can get confounding across levels; need
explicit modelling of variance at all levels
•
Example: between-district variation in price in terms
of size, requires model simultaneously betweenhouse variation in terms of size
•
Example: detached houses are more expensive than
non-detached (fixed), have bigger differences
between neighbourhoods (level 2) and bigger
differences between houses (level 1)
175
Exemplifying variance heterogeneity
• How has life satisfaction changed in Germany 1991-2006?
• Structure: 15 occasions for 16k individuals in 16 lander
• Response: Life satisfaction on a 10 point score
176
Model specification
Fixed part: 3rd order polynomial function of
time and East/West interaction
Between Lander as function
of time and East/West
Between Individuals as function
of time and East/West
Between occasions as function
of time and East/West
177
Exemplifying variance heterogeneity 2: varying treatment effects
• How has diastolic BP changed with 3 different drug treatments
• Structure: 4 post-treatment occasions for 287 patients nested
in 29 treatment centres; ie longitudinal, panel model
• Baseline average BP is 102
• Below: raw data and fixed effects
• All three drugs lower BP; no significant differences between
them
Variance functions
• Variance increases over time
between centre
• Drug A: greater volatility
around falling trend
• Drug B: most consistent and
reduction over time
• Drug C: less consistent, but
reduction over time
Modeled patient
Trajectories
•
•
•
Drug A: divergence in BP
around falling mean
Drug B: convergence in BP
around falling mean
Drug C: variable impact
Modelling segregation
Motivation: are we become a segregated society?
Virtuous and Vicious circles
Following 1988 Education Reform Act - choice, league
tables, competition expectation of INCREASED
segregation
Attractive
to High
status
parents
Apparent
high
performance
Apparent
poor
performanc
e
Unattractive to
High status
parents
Apparent
worsening
Apparent
improved
performance
Choice
Choice
performance
increased polarization in terms of ability
increased polarization in terms of socio-economic
background; poverty; ethnicity etc
Research Questions
Eligibility for free school meals as a measure of poverty
• Has school FSM segregation increased?
• Has LA segregation increased?
• Has segregation been differential
between types of schools and types of
LA’s
Approach: model the variance not calculate
an Index; low variance = low segregation
FSM: the data
• Source: Pupil Level Annual School Census
• Outcome: Proportion of intake Eligible for FSM
• Intake:
Year 7 of the national curriculum in 2001-2006,
Action
LA’s
Schools
Cohorts Pupils
Complete data from PLASC; 2001-2006
148
5.615
26,178
3,587,459
Omit ‘special’ schools
148
4.088
20.952
3,536,152
Omit cohorts with less than 20 pupils
148
3,636
20,429
3,535,056
Omit schools without a new intake at
aged 11 (ie middle schools) LA loss, eg
IOW, Poole
144
3,076
17,695
3271,010
Omit cohorts with implausible
year-on-year differences (see next slide)
144
3,076
17,637
3,261,372
Are LA’s that are selective (Grammar/ Secondary)
more segregated than totally Comprehensive
systems?
•
•
Pupils going to school in Selecting
LA’s are less likely to be in poverty
Slight decline in poverty in both
types of area
•
•
Schools in Selecting areas
are more segregated
Slight evidence of an increase
Is there more segregation in areas that are
Selective and where less schools are under LA
control in terms of admission policies?
Variance function for
Selective/Non-selective,
structured by the
proportion of pupils in an
LA who go to Community
or Voluntary Controlled
schools (contra Voluntary
Aided,Foundation, CTC’s,
Academies
•
Pupils going to school in NonSelecting LA’s with low LA
control are more likely to be in
poverty
•
•
Schools in Selecting areas
are more segregated
Segregation decreases with
greater LA control for both
types of LA
Area characteristics 3
•
•
Which of England’s LA’s have the most segregated school system?
Model with 144 averages and 144 variances, one for each LA!
185
Which of England’s LA’s have the most segregated school system?
Modelling 4.2million pupils; 144 Means and
Variances!!
LA analysis FSM 2001-6
4
Reading
Non
Select
3
Va ria nc e
Hammersmith and Fulham
Buckinghamshire
2
1
Southend-on-Sea
Slough
Traf f ord
Oldham
Calderdale
Sutton
Telf
ord
and
Wrekin
Solihull
Barnet
Wirral
Know sley
Milton Keynes
Croydon
Stockton-on-Tees
Birmingham
Liverpool
Kent
BexleyPlymouth
Salf
ord
Dudley
Gloucestershire
Wolverhampton
Lincolnshire
Bromley
Walsall
Bolton
Havering
Kingston-upon-Thames
Torbay
Enf
ield City
Bradf
ord of
Bristol,
City
North
of Peterborough
East
Leeds
Kensington
Lincolnshire
and Chelsea
Essex
Sef ton
St
Helens
Bracknell Forest
Redbridge
Lancashire
MedwCumbria
ay Bury Thurrock
Kirklees
Warw ickshire
Shef
f
ield
Wokingham
Northamptonshire
Hillingdon
Blackburn
w ith Darw en Camden
Cheshire
Lambeth Manchester
Derby
NorthHertf
Yorkshire
Warrington
Bournemouth
ordshire
Gateshead
Stoke-on-Trent
Coventry
Redcar
and
Cleveland
New castle-upon-Tyne Hackney
Hampshire
Sw indon
Portsmouth
Haringey
Derbyshire
South
Tyneside
Wigan
Waltham
Forest
Westminster
Barnsley
City
of
Kingston-Upon-Hull
Stockport
Hartlepool
Ealing
Staf
East
fYork
Sussex
Darlington
Rochdale
Wiltshire
North
Surrey
Somerset
Brighton
&
Hove Brent
Oxf
ordshire
Suf
fordshire
olk
Wakef
ield
City
of Nottingham
Somerset
Doncaster
Luton
Bath
and
North
NE
Somerset
Lincolnshire
Durham
Hounslow
Middlesbrough
East
Riding
of
Yorkshire
West
Sussex
Rotherham
Sunderland
Leicester
Halton Wandsw
City
orth
North
Tameside
Tyneside
Worcestershire
Greenw
ich
Merton
Cambridgeshire
Norf
Nottinghamshire
olk
BlackpoolLew
isham
Leicestershire
Heref
ordshire
Barking
Southw ark
Sandwand
ell Dagenham
Dorset
DevonRichmond-upon-Thames
Shropshire
Southampton
New ham Islington
Windsor
and
Maidenhead
South
Gloucestershire
Rutland
West
Berkshire
Cornw all
Harrow
Tow er Hamlets
0
0.0
0.1
0.2
0.3
0.4
0.5
Median Proportion FSM 2001-6
0.6
LA’s with highest segregation
(not including estimates lees than 2* SE)
LA
Variance
D equiv
Index
Median prop
FSM
2001-6
Select
Prop LA control
Buckinghamshire
2.12
0.46
0.03
Select
0.77
Southend-on-Sea
1.92
0.45
0.09
Select
0.21
Slough
1.76
0.43
0.11
Select
0.37
Trafford
1.75
0.43
0.08
Select
0.40
Oldham
1.72
0.43
0.18
Non
0.75
Calderdale
1.59
0.42
0.12
Select
0.32
Sutton
1.50
0.41
0.05
Select
0.39
Telford &Wrekin
1.46
0.40
0.15
Select
0.53
Solihull
1.42
0.40
0.08
Non
0.85
Barnet
1.42
0.40
0.16
Select
0.41
Knowsley
1.38
0.40
0.34
Non
0.67
Wirral
1.38
0.40
0.18
Select
0.74
Milton Keynes
1.36
0.39
0.12
Non
0.43
Croydon
1.30
0.39
0.16
Non
0.31
Stockton-on-Tees
1.29
0.39
0.16
Non
0.69
Take home message
• Heterogeneity a norm, not an aberration or
error
• Traditional approach: assume constant
variance that can be summarized in one
parameter is overly simplistic and potentially
misleading
• Multilevel approach: anticipates and allows
for modeling of the structure of variation not
accounted for by the fixed part in simple and
complex ways
188
Significance testing
• Individual fixed part and random
coefficients at each level
• Overall model
• Simultaneous and complex comparisons
189
Standard procedure
• H0 Null hypothesis of no difference
• H1 Alternate hypothesis of difference
• Calculate statistic (depends on the nature of the problem)
• Compare with the tabulated statistic based on H0
• Calculated statistic should be greater than the tabulated
statistic at certain level (often 0.05 level) and with given
degrees of freedom to be ‘significant’
• Answers question: What are the chances of getting observed
statistic if H0 is true?
190
Individual coefficients
• Akin to t tests in regression models
• Either specific fixed effect or specific variance-covariance
component
–
–
H0:
H1:
1
1
is 0
is not 0
–
–
H0:
H1:
 u20
 u20
is 0
is not 0
• Procedure: Divide coefficient by their standard error
– Judge against a z distribution
– If ratio exceeds 1.96 then significant at 0.05 level
• Approximate procedure; asymptotic test, small sample
properties not well-known.
• OK for fixed part coefficients but not for random (typically
small numbers; variance distribution is likely to have + skew)
191
Overall model
• Akin to F tests in regression models, i.e., is a more complex
model a significantly model better fit to the data
• Model 1: Null two-level random intercepts model
– Model 2: As model 1 + two additional fixed parameters
– Model 3: As model 2 + quadratic variance function at level 2
• Procedure:
– Calculate deviance (function of the actual and fitted values,
which equals -2*logarithm of the likelihood)
– Calculate the difference in the deviance
• Deviance of Model 3 – Deviance of Model 2
• Deviance of Model 2 – Deviance of Model 1
• Under H0 of no improvement in goodness of fit, difference in
deviance is distributed as chi-square with df = to number of extra
parameters.
– Does result of Model 2 – Model 1 exceed chi square with df = ?
– Does result of Model 3 – Model 2 exceed chi square with df=?
192
Simultaneous/complex comparisons
• Example: Testing H0: 2 – 3 = 0 AND 3 = 5
• H0: [C][]  [k]
• [C] is the contrast matrix (p by q) specifying the nature of
hypothesis: p is number of simultaneous tests; q is
number of parameters in that part of the model.
1 if parameter involved
-1 if involved as a difference
0 not involved otherwise
• [] is a vector of parameters (fixed or random); q
• [k] is a vector of values that the parameters are
contrasted against (usually the null); p
193
Simultaneous/complex comparisons
• Example in contrast matrix terms
– q = 4 (intercept and 3 slope terms)
– p = 2 (2 sets of tests)
[C]
[]
[k]
0
0
0
1
-1
0
0
0
1
1
*
2
=
0
5
3
•
Overall test against chi square with p degrees of freedom
•
Output
– Result of the contrast
– Chi-square statistic for each test separately
– Chi-square statistic for overall test; all contrasts simultaneously
194
Laboratory 6
• Return to the jsp data and draw graphs of level-2 and level-1
variance functions;confidence bounds of the population and
VPC for
•
•
•
•
Model 4a linear function level 1 and linear function at level 2
Model 4b quadratic function level 1, linear level 2
Model 4c quadratic function level 2, linear level 1
Model 4d fully random, quadratic at level 1 & level 2
• Test to see if the more complex model is an improvement over
the simpler using the deviance statistic; and test whether all
level 2 variance terms are needed using the contrast matrix
approach
195
SESSION 7
• Modeling categorical predictors
• Coding categorical predictors
- contrast specification
• Categorical predictors in the fixed part (as main
effects and Interactions) and MLwiN implementation
• Categorical predictors in the random part
at level 2 and level 1; contrast and separate coding
• Fitting models with categorical predictors in MLwiN
• Categorical response later
196
Coding: turning qualitative states into numbers
House
Heating
Type
Size-5
Price
1
Yes
Terraced
1
75
2
No
Terraced
-2
61
3
No
Semi
4
135
4
Yes
Detached
3
140
• Typical approach (Contrast Coding)
– Put in a constant for everybody
– Give most common category 0 (modal centering)
– Give numbers to other categories
• Example
– Heating: Yes = 0; No = 1
– Type: Terraced = 0; Semi = 1; Detached = 2
197
Creating indicator variables
House
Constant
Non-heating
Type
Size-5
Price
1
1
0
0
1
75
2
1
1
0
-2
61
3
1
1
1
4
135
4
1
0
2
3
140
• Are we ready to model this data?
• Heating is correct; 1 is different from zero
– But, type is not: 2 is not 2 *1
• Solution
• Replace k categories by k-1 indicators/dummies
– Re-label variables
198
Creating indicator variables
House
Constant
Non-heating
Semi
Detached
Size-5
Price
1
1
0
0
0
1
75
2
1
1
0
0
-2
61
3
1
1
1
0
4
135
4
1
0
0
1
3
140
• Other codings CONTRASTED with constant (the
base category)
• Constant represents a 5 roomed terraced property
with central heating (base category),
• Dummy variable trap: Use 1 less category than
number of states (exact multi-collinearity)
2 heating categories becomes 1 dummy
contrasted with constant
3 types become 2 dummies contrasted with
constant
199
Contrast coding: most common
Model : full model with main effects and interactions
yij  0 (Cons)  1 ( Nheat)  2 (Sem)  3 (Det)  4 ( NH * Sem)  5 ( NH * Det)
• Allows simplification to additive effects
- 6 means (based on a 2 by 3 combination of heating and type)
can be estimated with just four terms
Model : main effects
yij  0 (Cons)  1 ( Nonheat)  2 (Semi)  3 (Detach)
• Additive model
What is average price for terrace with heating?
What is average price for detached without central
heating?
• Asking the question in term of differences
if not different, not needed in model
200
Price
Possibility of complex variation between
places
Semi
Detached
201
Quadratic function at level 2
• Same principles and procedures (include as macro
equation; variance/covariance matrix)
yij  0 x0ij  1x1ij  (u0 j x0ij  u1 j x1ij  e0ij x0ij )
• Variance for base category:
Var(u0 j x0ij )   x
2
u 0 0ij
• Variance for contrast category:
Var(u0 j x0ij  u1 j x1ij )   u20 x02ij  2 u 0u1x0ij x1ij   u21x12ij
• Note: Variance of base category + 2 times the covariance +
variance of the contrasted category
202
Task 7
The response is house-prices and there are two levels ,
houses and districts. Houses are classified into terrace
semi’s and detached properties. Using terrace as the
base and contrast coding, write down a micro equation
and macro equations which allow each type of property
to have differential geography over the city. Form the
combined multilevel model and identify the fixed terms
and the variance-covariance matrices at each level. Draw
for each type of property a graph showing between
district variations. For lines on the graph identify the
terms forming the average prices and the district
differential price. Finally write down the variance
associated with each type of housing.
203
Separate coding
House
District
Type
Size-5
Price
1
1
Terraced
1
75
2
1
Terraced
-2
61
3
1
Semi
4
135
4
1
Detached
3
140
Separate coding
Replace all distinct categories with indicators
Re-label indicators
3 categories; therefore three dummies
Type
Terrace
Semi
Detached
Terraced
1
0
0
Terraced
1
0
0
Semi
0
1
0
Detached
0
0
1
204
Use contrast code in fixed and separate
code in random part at level 2
yij   0 (Cons)  1 (Size)   2 (Semi)  3 (Detach)
 (u1 jTerrace u2 j Semi  u3 j Detach  e0ijCons )
NB: no Cons at level 2
Big Advantage: the between district variance is given
directly
Var(u2 j Semi)  
2
Var(u3 j Detach)   u3
Var(u1 jTerrace)  
2
u1
2
u2
Big advantage: the correlation between different maps of effects
for each type of house given directly; not as correlation between
a base and a differential
Big disadvantage: model not reducible: no base; not easily
extendible; more than 1 set of categories requires combinations
of categories (see later)
205
Chapter 7 Modelling categorical predictors
Create dummy/indicator variables
206
Categorical predictors varying at level-1
•
•
•
•
House
Heating
Type
Size-5
Price
1
Yes
Terraced
1
75
2
No
Terraced
-2
61
3
No
Semi
4
135
4
Yes
Detached
3
140
Separate coding
Replace all distinct categories with indicators
Re-label indicators
DO NOT use a constant at all: 6 dummies, no Base
T-H
T-NH
S-H
S-NH
D-H
D-NH
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
207
Separate coding
Model for fixed effects
yij   1TH  2TNH  3SH
  4 SNH   5 DH   6 DNH
• Price for non-heated detached?
• Results given directly not by differentials
• Matter of convenience, BUT
• Separate NOT reducible to a simpler form
208
In multilevel models
• Contrast coding in the fixed part and higher-level
(level-2) random part
• Separate coding in level-1 random part
• Micro model
yij  0 j (Cons)  1 j ( NH )  e1ij ( NH )  e2ij (H )
e1ij 
 e21

Var   ~ N (0, 
)
2 
e2ij 
 0  e2 
•
Interpret
• Can only estimate variances, no ‘covariance’; can only
belong to one category not both simultaneously
209
Summary – Separate/contrast coding
House
Heating
Cons (x0)
Non-heat
(x1)
Heat
(x2)
1
Yes
1
0
1
2
No
1
1
0
3
No
1
1
0
4
Yes
1
0
1
5
Yes
1
0
1
Contrast
• Fixed part: contrast coded
Separate
0 x0ij  1x1ij
210
Contrast coding at level-2
• ‘Quadratic’ at level-2
Cons( x0 ) Non  Heat( x1 ) 

u0 j 


2
 u0
  ~ N ,0 Cons( x0 )

u
 1 j 
2
 Non  Heat( x )



1
u 0 u1
u1


• Between-district variance for heated properties

2
u0
• Between-district variance for non-heated properties
  2 u 0u1  
2
u0
2
u1
211
EITHER Separate coding at level-1
Non  heat( x1 ) Heat( x2 ) 

e1ij 


2
 e1
  ~ N ,0 Non  Heat( x1 )

e2ij 
2
 Heat( x )

0

2
e2


• Between-house variance for heated properties

2
e2
• Between-house variance for non-heated properties

2
e1
212
OR Contrast coding at level-1
Cons( x0 ) Non  Heat( x1 ) 

e0ij 


2
 e0
  ~ N ,0 Cons( x0 )

e
 1ij 
 Non  Heat( x )


0
1
e 0 e1


• Between-house variance for heated properties
 e20
• Between-house variance for non-heated properties
 e20  2 e0e1
213
Summary of level -1 heterogeneity with
categorical level-1 predictors
• At level 1 only have information to estimate two random
terms; not all 3 of full quadratic; therefore
• Either model directly using separate coding and
variance for each category; ie two+ variances;
• Or model indirectly using contrast coding and base
variance and covariance; variance of contrasted
category derived as var(base) + 2* Covar(base, contrast)
• NOT variance of base and variance of contrasted
category as then forcing base to have smaller variance
214
Model 10 Categorical predictors at level-1
(contrast coding)
QUESTION 6
Why are there zeros at level-1?
215
Laboratory 7
• In MLwiN open sample worksheet, choosing the tutorial data, this is
a new and larger dataset for secondary school children.
• The response is Normexam exam score aged 16, and an individual
categorical predictor is the categorical variable Vrband of the pupil 5
years earlier. The exam score has been formed by summing a
score for each examination that is sat at age 16. This is typically
skewed. It has been replaced by a normal score. You can do this by
generating a random normal vector of the same length as the
number of pupils; sorting the score and sorting the generated
values, replacing one by the other. Fit the following models
216
Laboratory 7 (continued)
• A: Setup a random-intercepts model with Normexam as response,
and Cons and VRband in the fixed part. Level 2 is school and level 1
is student. Choose VB3, the lowest achievement band as the base
referent category. Use customised predictions to plot the main effects
of Vrband and the predictions window to draw a varying relations plot.
Are the school differences remaining after taking account of prior
ability at entry?
• B: Develop model A to answer the following questions. Are there
different school effects for pupils of different initial prior ability? Are
children of different prior ability more or less variable in their progress?
That is allow the effects of VRB2 and VRB1 to be random at level 2
and level 1; that is use contrast coding as Vrb3 is already there.
Estimate the model. Why are some of the level-1 variances zero? Plot
the variance functions, and draw up the varying relations plot for each
VRband separately. For which type of pupil is the greatest between
school differences? Are different types of pupil differentially variable
in their progress?
217
Laboratory 7 (continued)
• C: This model addresses the same questions as Model B, but this time
we will use separate coding. The usefulness of this approach is that
we can assess the extent to which a school performs well for one
category, also performs well for other categories. Click on any of the
fixed VRband terms and delete all two of these terms to leave a null
random intercepts model. Click on the Cons term and tick off the level
1 and 2 random terms and the fixed term. There should be no terms in
the fixed and random part of the model . Add the VRband term
choosing None for the base reference category. Click on each of 3
separately coded dummies and tick on the level 1 and level 2 random
terms. Estimate this model. Which of the level-1 random terms are
zero, why is this? Plot the variance functions at level 1 and level 2; do
they differ from the results for model B. Produce pairwise plots of
these school differentials (level 2 residuals) and examine (via the
estimates table) the correlations between the level 2 school
differentials; how similar are schools in their performance for pupils of
different VRband abilities?.
218
Session 8
• Concept of higher-level predictors
• Contextual effect and cross-level interaction
• Specifying higher-level predictors
• Fitting models in MLwiN
219
Higher-level variables
• So far
all predictors have been level 1 (Math3,
boy/girl); (size, type of property)
• Now higher level predictors (contextual, ecological)
- global occurs only at the higher level;
-aggregate based on summarising a level 1 attribute
• Example: pupils in classes
progress affected by previous score (L1); class average
score (A:L2); class homogeneity (SD, A:L2); teaching style
(G:L2)
• NOW: trying to account for between school differences
220
Typology of contextual/ecologic measures
Typology
Derived,
Aggregate,
Collective
Description
Aggregate of attributes measured at the
individual (or lower) level.
Examples
Mean/Median income
Proportion reporting mistrust
Expressed as a measure of central tendency
(e.g. mean, median), or measures of variation
of individual-level variables (e.g. standard
deviation), but caution when sample-based.
Gini coefficient for inequality
Individual analogue can often be measured.
Contagion, Peer
Integral, Global
Aggregate of the individual-level outcome,
rather than predictor, that in turn predicts the
individual name-sake of the same variable.
Proportion smokers
Individual analogue is the outcome.
Infectious disease prevalence
Measure attributes of groups, organizations, or
areas.
Systematic Social Observation
Occurs uniquely at the higher level.
Policy variables
Individual analogue cannot be measured, thus,
irreducible.
Environmental, Physical
attributes of an area
Suicide rate
221
Main and cross-level relationships:
a graphical typology: voting
example
Yij
Propensity for left vote
The individual and the ecological - 1
Low SES: X1ij = 1
High SES: X1ij = 0
% Working class
X2j
222
The individual and the ecological - 2
Propensity for left vote
Low SES
High SES
% Working class
223
The individual and the ecological - 3
consensual
Propensity for left vote
Low SES
High SES
% Working class
224
A graphical typology of cross-level
interactions (A to F) (Jones & Duncan 1993)
Individual
Consensual
Ecological
Reactive
Reactive for W;
Consensual for M
Non-linear crosslevel interactions
225
Main and cross-level relationships:
Housing example: symbols &
variables
Yij
House price
Detached properties:
X1ij = 1
Environmental quality
X2j
Of district
Other housing types:
X1ij = 0
226
Data matrix: for main ecologic effect
District
j
House
i
Price
Yij
Detached
X1ij
EnvQual
X2j
1
1
75
1
10
1
2
61
0
10
1
3
135
0
10
1
4
140
1
10
1
5
125
0
10
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
50
1
75
1
80
50
2
61
0
80
50
3
135
0
80
50
4
140
1
80
Higherlevel
variable
227
Micro Model (Within-district, between-house)
Price of a house in a district
=
Price of non-detached properties in each district
+
Price DIFFERENTIAL for detached properties
+
Residual differential for each house in a district
228
Macro Model (Between-district)
Price of properties in each district
=
Average price of properties (across all districts)
+
Effect of environmental quality for each district
+
Residual differential on the price of
properties for each district
229
Combined model
Price of a house in a district
=
Average price of non-detached properties
+
Average DIFFERENTIAL for detached properties
+
Change in price for a unit change in DISTRICT
environmental quality
+
Residual differential on the price of
properties for each district
+
Residual differential for each house in a district
230
Specifying ecologic predictors – Main effect
allows fitting of types A to D
Micro model
yij  0 j x0ij  1x1ij  e0ij x0ij
Macro model
0 j  0  2 x2 j  u0 j
Effect for
ENVQ
Overall model
yij  0 x0ij  1x1ij  2 x2 j x0ij  (u0 j x0ij  e0ij x0ij )
[u ] ~ N,0(
0j
2
u0
)
[e ] ~ N,0(
0ij
2
e0
)
Note: Enters the macro equation, but as a fixed parameter
231
Data matrix: for cross-level ecological effect
(Graphs A to F)
District
House
Price
Detached
EnvQual
X2j
Detach*
EnvQual
X2jX1ij
1
1
75
1
10
10
1
2
61
0
10
0
1
3
135
0
10
0
1
4
140
1
10
10
1
5
125
0
10
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
50
1
75
1
80
80
50
2
61
0
80
0
50
3
135
0
80
0
50
4
140
1
80
80
232
Micro Model (Within-district, between-house)
Price of a house in a district
=
Price of non-detached properties in each district
+
Price DIFFERENTIAL detached properties in each district
+
Residual differential for each house in a district
233
Macro Models (Between-district) - 1
Price of non-detached properties in each district
=
Average price of non-detached properties (across all districts)
+
Effect of environmental quality for each district
+
Residual differential on the price of non-detached
properties for each district
234
Macro Models (Between-district) - 2
Price DIFFERENTIAL for detached properties in each district
=
Average price differential for detched properties (across all districts)
+
Effect of environmental quality for each district
+
Residual differential on the price differential of detached
properties for each district
235
Combined model
Price of a house in a district
=
Average price of non-detached properties
+
Average DIFFERENTIAL for detached properties
+
Change in price for a unit change in DISTRICT
environmental quality for non-detached properties
+
Differential change in price for a unit change in DISTRICT
environmental quality for detached properties
+
Residual differential on the price of detached properties
for each district
+
Residual differential on the price differential of detached
properties for each district
+
Residual differential for each house in a district
236
Specifying ecologic predictors – Cross-level interaction effect
allows fitting of types A to F
yij  0 j x0ij  1 j x1ij  e0ij x0ij
Micro model
Main effect
Macro model
0 j  0  2 x2 j  u0 j
X-level effect
1 j  1  3 x2 j  u1 j
Overall model
yij   0 x0ij  1 x1ij   2 x2 j x0ij   3 x2 j x1ij 
(u0 j x0ij  u1 j x1ij  e0ij x0ij )
( x0 ) ( x1 ) 

u0 j 


2
  ~ N ,0 ( x0 )  u 0

u1 j 
2 
 (x ) 

u 0 u1
u1 
 1
[e ] ~ N,0(
0ij
2
e0
)
Note: Combined makes it clear that it is a cross-level interaction
237
Cross-level interaction equation in
MLwiN
238
Non-linear macro models
allows fitting of types A to H
Polynomial for ENVQ
Macro model(s)
0 j  0  2 x2 j  3 x22 j  u0 j
1 j  1  4 x2 j   x  u1 j
2
5 2j
Overall model
yij   0 x0ij  1 x1ij   2 x2 j x0ij   3 x22 j x0ij   4 x2 j x1ij   5 x22 j x1ij 
(u0 j x0ij  u1 j x1ij  e0ij x0ij )
239
Cross-level interactions with continuous
variables
• So far typology, equation and examples for continuous
w and categorical x
• Now for continuous w and x
• No real changes except that needs a surface plot or a
contour plot to show the results
City-wide cost of an 5 room house
when ENVq = 0
Effect for ENV quality
yij  0 x0ij  1x1ij  2 x2 j x0ij  3 x2 j x1ij
Cost of an extra
room
Effect for EVQ * Size
240
Cross-level interactions: continuous variables
Row
1
2
3
Price district house room-5
85.0
1
1
-2
50.0
1
2
-4
102.5
1
3
-1
ENVq ENVQ* Rooms
5
-10
5
-20
5
-5
10
11
190.0
83.0
1
2
10
1
4
-2
5
7
20
-14
21
103.5
3
1
-1
7
-7
yij  100x0ij 10* x1ij  2 * x2 j x0ij 1.5 * x2 j x1ij
241
Multilevel data manipulations in MLwiN
• Creates aggregate variables for each block
Average
Calculates the mean of the input
column(s) values
Count
Calculates the number of level 1 units of
the input column
Stand Dev
Form the standard deviation (divisor n-1)
of the input column(s) values
Beware
Measurement error of SAMPLE-BASED
aggregate variables
242
Multilevel data manipulations: school
average attainment and heterogeneity
on entrance
243
EXAMPLE
•
•
•
-
STRUCTURE: 2275 voters in 218 constituencies, 1992
RESPONSE:
vote Labour not Conservative
PREDICTORS:
Level
individual: age, sex, education, tenure, income
1
: 8-fold classification of class
- constituency:% Local authority renters
2
% Employers and managers;100 - % Unemployed
• MODEL:
cross-level interactions between
INDIVIDUAL&CONSTITUENCY characteristics
Fixed part main effects: 8 fold division of class
Random part at level 2: 2 fold division of class
Working class:
Non-working class:
unskilled and skilled manual, foreman
public and private-sector salariat, routine nonmanual, petty-bourgeoisie, ‘unstated’
244
Note different terms in the fixed and
random
Fixed: all 8 fractions of class (contrast coded)
Random: two groupings of class
(separate coding)
yij   0 x0ij  1Class2ij   2Class3ij  ......  7Class8ij 
(u9 j Class1  3ij  u10 j Class4  8ij  e9ij x0ij  e10ij x1ij )
(Class1  3) (Class4  8) 

u
 9j 


2
 u9

 ~ N ,0 (Class1  3)

u
 10 j 
2
 (Class4  8)



u 9 u10
u10


245
Cross-level interactions
246
Conclusions 1
• Specifying ecologic measures within a multilevel
statistical model makes it clear that Ecological
variables
- account for between area differences NOT level 1
outcomes
- account for place differences AFTER allowing for
within place composition
• Single-level model under-estimate the relative
importance of ecological variables; moreover
multilevel models result in correct SE’s for higherlevel fixed terms………………..
247
Conclusions 2
Random variability between-areas (the response in the
macro models) is composed of three parts:
– TRUE parameter variance: implying that area
differences can be accounted by
observable/measurable ecological variables.
– UNOBSERVED heterogeneity due to in-numerable
small-scale factors that are hard to measure;
unexplainable
– SAMPLE variance: a consequence of working with a
sample of areas; unexplainable
• Single level model presume that higher-level has all
been accounted for
248
Two-level hierarchical model
Micro model : continuous predictor and response
yij  0 j x0ij  1 j x1ij  (e0ij x0ij  e1ij x1ij )
e1ij 
  e21

Var   ~ N (0, 
)
2 
e2ij 
 e1e 2  e 2 
Macro models
0 j  0  2 x2 j  u0 j
1 j  1  3 x2 j  u1 j
Combined multilevel model
( x0 ) ( x1 ) 

u0 j 


2
  ~ N ,0 ( x0 )  u 0

u
2 
 (x ) 
 1 j 

u 0 u1
u1 
 1
yij   0 x0ij  1 x1ij   2 x2 j x0ij   3 x2 j x1ij 
(u0 j x0ij  u1 j x1ij  e0ij x0ij  e1ij x1ij )
Level 2 variance
Level 1 variance
 x  2 u 0u1x0ij x1ij   x
 x  2 e0e1x0ij x1ij   x
2
u0
2
e0
2
0ij
2
0ij
2
u1
2
e1
2
1ij
2
1ij
249
Task 8: 3 level model of London Schools
There are 3 levels to this model; schools(3) are repeatedly
measured over 3 years as cohorts (2) pass through by
the performance of pupils (1) in public examinations.
The response is the score of a pupil aged 16 in public
examinations; the variables titles are reasonably selfexplanatory with the exception of VR1, and VR2.
Pupils are tested for ability (Verbal reasoning test) at
age 11 prior to entry to the school. VR1 is the highest
25%; VR2 is the middle 50%.
•
Interpret the model; begin by working out who is the
constant?
•
How would you improve the model?.
Source: Nuttall, D. L., Goldstein, H., Prosser, R., and Rasbash, J. (1989).
Differential school effectiveness. International Journal of Educational
Research, 13: 769-776.
250
Repeated cross-sectional design
• Structure of Example:(Nuttall et al’s analysis of changing school
performance in London)
– Level-3 is the school, level-2 is the year and level-1 is the individual.
Level-2 represents repeated measurements on the school.
– Imbalance:
– particular schools not included in particular years (level 2);
– different number of pupils in each cohort in each school (level 1)
– even if school is only measured once do NOT discard it, include it!
• Research Question
– modeling changing school performance as different cohorts of pupils
pass through the schools
251
3-level Random Intercept Model
y ijk   0 jk  1x ijk  eijk
 0 jk   0  u0 jk  v 0k
Note ijk subscripts
Note 2 higher-level random effects
i level 1 (e.g. pupil), j level 2 (e.g. cohort) ,
school)
k level 3 (e.g.
Variance between pupils within cohorts within schools
= Var (eijk) =  e2
Variance between cohorts within schools = Var (u0jk) =  u20
Variance between schools = Var (v0k) =  v20
Variance between cohorts =  u20 +  v20
Random effects at different levels assumed to be uncorrelated.
252
Variance Partition Coefficients for 3-level
random intercepts model
Proportion variance due to differences between cohorts
 u20   v20
= intra-cohort correlation = 2
 u 0   v20   e2
Proportion variance due to differences between schools
= intra-school correlation =
 u20
 v20
  v20   e2
Proportion between-cohorts variance due to differences between
 v20
cohorts within schools =  u20   v20
253
Correlation structure of
3 level model
S
Intra-class correlation (within same school & same class) 1
Intra-school correlation (within same school, different class) 2
1
1
1
2
2
2
2
3
3
3
1
1
2
1
1
2
2
1
1
2
P
1
2
3
1
2
3
4
1
2
3
C
1
1
1
1
1
2
0
0
0
0
0
0
0
1
1
2
1
1
2
0
0
0
0
0
0
0
1
2
3
2
2
1
0
0
0
0
0
0
0
2
1
1
0
0
0
1
1
2
2
0
0
0
2
1
2
0
0
0
1
1
2
2
0
0
0
2
2
3
0
0
0
2
2
1
1
0
0
0
2
2
4
0
0
0
2
2
1
1
0
0
0
3
1
1
0
0
0
0
0
0
0
1
1
2
3
1
2
0
0
0
0
0
0
0
1
1
2
3
2
3
0
0
0
0
0
0
0
2
2
1
254
Task 8 continued
Response:
ILEA score on pupils leaving aged 16
Lev 1: 31,623 pupils
Lev 2: 3 years (1984, 5, 6) Lev 3:140 schools
FIXED EFFECTS (ratio of estimate to standard error)
Level 1: Pupil
Intercept
Girl
VR1 (highest ability at 11 years)
VR2 (middle ability at 11years)
African
4.0(8.0)
Bangladeshi
4.7(6.7)
Greek
4.6(6.6)
Pakistani
6.0(10.0)
Turkish
3.7(9.3)
17.8
2.5(12.5)
19.0(63.3)
8.2(41.0)
Arab
Caribbean
Indian
SE Asian
Other
4.4(4.0)
-0.4(1.9)
7.3(14.6)
8.3(13.8)
3.8(9.5)
255
Task 8 continued Fixed effects
Level 2: Year
year 1.4(7.0)
Level 3: School
Boys
Girls
Church of England
Roman Catholic
Prop. Free school meals
Prop. Squared of fsm
0.8(2.7) single sex school
1.4(4.7) single sex school
1.2(3.0) denominational school
2.4(8.0) denominational school
-0.41(10.3)
0.003(7.5)
256
Task 8 continued
RANDOM EFFECTS (correlations in brackets)
Level 1: Between pupils 96.7
Level 2: Between years
1.2
Level 3: Between schools: estimated var-cov matrix
Intercept
VR1 VR2 Sex Carib Year
Intercept
2.9
VR1
-1.9 (0.0)
17.4
VR2
-0.1 (0.0)
6.1 (0.9)
2.8
Sex
-1.5 (-0.6)
2.4 (0.4)
0.6 (0.2)
2.1
Carib
-0.4 (-0.2)
-1.8(-0.4)
-0.5 (-0.3)
Year
(0.1)
0.5
257
Task 8 continued
• Interpretation
Who is the Intercept?
Including VR1, VR2 for age 11 in the model, means that we
are modelling ?
Fixed pupil effects?
Fixed Year effects?
Fixed school effects?
Level-1 random effects?
Level-2 random effects?
Level-3 random effects?
• Critique
258
Chapter 8 Higher-level variables
Model 10 Complex level-1, level-2 (with size,
type)
259
Lab 8 Using higher level variables
• In MLwiN open sample worksheet, choosing the tutorial data
• The response is Normexam exam score aged 16, and the
individual predictor is the categorical variable Vrband of the
pupil 5 years earlier. The higher level variable is avslrt, the
average reading ability in the school for pupils on entry aged
11. Then fit the following models
260
Lab 8 continued
• A: Setup a base random-intercepts model with Normexam as
response, and Cons and VRband in the fixed part. Level 2 is school
and level 1 is student. Choose VB3, the lowest achievement band as
the base referent category. Use customised predictions to plot the
main effects of Vrband and the predictions window to draw a varying
relations plot. Are the school differences remaining after taking
account of prior ability at entry? (same as 7A)
• B: Does the prior average ability of entry in a school have an effect on
subsequent achievement that is in addition to the pupil’s own prior
ability? That is include centred school Avslrt (around its grand mean)
in addition to individual VRband as predictor variables. Draw a plot of
the main effects with Avslrt on the horizontal axis, groups of lines
determined by VRband and the vertical axis is the predicted
Normexam based on the fixed part of the model. Are there significant
peer group effects? In comparison to Model A, has school average
ability accounted for some of the unexplained between school
variance. Has the level 1 variance reduced with the inclusion of Avslrt,
why or why not?
261
Lab 8 continued
• C: Setup a new base model to answer the following questions. Are
there different school effects for pupils of different initial prior ability?
Are children of different prior ability more or less variable in their
progress? First remove the fixed effect of Avslrt by deleting the term
(back to Model A). Then allow the effects of VRB2 and VRB1 to be
random at level 2 and level 1; that is use contrast coding as Vrb3 is
already there. Estimate the model. Why are some of the level-1
variances zero? Plot the variance functions, and draw up the varying
relations plot for each VRband separately. For which type of pupil is
the greatest between school differences? This model does not use
higher level predictor variables but sets up a base for the next model,
and is needed if we are going to fit cross-level interactions same as
Model 7b)
• D: Does the prior average ability of entry in a school have an effect
on subsequent achievement that is in addition to the pupil’s own prior
ability and is this effect differential for pupils of different prior ability?
Staring with Model C, include centred school Avslrt (around its grand
mean) and then add 1st order cross-level interactions between Vrband
262
Lab 8 continued
• (reference category is VR3) and centred Avslrt, in addition to
individual VRband as predictor variables. Draw a plot of the main
effects with Avslrt on the horizontal axis, groups of lines determined
by VRband and the vertical axis is the predicted Normexam based
on the fixed part of the model. . Are the significant peer group
effects? Which type of pupil is most affected by the peers? In
comparison to Model C, has school average ability accounted for
some of the unexplained between school variance?
• You could do other analyses with this data, that require higher-level
variables and cross-level interactions eg
• E: does school heterogeneity in terms of LRT have an effect on
progress in addition to AvsLrt, and is this differential for pupils of
different ability.
263
Session 9 Estimates and Estimation
A comparison of Fixed and Random specification estimates
- Anova versus ML random intercepts
- Ancova versus ML random slopes and intercepts
- empirical and algebraic comparisons
- borrowing strength; shrinkage
IGLS/RIGLS estimation
Markov Chain Monte Carlo estimation
- Frequentist approach
- Bayesian approach
- What does MCMC do?
- IGLS versus MCMC
- MCMC in practice in MLwiN
- DIC
MCMC is now the preferred form of estimation for realistically
complex models
264
Properties of ML estimates: One picture two
specifications
1a
1b
2
Select if; and fit a separate line
to each district
ANOVA; ie include a dummy
for each district; fixed part
expansion
Fit multilevel model: random part
expansion
Comparison:
Target of inference
OLS
Specific districts
Assumption:
Df consumed in
fit
Separate places
1 for each
district, grows!
ML
Between-district
differences
Exchangeability
Grand mean +
Variance(s)
265
Two specifications in MLwiN
Districts as Random effects
Districts treated as Categorical and added to
Model as Fixed effects
266
Data for London mortgages (Jones and
Bullen, 1994)
• 33 London districts 5% sample of Q1 1990 (Jan-March)
•
•
•
•
ML estimates
Grand mean:
Between district:
Within districts between houses
£84.27k
284.1 (117.9)
3112
• Comparison of OLS/ANOVA: separate estimation
• ML: assumes come from a normal distribution with a
variance of 284.1; results in precision-weighted
estimation; = “borrowing strength” & shrinkage
267
District
City
Barking
Barnet
Bexley
Brent
Bromley
Camden
Croydon
Ealing
Enfield
Greenwich
Hackney
Hammersmith
Haringey
Harrow
Havering
Hillingdon
Hounslow
Islington
Kensington
Kingston
Lambeth
Lewisham
Merton
Newham
Redbridge
Richmond
Southwark
Sutton
Twr Hamlets
WForest
Wandsworth
Westminster
nj
3
6
31
6
32
15
17
18
23
34
26
9
16
12
22
4
27
26
22
12
8
18
38
10
28
11
26
9
4
13
28
39
16
wj
0.21
0.35
0.73
0.35
0.74
0.57
0.60
0.62
0.67
0.75
0.70
0.45
0.59
0.52
0.66
0.26
0.71
0.70
0.66
0.52
0.42
0.62
0.77
0.47
0.71
0.50
0.70
0.45
0.26
0.54
0.71
0.78
0.59
OLS
79
39
93
84
73
97
76
72
78
80
58
54
103
76
81
59
100
99
90
124
91
81
66
89
63
85
130
66
207
42
75
91
109
ML
83
68
91
84
76
92
79
76
80
81
66
70
95
80
82
77
96
94
88
105
87
82
70
87
69
85
117
76
117
61
78
89
99
Diff
-4
-29
2
0
-3
5
-3
-4
-2
-1
-8
-16
8
-4
-1
-18
4
5
2
19
4
-1
-4
2
-6
0
13
-10
90
-19
-3
2
10
Rank
17.5
2
26
32.6
22
13.5
22
17.5
26
30
10.5
6
10.5
17.5
30
5
17.5
13.5
26
3.5
17.5
30
17.5
26
12
32.5
7
8.5
1
3.5
22
26
8.5
268
One picture two specifications
1a
1b
2
OLS
Select if; and fit a separate line
to each district
ANCOVA; ie include a
dummy for each district and
dummy interaction with size;
Fit multilevel model: random part
expansion
= ANCOVA = separate lines : overall model??
33 place = 66 parameters (separate slope and intercept)
Unstable estimation
- few houses in district
- little variation in predictor (all 5 roomed houses)
269
Two specifications in MLwiN
Districts as
Random effects
Districts treated as Categorical and District-Size
interactions: all as Fixed effects
270
District
OLS
Mean
City
81.9
Barking
38.7
Barnet
84.7
Bexley
79.1
Brent
73.4
Bromley
79.1
Camden
79.5
Croydon
72.7
Ealing
76.9
Enfield
80.1
Greenwich 58.0
Hackney
58.0
Hamrsmith 113.
Haringey
83.5
Harrow
81.4
Havering
63.7
Hillingdon
86.8
Hounslow
92.9
Islington
89.5
Kensington 138.0
Kingston
90.0
Lambeth
81.2
Lewisham
65.5
Merton
84.4
Newham
64.7
Redbridge 70.1
Richmond
110.0
Southwark 29.4
Sutton
-203.6
T.Hamlets 39.8
W Forest
72.5
Wandswrth 92.7
Westmnser 75.6
ML
Diff
Rnk
80.7
65.2
82.1
80.5
74.6
87.8
76.4
74.8
74.7
83.8
65.2
73.5
100.
79.3
82.5
69.8
94.7
94.4
83.2
107.
83.3
78.9
70.8
82.8
70.8
79.0
104.
63.1
99.6
64.7
76.2
86.8
62.6
1.2
-26.
2.65
-1.3
-1.1
-8.6
3.12
-2.1
2.12
-3.6
-7.1
-15.
12.4
4.18
-1.1
-6.1
-7.9
-1.4
6.27
30.5
6.7
2.29
-5.3
1.6
-6.0
-8.8
5.41
-33
-302
-24.
-3.6
5.98
13.
31
4
24
30
32
10
23
26
27
22
12
6
8
20
33
15
11
29
14
3
13
25
19
28
16
9
18
2
1
5
21
17
7
OLS
Slope
14.3
-8.4
13.6
15.6
7.69
29.5
5.07
7.77
3.66
23.9
1.63
6.88
37.2
10.2
18.2
-21.0
41.6
34.3
12.7
40.1
14.2
9.91
8.59
16.8
8.00
18.1
39.7
-27.0
150.
-2.6
12.9
13.9
-27.
ML
Diff
Rank
14.4
0.24
14.3
14.4
8.15
25.2
7.27
8.5
6.83
21.4
0.96
9.19
33.5
11.3
17.3
2.73
35.2
31.3
13.9
37.5
16.2
11.2
7.9
16.5
7.26
14.0
38.9
-5.3
35.9
2.49
11.6
15.4
-17.
-0.1
-8.7
-0.7
1.26
-0.4
4.23
-2.1
-0.7
-3.1
2.48
0.67
-2.3
3.7
-1.1
0.88
-24.
6.45
2.94
-1.1
2.56
-2.0
-1.3
0.69
0.32
0.73
4.07
0.84
-22.
114.
-5.1
1.27
-1.5
-9.8
33
5
26
21
31
8
16
28
11
14
30
15
10
23
24
2
6
12
22
13
17
19
29
32
27
9
25
3
1
7
20
18
4
271
Borrowing strength: Sutton
Sutton 4 houses
3 with 7 rooms £155,255,128
(average= £180k)
1 with 8 rooms 330K
cost of extra room = £150k
back project to average size (4.5 rooms)
= -203.6
Unstable estimation (few houses and
little variation in size)
ML line: combination of OLS line
shrunk to grand mean line
272
Relating ML to OLS estimates algebraically: Null model I
OLS/ ANOVA
Y
Y
*
i1
01
.
Y
Y
ML



i1
01
*
.
  i1
  i1
.
 
  D  D
i , 33
*

i , 33
0 , 33
*
ij
*
1
01
02
 ... 
2
 x 
  
Y   x  (  
 ~ N ( 0 , )
Y
ij

0j
ij
0
0j
0
0

*
0 , 33
D 
33
ij
ij
0j
0
0j
ij
)
2
0j
o
273
Relating ML to OLS estimates algebraically: Null model II
*
(1)
0   w j  /  w j
0j
ML grand mean is weighted average of all OLS intercepts
(weighted to emphasise reliability)
0 j 
w 
j
*
0j
 (1  w j ) 
0
(2)
ML intercept is weighted combination of OLS intercept and overall grand
Mean
2
2
2

/(

(3)
w j   o  j  uo )
weights are function of between-place variation and sampling variance
within a place

2
j

2
o
/nj
(4)
sampling variance of OLS intercept depends on level 1 variance and number
of houses
274
Relating ML to OLS estimates algebraically: Null model III
 Consequently
If large differences between districts and large number of house in a district;
sampling variance is small, reliable estimates; weight approaches 1 &
equation 2 becomes
0 j 
w 
j
*
0j
 (1  w j )   1 * 
0
*
0j
 (1  1)  
0

*
0j
ML intercept = OLS estimate
If small differences between districts and small number of houses in a
district; sampling variance is large, imprecise estimate; weight approaches 0
then equation 2 becomes
0 j 
w 
j
*
0j
 (1  w j )   0 * 
0
*
0j
 (1  0 )  
0

0
ML intercept = Grand mean estimate
275
Relating ML to OLS estimates algebraically for one district
2
 0 = £84k;  2 = 284.1  
ML estimates
= 3112

The city (j=1)
n

w
City

/(
o 
2
=3
City
2
City
2
City


2
o
2
uo
/ n City
)
= 3112/3
= 284.1/ (3112/3 + 28.41) = 0.21
0 j 
w 

= (0.21 *79) +((1-0.21) * 84) = 83
City
j
*
0j
 (1  w j ) 
0
79 shrunken towards 84 becomes 83
 “Pooled” estimation = “borrowing strength “ = precision-weighted
estimation
NB: fully random model borrowing strength between intercepts and slopes
according to degree of covariance
276
London data: shrinkage of
intercepts and slopes: Chapter 9
277
London data: the two models:
Chapter 9
278
Rubin’s analysis of law schools
• Aim: Predict performance (Y) in 1st year examination on
basis of test before entry(X)
• Data: thousands of applications to 83 law schools
• Problem: restricted range of X in any one law school; uninterpretable negative slopes; “bouncing β’s” over time
• Comparative test:
ML and OLS calibration for one year (1974)
Predict next year’s examination results (1975)
Compare actual and predicted for 1975, ML 40*
better than OLS
• WHY: unreliable school-specific estimates shrunk
towards grand slope based on wide range of X values
RUBIN, D. (1980). Using empirical Bayes techniques in the law school validity studies(with
discussion). J. Amer. Statist. Assoc. 75 801-827.
279
Analysis Strategies (from Session 1)
IV Analysis of covariance (fixed effects model). Include dummy
variables for each and every group
Problems
• What if number of groups very large, eg households?
• No single parameter assesses between group differences
• Cannot make inferences beyond groups in sample
• Cannot include group-level predictors as all degrees of
freedom at the group-level have been consumed
• Target of inference: individual School versus schools
• Potentially unstable estimates
280
Fixed and Random classifications
Random classification
Fixed classification
Generalization of a level
(e.g., schools)
Discrete categories of a
variable (eg Gender)
Random effects come
from a distribution
Not sample from a
population, but exhaust
the categories
All observed schools
contribute to betweenschool variance
Specific categories only
contribute to their
respective means
Pooling of information
across groups:
exchangeability
Separate estimation;
potentially imprecise
estimation
281
AKA shrunken
residuals(session 3)
uˆ 0 j 
 u20
 u20
.r j
2
  e0 / n j
The multiplier in the above formula is always less than or
equal to 1, so that the estimated residual
in magnitude than the raw residual,
rj
uˆ0 j is usually less
ie FIXED effect .
The raw residual is multiplied by a shrinkage factor and the
estimated (or posterior) residual therefore is also sometimes called a
shrunken residual.
Greatest shrinkage towards zero when between district
variance is relatively small and when the raw residual would be
un-reliably estimated:
16 minute video on multilevel residuals
http://www.cmm.bristol.ac.uk/learning-training/videos/RP/Residuals/Residuals.html
282
IGLS Estimation = FIML
• Maximum likelihood picks the values of the model
parameters that make the data "more likely" than any
other values of the parameters would make them
• Iterative scheme
each cycle : estimate fixed part, estimate random part on
updated estimate of fixed part
• IGLS (Goldstein, 1986) : see Chapter 9 for worked
example
1) OLS estimation of fixed part ignoring complex structure of
random part
2) squared residuals from this fit are regressed on a set of
indicators defining the structure of the random part
(defined by constant, linear quadratic structure etc) to get
random parameter estimates
3) these estimates are used in a generalised least-squares
to obtain revised estimates of the fixed part
283
RIGLS Estimation = REML
• Restricted (or Residual) Maximum Likelihood modifies
the maximum likelihood estimation procedures to take
account of the sampling variation in the fixed effects
when producing the variance estimates
• calculating a simple variance
- FIML estimator: divide the sum of the squared
deviations around the mean by n
- REML estimator: divide by n-1 that is the. The -1
representing the degree of freedom consumed in
estimating the mean
• In large samples there will be very little impact of this
correction on the estimated value of the variance.
284
Advantages of REML
• Morrell (1998) found the following in a simulation
study.[1]
• In comparing models with different random effects, REML
Type I error rates tended to be truer than FIML rates.
• REML was especially superior for small group sizes (20
or below) and for more random effects (three/four vs.
fewer).
• REML error rates decreased the more FIML error rates
increased.
• While REML is more realistic and is recommended,
especially when the number of grouping is small (it may
additionally help convergence), we need to be careful
when using likelihood tests
[1] .Morrell, C. H. (1998) Likelihood ratio testing of variance components in the mixed-effects models using
restricted maximum likelihood. Biometrics, 54(4), 1560-1568.
285
REML and testing
•
•
two models with nested fixed effects can be compared
using likelihood-based methods if they are fit using
FIML (not REML).
In contrast, models with the same fixed effect structure
but different random effects should be compared by
fitting with REML.
•
1
This suggests a twofold strategy.
Use FIML-based IGLS LRT model comparisons to
decide which fixed effects to include in the model.
2
Switch to REML to decide which random effects to
include and to obtain final estimates of random effects.
286
Markov Chain Monte Carlo estimation
-
Frequentist approach
Bayesian approach
What does MCMC do?
IGLS versus MCMC
MCMC in practice in MLwiN
DIC
Chapter 10 deals with Normal theory models
287
What is a Frequentist approach?
Consider the actual sample that you have as just one out of all the
samples that you could have collected.
Infer from the one sample to a hypothetical population and we use
statistics to guard against inferential error.
Assume that the estimate has a sampling distribution (with a
particular form)
Use the data to estimate the reliability or precision of the estimate,
that is the standard error.
Use the estimate (called the point estimate, because there is only
one) and the standard error to infer about the population.
Example…………
288
What is a Frequentist approach?
IGLS point estimate of level-2 variance 94.436with a SE of
22.1.
Make an assumption that the sampling distribution of this
variance is Normal then we can use theory to give us confidence
intervals
calc b1 = 94.44 + (1.96*22.1)= 137.76
calc b1 = 94.44 - (1.96*22.1)= 51.124
That is we can be 95% confident that the value in the population
lies between 51 and 138
289
Problems of Frequentist approach
Heroic assumption that the sampling distribution of the variance
is Normal (not just the random effects)
But Normal theory for the sampling distribution is based on
asymptotics; whereas in reality we only have 50 districts
Moreover, as a variance should not go negative, we can expect
that it will be positively skewed, not symmetric
Moreover, with just one sample we have no way of checking this
Normality assumption
Unlike the individual random effects where we can examine the
distribution of the estimated residual values, we only have one
estimate of the variance.
290
The Bayesian Approach
• Not repeated samples from hypothetical super-populations.
• But what is the degree of support for a parameter, combining the
researchers beliefs about the parameter and the evidence from the
data?
• The answer is a plot of a distribution which shows the degree of
support for a parameter (see below)
• MCMC obtains and plot this distribution for each and every term that is
estimated. Indeed estimation is about getting this distribution.
• Frequentist: the parameter is conceived as fixed but unknown &
subject to sampling fluctuations.
• Bayes: a parameter is simply another random variable that can be
characterized by its distribution
291
The Bayesian Approach
Full probability modelling
- everything treated as a distribution
- parameters & observations as random
variables
Fundamental theorem: 3 types of distribution
Posterior is proportional to the Prior times the Likelihood
• Prior
- belief about a parameter before data; subjectivity! Includes Diffuse of Flat
• Likelihood
- estimate of parameter based on data and assumptions
• Posterior
- updated evidence after combining prior and likelihood; distribution, not just a
point estimate and SE’s based on asymptotics
In Practice
- celibate sex counsellor…….. due to complex hyperspace of the joint posterior
when several
(maybe 100’s) parameters need evaluating, But MCMC
292
What can we do with MCMC?
• Simulation technique for evaluating the joint
posterior by iterative simulation from the marginal (ie
one at a time) distributions; converges to the
distribution; not to a point estimate
• Allows a building block approach to
estimation; decompose complex
problems into lots of small ones; local
computations on the graph
“decomposition
of complex problems
into smaller components, which
communicate locally to make global
inferences” (Spiegelhalter et al)
• Downside is convergence and time
- lots of diagnostics for non-convergence
293
Fundamentals of MCMC
• Anything about a probability distribution (even one of great
complexity) can be characterized by sampling sufficient draws from it;
eg the mode = most common draw; the 95% tails will be found by
ranking all the draws you get and finding the value that is placed in
the position 2.5% from the bottom and the top
• Draws do not have to be independent. If not, need more draws in a
longer run; dependency affects efficiency not its validity; nonindependent draws learn at a slower rate.
• Need initial estimates to start simulation; if good ones. then quickly
sampling from equilibrium solution of the posterior distribution and
characterizing this distribution.
• If bad, then could be exploring a small bit of hyperspace and it could
many thousands of samples before you ‘drift’ or ‘jump’ to the proper
equilibrium and begin characterizing the complete distribution
294
What is MCMC analysis and how does it work?
• Numerical engine for evaluating joint posterior
• Simulates/ samples from marginal distribution in an iterative scheme
• Different types of sampler: for different types of problem; eg Gibbs
Gibbs simulates a new value for each parameter in turn from its
marginal distribution assuming that the current values for the other
parameters are the true values
Example: simplest possible model, 2 Parameters: mean and variance
yi  0  (ei )
Var[e] ~ N (0,  e2 )
We have the following two unknowns:
 0 ,
2
e
We want to know the distribution of the joint posterior for the mean and
variance given the observed values of the vector of the dependent variables,
that is :
p( 0 , e2 | y)
295
Gibbs MCMC sampling of 2 parameter model
involves cycling between a two-step procedure as there are only
two parameters:
Step 1 : update the values of the mean based on an initial estimate
of the variance
p( 0 | y, (0))to get  0 (1)
2
e
Step 2 : update the values of the variance based on an the revised
estimate of mean
p( | y,  0 (1))to get  (1)
2
e
2
e
Process is simply repeated many times using the previously generated
set of estimates to generate the next.
296
Gibbs MCMC sampling of 2 parameter model (continued)
Using non-informative default priors, the marginal distributions at each step
are given by familiar values (following Browne, 2009 Chapter 1)
p(0 | y, )to get 0 ~ N[( yi / n), / n)
2
e
Step 1:
2
e
The MCMC estimate of β0 is simply a draw from a Normal distribution with
2
a mean of ( yi / n) and a variance of ( e / n)
where n is the number of observations.
Step 2:
p(1 /  | y,  0 ) to get 1/ ~ (  n / 2,    e i / 2)
2
e
2
e
2
The MCMC estimate of the precision parameter (1/variance) is simply a
draw from a Gamma (Non-Normal)distribution with the first parameter of
0.001 plus half the number of observations, and the other parameter of
0.001 plus half the sum of the squared residuals.
Coded as MLwiN macro and run for 50 iterations……
297
exploring the joint bivariate, posterior distribution;
50 and 5000 draws
a very good simulation:
continuously and rapidly
exploring the bivariate space;
no distinctive patches of colour
suggesting that the samplers
are being stuck in one region of
the graph
298
Marginal plot of 5000 MCMC draws
Marginal distributions are well
behaved in that they do not
show any evidence of
multimodality.
84.0
Betas
82.5
Marginal distributions give
degree of support for each
parameter
81.0
79.5
78.0
700
750
800 850
Vars
900
950
299
What does MCMC do in multilevel models?
What do we want to know? In a two-level, variance components model we
have the following unknowns:
 , u , u2 , e2
a joint posterior distribution with many dimensions (notice vector of uj’s and
β’s)
2
2
p( , u, u , e | y)
MCMC is a numerical engine for evaluating joint posterior by simulating
from each of the marginal distributions in an iterative scheme eg Gibbs
First we assume some starting values for our unknown parameters
 (0) , u(0) , u2(0) , e2(0)
T hensimulatefrom thefollowingconditional distributions in rotation
p(  | y, u( 0) ,  u2( 0) ,  e2( 0) ) to get  (1) , then
p(u | y,  (1) ,  u2( 0) ,  e2( 0 ) ) to get u(1) , then
2
p( u2 | y,  (1) , u(1) ,  e2( 0 ) ) to get  u(1)
, thenfinally
p( e2 | y,  (1) , u(1) ,  u2(1) )
300
MCMC in practice in MLwiN
1: Use maximum likelihood (IGLS): to get starting values estimates
2
Switch to MCMC use default priors (uninformative) and burn-in
for 500 simulations; throw away, aiming to get to the equilibrium
distribution and away from the starting values
3
Monitor for 5000 simulations and check for convergence to the
distribution: good is “white noise”; poor is “slow drift”; check
prospective diagnostics; check retrospective diagnostics
(especially Effective sample size eg need at least 500 of these
for key parameters of interest)
4
Increase monitoring size if suggested insufficient sample size or
lack of convergence (ie substantive trending)
5
Report the mean or mode and 95% credible intervals of the
parameters and DIC
6
Repeat the above with stronger priors to assess sensitivity
301
IGLS versus MCMC
Fast to compute
Slower to compute
Deterministic
convergence-easy to judge
Uses mql/pql approximations to fit
discrete response models which
can produce biased estimates in
some cases
In samples with small numbers
of level 2 units confidence
intervals for level 2 variance
parameters assume Normality,
which may be inappropriate.
Stochastic convergence-harder to
judge
Does not use approximations when
estimating discrete response models,
estimates are less biased
In samples with small numbers of
level 2 units Normality is not
assumed when making inferences
for level 2 variance parameters
Cannot incorporate prior information
Can incorporate prior information
Difficult to extend to new models
Easy to extend to new models eg MM,
X-class; and functions of parameters;
eg ranks
302
Two-level hierarchical model : estimated in MCMC
All estimates are blue; ie
not converged
clicking on the + button
on the tool bar, brings
up default weak priors
303
Monitor trajectories for each parameter (last 500 shown)
304
MCMC diagnostics
A: The trace of all of the estimates
B: The Posterior distribution
C: The Auto-correlation function
D: The Partial Autocorrelation Function
E: The Monte Carlo standard error
F: Prospective diagnostics
G: The summary statistics of the Posterior distribution
And the Effective sample size
305
“White
noise”
Trajectories showing good mixing

2
u
Posterior
Distribution
+ skew
Prospective
convergence
diagnostics
95% credible
intervals
Posterior mean
ESS = 2751
306
Trajectories showing less- good mixing β0
RL diagnostic
suggests longer
run needed
Auto-correlated trace;
“stickiness”; exploring the
parameter space slowly
ESS = 375
307
Very problematic mixing despite 100K!
ESS only 51; highly auto-correlated trace
Still trending at 50k; multimodal distribution
(has not converged to a distribution); due to
very small n; only 11 schools
308
Model comparison table: extra information
Note:
summary information
ESS and potentially asymmetric
credible intervals
More efficient MCMC estimation
• More efficient = less auto-correlated monitoring chain
• IE. Greater ESS for same monitoring size of chain
• In MLwiN a number of procedures described in
Browne, W.J. (2009) MCMC Estimation in MLwiN, v2.13. Centre for
Multilevel Modelling, University of Bristol.
Accessed through MCMC options
Strongly recommend as a minimum that
You use HC at the highest level of the model
Deviance Information Criterion
•
•
•
•
•
•
Diagnostic for model comparison
Goodness of fit criterion that is penalized for model complexity
Generalization of the Akaike Information Criterion (AIC; where df is known)
Used for comparing non-nested models (eg same number but different variables)
Valuable in Mlwin for testing improved goodness of fit of non-linear model (eg Logit)
because Likelihood (and hence Deviance is incorrect)
Estimated by MCMC sampling; on output get
Bayesian Deviance Information Criterion (DIC)
Dbar
D(thetabar)
pD
DIC
9763.54 9760.51
3.02
9766.56
Dbar:
D(thetaBar):
pD:
DIC:
the average deviance from the complete set of iterations
the deviance at the expected value of the unknown parameters
the Estimated degrees of freedom consumed in the fit, ie DbarD(thetaBar)
Fit + Complexity; Dbar + pD
NB lower values = better parsimonious model
• Somewhat contoversial!
Spiegelhalter, D.J., Best, N.G., Carlin, B.P. and van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal
Statistical Society, Series B 64: 583-640.
311
Some guidance on DIC
• any decrease in DIC suggests a better model
• But stochastic nature of MCMC; so, with small difference in DIC you should confirm if
this is a real difference by checking the results with different seeds and/or starting values.
More experience with AIC, and common rules of thumb………
312
Example: House price dataset
Sequence of Models
• Model 0
• Model 1b:
• Model 1:
• Model 2b:
• Model 2:
• Model 3b:
• Model 3:
Null one-level model; two parameters a fixed mean and
a variance
Null model but districts as 50 fixed effects (49 dummies
& constant); separate estimation of each intercept
Null two-level random intercepts model: differential
intercepts coming from a distribution
with size-5 in model, but districts as 50 fixed effects;
separate estimation of each intercept
Same basic model but with districts as random effects;
differential intercepts coming from a distribution
Differential district slopes and intercepts as 100 fixed
effects; separate estimation of each intercept and slope
for each district.
Random slopes model; quadratic variance function at
level 2; differential intercepts and slopes coming from a
joint multivariate distribution
313
DIC for Sequence of Models
Model
Nominal DF
Estimated DF
DIC
0: single level
2
2.00
10728.31
1b: 50 district effects
51
50.95
10504.88
1: 2 level Random
intercepts
51
44.43
10498.74
2b 1b+ Size
52
51.93
9874.45
2 : 1 + Size
52
45.27
9868.08
3b: 2b + 49 slopes
101
101.16
9843.51
3: + Random Slopes
101
65.24
9807.47
Model 1b:
constant + 49 differential intercepts + level 1 variance = 51
effective parameters.
Model 1: districts as random effects; only 44 parameters as these effects
are estimated as coming from an overall distribution. The nominal 51
parameters are shrunk due to sharing a prior distribution and therefore do
not contribute whole parameters to the parameter count
Model 3: random intercepts & slopes: Lowest DIC; most parsimonious
parameterization: most strongly favoured by this approach
314
Laboratory 9
• Using either the jsp data or the tutorial data, fit a
sequence of model of increasing complexity, using the
MCMC approach. Compare the models using the DIC
diagnostic. Store the models and check that there has
been convergence to a distribution by examining the
trajectories and ensuring that the ESS is greater than 500
for each and every parameter.
315
10: Some final bits for going further
• Sensitizing you to issues and technical language
• Discrete outcome models and Population average versus
Cluster-specific
• Missing data
• Repeated measures data with ‘extra’ serial correlation
• Endogeneity and fixed versus random coefficient models
(Hausman test)
• Age period and cohort models
• Spatial models with ‘extra’ spatial autocorrelation
• Design and Power
• Modelling strategy
• How to write up your results
• What are the benefits of multilevel modelling?
• Lab 10
316
Discrete outcome models: the type of
response
Response Example
Binary
Yes/No
Categorical
Model
Logit or probit or log-log
model with binomial L1
random term
Proportion Proportion un- Logit etc. with binomial L1
employed
random term
Multiple
Travel by
Logit model with multicategories: train, car, foot nominal random term
unordered
Multiple
Very happy,
Logit model with multicategories: quite happy,
nominal random term;
ordered
neutral,
proportional odds
unhappy, very assumption
unhappy
Count
No of crimes Log model with L1
in area
Poisson random term; can
include an Offset
Count
LOS
Log model with L1 NBD
random term; can include
an Offset
Reading:
Subramanian S V,
Duncan C, Jones K (2001)
Multilevel perspectives on
modeling census data
Environment and Planning
A 33(3) 399 – 417
317
Discrete outcome models: MLwiN
Implementation
•
•
•
•
•
Handles all models listed on previous slide
Must use MCMC, otherwise poor estimates
Compare models with DIC
Ensure data are properly coded: eg Binary 1,0; ordinal: 0,1,2,3
Click on N to change distribution
Multivariate model with more than one response with different types
of response: Realcom (free)
http://www.bristol.ac.uk/cmm/software/realcom/
Uses MCMC
318
Discrete outcome models: MLwiN
Implementation EG Binomial Logit
Non-linear transformation
of the response, prevents
predictions going outside
the inadmissible range of
0 and 1 random terms
Level 2 and higher:
Normal distribution, as
usual
Level 1: binomial random terms
319
Discrete outcome models: VPC and MOR
• In discrete outcome models: level 1 variance is not generally estimatable
EG Binomial Set the level 1 variance = variance of a standard
logistic distribution:
 u2
Then VPC  2
 u  3.29
Variance
MOR
VPC
0.06
0.18
0.34
0.53
1.25
1.50
1.75
2.00
0.02
0.05
0.09
0.14
320
Population average versus Cluster-specific
• Only applies to discrete outcomes
• Different interpretation of the predicted probability from estimated model
• Marginal: population average or GEE estimation:
- is the effect of change in the whole population if everyone’s
predictor variable changed by one unit
• Conditional: cluster specific:
- slope is the effect of change for a particular ‘individual’ of
unit change in a predictor
• Marginal estimates should not be used to make inferences about individuals
(ecological fallacy),
• Conditional should not be used to inferences about populations
(atomistic fallacy.)
•
Population average approach underestimates the individual risk & vice-versa
Subramanian S V and O'Malley AJ. (2010) The futility in comparing mixed and
marginal approaches to modeling neighborhood effects. Epidemiology 21, 4: 475-478
321
Population average versus Cluster-specific
• KJ: Multilevel is fundamental; allows the calculation of both
• Fit multilevel in MLwiN and then use Customised predictions
• A set of scenarios with different higher level variance but same
fixed estimate of logit =-1.5
Between
variance
Logit
Probability
VPC
Subject specific
Median
1.5
1.5
1.5
1.5
1.5
1.5
0
1
3
5
8
20
0.00
0.23
0.48
0.60
0.71
0.86
0.18
0.18
0.18
0.18
0.18
0.18
Pop Average
Mean
0.18
0.22
0.27
0.30
0.33
0.38
Case 1: no higher-level variance: subject-specific median: the probability of
the disease for a typical individual is 0.18; prevalence of disease in the
population is also 0.18.
Case 6: very large higher-level variance, large differences between individuals
in their propensity for the ‘Yes’ outcome; individual risk remains at 0.18,
while population prevalence is 0.38. WHY?
322
Population average versus Cluster-specific
Simulated 1000 individuals
with same mean on the
Logit scale of -1.5 and with
5 variances (1,3 ,5, 8, 20)
Transform to probabilities
Positively skewed as cannot
go below 0
Median on the probability
scale = 0.18 (median of logits
= logit of median)
cluster specific risk for typical
individual is unchanging
Mean on the probability scale
changes with variance and
pulled upwards by skewed
data
323
Population average versus Cluster-specific
So what is the point of GEE estimation:?
Upside:
robust to assumptions about the higher level
distribution;
gives correct SE for fixed part;
gives Population average value
Downside:
does not give the higher-level variance
not extendible to random slopes and 3 level
models etc;
does not give & cannot derive the clusterspecific estimates;
KJ recommends: use multilevel model; and if concerned about
distributional assumptions of the higher-level variance, fit
discrete random eg NPMLREG in R, Latent Gold; MPlus
324
Chapter 12 Binomial Logistic Model:
Teenage employment in Glasgow
Differentials by Gender and Qualifications and Adult unemployment in the area
325
Population average and cluster specific estimates
Null random intercepts LOGIT Model
CLUSTER SPECIFIC = Medians
Probability of being employed for
median teenager in median district
POPULATION-AVERAGE= Means
Probability of being employed for
mean teenager in mean district
Little difference here as level 2
variance is relatively small
326
Chapter 13 HIV in India: SMR’s
Higher educated females have the lowest risk, across the age-groups
327
HIV in India: some results
Risks for different States relative to living in urban and rural areas nationally.
328
Missingness
•
•
•
•
•
•
Random-coefficient model allows imbalance and missingness in the
outcome
Based on Missingness at Random assumption (Rubin, D B, 1976,
Inference and missing data Biometrika 63, 581-592). That is conditional
on the predictors in the fixed part of the model, there should be no
patterning to the missingness.
If a lot of missing data or and informative dropout (eg in a study of
alcohol consumption, the heavy drinkers not showing up for
appointments) can use Realcom Impute which can impute continuous
and discrete variables at different levels
The model of interest is first set up in MLwiN and then the variables
exported to REALCOM-IMPUTE and then the imputed datasets
returned to MLwiN where they will be fitted and combined automatically
for the specified model of interest.
http://www.bristol.ac.uk/cmm/software/realcom/imputation.html: free
software
Excellent practical advice on missing data can be found at
http://www.lshtm.ac.uk/msu/missingdata/index.html
329
Repeated measures data
Standard two level model deals with this:
yij  0 x0ij  1Timeij  (u0 j x0ij  e0ij x0ij )
Where i is occasion and j is individual
Term in the model
What is shown
Crowder and Hand
Fixed part with
linear trend
A generally increasing secular trend of happiness, a rising
population average
‘immutable constant of the universe’
Random effects
for individuals
Person 2 is consistently above the trend; Person 1 is
‘the lasting characteristic of the
consistently below; the heterogeneity between individuals individuals’
(the gap between them) is increasing with time
Needs random slopes for latter.
Random
departure for
occasion
Individuals have their good and bad days which seem
‘the fleeting aberration of the
‘patternless’; there is no evidence of either individual
moment’
becoming more ‘volatile’ with time, and one day unrelated
to the next
330
Repeated measures data with ‘extra’ serial
correlation
Standard two level model is not
enough: residual time dependence;
today carries over into tomorrow
Chapter 14: extended discussion
of such models: development of
aerobic performance in Madeira
children
331
Fixed versus random
Two versions of the subject-specific repeated measures model
[u ] ~ N,0(
Multilevel random effects
0j
[e ] ~ N,0(
0ij
2
u0
2
e0
);
)
Fixed effects: a dummy for each and every person
[e ] ~ N,0(
0ij
2
e0
)
300 children requires 300 separate parameters, not a multilevel
model!
Dominant approach in time-series cross sectional data (econometrics &
comparative political economy is to apply Hausman (1978) specification test and if
significant, then use Fixed effects (>2700 citations Web of Science)
332
Fixed versus random
The drawbacks of Fixed effects model
• Target of inference: cannot generalize to children, only specific child
• No parameter for between child-variance, and does not extend to complex
heterogeneity (random slopes ; between occasion, between child & between
school etc).
• Cannot include any time-invariant variables (eg gender) as all df used up
• Rarely used in educational research as interested in gender, school effects etc.
The perceived advantages of Fixed effects model
• As removed all child–level variability, the estimated slope gives pure longitudinal
‘causal’ effect
• Children become their own controls
• The BIG prize: ‘these statistical models perform neatly the same function as
random assignment in a designed experiment’
333
Fixed versus random : the Mundlak
specification
Age-Period and Cohort: Three types of
‘change’
1 : Age: how outcome develops as individuals develop;
maturation; internal to individuals; eg more conservative
as grow older
2: Period: specific time that outcome is measured; external to
individuals, represent variation over time periods regardless of
age eg 7/11
3: Cohort: group who share the same events with the same
time interval; the baby boomers
The centrality of cohorts for understanding social change was
made by Ryder (1965): successive replacement
The identification problem
Three type of change are conceptually clearly different but
difficulty in practice of distinguishing between them
Age = Period – Cohort
EG a child aged 12 will have been born in 1980 if they
have been measured in 1992.
The ability to disentangling the effects of these different
time depends on the research design that is used.
Accelerated Longitudinal Design
Multiple sequential age cohorts followed longitudinally,
usually with overlap
Eg follow 7 years until 11; 11 until 15, 15 until 19
Accelerated? 4 years of study get 12 year span of
development by splicing together get one overall
development trajectory
Advantages: very efficient in terms of time, less problem of
attrition; less cost keeping team in field
Age, period & cohort are potentially un-confounded in this
design so that we can assess how developed is a 11 year old
in different cohorts
Madeira Growth Study
3levels: Repeated measures within students within cohorts
Age Cohorts
Students
Msmnt occ
St1
O1 … O3
…
1 ………..
5
St100.
St1 …
O1…O3
St100
O1…O3 O1…O3
• Response : 12 minutes runs test; aerobic performance,
cardiovascular endurance
• Accelerated longitudinal design: five age cohorts 8,10 12,14 and
16 year olds studied for 3 consecutive years;
• Cohorts a tell us about stability of effects over time
• Measurement occasions are repeated measures on students and
can tell us about students’ development trajectories.
• Ignore period: with physical growth short-term time effect can
safely be assumed to be zero
Results of 3 level model
Allowing Age by
Sex by Cohort
interactions in
the fixed part
Convergence
for girls
Cohort change
for boys
Changing performance with cohort:
Boys and Girls
Predicted performance for a
14 year old boy and girl
born in different cohorts
Yang (2008) Happiness in USA: APC
model
• Instant Citation classic
• Innovation: used Synthetic cohorts from the General Social
Survey; grouped people in 5 year bands 1895-1900, 1900-5,
….. 1975-1980
• Period 1973 to 2004 in years
• Age: continuous age as at survey
• Multilevel cross classification of people in Period and in
Cohorts
• Overcomes identification problem
Yang (2008) Random-effects APC model
Cohorts
Periods
People
Multilevel proportional odds
model (Ordinal: cumulative
odds) with random effects
Developing APC model to take account
of Geography: Form of model
States
Periods
Cohorts
People
Individual effects Age, Gender,
Education, Income, etc
random effects: 18 Cohorts + 21 Periods + 49 States +
21*49 Periods*States
Spatial models with ‘extra’ spatial
autocorrelation
‘Small-for-Age’ children in Zambia
Decompose area variation into two parts: surrounding local area
and ‘pure’ local area-; shrinkage brings smoothness
Un-structured spatial effect
Structured spatial effect
344
Spatial Models as a combination of strict hierarchy
and multiple membership (including GWR)
Adjacency For WinBUGS Tool can be used to create the membership in
ArcMap: http://www.umesc.usgs.gov/management/dss/adjacency_tool.html
A
D
G
B
Person in A
C
Affected by A(SH) and
B,C,D (MM)
E F
Person in H
H
Affected by H(SH) and
E,I,K,G (MM)
I J
K L M
Multiple membership defined by common boundary; weights as function of
inverse distance between centroids of areas
345
Ohio cancer: repeated measures (space and time!)
•
Response:
counts of respiratory cancer deaths in Ohio counties
• Aim:
Are there hotspot counties with distinctive trends? (small
numbers so ‘borrow strength’ from neighbours)
• Structure:
annual repeated measures (1979-1988) for counties
Classification 3: n’hoods as MM (3-8 n’hoods)
Classification 2: counties (88)
Classification 1: occasion (88*10)
•
Predictor:
Expected deaths; Time
•
Model
1
2
Log of the response related to fixed predictor, with an
offset, Poisson distribution for counts (C1);
Two sets of random effects
area random effects allowed to vary over time; trend for each county
from the Ohio distribution (c2)
multiple membership set of random effects for the
neighbours of each region (C3)
346
MCMC estimation: repeated measures model,
50,000 draws
General
trend
Default
priors
N’hood
variance
Variance
function for
between
county time
trend
347
Respiratory cancer trends in Ohio: raw and modelled
Red: County 41 in 1988; SMR = 77/49 = 1.57
Blue: County 80 in 1988: SMR= 6/19 = 0.31
348
Design Issues I: What sort of design?
• Not SRS eg 10,000 people in 10,000 communities;
within community variation confounded with betweencommunity variation
• BUT Multistage design
indeed, highly-clustered samples may be needed when
require sample estimate of higher-level variables
• Lots of studies (especially geographical studies) have
poor higher-level design
eg just use convenient administrative areas
but need theoretically relevant definitions; stratify on types
of place Eg Laura Stoker and Jake Bowers,"Designing Multi-level
Studies: Sampling Voters and Electoral Contexts". 2002. Electoral
Studies. 21(2):235-267.
349
Design Issues II: What sample size?
• General advice: as many as possible at each level!
• Not thousands of peoples in 10 schools
• From the literature
- Bryk and Raudenbush (1992) HLM; with 60 students in each of 160
-
schools estimate 4 random variables at school level
Paterson and Goldstein (1991) a minimum of 25 individuals in 25
schools; in preference 100 schools New statistical methods for
analysing social structures: an introduction to multilevel models. British
Educational Research Journal, 17: 387-394.
• But target of inference
eg households
350
Free software for sample-size determination
• PinT is a program for Power analysis IN Two-level designs for
determination of standard errors and optimal sample sizes in
multilevel designs with 2 levels)
Snijders T A B and Bosker R J (1993) Standard errors and
sample sizes for two-level research Journal of Educational
Statistics, 18 (1993), 237-259.
Snijders T A B (2001) Sampling , being Chapter 11 in Leyland, A
and Goldstein, H (2001) Multilevel modelling of health
statistics, p159-174 contains an example of PINT
http://stat.gamma.rug.nl/snijders/multilevel.htm#progPINT
• Optimal design software (Raudenbush)
http://sitemaker.umich.edu/group-based/optimal_design_software
• Bill Browne’s software using MLwiN: MLPowSim
• Download from http://seis.bris.ac.uk/~frwjb/esrc.html
351
General modeling strategy for multilevel analysis
• Start by specifying the fixed part, and a variance component at
each level of your analysis
• Examine variance component at the cluster-level for normality,
outliers
• Examine level-1 residuals for normality, heteroscedasticity
• Consider modeling heteroscedasticity at level-1 before
considering complex higher-level heterogeneity
• Consider modeling higher-level heterogeneity as a function of
observed individual predictor variables
• Include cluster-level predictors in the fixed part (as main or
cross-level effects depending upon research question)
• Chapter 13 is a ‘project’; discusses alternative strategies;
implements a bottom-up exploratory and theoretically-driven
approach
352
How to write up your results
• Kromrey, T R. et al (2009) Multilevel modelling: a review
of methodological issues and applications REVIEW OF
EDUCATIONAL RESEARCH, 79: 69-102, developed a
checklist of good practice
• http://www.infoagepub.com/products/content/files/serlinfil
es_multilevel/Chapter%2011/checklist.pdf
• Could the reader answer the following general questions
about the study? 99 questions
• Write it so they can!
353
Final example (BJPS 1992)
• Stucture:
3 levels strict hierarchy
individuals within constituencies within regions
• Response:
Voting for labour in 1987
• Predictors
1 age, class, tenure, employment status
2 %unemployed, employment change, % in mining in 1981
• Expectation: coal mining areas vote for the left
• Allow: mining parameters for mining effect(2) to vary over
region(3) in a 3-level logistic model
354
Varying relations for Labour voting and
% mining
355
Conclusions
• Substantive advantages
1 Modelling contextuality and heterogeneity
2 Micro AND macro models analysed simultaneously
-avoids ecological fallacy and atomistic fallacy
3 Social contexts maintained in the analysis; permits
intensive, qualitative research on ‘interesting’ cases
•
1
2
3
Technical advantages
auto-correlation : expected and modelled
heterogeneity expected and modelled
correct assessment of size of effect for higher-level
variables
4 precision-weighted estimation
356
Conclusion
“The complexity of the world is not ignored in
the pursuit of a single universal equation,
but the specific of people and places are
retained in a model which still has a
capacity for generalisation”
LEMMA:
http://www.ncrm.ac.uk/nodes/lemma/about.php
357
Lab 10: Your Choice
• Chapter 11 Three-level Model Changing School
performance in London schools, project bringing
together everything we have done on Normal-theory
response: data is ileaschools.ws; start p211
• Chapter 12 Logistic modelling: teenage employment of
Glasgow; data is employ.ws, start p5
• Chapter 13. HIV in India: Poisson, and a NBD model;
data is Hivcounts, start p35
• Chapter 14. Longitudinal analysis of repeated
measures data:development of children in Madeira,
data is Endurancewide.wsz; start p120 (skip theory!)
• Chapter 16. The analysis of spatial and space-time
models: low birthweight babies in S Carolina, cancer
deaths in Ohio
358
Task 5 District level price-type relationship
359
Some Answers
Task 4
Fat
Age-46
360
Mean and district differentials for base
group
Mean
District Differentials
0+u0
0
0-u0
361
JSP Model 1: two-level null random intercepts
Some questions:1
What does 30.501 represent?
And 4.861; and is it
significantly different from
zero?
And 39.42?
Does it appear that student
achievement varies
between schools?
What is overall mean?
30.5
What is total variation around
this mean 4.861 + 39.42 =
44.28
What proportion of the variance
is at the school level?
4.861/ (4.861 + 39.42) =
11%
What is likely bounds of
variation on schools?
Assuming normality
95% of schools lie 30.5  1.96
* sqrt(level 2 variance);
that is between 34.8 and
26.2
362
Residual
Rank
1 -3.7826
2 -0.49765
3 1.5372
4 -1.7136
5 0.87883
6 0.037677
7 1.2816
8 -0.10038
9 0.77229
10 -0.65464
11 -0.88464
12 -0.83096
13 -1.3033
14 2.4153
15 -1.8671
16 0.92097
17 -0.57512
18 1.2206
19 0.49685
20 -2.5208
21 -1.2670
22 -0.99642
23 2.6617
24
1.2658
Residual
2
20
40
9
32
24
37
22
31
18
15
17
10
45
8
33
19
35
29
5
11
13
46
36
Rank
25 0.67528
26 -0.067084
27 -3.8734
28 -2.0139
29 0.38381
30 4.0935
31 -0.85249
32 -1.0178
33 2.2921
34 0.082354
35 3.6548
36 0.97511
37 1.3421
38 -0.40763
39 -3.3689
40 1.8217
41 -1.9047
42 0.34291
43 -3.4804
44 1.7519
45 2.2007
46 -0.93598
47 0.36467
48 1.4468
Some questions: 2
30
23
1
6
28
48
16
12
44
25
47
34
38
21
4
42
7
26
3
41
43
14
27
39
What is the highest achieving
school ; what does a pupil on
average achieve there?
What is the lowest achieving
school ; what does a pupil on
average achieve there?
The extremes ‘add’ and ‘takeaway’ 4.1 and –3.9;
corresponding to likely
bounds
363
Some questions:3
What is the high and low achieving schools?
Click in the graph and use the Identify points tab
What determines confidence band: click on one
with wide band and look at hierarchy
364
Variations in attainment between schools
365
JSP Model 2
Some questions:4
What do the estimates represent
30.265
(remember Math3 is centred at 25)
0.604
the general rate of progress across all schools
3.975
is there still variation between schools after taking account of input
28.349
Partitioning the variance
A: Original Variance from null model: 4.861 + 39.42 = 44.28
B: Total residual variance from model 2: 3.975 + 28.349 = 32.224
C Proportion of original variance accounted for by ‘math3’
(B-A)/ A = (32.224 – 44.28)/ 44.28 = 27% accounted for
D Proportion of remaining variance still to be accounted for at school level
3.975/ (3.975+ 28.349) = 12 percent
What is likely bounds of variation on schools? Assuming normality
95% of schools lie 30.3 1.96 * sqrt(level 2 variance); that is between 34.2 and 26.4
That is typical child starts with 25; progresses typically to 30.3, but in top 2.5% of schools
progress to 34.2; bottom 2.5% only progress to 26.4
Notice for 1 extra terms that deviance has decreased from to 6263 to 5952 with single
parameter
366
School
RI Null
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-3.78
-0.50
1.54
-1.71
0.88
0.04
1.28
-0.10
0.77
-0.65
-0.88
-0.83
-1.30
2.42
-1.87
0.92
-0.58
1.22
0.50
-2.52
-1.27
-1.00
2.66
1.27
RI NullRk
2
20
40
9
32
24
37
22
31
18
15
17
10
45
8
33
19
35
29
5
11
13
46
36
RI M3
-2.62
-0.21
0.88
-2.03
-0.22
-0.13
1.56
0.26
-1.88
0.75
-0.77
-0.14
-0.76
0.69
-0.56
0.45
-0.28
1.01
-0.44
-3.06
-1.57
-0.58
2.93
1.80
RM M3 Rnk
5
23
36
7
22
25
40
27
8
34
12
24
13
33
17
31
20
38
19
2
9
16
44
42
School
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
RI Null
0.68
-0.07
-3.87
-2.01
0.38
4.09
-0.85
-1.02
2.29
0.08
3.65
0.98
1.34
-0.41
-3.37
1.82
-1.90
0.34
-3.48
1.75
2.20
-0.94
0.36
1.45
RI NullRk
30
23
1
6
28
48
16
12
44
25
47
34
38
21
4
42
7
26
3
41
43
14
27
39
RI M3
0.99
0.34
-2.88
-3.37
1.38
3.05
0.42
-1.37
3.02
-0.64
2.95
0.83
3.20
-1.33
-2.58
2.10
-0.45
-0.23
-2.83
0.41
1.68
-0.62
0.16
0.67
RM M3 Rnk
37
28
3
1
39
47
30
10
46
14
45
35
48
11
6
43
18
21
4
29
41
15
26
32
Some questions:5
After taking account of pre-test, what is the school with the most progress, and the least?
Values of 3.2 -3.3 compared to ?
What has happened to particular schools and why?
School 6?
School 9?
School 10?
367
The caterpillar plot
Some questions: 6
After taking account of pre-test, what can you say about
the majority of schools?
And hence league tables?
368
Varying relations plot
369
JSP Model 3: fully random model at level 2
Some questions:1
What do the estimates represent
30.23
0.61
4.638
-0.348
0.035
27.206
Do you anticipate fanning in or out?
370
Caterpillar plotS
Some questions:2
What does top graph show?
And bottom
Intercepts
Slopes
371
Sch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
RI
-2.07
0.15
0.55
-2.44
-0.24
0.34
1.29
0.07
-3.13
0.48
-1.04
0.31
-0.75
1.26
-0.65
-0.08
-0.20
1.49
-0.59
-3.02
-1.79
-1.25
3.14
2.42
RI Rnk
8
26
33
5
22
29
37
25
3
32
14
28
15
36
17
24
23
39
18
4
9
12
45
42
RS
RS Rnk
0.10
35
-0.04
18
-0.01
22
0.21
45
0.02
26
-0.07
17
-0.07
16
0.03
27
0.23
46
-0.01
21
0.11
37
-0.08
15
0.06
32
-0.10
13
0.07
33
0.03
28
0.01
25
-0.14
10
0.05
30
0.18
43
0.14
39
0.17
42
-0.24
3
-0.23
5
Sch
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
RI
1.24
0.21
-2.42
-3.71
1.31
3.39
0.36
-1.23
2.95
-1.49
3.40
1.16
3.16
-1.34
-2.43
2.51
-0.30
-0.50
-3.59
0.45
2.40
-0.74
-0.57
1.53
RI Rnk
35
27
7
1
38
47
30
13
44
10
48
34
46
11
6
43
21
20
2
31
41
16
19
40
RS
-0.12
0.01
0.15
0.24
-0.09
-0.21
-0.02
0.05
-0.21
0.17
-0.24
-0.11
-0.27
0.08
0.18
-0.20
0.00
0.05
0.34
-0.03
-0.24
0.10
0.12
-0.16
RS Rnk
11
24
40
47
14
7
20
31
6
41
2
12
1
34
44
8
23
29
48
19
4
36
38
9
Some questions:3
What does a pupil with a score of 25 achieve in school 8?: 30.23 + 0.07
In school 28 30.234 –3.7
In school 35 30.234 +3.4
What does a pupil with a score of 35 achieve in school 8?: 30.23 + 0.07 +10* (0.61 +0.03)
In school 28 30.234 –3.7 + 10*(0.61+ 0.239)
In school 35 30.234 +3.4 +10*(0.61 –0.242)
372
Varying relations plot
373
Lab 5: Analysis of JSP data: Modelling
heterogeneity at Level 2
• Return to the jsp data and draw graphs of level-2 and level-1
variance functions;pictures of the between school and between
pupil populations , and VPC for
• Model 1:
• Model 2:
null random intercepts model
random intercepts model with Maths3 as
predictor
• Model 3:
random intercepts and slopes model:
quadratic function and level 2
• New Model: linear function for between school
heterogeneity
374
JSP Data : Model #1, Variance Components
JSP Data : Model #1, Variance Components
JSP Data : Model #2, Random Intercepts
JSP Data : Model #2, Random Intercepts
JSP Data : Model #3, L2 Random QUADRATIC
JSP Data : Model #3, L2 Random QUADRATIC
Var(e0ij x0ij )   e20 x0ij  27.206
2 2
2 2
Var(u0 j x0ij  u1 j x1ij )   u 0 x0ij  2( u 0u1 ) x0ij x1ij   u1x1ij
 4.64Cons  2(0.35)Maths31ij  0.03Maths312ij
JSP Data : Model #3, L2 Random QUADRATIC
Fit  0 x0ij  1x1ij  30.23Consij  0.612Math3ij
95%CI  FIT  (1.96*  u20 x02ij )
95%CI  FIT  (1.96* 27.206)  FIT  (1.96* 5.201)  FIT  10.223
JSP Data : Model #3, L2 Random QUADRATIC
Fit  0 x0ij  1x1ij  30.23Consij  0.612Math3ij
95%CI  FIT  (1.96 * ( u20 x02ij  2( u 0u1 ) x0ij x1ij   u21 x12ij ) )
 FIT  (1.96* (4.64Cons  2(0.35) Maths31ij  0.03Maths312ij )
JSP Data : Model #3, L2 Random QUADRATIC
 u20  2 u 0u1 x1ij   u20 x12ij
 2
 u 0  2 u 0u1 x1ij   u20 x12ij   e20

4.64  2(0.35) Maths3ij  0.035Maths312ij
4.64  2(0.35) Maths31ij  0.035Maths312ij  27.206
JSP Data : Model #3, L2 LINEAR & L1 CONSTANT
Doesn’t converge
Laboratory 6
• Return to the jsp data and draw graphs of level-2 and level-1
variance functions;confidence bounds of the population and
VPC for
•
•
•
•
Model 4a linear function level 1 and linear function at level 2
Model 4b quadratic function level 1, linear level 2
Model 4c quadratic function level 2, linear level 1
Model 4d fully random, quadratic at level 1 & level 2
• Test to see if the more complex model is an improvement over
the simpler using the deviance statistic; and test whether all
level 2 variance terms are needed using the contrast matrix
approach
385
JSP Data : Model #4a, Fully Random LINEAR L1&2
JSP Data : Model #4a, Fully Random LINEAR L1&2
JSP Data : Model #4b, Fully Random LIN2 & QUAD1
JSP Data : Model #4b, Fully Random LIN2 & QUAD1
JSP Data : Model #4c, Fully Random QUAD2 & LIN1
JSP Data : Model #4c, Fully Random QUAD2 & LIN1
JSP Data : Model #4d, Fully Random QUAD1 & 2
JSP Data : Model #4d, Fully Random QUAD1 & 2
Task 7: Mean and district differentials
Terrace
Semi
0+ 1
0
0+1+u0+u1
0+u0
0+1-u0+u1
0-u 0
394
LAB 7 A: base random intercepts model: main effects of prior ability at pupil level
Clearly there are
significant differences
between the subsequent
progress of the three
Vrbands. There are
significant differences
between schools after
taking account of banded
pupil ability on entry.
395
LAB 7 B: New base model, complex variance function
at level 2 & level 1
The greatest variation in progress is for the
middle-ability pupil, and the greatest
between-school variation is for those of
highest ability on entry; however there are
wide CI’s on these estimates.
LAB 7C: Complex variance level 1&2: separate coding
The variance function results are the same as
Model B, but are given directly. This formulation
is helpful in comparing differential school
performance eg the correlation between the
school effects of a VR1 and a VRB3 pupil is
only 0.37; a school could be effective for one
type of pupil but not another. The best
performing school (53) for the high ability pupils
is only average for low ability.
LAB 8 A: base random intercepts model: main effects of prior ability at pupil level
Clearly there are
significant differences
between the subsequent
progress of the three
Vrbands. There are
significant differences
between schools after
taking account of banded
pupil ability on entry.
398
LAB 8 B: main effects of prior ability at pupil and
school level
There is a significant effect of school pupil
ability over and above individual ability. The
introduction of the level 2 variables has
accounted for some of the unexplained
variation, but only between schools.
LAB C: New base model, complex variance function
at level 2 & level 1
The greatest variation in progress is for the
middle-ability pupil, and the greatest
between-school variation is for those of
highest ability on entry; however there are
wide CI’s on these estimates.
LAB 8 D: Cross-level interactions
There appears to be an overall peer group
effect that is not differential for pupils of
different prior ability.
401
Lab 8 E: cross-level interactions school heterogeneity
& prior ability
Rather striking results: high ability make
more progress when there are diverse
children around them; lower ability make
worse progress when schools are diverse!
Are they interpretable?
402