Data Screening - Structural Equations

Download Report

Transcript Data Screening - Structural Equations

SE Modeling When Some
Response Variables are
Categorical:
The special case of binary
(dichotomous) variables
In Amos, models involving categorical
responses require the use of the Markov chain
Monte Carlo (MCMC) option.
1
In Amos, models involving categorical responses require
the use of the MCMC option. This option is currently not
available in the free student version of Amos and is only
found in Amos 6 and higher versions.
For learning more about Bayesian and MCMC approaches
to SEM and how they are implemented in Amos, refer to
the tutorial at www.structuralequations.org entitled,
“Intro to Bayesian SEM and MCMC”.
2
The special case of binary
(dichotomous) variables
3
Example Data
For this example, we will look at a
simple regression so as to focus our
attention on the most fundamental
issues. For this example, we have one
continuous predictor (flood_level), the
level of flooding in a coastal marsh (in
meters) and one response
(mass_class), whether the standing
biomass of vegetation is in the range of
mature communities (signified with the
letter b) or is low as typical of either
disturbed or stressed conditions for
these communities (signified with the
letter a).
4
About the coding of Data
Amos allows you to code your categorical outcomes using
either letters (a vs b) or numbers (e.g., 0 vs 1). Note,
however, that you must recode the data for your categorical
response variable (as explained on slides 9 and 10) for
BOTH kinds of data. More on this in a few slides.
5
The Problem of Analyzing Binary Data
Fitting a straight line through
such a set of points represents
the points quite poorly and leads
to illogical extrapolations, like
intercepts > 1 or < 0. It also
violates assumptions about
normality of residuals. What we
need is a way to interpret binary
outcomes that makes sense.
Often this is accomplished by
assuming that behind the binary
outcomes lies a continuous
probability of observing a 1 or 0
response, as shown on the next
slide.
6
Models of Binary Responses
Two of the most common ways
of representing the probability of
observing a 1 or 0 outcome are
the probit and logit models.
For the probit model, we link our
predictor to our responses using
a cumulative normal probability
function, as shown to the left.
With the logit model, we link our
predictor to our responses using
an inverse log tranformation of
the ratio of probabilities of
outcomes.
7
Linking Continuous Probabilities to Binary
Responses
In SEM, the probit model is
generally preferred because it
has a slightly more natural
interpretation. Amos uses a
probit model for categorical
outcomes.
The continuous underlying
probability is connected to
observed outcomes by
thresholds, in this case a single
threshold, below which we see a
response of 0 and above which
we see a response of 1.
8
How Amos Handles Binary Responses
Probit modeling is specified by selecting the option
“Allow non-numeric data”. You MUST do this to get
appropriate results.
9
How Amos Handles Binary Responses
The next step is to select “Tools”
and then “Data Recode”.
You then have to highlight your
categorical response variable
and choose the Recoding rule
“Ordered-categorical”.
highlight the variable
you want to recode.
Amos will show the frequency of
different responses in the data
set, along with the default
recoding it choses.
If you select “Details”, you can
see (and alter) the threshold
specified for the probit model (in
this case it will be 0).
from among the dropdown options,
choose "Ordered-categorical"
10
Reminder!
If your categorical outcomes are represented in your data
using numbers (e.g., 0 vs 1), you still must recode the data
for your categorical response variable in Amos as shown on
the previous slide (by selecting the Recoding rule "Ordered-categorical".
11
If you selected “Details”, you can see and alter the threshold specified for the
probit model (in this case it will be 0). Refer to the Amos manual for a more
comprehensive illustration of how and why you might change the thresholds.
12
The Amos manual has a clear presentation of the steps
involved in setting up models with categorical outcomes.
You can also find a helpful video at –
www.amosdevelopment.com
select the Site Map -> Videos,
“Ordered-Categorical: Recoding the Data”
Important Note: The video focuses on the case of
multinomial responses (more than two outcomes) where
model identification does not have to be addressed. So,
best to follow my example closely if you have only two
outcomes.
13
Example of Binary Outcome Regression
Note that in Amos, you are
required to fix the value of one
parameter in order to achieve
identification when working
with binary outcomes. This is
not necessary when working
with categorical variables
having more than two levels
(e.g., a, b, and c). Here we set
the variance of e1 to 1.0,
which allows us to solve for
the other free parameter
values.
Recall that to set the variance of e1 to 1.0,
right click on e1, and select "Object Properties".
Then select the "Parameters" tab and put a
value in for "Variance", typically the value of 1.
The "Mean" for e1 will usually be set at 0.
14
Example of Binary Outcome Regression (cont.)
For this kind of analysis, you
must select “Estimate means
and intercepts” in the Analysis
Properties estimation tab. Also,
you must use MCMC
estimation.
Analysis Properties
MCMC Estimation
15
Regression weight of -3.885 specifies the effect of one unit change in flood-level on
the probability of observing mass_class “b”.
Results: note no value for e1, since this was fixed
to a value of 1.0
16
Posterior Distribution of Regression Weight
17
Standardized Parameters
When working with continuous response variables, we often
want to obtain standardized parameters, particularly:
- standardized path coefficients
- standardized error variables (typically as R2s).
Standardized path coefficients:
For both continuous and categorical response models, the
relationship for standardized coefficients is
std coeff = unstd coeff * (sdx / sdy).
(eqn 1)
We can obtain such estimates from Amos by requesting
them and then going to "Additional Estimands" and
selecting the Standardized Effects (see next page).
18
Standardized Parameters (cont.)
19
Standardized Parameters (cont.)
For Background Information:
Standardized path coefficients for categorical response
models:
For categorical response models, the novel question is, how
do we get a standard deviation estimate for y? The reason
this is not the obvious choice (the standard deviation of the
0s and 1s) is because in probit modeling, we are modeling
the responses of an underlying continuous probability
function (the probability of getting a 0 or 1). If y = the
observed 0s and 1s, let y* = the underlying probability of
0s and 1s. Thus, the question is how did we get estimates
for the standard deviation of y* shown on the previous
20
slide.
Standardized Parameters (cont.)
Standardized path coefficients for categorical response
models: (cont.)
The formula for the sd of y* is
sd(y*) = SQRT[(unstd_beta2 * VAR(x)) + VAR(error of y)]
(eqn 2).
Typically, the error variance in probit models is set to a
value of 1.0. In Amos, we set the error variance to 1.0 on
slide 14 in this example. So, VAR(error of y) = 1. So, we
just need the unstd_beta and the variance of x, which come
from the results (page 16) and shown again on the next 21
page.
Standardized Parameters (cont.)
Using eqn 2 to get sd(y*):
sd(y*) = SQRT[(unstd_beta2 * VAR(x)) + VAR(error of y)]
= SQRT[(-3.8852*0.017) + 1]
= 1.121
22
Standardized Parameters (cont.)
Now, we get the std coefficient
std_beta = unstd_beta * (sdx / sdy*).
(eqn 1)
where unstd_beta = -3.885
sdx = SQRT(VARx) = SQRT(0.017) = 0.130
sdy* = 1.121 (from previous page)
std_beta = -0.45 (which is, with rounding, the same
estimate we obtained directly from Amos on page 19).
We interpret this coefficient as follows: "If we were to
increase x by one standard deviation, the probability of
seeing a 0 or 1 would decline by 45% of a standard
deviation."
23
Calculating R-square for Categorical Responses
R2 calculations for categorical response models is a complex topic. While
logistic and probit models give very similar results for many parameters,
they give very different results when we are estimating R2s. Why, you
should ask? Putting it simply, logistic models are drawing inferences on
the observed 0s and 1s while probit modeling is drawing inference on
the underlying latent probability function.
The recommended procedures for logistic and probit results are
different. In the logistic case, there is a desire to have an R2 that
corresponds with the observed success of classification for the 0s and
1s. Note that this is a nonlinear function. In the probit case, the R2
operates as if we have estimated the actual underlying continuous
probabilities and wish to explain them using a linear model. In the
following pages, I show the calculation for the probit case. Michael
Anderson has developed an R script for the logistic case that I intend to
post soon because it can be useful as well if one wants to emphasize
24
the success in classifying/predicting 0s and 1s from your model.
Calculating R-square for Categorical Responses:
Implimentation
At the present time, Amos does not provide R2s for its
MCMC models. Here I describe how to calculate the R2 for
our categorical response variable. Strictly speaking, we can
obtain a "pseudo-R2" for our ability to explain variance in
y*. To do this, we understand that the latent values of y*
are made up from the prediction equation. So, the predicted
values, y* are
y*= intercept + unstd_beta * x
(eqn 3)
where intercept = 1.416 (intercept of mass_class on p 22)
unstd_beta = -3.885
25
x = flood_level
Calculating R-square for Categorical Responses
explained sums of squares = ESS
ESS = SUM[(y* - y*bar)2],
(eqn 4)
where y*bar is the mean of the predicted y* values).
and for the R2,
R2 = ESS / (ESS + N),
(eqn 5)
where N = sample sizea.
anote,
the sample size, N, winds up being used for the value of the unexplained
SS because the error variance has been set to 1 (in general, VAR = SS/N, so
SS = VAR*N and when VAR=1, SS=N). The denominator for eqn 5 is the total26
SS, the sum of the explained and error SS.
Implementing the R2 Calculation
A little formula in the R language we can use to implement
eqns 4 and 5 so as to calculate R2 is:
ystar = 1.416 -3.885*flood_level
ystar_ave = mean(ystar)
diff = ystar - ystar_ave
dsqr = diff**2
sumsqr = sum(dsqr)
rsquare = sumsqr/(sumsqr + n)
In this case, the value for R2 is 0.20.
Consistency is obtained in this case of a simple regression
because the square of the standardized path coefficient
27
(page 23) equals the R2 for y*.
For Categorical Outcomes with
More than Two Levels
Amos handles categorical variables with more than two
categories (e.g., a, b, c) in the same way as with binary
responses EXCEPT there is no need to fix any parameter
values to achieve identification. Thus, that case is even
easier.
28