Measurement - Columbia Law School

Download Report

Transcript Measurement - Columbia Law School

Measurement
Class 8
Purposes of Measurement

Make connections between concepts and data





A measure is a representation of a variable, or a construct
A measure also has to capture the true meaning of the
construct.
The accuracy of this capture is called validity
But there are many ways to operationalize and measure a
construct such as “damages,” “job satisfaction,” “productivity,”
“social class,” “segregation” or “attitudes toward the law."
What is critical to understand is that decisions
that we make in measuring any of these
constructs can bear on the results of the
research and the decision to reject the null
hypothesis. This is called measurement error.

There is no bigger cancer on social research than measurement
error
Variables and Measures


The critical idea of a variable is that it has to
vary. That is, it must represent the range of
values implicit in the construct.
Variables are operationally defined by how they
are measured. Precision in measurement, and
close relationship between measure and
construct, allows other researchers to replicate
findings from other studies


No replication → No validity in the causal claim
By operationally measuring a variable, you allow
others to reach an independent judgment about
the meaning of your results, and you allow
others to try to reach those conclusions on their
own.
Illustrations of Theoretical
Meanings in Measurement


Some variables can be operationalized only in
one way: gender, for example, or jury size.
But most variables can be operationalized and
measured in many different ways.





Age can conceptualized as a number, as a range, or
as a descriptive—qualitative assessment
Race and ethnicity – two constructs or one
But gender too can be operationalized in more than
one way
Consider dangerousness in a civil proceeding, or
exposure in a tort case
Violence – acts or consequences? threats or just
physical acts? what about robbery?
Types or Levels of Measurement

Examples of nominal measures (measures that
indicate distinct categories or types)





Gender
Religious preference
Region
Type of defense counsel
Examples of ordinal measures (measures that
enable the ranking of categories but offer no
information about the meanings of the intervals
between ranks)



Birth order
School grade
Criminal Sentence

Examples of interval scales (scales where the distance
between the points are equal and signify actual
differences in the construct)-- - the "zero" point remains
arbitrary.




Attitude scale scores
Crime rates (do we actually know what the "zero" point is?)
Heart rate
Examples of ratio scales (scales where the distances
between units are based on a meaningful "zero" point,
and where differences signify relative distances as well
as actual distances)




Age, income
Exposure to toxins
Prison sentences lengths
Punitive damage awards
Scales




Scales are composites of separate items to
form composite and complex
representations of constructs
Scales avoids reliance on individual items
for representation of a construct or
phenomenon
Most scales demand that the face validity
of the items comprising the scale is high
Never trust a single item scale !!!
Types of Scales

Likert Scales


Scores are obtained directly from respondents and
there is no discarding of items with disagreement.
Assignment of arbitrary #’s to low or high values.
Use of reversals within related items to avoid
“response sets).”
The scale is developed in stages where we begin
with a large number of items and reduce them
through item analysis. Redundant items (where high
and low scorers answered specific questions in the
same way) are eliminated from the analysis. Scale
scores represent the total of responses to items in
the scale
Example:
Collective Efficacy Scale
For each of these statements, please tell me whether you
(1) strongly agree, (2) agree, (3) neither agree nor disagree,
(4) disagree, or (5) strongly disagree










If there is a problem around here the neighbors get together to deal with
it.
This is a close-knit neighborhood.
When you get right down to it, no one in this neighborhood cares much
about what happens to me
There are adults in this neighborhood that children can look up to
People around this neighborhood are willing to help their neighbors
People in this neighborhood generally don’t get along with each other
If I had to borrow $30 in an emergency, I could borrow it from a neighbor
People in this neighborhood do not share the same values
People in the neighborhood can be trusted
Parents in this neighborhood know their children’s friends

Guttman Scales




A scale where the responses indicate the precise order of
response to the constituent items. That is, responses to one item
predict responses to other items. It constructs a scale based on
the logic that if one scores positively or high at the upper
boundaries of the scale, they also (at least 90% of the time)
score high or positively on the lower ranges of the scale. It is a
convenient way of organizing ordinal or even nominal data into
an interval or ratio scale.
Statistical tests are available to assess whether items fit into a
Guttman scale and what the scale properties are (the coefficient
of reproducibility).
Raises questions of the temporal dimension of scaling: items
may scale well when the time span is long but the scale may be
plagued by errors if the time span is short.
Example: Spouse assault scale -- What happens if
lengthen or shorten the time frame? Validity threats from
this concern?
Example: Conflict Tactics Scale
We also are interested in whether your partner has done any of these things have happened to you
during any relationship you’ve been in over the past year. Please tell me if this has happened in the
past [YEAR], and, if so, how many times you think it has happened.
In the last [YEAR]…..















Has your [PARTNER]
Has your [PARTNER]
Has your [PARTNER]
Has your [PARTNER]
Has your [PARTNER]
Has your [PARTNER]
Has your [PARTNER]
Has your [PARTNER]
her point of view?
Has your [PARTNER]
Has your [PARTNER]
Has your [PARTNER]
Has your [PARTNER]
Has your [PARTNER]
Has your [PARTNER]
Has your [PARTNER]
want to?
No
Yes
# Times
pushed, grabbed, shoved, slapped, or shaken you?
punched, choked, strangled, kicked, bitten, or you?
thrown an object at you or tried to hit you with an object?
threatened you with a knife or gun?
ever shot at or stabbed you?
tried to stop you from working or studying?
tried to stop you from having contact with family, friends or co-workers?
become angry (e.g., yelled, gotten real upset) when you disagreed with his or
damaged, destroyed, hid or thrown out any of your clothes or possessions?
damaged or destroyed any other property when angry with you?
locked you out of the house?
insulted or shamed you in front of others?
threatened to leave you?
called you stupid, fat or ugly?
used physical force or threats of force to make you have sex when you didn’t

Simple additive scales or arbitrary scales


Example: social class
Factor analysis


Rather than use an additive or other computational
technique to arrange items in a scale, factor analysis
offers the possibility of identifying and quantifying
underlying patterns or dimensions among the items in
a scale. It avoids the assumption that all items tap
the same dimension of the phenomenon or construct.
Examples:



Neighborhoods
Childhood exposure to violence
Organizational climate



Procedures: The researcher constructs a correlation
matrix, and factors that share a high correlation among
themselves are organized computationally as a "factor."
Each items is given a factor "loading" that shows its
relative correlation within the factor.
The researcher has the choice of either selecting those
items within each factor that best represent the factor
(those with the highest factor loads), or using the factor
score, a composite index of the items based on weights
assigned that reflect their factor loads.
Criticisms:



It is sample-specific. The results will vary with the response
patterns of the sample. Addition of a few cases can alter the
factor scores.
It often is abused and used in a context void of theory. In this
regard, it is simply an exercise in "barefoot empiricism:"
aimlessly tiptoeing through the data.
When used to analyze data that may itself have measurement
errors (e.g., arrest records or bad scales, the errors approach a
level that cannot be tolerated.
Evaluating Scales or Measures:
Validity and Reliability

Validity and reliability are assessment tools
that are a constant presence in the
background, standing off-center,
commenting on the state of affairs and
suggesting plots and weaknesses that
undermine the affairs of state at the
center stage.
Validity



Validity asks, quite simply, did we measure what
we thought we measured?
A variable is validly measured if it accurately
measures what you want it to measure. Thus,
we ask whether what we observed a function of
the actual phenomena and relationships we have
hypothesized, or is it an artifact of the research
design (especially measurement) that we used
to generate these data?
In fact, some of the trends that we think have
held up consistently regarding some simple
relationships in criminality are explained better
by the artifacts of the study design than by the
behaviors themselves.
Types of Validity

Face validity



Am I measuring what I think I am measuring?
Example: Family supervision -- LIKE PARENTS
Content validity


Does the item measure the concept it addresses? This is
dimension of validity similar to face validity. Yet it differs in one
important way: it refers to the ability of the item to distinguish
among people within a population.
Examples: Test scores -- all students score the same SRD scales
-- kids with black eyes report NONE on the item relating to
fighting

Construct validity



Concurrent validity



Am I measuring what the theory states and what the construct
implies? The match between the theoretical and the operational
definitions of the concept. Error could lie in the measurement, or
it could be in the formulation of the construct, but something is
mismatched in this critical relationship.
Examples: Fighting -> deviance --- maybe some behaviors are
normative!
Does the item have the ability to accurately state the present
state of another variable?
Example: Measures of spouse assault from one member of the
couple
Predictive validity



Does the item accurately forecast behaviors or outcomes in the
future?
Examples: "Dangerousness" and future behavior after release
"Rehabilitation" and parole outcome

Convergent validity





How consistent or distinctive are multiple measures of the same or
multiple constructs? Three measures of the same construct should point
in the same direction. One measure of different constructs should yield
distinctive results.
Examples: Using husband and wife self-reports of both victimization and
offending to measure spouse assault
Using MMPI (standardized)and Rorschach (projective, subjective) tests
to measure psychopathology
Another element of this strategy is the use of multiple methods. For
example, using participant observation to determine whether drinking
precedes gang fights
Techniques for assessing validity in surveys




Social desirability scales
Lie detector tests
Known-group tests
Secondary sources of data
Reliability

A reliability coefficient is a measure of the
consistency and stability of measurement across
subjects and populations.



Consistency: a reliable variable is one where you keep
getting the same value every time you measure it.
Stability refers to the consistency of measurement
across periods. Simplest example is the test-retest
score -- -scores should be consistent across time (after
controlling for rival causal factors such as history or
maturation) Internal consistency refers to the
associations among items within the measurement of a
complex phenomenon.
Examples:


Weight – how consistent is a digital v mechanical scale?
Parental supervision
Types of Reliability




Inter-rater reliability refers to the degree of agreement between
two independent people on a measure. The judges of Olympic
skating provide a good test each year of the reliability of the
judging technique. When the East German judge holds up a
number very dissimilar from the other scores, we question
whether the measure of scoring has good inter-rater reliability.
Test-retest reliability refers to the relationship between the score
a person gets on one occasion on one variable and the score he
or she receives on a subsequent occasion. The LSAT has good
test-retest reliability: in general, scores on one administration are
similar to the score he or she receives on a second
administration (compared to others who also are taking a second
administration)
Multiple instrumentation -- e.g., randomization of items within
multiple administrations of a test (used widely by the Educational
Testing Service and the major polling organizations).
Split-half methods -- random sampling within a test population to
determine internal consistency.
Measuring Reliability

Cronbach's alpha is used most widely, and is available on
most statistical packages
where N is equal to the number of items and r-bar is the average interitem correlation among the items.


If you increase the number of items, you increase
Cronbach's alpha. Additionally, if the average inter-item
correlation is low, alpha will be low. As the average
inter-item correlation increases, Cronbach's alpha
increases as well.
CFI, RMSEA, etc. – “latent construct” measures – shows
whether all items form a single construct when
measured concurrently.
The Relationship between
Validity and Reliability


You must have reliability in order to have (measurement) validity. If
everyone rating a variable came up with a different score every time the
variable was measured, you can't draw an inference about the measure
of the variable. You don't know if the measure is accurate, or if it is not
tapping into the dimension you really want to measure.
But you can have reliability without validity. For example, you could use
hat size as a measure of IQ. Reliable yes, but valid? Hardly.



Nothing wrong with the tape measure, differences from one measurement to
the next aren't likely to vary much. But hat size has little to do with IQ, so
why measure?
Reasonable people will argue about validity. It is rarely an all or nothing
assessment, as in the case of age. Validity means not just that the
measure was accurate and clear, but that it was true.
Consider the different measures of faculty productivity: article credits,
pages per faculty member, prestige of law reviews or journals, footnotes
per faculty member, etc. (What about teaching productivity???) All
measures are quite reliable, but how valid are they re: "productivity"?