Introduction to Measurement Goals of Workshop • Reviewing assessment concepts • Reviewing instruments used in norming process • Getting an overview of the secondary.

Download Report

Transcript Introduction to Measurement Goals of Workshop • Reviewing assessment concepts • Reviewing instruments used in norming process • Getting an overview of the secondary.

Introduction to Measurement
Goals of Workshop
• Reviewing assessment concepts
• Reviewing instruments used in norming
process
• Getting an overview of the secondary and
elementary normative samples
• Learning how to use the manuals in
interpreting students’ scores.
ASSESSMENT
• The process of collecting data for the purpose of
making decisions about students
• It’s a process and typically involves multiple
sources and methods.
• Assessment is in service of a goal or purpose.
• The data we collect will be used to support some
type of decision (e.g., monitoring, intervention,
placement)
Major Types of Assessment in
Schools
• More frequently used:
– Achievement: how
well is child doing in
curriculum?
– Aptitude: what is this
child’s intellectual and
other capabilities?
– Behavior: Is the child’s
behavior affecting
learning?
• Less frequently used:
– Teacher competence: Is
teacher actually
imparting knowledge?
– Classroom
environment: Are
classroom conditions
conducive to learning?
– Other concerns: home,
community,...
Types of Tests
• Norm-referenced
– Comparison of performance to a specified
population/set of individuals
• Individually-referenced
– Comparisons to self
• Criterion-referenced
– Comparison of performance to mastery of a content
area; what does the student know?
• The data in the manual will allow you to do look
at norms and at individual growth.
MAJOR CONCEPTS
•
•
•
•
•
•
Nomothetic and Idiographic
Samples
Norms
Standardized Administration
Reliability
Validity
Nomothethic
• Relating to the abstract, the universal, the general.
• Nomothetic assessment focuses on the group as a
unit.
• Refers to finding principles that are applicable on
a broad level.
• For example, boys report higher math selfconcepts than girls; girls report more depressive
symptoms than boys..
Idiographic
• Relating to the concrete, the individual, the
unique
• Idiographic assessment focuses on the
individual student
• What type of phonemic awareness skills
does Joe possess?
Populations and Samples I
• A population consists of all the
representatives of a particular domain that
you are interested in
• The domain could be people, behavior,
curriculum (e.g. reading, math, spelling, ...
Populations and Samples II
• A sample is a subgroup that you actually
draw from the population of interest
• Ideally, you want your sample to represent
your population
– people polled or examined, test content,
manifestations of behavior
Random Samples
• A sample in which each member of the population
had an equal and independent chance of being
selected.
• Random samples are important because the idea is
to have a sample that represents the population
fairly; an unbiased sample.
• A sample can be used to represent the population.
Probability Samples I
• Sampling in which elements are drawn
according to some known probability
structure.
• Random samples are subcases of probability
samples.
• Probability samples are typically used in
conjunction with subgroups (e.g., ethnicity,
socioeconomic status, gender).
Probability Samples II
• Probability samples using subgroups are
also referred to as stratified samples.
• Standardization samples are typically
probability or stratified samples.
• Standardization samples need to represent
population because the sample’s results will
be used to create norms against which all
members of population will be compared.
Norms I
• Norms are examples of how the “average”
individual performs.
• Many of the tests and rating scales that are
used to compare children in the US are
norm-referenced.
– An individual child’s performance is compared
to the norms established using a representative
sample.
Norms II
• For the score on a normed instrument to be
valid, the person being assessed must
belong to the population for which the test
was normed
• If we wish to apply the test to another group
of people, we need to establish norms for
the new group
Norms III
• To create new norms, we need to do a
number of things:
– Get a representative sample of new population
– Administer the instrument to the sample in a
standardized fashion.
– Examine the reliability and validity of the
instrument with that new sample
– Determine how we are going to report on
scores and create the appropriate tables
Standardized Administration
• All measurement has error.
• Standardized administration is one way to
reduce error due to examiner/clinician
effects.
• For example, consider these questions with
different facial expressions and tone:
• Please define a noun for me :-)
• DEFINE a noun if you can ? :- (
Distributions
• Any group of scores can arranged in a
distribution from highest to lowest
• 10, 3, 31, 100, 17, 4
• 3, 4, 10, 17, 31, 100
Normal Curve
• Many distributions of human traits form a
normal curve
• Most cases cluster near middle, with fewer
individuals at extremes; symmetrical
• We know how the population is distributed
based on the normal curve
Ways of reporting scores
• Mean, standard deviation
• Distribution of scores
– 68.26% ± 1; 95.44 ± 2; 99.72 ±3
• Stanines (1, 2, 3, 4, 5, 6, 7, 8, 9)
• Standard scores - linear transformations of
scores, but easier to interpret
• Percentile ranks*
• Box and Whisker Plots*
Percentiles
• A way of reporting where a person falls on a
distribution.
• The percentile rank of a score tells you how
many people obtained a score equal to or
lower than that score.
• So if we have a score at the 23rd %tile and
another at the 69th %tile, which score is
higher?
Percentiles 2
• Is a high percentile always better than a low
percentile?
• It depends on what you are measuring.
• For example….
• Box and whisker plots are visual displays r
graphic representation of the shape of a
distribution using percentiles.
20
Explanation of the
Box Plot
Individual
Outliers
18
16
90th P ercentile
P erformance
14
75th P ercentile
12
50th
P ercentile
10
8
25th P ercentile
6
4
10th P ercentile
2
0
Grade 2 Students
The box plot is a picture of the dist ribut ion of scores on a measure.
Correlation
• We need to understand the correlation coefficient
to understand the manual
• The correlation coefficient, r, quantifies the
relationship between two sets of scores.
• A correlation coefficient can have a range from -1
to + 1.
– Zero means the two sets of scores are not related.
– One means the two sets of scores are identical (a
perfect correlation)
Correlation 2
• Correlations can be positive or negative.
• A + correlation tells us that as one set of scores
increases, the second set of scores also increases.
they can be negative. Examples?
• A negative correlation tells us that as one set of
scores increases, the other set decreases. Think of
some examples of variables with negative r’s.
• The absolute value of a correlation indicates the
strength of the relationship. Thus .55 is equal in
strength to -.55.
How would you describe the
correlations shown by these charts?
3
5
7
9
11
13
14
12
10
8
6
4
1
4
3
7
9
10
Se rie s1
2
0
1
12
2
3
4
5
6
10
9
8
7
6
5
10
8
6
4
2
Se rie s1
0
1
1.4
1.2
1
0.8
0.6
0.4
2
3
4
5
6
1.2
1.2
1.2
1.2
1.2
Se rie s1
0.2
0
1
2
3
4
5
Correlation 4
•
•
•
•
.25, .70, -.40, .55, -.87, .58, .05
Order these from strongest to weakest
-.87, .70, .58, .57, -.40, .25, .05
We will meet 3 different types of correlation
coefficients today:
• Reliability coefficients - Definitions?
• Validity coefficients
• Pattern coefficients
Reliability
• Reliability addresses the stability,
consistency, or reproducibility of scores.
– Internal consistency
– Split half, Cronbach’s alpha
– Test-retest
– Parallel forms
– Inter-rater
Reliability 2
• Internal Consistency
– How do the items on a scale relate to one
another? Are respondents relating to them in the
same way?
• Test-retest
– How do respondents’ scores at Time 1 relate to
their scores at Time 2?
Reliability 3
• Parallel forms
– Begin by creating at least two versions of the
exam. How do respondents performance on one
version compare to their performance on
another version
• Inter-rater
– Connected to ratings of behavior. How does
one rater’s scores compare to another’s?
Validity
• Validity addresses the accuracy or truthfulness of
scores. Are they measuring what we want them to?
–
–
–
–
Content
Criterion - Concurrent
Criterion - Predictive
Construct
– Face
Content Validity
• Is the assessment tool representative of the
domain (behavior, curriculum) being
measured?
• An assessment tool is scrutinized for its
(a) completeness or representativeness,
(b) appropriateness, (c) format, and (d) bias
– E.g., MSPAS
Criterion-related Validity
• What is the correlation between our instrument,
scale, or test and another variable that measures
the same thing, or measures something that is very
close to ours?
• In concurrent validity, we compare scores on the
instrument we are validating to scores on another
variable that are obtained at the same time.
• In predictive validity, we compare scores on the
instrument we are validating to scores on another
variable that are obtained at some future time.
Structural Validity
• Used when an instrument has multiple scales.
• Asks the question, “Which items go together best?
• For example, how would you group these items from the
Self-Description Questionnaire?
• 3. I am hopeless in English classes.
• 5. Overall, I am no good.
• 7. I look forward to mathematics class.
• 15. I feel that my life is not very useful.
• 24. I get good marks in English.
• 28. I hate mathematics.
Structural Validity 2
• We expect the English items (3, 24), Math items
(7, 28) and global items (5, 15) to group together.
• The items that group together make up a new
composite variable we call a factor.
• We want each item to correlate highly with the
factor it clusters on, and less well with other
factors.
• Typically, we accept item-factor coefficients from
about .30 and higher.
What can we say about the structural
validity of the SDQ given these scores?
Item #
Verbal
Math
Global
3
.587
-.044
.624
5
-.016
.024
.561
7
.086
.630
-.059
23
.019
-.015
.625
24
.754
-.006
-.024
28
-.020
.750
.042
Construct Validity
• Overarching construct: Is the instrument
measuring what it is supposed to?
– Dependent on reliability, content and criterion-related
validity.
• We also look at some other types of validity
evidence some times
– Convergent validity: r with similar construct
– Discriminant validity: r with unrelated construct
– Structural validity: What is the structure of the scores
on this instrument?
Statistical Significance
• When we examine group differences in science,
we want to make objective rather than subjective
decisions.
• We use statistics to let us know if the difference
we are observing occurs by chance.
• In psychology, we typically set our alpha or error
rate at 5% (i.e., .05), and we conclude that if a
difference was likely less than 5% of the time, that
difference is statistically significant.
Statistical Significance 2
• When our statistical test tells us that our difference is
statistically significant (i.e., < .05).
• Statistical significance is affected by a number of
variables, including sample size. The larger the sample, the
easier it is to achieve statistical significance.
• We also look at the magnitude of the difference (or effect
size).
• A difference may be statistically significant, but have a
small effect size.
• .10 to . 30 = small effect; .40 to .60 = medium effect; > .60
= large effect.