Appraisal 2 - Troy University

Transcript Appraisal 2 - Troy University

Exam Review
Part 2
Statistical Concepts
for Appraisal
A frequency distribution is a tabulation of
scores in numerical order showing the
number of persons who obtain each score or
group of scores.
A frequency distribution is usually
described in terms of its measures of central
tendency (i.e., mean, median, and mode),
range, and standard deviation.
The (arithmetic) mean is the sum of a set of
scores divided by the number of scores.
The median is the middle score or point
above or below which an equal number of
ranked scores lie; it corresponds to the 50th
The mode is the most frequently occurring
score or value in a distribution of scores.
The range is the arithmetic difference
between the lowest and the highest scores
obtained on a test by a given group.
Variability is the dispersion or spread of a
set of scores; it is usually discussed in terms
of standard deviations.
The standard deviation is a measure of the
variability in a set of scores (i.e., frequency
The standard deviation is the square root of the
squared deviations around the mean (i.e., the
square root of the variance for the set of
The normal distribution curve is a bellshaped curve derived from the assumption
that variations from the mean are by chance,
as determined through repeated occurrences
in the frequency distributions of sets of
measurements of human characteristics in
the behavioral sciences.
Scores are symmetrically distributed above
and below the mean, with the percentage of
scores decreasing in equal amounts (standard
deviation units) as the scores progress away
from the mean.
Skewness is the degree to which a
distribution curve with one mode departs
horizontally from symmetry, resulting in a
positively or negatively skewed curve.
A positive skew is when the “tail” of the
curve is on the right and the “hump” is on
the left.
A negative skew is when the “tail” of the
curve is on the left and the “hump” is on
the right.
Kurtosis is the degree to which a
distribution curve with one mode departs
vertically from symmetry .
A leptokurtic distribution is one that is more
“peaked” than the normal distribution.
A platokurtic distribution is one that is
“flatter” than the normal distribution.
Percentiles result from dividing the
(normal) distribution into one hundred
linearly equal parts.
A percentile rank is the proportion of scores
that fall below a particular score.
Two different percentiles may represent
vastly different numbers of people in the
normal distribution, depending on where
the percentiles are in the distribution.
Standardization, sometimes called
“normalizing,” is the conversion of a
distribution of scores so that the mean equals
zero and the standard deviation equals 1.0
for a particular sample or population.
“Normalizing” a distribution is appropriate
when the sample size is large and the actual
distribution is not grossly different from a
normal distribution.
Standardization, or normalizing, is an
intermediate step in the derivation of
standardized scores, such as T scores, SAT
scores, or Deviation IQs.
Stanines are a system for assigning a score
of one through nine for any particular
score. Stanines are derived from a
distribution having a mean of five and a
standard deviation of two.
A correlation coefficient is a measure of
relationship between two or more variables
or attributes that ranges in value from -1.00
(perfect negative relationship) through 0.00
(no relationship) to +1.00 (perfect positive
A regression coefficient is a measure of the
linear relationship between a dependent
variable and a set of independent variables.
The probability (also known as the alpha)
level is the likelihood that a particular
statistical result occurred simply on the basis
of chance.
The coefficient of determination is the
square of a correlation coefficient. It is used
in the interpretation of the percentage of
shared variance between two sets of test
Error of measurement is the discrepancy
between the value of an observed score and
the value of the corresponding theoretical
true score.
The standard error of measurement is an
indicator of how closely an observed score
compares with the true score. This statistic
is derived by computing the standard
deviation of the distribution of errors for the
given set of scores.
Measurement error variance is the portion
of the observed score variance that is
attributed to one or more sources of
measurement error (i.e., the square of the
standard error of measurement).
Random error is an error associated with
statistic analyses that is unsystematic, often
indirectly observed, and appears to be
unrelated to any measurement variables.
Differential item functioning is a statistical
property of a test item in which, conditional
upon total test score or equivalent measure,
different groups of test takers have different
rates of correct item response.
The item difficulty index is the percentage
of a specified group that answers a test
item correctly.
The item discrimination index is a statistic
that indicates the extent to which a test item
differentiates between high and low scorers.
Extrapolation is the process of estimating
values of a function beyond the range of the
available data.
A confidence interval is the interval
between two points on a scale within which
a score of interest lies, based on a certain
level of probability.
The error of estimate (standard or
probable) is the degree to which test
scores estimated from a criterion
correspond with actual scores.
The regression effect is the tendency of a
predicted score to be nearer to the mean of
its series of scores than was predicted.
A factor is a hypothetical dimension
underlying a psychological construct that is
used to describe the construct and
intercorrelations associated with it.
Factor analysis is a statistical procedure for
analyzing intercorrelations among a group
of variables, such as test scores, by
identifying a set of underlying hypothetical
factors and determining the amount of
variation in the variables that can be
accounted for by the different factors.
The factorial structure is the set of factors
resulting from a factor analysis.
Reliability is the degree to which an individual
would obtain the same score on a test if the
test was re-administered to the individual with
no intervening learning or practice effects.
The reliability coefficient is an index that
indicates the extent to which scores are free
from measurement error. It is an approximation of the ratio of true variance to
observed score variance for a particular
population of test takers.
The coefficient of equivalence is a correlation
between scores for two forms of a test given at
essentially the same time; also referred to as
alternate-form reliability, a measure of the
extent to which two equivalent or parallel
forms of a test are consistent in what they
The coefficient of stability is a correlation
between scores on two administrations of a
test, such as test administration and retest
with some intervening time period.
The coefficient of internal consistency is a
reliability index based on interrelationships
of item responses or of scores on sections of
a test obtained during a single administration. The most common examples include
the Kuder-Richardson and split-half.
Coefficient Alpha is a coefficient of internal
consistency for a measure in which there are
more than dichotomous response choices,
such as in the use of a Likert scale.
The split-half reliability coefficient is a
reliability coefficient that estimates the
internal consistency of a power test by
correlating the scores of two halves of the
test (usually the even-numbered items and
the odd-numbered items, if their
representative means and variances are
The Spearman-Brown Prophecy Formula
projects the reliability of a test that has been
reduced from the calculated reliability of the
test. It is a “correction” appropriate for use
only with a split-half reliability coefficient.
Interrater reliability is an index of the
consistency of two or more independent
raters’ judgments in an assessment
Intrarater reliability is an index of the
consistency of each independent rater’s
judgments in an assessment situation.
Validity is the extent to which a given test
measures or predicts what it purports to
measure or predict.
The two basic approaches to the determination of validity include logical analysis, which
applies to content validity and item structure,
and empirical analysis, which applies to
predictive validity and concurrent validity.
Construct validity falls under both logical and
empirical analyses.
Validity is application specific, not a
generalized concept. That is, a test is not in
and of itself valid, but rather is valid for use
for a specific purpose for a specific group of
people in a specific situation.
Validation is the process by which the validity
of an instrument is measured.
Face validity is a measure of the acceptability of
a given test and test situation by the examinee
or user, in terms of the apparent uses of the test.
Concurrent validity is a measure of how well
a test score matches a measure of criterion
Example applications include comparing a
distribution of scores for men in a given
occupation with those for men in general,
correlating a personality test score with an
estimate of adjustment made in a counseling
interview, and correlating an end-of-course
achievement or ability test score with a
grade-point average.
Content validity is a measure of how well
the content of a given test represents the
subject matter (domain or universe) or
situation about which conclusions are to be
A construct is a grouping of variables or
behaviors considered to vary across people. A
construct is not directly observable but rather
is derived from theory.
Construct validity is a measure of how well
a test score yields results in line with
theoretical implications associated with the
construct label.
Predictive validity is a measure of how well
predictions made from a given test are
confirmed by data collected at a later time.
Example applications of predictive validity
include correlating intelligence test scores
with course grades or correlating test scores
obtained at the beginning of the year with
grades earned at the end of the year.
Factorial validity is a measure of how well
the factor structure resulting from a factor
analysis of the test matches the theoretical
framework for the test.
Cross-validation is the process of
determining whether a decision resulting
from one set of data is truly effective when
used with another relevant and independent
data set.
Convergent evidence is validity evidence
derived from correlations between test scores
and other types of measures of the same
construct and in which the relationships are
in predicted directions.
Discriminant evidence is validity evidence
derived between test scores and other forms
of assessment for different constructs and in
which the relationships are in predicted
Appraisal of
A very general definition of intelligence is
that it is a person’s global or general level of
mental (or cognitive) ability.
However, there is considerable debate as to
what intelligence is, and a corresponding
amount of debate about how it should be
Perhaps the biggest debate in the
assessment of intelligence is how to use
intelligence tests effectively.
Given that intelligence is a “global”
construct, what are the implications of
intelligence test results for relatively specific
circumstances and/or sets of behaviors?
In general, intelligence test results have been
most useful for interpretation in contexts
calling for use of mental abilities, such as in
educational processes.
Another argument concerns whether
intelligence is “a (single) thing,” which is
reflected in unifactor theories of
intelligence, or a unique combination of
things, which is reflected in multifactor
theories of intelligence.
The measurement implications from this
debate result in some intelligence tests attempting to measure a single construct
and some attempting to measure a unique
set of interrelated constructs.
Another debate centers on what
proportion of intelligence is genetic or
inherited and what proportion is
environmentally determined. This is the
so-called “nature-nurture” controversy.
So-called “fluid” intelligence (theoretically
a person’s inherent capacity to learn and
solve problems) is largely nonverbal and is
a relatively culture-reduced form of mental
So-called “crystallized” intelligence
(theoretically) represents what a person has
already learned, is most useful in
circumstances calling for learned or
habitual responses, and is heavily culturally
The nature-nurture concern has significant
implications for how intelligence is assessed
(e.g., what types of items and/or tasks are
included), but there has not been full or
consensual resolution of the debate.
A fourth major debate concerns the extent
to which intelligence tests are racially,
culturally, or otherwise biased.
Although evidence of such biases were
found in some “early” intelligence tests,
improvements in psychometry have done
much to alleviate such biases, at least in
regard to resultant psychometric properties
of “newer” intelligence tests.
In light of these and other considerations, the
primary focus for the assessment of
intelligence is on the construct validity of
intelligence tests.
In general, individually administered
intelligence tests have achieved the greatest
Individual intelligence tests typically are
highly verbal in nature, i.e., necessitate
command of language for effective
Individual intelligence tests typically include
both verbal (e.g., response selection or item
completion) and performance (e.g.,
manipulation task) subsets of items.
However, nonverbal and nonlanguage
intelligence tests have been developed.
Group administered intelligence tests, such as
those commonly used in schools, are typically
highly verbal and non-performance in nature.
Appraisal of
An aptitude is a relatively clearly defined
cognitive or behavioral ability.
An aptitude is a much more focused ability
than general intelligence, and the
measurement of aptitudes also has been
more focused.
Literally hundreds of aptitude tests have
been developed and are available for a
substantial number of rather disparate
human abilities.
Theoretically, aptitude tests are intended to
measure “innate” abilities (or capacities)
rather than learned behaviors or skills.
There remains considerable debate as to
whether this theoretical premise is actually
achieved in practice.
However, this debate is lessened in importance
IF the relationship between a current aptitude
test result and a future performance indicator
is meaningful and useful.
Aptitude tests are used primarily for
prediction of future behavior, particularly
in regard to the application of specific
abilities in specific contexts.
Predictive validity is usually the foremost
concern in aptitude appraisal and is
usually established by determining the
correlation between test results and some
future behavioral criterion.
Although there are many individual aptitude
tests, aptitude appraisal is much more
commonly achieved through use of multipleaptitude test batteries.
There are two primary advantages to the use
of multiple-aptitude batteries (as opposed to
a collection of individual aptitude tests from
different sources):
First, the subsections of multiple-aptitude test
batteries are designed to be used as a
collection; therefore, there is usually a
common item and response format, greater
uniformity in score reporting, and generally
better understanding of subsection and
overall results.
Second, the norms for the various subtests
are from a common population; therefore,
comparison of results across subtests is
Perhaps the most widely recognized use of
aptitude tests is for educational purposes, e.g.,
Scholastic Assessment Test (formerly the
Scholastic Aptitude Test; SAT), American
College Testing Program (ACT), and
Graduate Record Examination (GRE).
However, aptitude tests used specifically for
vocational purposes (e.g., General Aptitude
Test Battery; GATB) or armed services
purposes (e.g., Armed Services Vocational
Aptitude Battery; ASVAB) also are very
widely used.
Appraisal of
Achievement tests are measures of success,
mastery, accomplishment, or learning in a
subject matter or training area.
The greatest use by far of achievement tests is
in school or educational systems to determine
student accomplishment levels in academic
subject areas.
The vast majority of achievement tests are
group tests.
Most achievement tests also are actually
multiple-achievement test batteries because
they typically have subtests for several
different subject matter areas.
However, there are achievement tests
available that measure across several
different subject matter areas but that are
designed for individual administration.
Individual achievement tests are used most
commonly in processes to diagnose learning
Most achievement tests are norm-referenced
to facilitate comparisons within and between
components of educational systems.
However, increasingly, criterion-referenced
achievement tests are being used in the attempt to determine with greater specificity
the particular skills and/or knowledge
students are mastering at various
educational levels.
Appraisal of
The primary goal of interest assessment is to
help individuals differentiate preferred
activities from among possible activities.
Presumably, the information derived from
interest assessment will enable the
respondent to achieve greater vocational
productivity, success, and/or life
Most interest inventories are used in the
context of vocational counseling (i.e., to help
individuals determine preferences in various
aspects of the world of work).
However, increasingly, interest inventories
are being developed and used to assess
preferences in other aspects of life, such as
Some interest (and some personality)
inventories are ipsative measures, which
means that the average of the subscale
responses is the same for all respondents.
Ipsative measures usually have a forcedchoice format, which means that a
respondent cannot have all high scores or all
low scores across subscales.
Interest inventories are most commonly used
by and developed for young adults, such as
late high school or college students.
However, interest inventories suitable and
valid for use with persons at any age are
The major problem with interest inventories
is the tendency for respondents to interpret
them as measures of ability or probable
satisfaction, neither of which is necessarily
directly related to any particular preference.
Appraisal of
Personality is a vague, difficult-to-define
construct. People tend to think of it as “the
way a person is.” However, there are at least
two points of agreement about personality:
First, each person is consistent to some extent
(i.e., has coherent traits and action patterns
that are repeated).
Second, each person is distinct or unique to
some extent (i.e., has traits and behaviors
different from others).
It is exactly this strange set of conflicting
conditions that makes the assessment of
personality so complex.
“Normality” is a relativistic term used to
describe how some identifiable group of
people (should) behave most of the time.
The assessment of personality thus involves
determining the extent to which a person’s
traits and/or behaviors fit normality (i.e., are
compared to average behavior in some
reference group).
The use of projective techniques and selfreport inventories are the two primary
methods of appraisal of personality.
Projective techniques involve respondents
constructing their own responses to vague
and ambiguous stimuli.
The projective hypothesis is that personal
interpretation of ambiguous stimuli reflects
unconscious needs, motives, and/or conflicts.
Generally, five types of projective assessment techniques are discussed:
Association techniques, such as the
Rorschach or Holtzman Inkblot techniques,
ask the respondent to “explain” what is seen
in the stimulus.
Construction techniques, such as the Thematic
Apperception Test or the Children’s
Apperception Test, ask the respondent to “tell
a story” about what is represented by the
stimulus, usually a vague picture.
Expression techniques, such as the Draw-APerson Test or the House-Tree-Person Test,
ask the respondent to create a figure or
drawing in response to some instruction.
Arrangement techniques ask respondent to
place in order the elements of a set (usually)
of pictures and then to “explain” the
Completion techniques ask the respondent
to make a complete sentence from a sentence
Historically, the results of projective techniques have exhibited poor psychometric
However, the use of projective techniques
remains quite popular, primarily because
respondents often do disclose information,
particularly “themes” of information, not
easily obtainable through other methods.
Generally, three types of self-report
personality inventories are discussed in the
professional literature:
Theory-based inventories, such as the MyersBriggs Type Inventory, State-Trait Anxiety
Inventory, or Personality Research Form,
assess traits and/or behaviors in accord with
the constructs upon which the inventory is
Factor-analytic inventories, such as the
Sixteen Personality Factor Questionnaire
or the Neo-Personality Inventory-Revised,
assess personality dynamics outside the
context of any particular theory of
Items in these types of instruments are
selected from the results of factor analyses of
large samples of items and generally have
very good psychometric properties.
Criterion-keyed inventories, such as the
Minnesota Multiphasic Personality Inventory2 or Millon Clinical Multiaxial Inventory-III,
contain subscale items that discriminate
between a criterion group (e.g., schizoid or
narcissistic) and a relevant control group (e.g.,
These types of inventories usually are used
to assist in making clinical diagnoses.
Self-report personality inventories generally
have much better psychometric properties
than do projective techniques.
However, clinical diagnoses should never be
made solely on the basis of personality
instrument results; clinical judgments
should be used in combination with
assessment results.
Computers and
Clearly the most prominent trend in
appraisal today is toward “computerization”
of testing.
In computer-based testing, instruments or
techniques that are or could be in other
formats (e.g., “paper-and-pencil”) are
converted to a situation in which they are
presented on and responded to through use
of a computer.
Adaptive testing is when an item presented
subsequently to a respondent is selected
based on the qualitative or accuracy nature
of the response to the preceding item.
Adaptive testing is facilitated through the
use of computers due to the capability to
handle large numbers of contingencies and
choices efficiently and accurately.
Computer-generated interpretive reports
also are increasing in frequency of use.
A computer’s capability to analyze complex
data sets and intricate patterns in data are
the primary reasons for the increasing use of
computer-generated interpretive reports.
However, computer-generated interpretive
reports are only as good as the programming
underlying them, and never as good as when
used in conjunction with sound clinical
This concludes Part 2 of the
presentation on