Transcript Document

Determining the Validity
and Reliability of Key
Assessments
Julia M. Lee – Presenting the work of the
faculty of the Dewar College of Education at
Valdosta State University
GaPSC Assessment Workshop
May 14, 2012
Tips for Developing Key
Assessments
 “Begin with the end in mind”
 Have as many Education faculty members
as possible involved in the development
 Make sure you have P-12 and A&S
partners involved
 Look at “big picture” – ultimate
outcome(s) rather than isolated
knowledge and skills
 Explicit, explicit, explicit
Tips for Meeting Standard Two
 Involve as many faculty as possible in the assessment
system
 Development
 Implementation
 Evaluation
 Analysis
 Revision
 Implementation
 .....
 Have faculty complete a “self-study” or “selfevaluation” of each program and its assessment
components
 Provide training to faculty, candidates, and other
“users” on the instruments developed
 Develop users’ guides, data calendars, data collection
documents (information to be collected, timeline,
source, responsibility, etc.)
Consistency
Legitimacy
Fairness
Soundness
Impartiality
with
Aligned
Comprehensiveness
Validity
Reliability Objectivity
Purpose
Stability
Assessment System and Unit
Evaluation (2a)
“4. The professional education unit has
taken effective steps to eliminate bias in
assessments and is working to establish
the fairness, accuracy, and consistency of
its assessment procedures and
professional education unit operations.”
Impact of 2a4
 “2b1. The professional education unit maintains an
assessment system that provides regular and
comprehensive information on applicant
qualifications, candidate proficiencies, competence
of graduates, professional education unit operations,
and preparation program quality.”
 “2b3. Candidate assessment data are regularly and
systematically collected, compiled, aggregated,
summarized, and analyzed to improve candidate
performance, preparation program quality, and
professional education unit operations.”
Impact of 2a4
 “2c1. The professional education unit regularly and
systematically uses data, including candidate and
graduate performance information, to evaluate the
efficacy of its courses, preparation programs, and
clinical experiences.”
 “2c2. The professional education unit analyzes
preparation programs’ evaluation and performance
assessment data to initiate changes in preparation
programs and professional education unit
operations.”
Establishing fairness, accuracy, and
consistency of assessment procedures
and instruments: Steps taken by the COE
 Use of multiple assessments (multiple sources)
 Primary use of analytic rather than holistic rubrics
 Use of multiple raters
 Provision of training on assessment instruments
 Completion of inter-rater reliability studies and / or
consensus agreement
Processes Used to Determine
Reliability and Validity of Two Key
Assessments
 College of Education Observation Instrument
 College of Education Disposition Survey
College of Education
Observation Instrument
 Part of “determining” reliability and validity
involves building instruments and supporting
documents in such a way that these issues are
considered from the very beginning
 How the COE OI was developed
 Development and implementation of an
instructional manual and training sessions
 Completion of inter-rater reliability studies
Development of the COE
Observation Instrument
 Aligned to professional education
standards (Danielson, INTASC, Georgia
Framework)
 Georgia Framework indicators that were
observable formed the foundation of the
instrument
 P-12 Teachers, P-12 Administrators, and
University faculty participated in the
development
Development and Implementation of an
Instruction Guide and Training Sessions
for the COE OI
 Training manual and training developed by
a group of P-12 educators and University
Faculty Members
 Training manual provides explicit guidance
for decision-making regarding the rubric
 Training sessions are provided for firsttime users (2 hour session) as well as for
on-going users (1 hour “refresher”
training)
Completion of Inter-rater
Reliability Studies
 Provided training to 17 triads (student teachers, their
P-12 mentors, and their university supervisors) on the
Instrument
 These triads independently rated teaching one
teaching episode for each candidate
 Computed inter-rater agreement between P-12
mentors and university supervisors
 (Agreements/Agreements + Disagreements) * 100
 Adjacent values
 Standard Met / Not Met
 Results
 Adjacent values
 Standard Met / Not Met
Inter-rater Agreement Results
(% of agreement)
ITEM
I-A
I-B
I-D
II-C
III-B
III-C
III-G
IV-C
V-B
V-C
V-D
V-F
Adjacent
Value
Reliability
88
100
94
100
94
100
53
100
100
100
88
94
ITEM
I-A
I-B
I-D
II-C
III-B
III-C
III-G
IV-C
V-B
V-C
V-D
V-F
Standard
Met
Reliability
71
94
82
100
88
94
47
100
100
94
82
94
What did these data tell us?
All items on this instrument were reliable
and valid.
B. There was a high level of agreement on all
items on this instrument.
C. The independent raters did not agree with
each other regarding whether or not
candidates met the standard for most items.
D. In general, with the exception of one item,
the independent raters’ had similar ratings
for both types of reliability evaluated.
A.
Inter-rater Agreement Results
(% of agreement)
ITEM
I-A
I-B
I-D
II-C
III-B
III-C
III-G
IV-C
V-B
V-C
V-D
V-F
Adjacent
Value
Reliability
88
100
94
100
94
100
53
100
100
100
88
94
ITEM
I-A
I-B
I-D
II-C
III-B
III-C
III-G
IV-C
V-B
V-C
V-D
V-F
Standard
Met
Reliability
71
94
82
100
88
94
47
100
100
94
82
94
Decisions Made
 Require all faculty who supervise to complete
the training session
 Provided training to mentors who frequently
supervise student teachers
 Provided training to several cohorts of Ed.S.
students, many of whom served as public school
mentors
 Asked COE Assessment Committee to review
data and make recommendations for changes
based on reliability data
 Modified this item on the instrument
Modification of the Item
 Learning Environments
 Original Item III-G.: Communication
 Rating of 1-2: Errors in spoken/written
language; ineffective nonverbal
communication; unclear directions; does not
use effective questioning skills
 Rating of 3-4: Error-free spoken/written
language; effective nonverbal communication;
directions are clear or quickly clarified after
initial student confusion; effective questioning
and discussion strategies
Modification of the Item,
continued
 Learning Environments
 New Items III-Ga. Communication
 Rating of 1-2: Errors in spoken/written language
 Rating of 3-4: Error-free spoken/written language
 New Item III-Gb. Communication
 Rating of 1-2: Ineffective nonverbal
communication; unclear directions; does not use
effective questioning skills
 Rating of 3-4: Effective nonverbal communication;
directions are clear or quickly clarified after initial
student confusion; effective questioning and
discussion strategies
College of Education Disposition
Survey
 Again, at the initial development stage, there was
a focus on reliability and validity issues
 How the Unit-adopted dispositions were
chosen
 How the COE Disposition Survey was developed
Adoption of Dispositions
 Looked at all the disposition statements in the
INTASC standards
 Data collected from P-12 educators and
candidates regarding importance of specific
dispositions
 Surveyed unit faculty regarding the relative
importance of each disposition statement
 Conceptual framework committee reviewed
results and provided input into selection
 Three primary dispositions emerged from this
process
Development of the COE Disposition
Survey and Advanced Disposition Survey
 Initially designed and field tested in the summer
of 2005
 Original survey consisted of 12 items
 Of those 12 items, four were targeted to
specifically address two of the unit-adopted
dispositions (fairness and the belief that all
students can learn)
 Alternate forms of survey questions were
written to address reliability
 Candidates were asked to identify, using a Likert
scale, if they “strongly agree,” “agree,” “n/a or
neutral,” “disagree,” or “strongly disagree” with
statements addressing these dispositions
Original Statements on Surveys
 Statement 2: I believe that schools today need to
get back to basics--teachers should present lessons
for everyone in the same structured way for
students to learn the content.
 Statement 3: I believe that it is important to adapt
instruction to students' different learning styles,
and help students achieve in ways they find easy to
learn.
 Statement 11: The impact of my performance as a
teacher is primarily dependent upon the students'
family backgrounds and the students' personal
motivation.
 Statement 12: I believe all students can learn.
Early Data Gathered
Semester
Transition
Point
Statement 2
(“Fairness”)
Statement 3
(“Fairness”)
Statement 11
(“Belief all Ss
can learn”)
Statement 12
(“Belief all Ss
can learn”)
73.68
97.37
44.74
94.74
100
100
55.6
100
Fall
Admission to
Program
Spring
Exit from
Program
Summer
Admission to
Program
85.71
100
21.43
100
Summer
Exit from
Program
94.12
100
35.29
94.12
What did these data tell us?
The four items on this instrument appeared
to be reliable and valid.
B. Candidates’ responses appeared to be
pretty consistent in terms of these items.
C. There appeared to be little if any
consistency in candidates’ responses on
these items.
D. In general, candidates’ responses related to
the two items addressing “fairness” appear
to be consistent, this does not appear to be
the case with the two items addressing “the
belief that all students can learn.”
A.
Decisions Made
 Looked more in-depth at these items (at the
individual candidate level) to determine
agreement for the two items addressing the belief
that all students can learn
 Asked COE Assessment Committee to review data
and make recommendations for changes based on
these (and other) data
 The Assessment Committee recommended rewording of the item and including separate
statements rather than combined statement
 Faculty across the unit had multiple conversations
about the role of the teacher in influencing
student achievement as well as motivation.
Some Common Errors Found In
Key Assessments
 Items included that are not appropriately aligned to the standard(s) OR









what is supposed to be measured
Not adequately measuring the standard (only certain aspects)
Not setting clear performance expectations – e.g., what is “passing” or
“acceptable? OR, setting inappropriate performance expectations
Not matching the type of rubric to the assessment need (e.g., use of
holistic vs. analytic rubrics).
Performance descriptors on rubrics that are not sufficiently differentiated
across levels.
Use of non-specific terms in performance descriptors (“some,”
“effectively,” “adequately”) without explicit guidance for how those terms
are to be defined
Use of broad terms – outcomes not well defined
Lack of appropriate balance of “brevity and detail” – either not efficient
or not effective
Lack of well-defined criteria to guide ratings – may lead to biased ratings
(e.g., leniency bias)
Not using multiple measures to assess outcomes
Candidate will integrate research findings
in his/her practice: Research proposal
Components
Abstract
Literature Review
Research Design
Methodology
Conclusion
Use of APA style
guide
Effective
communication
Target
Acceptable
Unacceptable
References and Resources
 Carey, J.). (2011). Outcomes assessment: Linking learning, assessment, and
program improvement. PowerPoint presentation from ALA Annual Meeting,
June 27, 2011.
 Darling-Hammond, L. (2006). Assessing teacher education: The usefulness of
multiple measures for assessing program outcomes. Journal of Teacher
Education, 57 (2), 120-138.
 Darling-Hammond, L., Amrein-Beardsley, A., Harertel, E., Rothstein, J.
(2012). Evaluating teacher evaluation. Phi Delta Kappan, (Mar 2012
Supplement), 5-6.
 Gonsalvez, C.J. & Freestone, J. (2007). Field supervisors’ assessments of
trainee performance: Are they reliable and valid? Australian Psychologist,
42(1), 23-32.
 Johnson, L.E. (2008). Teacher candidate disposition: moral judgment or
regurgitation? Journal of Moral Education, 37, 429-444.
References and Resources,
continued
 Magin, D. & Helmore, P. (2001). Peer and teacher assessments of oral
presentation skills: How reliable are they? Studies in Higher Education, 26, 287298.
 McAllister, S., Lincoln. M., Ferguson, A., & McAllister, L. (2010). Issues in
developing valid assessments of speech pathology students’ performance in the
workplace. International Journal of Language and Communication Disorders,
45(1), 1-14.
 Oláh, L.N., Lawrence, N.R., & Riggan, M. (2010). Learning to Learn From
Benchmark Assessment Data: How Teachers Analyze Results. Peabody Journal
of Education, 85, 226-245.
 Sandholtz, J.H. & Shea, L.M. (2012). Predicting performance: A comparison of
university supervisors’ predictions and teacher candidates’ scores on a teaching
performance assessment. Journal of Teacher Education, 63(1), 39-50.
 VSU Dewar College of Education Institutional Report (2006).