第五章测试效度及其验证方法的演变（二）湖南师范大学外国语学院邓杰教授教学目标 1. 了解整体效度观的主要思想    2. 整体效度观测试辩论法图尔明模型了解累进效度观的主要思想    测试辩论模型的逻辑错误及成因累进辩论法累进效度观主要时期：20世纪80年代中期以后基本观点：整体多维性验证方法：理性辩论辩论框架：图尔明模型整体效度观定义  《教育与心理测验标准》（1985）  Validity … refers to the appropriateness, meaningfulness and usefulness of the specific inferences made form test scores. （从考分中推理出来的特定结论的恰当性、意义性和有用性）  《教育与心理测验标准》（1999）  Validity refers to the.

Transcript 第五章测试效度及其验证方法的演变（二）湖南师范大学外国语学院邓杰教授教学目标 1. 了解整体效度观的主要思想    2. 整体效度观测试辩论法图尔明模型了解累进效度观的主要思想    测试辩论模型的逻辑错误及成因累进辩论法累进效度观主要时期：20世纪80年代中期以后基本观点：整体多维性验证方法：理性辩论辩论框架：图尔明模型整体效度观定义  《教育与心理测验标准》（1985）  Validity … refers to the appropriateness, meaningfulness and usefulness of the specific inferences made form test scores. （从考分中推理出来的特定结论的恰当性、意义性和有用性）  《教育与心理测验标准》（1999）  Validity refers to the.

Slide 1

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 2

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 3

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 4

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 5

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 6

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 7

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 8

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 9

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 10

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 11

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 12

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 13

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 14

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 15

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 16

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 17

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 18

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 19

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 20

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 21

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 22

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 23

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 24

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 25

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Slide 26

第五章测试效度及其验证
方法的演变（二）
湖南师范大学外国语学院
邓杰教授

教学目标
1.

了解整体效度观的主要思想




2.

整体效度观
测试辩论法
图尔明模型

了解累进效度观的主要思想




测试辩论模型的逻辑错误及成因
累进辩论法
累进效度观

主要时期：20世纪80年代中期以后
基本观点：整体多维性
验证方法：理性辩论
辩论框架：图尔明模型

整体效度观

定义


《教育与心理测验标准》（1985）


Validity … refers to the appropriateness,
meaningfulness and usefulness of the specific
inferences made form test scores.
（从考分中推理出来的特定结论的恰当性、意义性和有用性）



《教育与心理测验标准》（1999）


Validity refers to the degree to which evidence
and theory support the interpretations of test
scores entailed by proposed uses of tests.
（证据和理论支持测试使用所需的考分解释的程度）

解读（效度概念的内涵）


整体多维性





整体概念：构念效度整体效度，即效度（不存在类别之分）
多维互补：之前不同类别的效度同一整体的不同维度（相互补充、
相互依存，为整体效度提供证据）

重心转移







不再是测试本身固有的属性，而在于分数的解释和使用，包括分数
解释的合理性、使用决策的恰当性和使用后果的裨益性
不再是“有”或“无”的问题，而是“程度”问题
不再是单一指标，而是综合评判
不再是抽象数值，而是理性结论
无所谓效度系数之说
不再由公式计算，而应逻辑推理

Validity Model (Messick 1988: 42)
TEST INTERPRETATION
EVIDENTIAL
BASIS

Construct validity

CONSEQUENTIAL
BASIS

Value implications

TEST USE
Construct validity
+
Relevance/utility
Social consequences

1. An inductive summary of convergent and discriminant evidence that
the test scores have a plausible meaning or construct interpretation,

2. An appraisal of the value implications of the test interpretation
3. A rationale and evidence for the relevance of the construct and the
utility of the scores in particular applications
4. An appraisal of the potential social consequences of the proposed use

and of the actual consequences when used

Interpretive Argument (IA, Kane 1992)
IA: Score interpretation and use
Descriptive interpretations—no particular use specified

Evaluation via
scoring
procedure

Observation

Generalization
via reliability
studies

Observed
Score

Decision-based interpretations

Extrapolation to
nontest behavior;
Explanation in terms
of a model

Universal
Score

Semantic inferences—claims about what the test scores mean

(转引自McNamara & Roever, 2006: 25)

Relevance,
associated values,
consequences

Target
Score
(Inference)

Decision

Policy inferences— claims
about positive consequences
of adopting decision rules

Assessment Argument (Mislevy et al. 2003)
Evidence-centered Design (ECD)

Evidence

Principle

Claim

（Observation）

（Verification）

（Inference）

What the students
say or do

Statistical models;
Probability-based
reasoning

What the students
know or can do

Test construction,

Relevance of data,

Target knowledge,

administration,

value of observations

acquisition process,

scoring, and

as evidence

contextualized use

reporting

Evidence-based Validation (Weir 2005)
Test Taker
Characteristic
Context Validity
Setting: task
Demand: task
•
•
•
•
Setting: administration
•
•

Executive
Processes
•
•

monitoring

Theory-based Validity
Executive
Resources
•
•

A Priori Evidence
Response

A posteriori Evidence
However, is there not a problem if
we do not have a clear idea of
what we want to measure before
we construct and administer a test
to students?

Is there a problem with a
'suck-it-and-see' approach?

Scoring Validity
•
•

Scoring

Score/Grade

•
•

Score Interpretation

•
•
•
•
•
•

Criterion-related
Reliability

Consequential
Validity

Criterion-related
Validity
Score Value

Assessment Use Argument (AUA,
Bachman & Palmer 2010)
Warrants and
Warrants
and
Rebuttals

Intended/Actual
Consequence(s)

Intended/Actual
Interpretation(s)
Assessment Record
(Score, description)
Test Taker’s Performance

Assessment Tasks

Rebuttals

Warrants and

INTERPRETATION AND USE

ASSESSMENT DEVELOPMENT

Intended/Actual
Decision(s)

1. Claim: consequences are

Warrants
and
Rebuttals
Rebuttals

1. Claim:
consequences are
• beneficial
• beneficial

2. Claim: decisions are
2. Claim:
decisions
are
• values
sensitive
• values
sensitive
• equitable
• equitable

3. Claim: interpretations are

3. Claim: interpretations are
• meaningful
• meaningful
• impartial
• impartial
• generalizable
• generalizable
• relevant
• relevant
• sufficient
• sufficient
Warrants
and and
Warrants
Rebuttals
Rebuttals

Warrants
and and
Warrants
Rebuttals
Rebuttals

4. Claim
assessment
recordsrecords
are
4. Claim:
assessment
are
• consistent
• consistent

TestPerform
Taker'sance
Performance
Assessment tasks

Assessment tasks

Logical Structure of IA, ECD, AUA
Chain of Inferences

Base Argument
inference

Consequence
Premise

decision
Target score

Conclusion

interpretation

extrapolation
Universal score

Assumption

Challenge

Evidence
(?)

Evidence
(?)

Validity

score

generalization
Observed score

Most questionable
assumption

observation
IA, based on Kane (1990, 1992)

unless

unless A

since

Warrant

W

on
account
of
B

Claim

Alternative Explanation or
Rival Hypothesis

C

support
weaken

so
D

R

ECD, Mislevy et al. (2003: 15)

on
account
of

Backing

Rebuttal

since

support
weaken
reject

so
Data

Rebuttal

AUA, Bachman (2005: 15)

Rebuttal Data
Rebuttal Backing

Logical Problems of ECD
Claim
C: Sue can use specifics
to illustrate a description
of a fictional character.

Warrant

Alternative

W: Students who know how
to use writing techniques
will do so in an assignment
that calls for them.
on
account
of

since

Backing

B: The past three terms,
students’ understandings of
the use of techniques in indepth interviews have
corresponded with their
performances in their essays.

unless

so

A: The student has not
actually produced the work.

Rebuttal

D: Sue’s essay uses three
incidents to illustrate
Hamlet’s indecisiveness.

supports

R: Sue’s essay is very
similar to the character
description in the Cliff
Notes guide to Hamlet.

Data

Logical Problems of AUA－1
Counterclaim：
Jim is not sick.
Claim：Jim is sick.

If we already
know Jim is
visiting hisunless
partner in the
since
(Warrant)：People
often go to the hospitalhospital, do we
go
when they are sick. still need to so
through all these
steps?!
Data：Jim
is going to the
hospital.
(Bachman & Palmer, 2010, p. 97)

What if he is seeing the
doctor himself as well?

Rebuttal：Jim could
be visiting someone
who is in the hospital.
Supports

Rebuttal Backing：
Jim is visiting his
partner in the hospital.
What if Jim is attending a meeting in the
hospital, not visiting anyone in particular?

Logical Problems of AUA－2
Claim: Malissa was
paid time and a half.
unless
Warrant: All
Employers who work
overtime must be paid
time and a half.

since
so

Rebuttal: Malissa is in an
exempt category.
Rejects
Rebuttal Backing:
Malissa’s personnel file
indicates that she is not in an
exempt category.

Backing: According to
US labor law ...
Data: Malissa
worked overtime.
(Bachman & Palmer, 2010, p. 98)

Can it still be called
Rebuttal Backing if it
rejects the Rebuttal?

The Toulmin Model (Toulmin 1958, 2003)
(properly worded qualifier)

Rational logic

Harry was born
in Bermuda

D

So,

理性推理：以一般情况下都可以
接受的假定性理由为前提，结论
应该具有合理性
(highly probable assumption)

Since

Unless

W

R

A man born in Bermuda will
generally be a British subject

presumably Harry
So, presumably,
is a British subject
不可省略：结论通常不是绝对
的，应该根据反驳的可能性选
用一个恰当的限定词限定声明
的语气强度或成立条件
(rare and exceptional conditions)

Both his parents were aliens/
He has become a naturalized American/
……

不容置疑：假定性理由应不证
自明，或已事先证明

On account of
(readily available facts or truth)

Q, C

B

可以忽略：例外不足以威胁声
明的整体合理性；
必须忽略：追究例外即为陷入
死循环

The following statutes and other legal provisions:
……

客观存在：事实性支
撑应可随时奉取，而
无需争辩

IA、ECD和AUA共同的逻辑错误
及其产生根源
逻辑错误


自相矛盾








先声明后论证，即先作出声明后又
说自己的声明不一定成立
强调论证反驳，但在论证反驳时又

不得不放弃论证反驳



结构修改



明知声明不一定成立，也要强行作
出声明（将假设作为声明提出）
对反驳的论证，既不讲理由也不顾
反驳（又一次强行做出结论）





无限循环



反驳不可穷尽，甚至不可预知
声明的反驳的反驳正是声明自身



增加了反驳的证据
删除了限定词

模型误解


不具理性




错误根源

将假设称为声明（因为没有声明的
模型就不能称为辩论模型）
将反驳由必须忽略的特殊例外替换
为不可忽略的反面解释（为了消除
质疑和异议）
将辩论双方都应该遵循的逻辑推理
过程误解为双方的争辩过程

反驳误用


用反驳来论证声明而不是限定声明

图尔明对三段论的批判
Element
Minor Premise
Major Premise
Conclusion

Example 1
Socrates is a man
All men are mortal
Socrates is mortal

Example 2（p.115）
Anne is one of Jack’s sisters
All Jack’s sisters have red hair
So, Anne has red hair

Anne now
So,
presumably has red hair

Anne is one of
Jack’s sisters




Since

大前提存在歧义，既可以是
Unless
Any sister of Jack’s
may be taken to have Anne has dyed/
假定，也可是事实，因此三
gone white/
red hair
段论不能区分真假辩论。
lost her hair…
结论非是即否，容不得例外， On account of the fact that
因此三段论在日常辩论中应 All his sisters have previously
been observed to have red hair
用价值不大。
例1以假定为大前期，结论为对未来或未知的推理，因此可争可辩；
例2的大前提为事实，结论实为大前提事实的重复，而不是推理的结果，因此
无可争辩。如对事实存在质疑，争辩没有必须，摆出事实即可（如把Anne叫
到跟前，头发颜色自知）。

基于图尔明模型的AUA示例
A: Jim is going to the hospital, so he is probably sick.
(since people often go to the hospital when they are sick,
unless they are visiting someone who is in the hospital)

B: Jim is going to the hospital to visit his partner, so he can’t
possibly be sick himself.
(since people are usually not sick themselves when they are visiting someone,
unless they are seeing the doctor themselves)

A: Jim is seeing the doctor himself as well, so he must be sick.
可见，限定词是图尔明模型与三段论的唯一的显性差别，没有限定词，图尔
明模型也成了三段论，这正是图尔明所批判的。IA、ECD和AUA中，限定词都
已被删除，且所谓的“声明”实为假设。因此，三个模型实质上并不是辩论模
型，也不是所谓的论证模型，因为即使将“声明”改为假设，但如何检验假设
仍然不得而知。

主要时期：最新提出（2011）
基本观点：层级累进观
验证方法：累进辩论法
辩论框架：累进辩论模型

累进效度及累进辩论法

定义


测试数据对测试目标构念的体
现程度

层级累进观






效度是相对于测试环节而言
的。每个环节的结果数据，
而不仅仅是测后分数，都应
该充分体现测试的目标构念
当前环节的效度是所有前任
环节效度层级累进的结果，
并对所有后续环节的效度产
生影响。
累进意味着一个环节的效度
最大不大于最薄弱前任环节
的效度；一个环节的效度不
可接受，所有后续环节都没
有效度可言。

说明：
1. 累进辩论可以始于任何一个环节，只要
有理由相信前任环节是有效的，否则永远找
不到起始点。
2. 效度虽是“程度”问题，但只要达到可
以接受的程度，测试就是“有效”的，否则
即为“无效”。
3. 测试效度自然是测试固有的属性，而不
属于数据的解释或使用，否则就是解释效度
或使用效度。
A Priori

A Posteriori

Response

Administrating

Scoring
Score

Construct

Task

Using

Developing
Dsgn/Invstg
Specification

Purpose/Consq
Referencing
Criterion

累进辩论法
1.
2.
3.
4.

Comparability
Reference Value
Predictability
1.
…
2.
3.
Criterion 4.

Beneficence
Fairness
Ethics
1.
…
2.
3.
Consequence 4.

Planning

Hypothesis

由果及因：
详细列举问题
明确提出假设

由因及果：
逐一检验假设
做出理性结论
Construct
Claim
Reliability
Item Quality
Language level
1.
…
2.
3.
4.
Score

Executing
Relevancy
Authenticity
Interactiveness
1.
…
2.
3.
Response 4.

Correctness
Representativeness
Sufficiency
1. Clarity
…
2. Specificity
3. Practicability
4. …
Task
Specification

累进辩论模型：理性辩论与科学调
查的有机整合




基础部分：理性辩论，确保模型本质上仍然属于辩论模型（统计分
析的设计、实施和解读都离不开逻辑推理）
扩展部分：假设检验，用于处理复杂数据并得出有说服力的结论
（逻辑推理仅适用于数据简单明了、理由显而易见的情况）
不会陷入死循环
1.只要理由充分，无需假设检验
（可避免滥用）

1.一次假设检验，必然得出结论
（需避免误用）

Hypothesis

Analysis
(evidential?)

N

(H0|H1)

Y

Data
(p)

Claim
(c=1-α)
Since
Warrant
on account of
Backing

(α/𝛽)
Unless
Rebuttal

(α)
So
Qualifie
r

假
设
检
验
结
果
解
读

H0
N

p≤α?
D

Y
H1
(a)
D=p: probability
H0: There is no significant difference.
H1: The difference is significant.
α: significance level (e.g. 0.05, 0.1, 0.01)
D

C0
W

B

R

Q

D

C1
W

R

Q

B
(b)

e.g. D=0.8
C0: There is no significant difference
R=Type II error (β)

(c)
e.g. D=0.0
C1: The difference is significant
R=Type I error (α)

W=1-α=0.95 (confidence level); B=Empirical data (e.g. statistics)
Q=at the significance level of α

证实与证伪

正面解释与反面解释：有利于测试
的解释为正面解释，反之即为反面
解释。

Interpretation

Falsifying
Q

C’

C

R

Q'

R'
W'

W
Justifying
B

声明与反声明：声明既可以是正面
解释，也可以是反面解释。也就是
说，声明并不等于正面解释，反声
明亦不等于反面解释。

(D')

(D)
Evidence

证据与反面证据：证据只会“支持”
而不会“拒绝”声明。所谓反面证
据，实际上是支持反声明的证据。

B'

证实与证伪：证伪实际上是通过证
实反声明来间接实现的。

正面解释和反面解释的内容是确定的。研究问题一旦确定，正面与反面解释随之确定；
声明与反声明的内容是不确定的，反声明依赖于声明而存在，没有声明也就无所谓反声明。只有研究
结果产生以后，声明和反声明才会出现。

循环与递归
(Final Claim)

Q
Planning

每解决一个问题，需要单独使用一
次辩论模型

Q 1-1
Q 1-1-1
Q 1-1-2

(Q = Hypothesis)

……
Q 1-2
Q 1-2-1
(Q = Claim)
Q 1-2-2
……
Executing

……

(Initial Data)

循环：逐步解决同一层级的问题，
涉及模型的循环使用，即一次辩论
结束后接着开始另一个辩论，称为
同级辩论（Sibling Argument）
递归：逐级解决不同层次的问题，
涉及模型的递归使用，即当前辩论
还未结束又启动另一个辩论,称为
子辩论（Sub-Argument），子辩论
结束后再返回当前辩论

应用示例：选项可猜性的累进辩论
(Q = Hypothesis)

Executing
Factor

Review

Quality

Scale

Planning

Rating
p
Option

Consistency

Hypothesis
Recursion

N
Evidential?

α - significance level or
Type I error
β - Type II error
p - Probability
c – confidence interval
s – statistics theory

Y

Warrant-c

Backing-s

Guessability
Rebuttal-α/β

Qualifier-α

(Q = Claim)

Directory