THE FOLLOWING LECTURE HAS BEEN APPROVED FOR ALL STUDENTS BY BIRMINGHAM CITY UNIVERSITY This lecture may contain information, ideas, concepts and discursive anecdotes that.

Download Report

Transcript THE FOLLOWING LECTURE HAS BEEN APPROVED FOR ALL STUDENTS BY BIRMINGHAM CITY UNIVERSITY This lecture may contain information, ideas, concepts and discursive anecdotes that.

THE FOLLOWING LECTURE HAS BEEN APPROVED FOR
ALL STUDENTS
BY BIRMINGHAM CITY UNIVERSITY
This lecture may contain information, ideas,
concepts and discursive anecdotes that may
be thought provoking and challenging
Any issues raised in the lecture may require the
viewer to engage in further thought, insight,
reflection or critical evaluation
health.bcu.ac.uk/craigjackson
Validity & Variability
of Back Pain Assessments
Dr. Craig Jackson
Senior Lecturer in Psychology
School of Psychology
Faculty of Education Law &Social Science
BCU
Who is Observing What?
The validity of any observation depends upon who is observing whom
Heisenberg’s uncertainty principle
(1927)
Content
Assessment Criteria
Validity
Reliability
Low Back Pain Assessments
Appropriateness & Feasibility
Between-Observer Variability and Consistency
The Future: Mathematical Models?
Validity without Psychology?
Variability
Specificity of Defined Field + Repeatable Measurement = Valid Measures
S R=V
100
Problem of between-observer variation remains
GP eliciting signs in respiratory disease
Neurologist evaluating diagnosis of multiple sclerosis
Geriatrician assessing stroke rehab.
Anaesthetist determining fitness for operation
1. Judgements might be made differently by other observers
2. Judgements might be made differently by same on repeated occasions
Between-Observer Variability
Variation between observers
Seriously compromise research / clinical findings
Worst example:
Patients with condition A - all examined by Dr X
Patients with condition B - all examined by Dr Y
One observer examine all patients ?
Not possible / practical
Examples of Between-Observer Variability
Diagnostic classification for multiple sclerosis for 149 patients
By two clinicians (observers)
Neurologist A
diagnostic
class
Neurologist B
Certain
38
5
0
1
44
0.30
Probable
33
11
3
0
47
0.32
Possible
10
14
5
6
35
0.23
Doubtful
3
7
3
10
23
0.15
Total
84
37
11
17
149
Proportion
0.56
0.25
0.07
0.11
Examples of Between-Observer Variability
Circum-corneal hyperaemia (scored 0,1,2,3,4) by four ophthalmologists
Ophthalmologist
Patient
A
B
C
D
1
3
3
4
3
2
2
3
4
3
3
2
2
3
1
4
2
2
3
2
5
2
3
4
2
6
2
2
4
2
• Systematic error by observer C - consistently higher
• Observer B sticks to mid-ranges
• No patient on whom there is total agreement
7
1
2
2
1
8
2
2
3
3
9
2
2
3
3
Examples of Between-Observer Variability
Iris hyperaemia (scored 0,1,2,3,4) by four ophthalmologists
Ophthalmologist
Patient
A
B
C
D
1
1
1
4
3
2
0
0
0
3
3
0
1
0
1
4
0
4
0
1
5
0
1
0
2
6
3
1
4
2
7
1
1
0
1
8
2
1
4
2
• Observer C uses only extremes of scale
• Observer C introduces spurious code
• Observer D avoids extreme codes
• Only 2 cases with difference of 1 between highest and lowest scores
9
0
4
9
2
Reducing Between-Observer Variability
Use expert panel / reference library – they evaluate all procedures
Compare rival observation methods in small pilot studies
Suspect observer at all times - how may s/he be biased?
Train observers / assessors
Standardised techniques & judgement criteria
Consider severity of disagreements
Randomise patients out to multiple observers / multiple observations
Appoint external assessor
Assessment Criteria
With any assessment – observation, questionnaire or equipment – we ask:
Utility
Is it useful?
Reliability
Is it dependable?
Validity
Does it do what it is supposed to?
Sensitivity
Can it identify patients with a condition?
Specificity
Can it identify those that do not have the condition?
Responsiveness
Can it measure differences over time?
Purpose of Assessment Instruments
Is the purpose of the instrument clearly stated?
Is it discriminative?
Is it evaluative?
Is it prognostic?
Which population is it appropriate for?
Clinical
Working
Research / Epidemiological
Reliability & Validity
Validity
The degree to which an instrument measures what it is intended to measure
Reliability is a necessary but insufficient condition for validity
The approximate truth about inferences regarding causal relationships
Reliability
A degree of consistency of a measure
The degree to which a test is free of random error
A measure that produces consistent results is said to have high reliability
Validity in Research
Validity
Poor repeatability of examination implies a poor validity
How repeatable are results by the same observer:
On two (or more) occasions by same observer?
(Temporal Stability)
Or repeated occasions by different observers?
Applies equally to Clinical Practice and Research
A clinical sign carries no info if it is assessed differently when re-examined
Measures for Clinical Use
Questionnaires
General health status
Pain
Functional status
Patient satisfaction
Physiological outcomes
Utilization measures
Cost measures
Mathematical Modelling
Face Validity
Are items measured in a sensible way ?
How specific are the questions ?
Do questions have a specific time frame / frame of reference ?
Are questions performance related ? (do you do it?)
Are questions capacity related ?
How is the index scored ?
Weighting of items ?
(can you do it?)
Content Validity
Content validity is concerned with “representativeness”
Are all relevant dimensions of functionality included ?
Subjective
Was method for choosing items appropriate ?
Draw an inference from test scores to a
larger domain of functionality
e.g. the abilities covered by the test items
should be representative to the larger
domain of abilities and function
Construct Validity
What is the bigger concept that the assessment is trying to measure?
“Theoretical Construct”
Does assessment perform satisfactorily when compared with other measures
Is that concept a real one?
e.g. does specific local pain prevent general functioning?
Measured by correlation between the intended independent variable (back
health) and a proxy independent variable (specific test performance) that is
actually used
Construct Validity
For example:
Company physician wants to study the relationship between general back
health and job performance
However, the physician may not be able to administer a comprehensive back
health test to every worker
In this case, s/he can use a proxy variable such as “performance on a specific
functional test" as an indirect indicator of back health
Administer the proxy test AND comprehensive back test to a portion of
Workers
If finding a strong correlation between general back health and the specific
test, the proxy test can be used with the larger group because its construct
validity is established
Criterion Validity
Drawing an inference from specific test scores to general performance
Criterion validity is about prediction rather than explanation.
Prediction is concerned with non-casual dependence
Explanation is pertaining to causal or logical dependence
E.g. one can predict the weather based on the height of mercury inside a
Barometer. However, one cannot use the behaviour of mercury height to
explain why the weather changes.
Responsive to Change
Is measure sensitive enough to detect clinically relevant change?
Essential for evaluative measurements
Examples: Pain Perception
Visual Analogue Scales Reliable and Valid (Jensen & Karoly 1993)
Advantages over other pain assessment methods
(Scott & Huskisson 1976, Price et al. 1994)
Quadruple Visual Analogue Scales – 4 specific factors – Von Korff et al. 1992
CURRENT Pain Level
AVERAGE or TYPICAL Pain Level
Pain level at its BEST
Pain level at its WORST
Ratings are averaged
x 10
= TOTAL SCORE
(Range 0 – 100)
Condition-Specific Assessment – Low Back Pain
40+ low back functional questionnaires exist
5 identified as “gold standard” (Kopec & Esdaile, 1995)
1. Sickness Impact Profile (Bergner et al. 1981)
2. Roland-Morris Disability Questionnaire (Roland and Morris, 1983)
3. Oswestry Low Back Pain Disability Questionnaire (Fairbank et al. 1980)
4. Million Visual Analogue Scale (Million et al. 1982)
5. Waddell Disability Index (Waddell, 1984)
Condition-Specific Assessment – Low Back Pain
2 of the “gold standards” (Kopec & Esdaile, 1995)
1. Roland-Morris Disability Questionnaire (Roland and Morris, 1983)
2. Oswestry Low Back Pain Disability Questionnaire (Fairbank et al. 1980)
+ Quebec Back Pain Disability Scale (Kopec et al. 1995)
Roland Morris Disability Questionnaire (RMQ)
Purpose:
Acute and Chronic population of low back pain sufferers
An evaluative measure in clinical trials
Face Validity:
+ 24 Yes No questions
+ Moderate specificity
+ Today is the frame of reference
+ Performance related
+ Double negatives
+ “Yes response” scores – score out of 24
Content Validity:
Mobility
Work
Sleeping
Recreation
Dressing / grooming
Standing
Mood
Appetite
Roland Morris Disability Questionnaire (RMQ)
“The best single study of assessing short-term outcomes of primary care
patients with low back pain“
(Von Korff & Saunders, 1996)
Scores > than 13 =
Significant disability associated with an unfavorable
outcome
(Von Korff & Saunders, 1996)
Any change of less than 4 points is both too small to matter and too small to
be reliable
(Stratford et al. 1996)
Oswestry Disability Questionnaire (revised)
Purpose:
Acute and Chronic population of low back pain sufferers
Discriminate between chronic and acute low back pain
An evaluative measure in clinical trials
Used to predict different rates of improvement
Face Validity:
+ Measured 0 – 5 by degree of difficulty
+ Very specific questions
+ No specific frame of reference
+ Capacity related questions
+ Score by summing all items = percentage score
Content Validity:
Pain intensity
Walking
Sleeping
Personal care Lifting
Sitting
Standing
Sex / social life Travelling
Oswestry Disability Questionnaire (revised)
Content Validity:
Omits: bending
twisting
emotional state
kneeling
turning
sudden movement
“Sex life” reduced response rates
(Hudson-Cook et al. 1989)
Scoring issues:
11% is a cut off score (Erhard et al. 1994)
00 - 20%
20 - 40%
40 - 60%
60 - 80%
80 - 100%
Stratford et al. 1988
Minimal Disability
Moderate Disability
Severe Disability
Crippled
Bed Bound or Exaggerating
Quebec Back Pain Disability Scale
Purpose:
Acute and Chronic population of low back pain sufferers
Assess level of functional disability
Designed as discriminative, evaluative and predictive
Face Validity:
+ Response on rating scale 0 - 5
+ Very specific questions
+ “Today” as frame of reference
+ Performance related questions
+ Score by summing all items = percentage score
Content Validity:
Mobility
Sitting
Lifting
Travelling
Standing
Bending
Sleeping
Running
Quebec Back Pain Disability Scale
Content Validity:
Omits: twisting
emotional state
sex life
turning
sudden movement
Reliability
Has test-retest reliability been established ?
Measure reproducible on repeated use on stable patient ?
Internal consistency ?
Do items correlate with others ?
Roland-Morris Disability Questionnaire
Alpha (reliability score)
lowest
highest
0.89
0.93
Oswestry Disability Questionnaire
0.77
Waddell Disability Index
0.76
Quebec Back Pain Disability Scale
-
0.93
0.95
Back Performance Scale (BPS)
5 Tests of sagittal-plane mobility
A) Sock test
B) Pick up test
D) Roll-up test E) Lift test
C) Fingertip-to-floor test
Sum scores to obtain performance measure
of mobility-related activities
Objectives:
Develop a sum scale
Discriminative ability
Sensitivite to change
Strand et al. 2002
Back Performance Scale (BPS) – Evaluation of . . .
Correlations among 5 tests of sagittal-plane mobility:
Correlations among 5 tests and BPS total:
Cronbach Alpha (reliability):
Sum Scores Discrimination:
Responsiveness:
Back Performance Scale (BPS) – Evaluation of . . .
Correlations among 5 tests of sagittal-plane mobility:
Ranged from: 0.27 – 0.50
Correlations among 5 tests and BPS total:
Ranged from: 0.63 – 0.73
Cronbach Alpha (reliability):
Achieved:
0.73
Sum Scores Discrimination:
Higher scores in patients not returning to work
Higher scores in patients with back pain rather than MSD
Responsiveness:
Effect size high (1.33) for patients who returned to work
Effect size low (0.31) for patients who had not returned to work
Back Performance Scale (BPS) – Evaluation of . . .
1. BPS sum more responsive than separate tests
2. Measures aspects of performance of clinical importance to back pain
3. Quick, simple and cheap to administer
4. No costly equipment
Future research:
Could tests with lateral bending and twisting be added?
Could twisting / lateral tests replace any of the sagittal bending tests?
Yellow Flags of Low Back Pain
Indicative of long term chronicity and disability
•
Negative attitude – back pain is harmful and disabling
•
Fear avoidance
•
Reduced activity
•
Expects passive treatment to be better than active treatment
•
Tendency to low morale, depression and social withdrawal
•
Social / Financial problems
Should these psychosocial aspects be included in assessment scale?
What validity does any scale have when omitting these constructs?
Appropriateness / Feasibility
Is administration format suitable ?
Time take to complete questionnaire appropriate ?
Questions easy to understand ?
Questions acceptable to patient ?
Clinical relevance ?
Mathematical Models
Leg length differences and MSDs
Two measurement methods
1) Direct measurement / observation
2) Regression equations
MRI
Ultrasonics
Early stages
Not cost-effective
Complimentary at present
Physiologically valid
Requires physiological uniformity
Not valid with clinical populations Ashford & Marlbrook, 2003
Summary of Reliability & Validity
• There can be validity without reliability
• Reliability is an aspect of construct validity - as assessment becomes less
standardized, distinctions between reliability and validity blur
• In many situations assessors are not trained to agree on a common set of
criteria and standards
• Inconsistency in performance across tasks does not invalidate the
assessment
• Rather it becomes an empirical puzzle to be solved by searching for a more
comprehensive interpretation
• Initial disagreement does not invalidate any assessment - provides impetus
for dialog
Moss, 1994
Implications for Back Pain Assessments
• Development of 1 single valid universal test may be pointless
• No grand Unifying Theory of measurements
• If something is easy to measure validly,
it would’ve been done by now
• Functional assessments seem alive and well
(for now)
• Functional assessments must develop and
include psychosocial aspects