PhD Research Seminar Series: Reliability and Validity in Tests and Measures

Transcript PhD Research Seminar Series: Reliability and Validity in Tests and Measures

PhD Research Seminar Series: Reliability and Validity in Tests and Measures

Dr. K. A. Korb University of Jos

Outline

  Reliability      Theory of Reliability Split-Half Reliability Test-Retest Reliability Alternate Forms Reliability Inter-Rater Reliability Validity   Construct Validity Criterion Validity   Content Validity Face Validity Dr. K. A. Korb University of Jos

Overview

 Test Developer: The person who created a test  Test user: A person administering the test  Test taker: Person taking the test Dr. K. A. Korb University of Jos

Reliability: Consistency of results Reliable Reliable Unreliable

Dr. K. A. Korb University of Jos

Reliability Theory

 Actual score on test = True score + Error  True Score: Hypothetical actual score on test  The reliability coefficient indicates the ratio between the

true score

variance on the test and the

total

variance  In other words, as the error in testing decreases, the reliability increases Dr. K. A. Korb University of Jos

Reliability: Sources of Error

   Error in Test construction  Error in Item Sampling: Results from items that measure more than one construct in the same test  For example: A test that has items assessing both reading and math ability will have lower reliability than a test that assess just reading Error in Test Administration    Test environment: Room temperature, amount of light, noise, etc.

Test-taker variables: Illness, amount of sleep, test anxiety, etc.

Examiner related variables: Absence of examiner, examiner’s demeanor, etc.

Error in Test Scoring  Scorer: With subjectively marked assessments, different scorers may give different scores to the same responses Dr. K. A. Korb University of Jos

Reliability: Error due to Test Construction

 Measured by Split-Half Reliability: Determines how consistently your measure assesses the construct of interest.

  A low split-half reliability indicates poor test construction.

 If your measure assesses multiple constructs, split-half reliability will be considerably lower.  Separate the constructs that you are measuring into different sections of the questionnaire and calculate the reliability separately for each construct.

 If you get a low reliability coefficient, then your measure is probably measuring more constructs than it is designed to measure.  Revise your measure to focus more directly on the construct of interest.

When validating a measure, you will most likely calculate the split-half reliability of your instrument.

Dr. K. A. Korb University of Jos

Reliability: Error due to Test Construction

 Calculating Split-Half Reliability   If you have dichotomous items (e.g., right-wrong answers) as you would with multiple choice exams, calculate the KR-20. If you have a Likert scale, essays, or other types of items, use the Spearman-Brown formula.

 For a step-by-step example of calculating the Split-Half Reliability, see the associated presentation entitled

Calculating Reliability of Quantitative Measures.

Dr. K. A. Korb University of Jos

Reliability: Error due to Test Administration

 Test-Retest Reliability: Determines how much error in a test score is due to problems with test administration.

 To calculate:  Administer the same test to the same participants on two different occasions, perhaps a week or two apart.  Correlate the test scores of the two administrations of the same test using Pearson’s Product Moment Correlation.

Dr. K. A. Korb University of Jos

Reliability: Error due to Test Construction with Two Forms of the Same Measure

 Parallel Forms Reliability: Determines the similarity of two different versions of the same measure.

 To calculate  Administer the two tests to the same participants within a short period of time.  Correlate the test scores of the two tests using Pearson’s Product Moment Correlation.

Dr. K. A. Korb University of Jos

Reliability: Error due to Test Scoring

 Inter-Rater Reliability: Determines how closely two different raters mark the assessment.

 To calculate  Give the exact same test results from one test administration to two different raters.

 Correlate the two markings from the different raters using Pearson’s Product Moment Correlation Dr. K. A. Korb University of Jos

Validity: Measuring what is supposed to be measured Valid Invalid Invalid

Dr. K. A. Korb University of Jos

Validity

 Three types of validity:  Construct validity: Measure the appropriate psychological construct  Criterion validity: Predict appropriate outcomes  Content validity: Adequate sample of content  Each type of validity should be established for all psychological tests.

Dr. K. A. Korb University of Jos

Construct Validity



Definition:

Appropriateness of inferences drawn from test scores regarding an individual’s status of the psychological construct of interest  For example, a test is developed to measure

Reading Ability.

Once the test is administered to students, does their score on the test accurately reflect their true

reading ability?

 Two considerations:  Construct underrepresentation  Construct irrelevant variance Dr. K. A. Korb University of Jos

Construct Validity

 

Construct underrepresentation:

A test does not measure all of the important aspects of the construct.

 For example, a test of academic self efficacy (perceived effectiveness in academics) might measure self efficacy only in math and science, thus ignoring other important academic subjects.

Construct-irrelevant variance:

other unrelated processes Test scores are affected by  For example, a test of statistical knowledge that requires complex calculations is likely influenced by construct-irrelevant variance. In addition to measuring statistical knowledge, the test is also measuring calculation ability.

Dr. K. A. Korb University of Jos

Sources of Construct Validity Evidence

  

Homogeneity:

 The test measures a single construct

Evidence:

High internal consistency - calculated by Split-Half reliability

Convergence:

Test is related to other measures of the same construct and related constructs 

Evidence:

Validity Highly correlations with other measures – same as Criterion

Theory:

The test behaves according to theoretical propositions about the construct  

Evidence by changes in test scores according to age:

measure should change by age as predicted by theory.  Scores on the For example, intelligence scores of one person should increase as that person gets older because theories of intelligence dictate increases by age.

Evidence from treatments:

Scores on the measure change as predicted by theory from a treatment between pretest and posttest.  For example, scores on a test of Knowledge of Nigerian Government should significantly increase after a course on the Nigerian Government.

Dr. K. A. Korb University of Jos

Criterion Validity

 

Definition:

Correlation between the measure and a

criterion.



Criterion:

Other accepted measures of the construct or measures of other constructs similar in nature

A criterion can consist of any standard with which your test should be related  Examples:  Behavior (e.g., misbehavior in class, teacher’s interactions with students, days absent from school)    Other test scores (e.g., standardized test scores) Ratings (e.g., teachers ratings of helpfulness) Psychiatric diagnosis (e.g., depression, schizophrenia) Dr. K. A. Korb University of Jos

Criterion Validity



Three types:



Convergent validity:

High correlations with measures of similar constructs taken at the same time.



Divergent validity:

Low correlations with measures of different constructs taken at the same time.



Predictive validity:

High correlation with a criterion in the future Dr. K. A. Korb University of Jos

Criterion Validity

 Example: You developed an essay test of science reasoning to admit students into the science programme at the university.

  

Convergent Validity:

with other science tests, particularly well established science tests.

Your test should have high correlations

Divergent Validity:

Your test should have low correlations with measures of writing ability because your test should

only

measure science reasoning, not writing ability.

Predictive Validity:

at the university.

Your test should have high correlations with future grades in science courses because the purpose of the test is to determine who will do well in the science programme Dr. K. A. Korb University of Jos

Criterion Validity Example

Criterion Validity Evidence for New Science Reasoning Test: Correlations between Science Reasoning and Other Measures

New Science Reasoning Test WAEC Science Scores School Science Marks WAEC Writing scores WAEC Reading Scores Future marks in university science courses Dr. K. A. Korb University of Jos .83

.75

.34

.24

.65

High correlations with other measures of science ability indicates good criterion validity.

Low correlations with measures unrelated to science ability indicates good criterion validity.

High correlation with future measures of science ability indicates good criterion validity.

Content Validity

 

Definition:

measure Sampling the entire domain of the construct it was designed to For example:    The first chart represents the amount of time in class spent on each maths topic The second chart represents the amount of test questions on each maths topic This test does NOT demonstrate content validity because the proportion of test questions does not match the proportion of coverage in class.

Dr. K. A. Korb University of Jos

Class Coverage Addition Subtraction Multiplication Division Test Coverage Addition Subtraction Multiplication Division

Content Validity

Class Coverage Test Coverage Addition Subtraction Multiplication Division Addition Subtraction Multiplication Division

  For academic tests, a test is considered content valid when the proportion of material covered by a test approximates the proportion of material covered in a class.

This maths test demonstrates good content validity because the proportion of test questions on each topic matches the proportion of time spent in class on each topic.

Content Validity

 Content validity tends to be an important consideration ONLY for achievement tests  To assess:  Gather a panel of judges    Give the judges a table of specifications of the amount of content covered in the domain Give the judges the measure Judges draw a conclusion as to whether the proportion of content covered on the test matches the proportion of content in the domain.

Dr. K. A. Korb University of Jos

Face Validity

  Face validity addresses whether the test

appears

to measure what it purports to measure.

 To assess: Ask test users and test takers to evaluate whether the test appears to measure the construct of interest.

Face validity is rarely of interest to test developers and test users.

   The only instance where face validity is of interest is to instill confidence in test takers that the test is worthwhile.

Face validity is NOT a consideration for educational researchers.

Face validity CANNOT be used to determine the actual interpretive validity of a test.

Dr. K. A. Korb University of Jos

Concluding Advice

  The best way to determine that the measures you use are both reliable and valid is to use a measure that another researcher has developed and validated 1.

This will assist you in three ways: You can confidently report that you have accurately measured the variables you are studying.

By using a measure that has been used before, your study is intimately tied to previous research that has been conducted in your field, an important consideration in determining the importance of your study. It saves you time and energy in developing your measure Dr. K. A. Korb University of Jos

Finding Pre-Existing Measures

  Information on how to find pre-existing measures:  http://www.apa.org/science/faq-findtests.html#printeddirec Online directory of pre-existing measures       http://www.ets.org/testcoll/ Type the construct you want to measure in the empty box and click the

Search

button.

Find the test that is most relevant to for your purposes.

When you click on the measure name in blue, if it has a journal article listed in the

Availability

category, the measure will be published in that journal article.

Some tests can also be ordered from the ETS Tests collection for about N3000 and then downloaded to your computer.

You can also try googling the name of the test to determine if somebody else has published the measure on the internet.

Dr. K. A. Korb University of Jos

Websites for Pre-existing Measures

 Personality Variables: International Personality Item Pool  http://ipip.ori.org/ipip/  Motivation Constructs: Self Determination Theory   http://www.psych.rochester.edu/SDT/ Motivation Constructs: Students’ goal orientations:  http://www.umich.edu/~pals/ Dr. K. A. Korb University of Jos

PhD Research Seminar Series: Reliability and Validity in Tests and Measures

Transcript PhD Research Seminar Series: Reliability and Validity in Tests and Measures