Introduction to Measurement Goals of Workshop • Reviewing assessment concepts • Reviewing instruments used in norming process • Getting an overview of the secondary.
Download ReportTranscript Introduction to Measurement Goals of Workshop • Reviewing assessment concepts • Reviewing instruments used in norming process • Getting an overview of the secondary.
Introduction to Measurement Goals of Workshop • Reviewing assessment concepts • Reviewing instruments used in norming process • Getting an overview of the secondary and elementary normative samples • Learning how to use the manuals in interpreting students’ scores. ASSESSMENT • The process of collecting data for the purpose of making decisions about students • It’s a process and typically involves multiple sources and methods. • Assessment is in service of a goal or purpose. • The data we collect will be used to support some type of decision (e.g., monitoring, intervention, placement) Major Types of Assessment in Schools • More frequently used: – Achievement: how well is child doing in curriculum? – Aptitude: what is this child’s intellectual and other capabilities? – Behavior: Is the child’s behavior affecting learning? • Less frequently used: – Teacher competence: Is teacher actually imparting knowledge? – Classroom environment: Are classroom conditions conducive to learning? – Other concerns: home, community,... Types of Tests • Norm-referenced – Comparison of performance to a specified population/set of individuals • Individually-referenced – Comparisons to self • Criterion-referenced – Comparison of performance to mastery of a content area; what does the student know? • The data in the manual will allow you to do look at norms and at individual growth. MAJOR CONCEPTS • • • • • • Nomothetic and Idiographic Samples Norms Standardized Administration Reliability Validity Nomothethic • Relating to the abstract, the universal, the general. • Nomothetic assessment focuses on the group as a unit. • Refers to finding principles that are applicable on a broad level. • For example, boys report higher math selfconcepts than girls; girls report more depressive symptoms than boys.. Idiographic • Relating to the concrete, the individual, the unique • Idiographic assessment focuses on the individual student • What type of phonemic awareness skills does Joe possess? Populations and Samples I • A population consists of all the representatives of a particular domain that you are interested in • The domain could be people, behavior, curriculum (e.g. reading, math, spelling, ... Populations and Samples II • A sample is a subgroup that you actually draw from the population of interest • Ideally, you want your sample to represent your population – people polled or examined, test content, manifestations of behavior Random Samples • A sample in which each member of the population had an equal and independent chance of being selected. • Random samples are important because the idea is to have a sample that represents the population fairly; an unbiased sample. • A sample can be used to represent the population. Probability Samples I • Sampling in which elements are drawn according to some known probability structure. • Random samples are subcases of probability samples. • Probability samples are typically used in conjunction with subgroups (e.g., ethnicity, socioeconomic status, gender). Probability Samples II • Probability samples using subgroups are also referred to as stratified samples. • Standardization samples are typically probability or stratified samples. • Standardization samples need to represent population because the sample’s results will be used to create norms against which all members of population will be compared. Norms I • Norms are examples of how the “average” individual performs. • Many of the tests and rating scales that are used to compare children in the US are norm-referenced. – An individual child’s performance is compared to the norms established using a representative sample. Norms II • For the score on a normed instrument to be valid, the person being assessed must belong to the population for which the test was normed • If we wish to apply the test to another group of people, we need to establish norms for the new group Norms III • To create new norms, we need to do a number of things: – Get a representative sample of new population – Administer the instrument to the sample in a standardized fashion. – Examine the reliability and validity of the instrument with that new sample – Determine how we are going to report on scores and create the appropriate tables Standardized Administration • All measurement has error. • Standardized administration is one way to reduce error due to examiner/clinician effects. • For example, consider these questions with different facial expressions and tone: • Please define a noun for me :-) • DEFINE a noun if you can ? :- ( Distributions • Any group of scores can arranged in a distribution from highest to lowest • 10, 3, 31, 100, 17, 4 • 3, 4, 10, 17, 31, 100 Normal Curve • Many distributions of human traits form a normal curve • Most cases cluster near middle, with fewer individuals at extremes; symmetrical • We know how the population is distributed based on the normal curve Ways of reporting scores • Mean, standard deviation • Distribution of scores – 68.26% ± 1; 95.44 ± 2; 99.72 ±3 • Stanines (1, 2, 3, 4, 5, 6, 7, 8, 9) • Standard scores - linear transformations of scores, but easier to interpret • Percentile ranks* • Box and Whisker Plots* Percentiles • A way of reporting where a person falls on a distribution. • The percentile rank of a score tells you how many people obtained a score equal to or lower than that score. • So if we have a score at the 23rd %tile and another at the 69th %tile, which score is higher? Percentiles 2 • Is a high percentile always better than a low percentile? • It depends on what you are measuring. • For example…. • Box and whisker plots are visual displays r graphic representation of the shape of a distribution using percentiles. 20 Explanation of the Box Plot Individual Outliers 18 16 90th P ercentile P erformance 14 75th P ercentile 12 50th P ercentile 10 8 25th P ercentile 6 4 10th P ercentile 2 0 Grade 2 Students The box plot is a picture of the dist ribut ion of scores on a measure. Correlation • We need to understand the correlation coefficient to understand the manual • The correlation coefficient, r, quantifies the relationship between two sets of scores. • A correlation coefficient can have a range from -1 to + 1. – Zero means the two sets of scores are not related. – One means the two sets of scores are identical (a perfect correlation) Correlation 2 • Correlations can be positive or negative. • A + correlation tells us that as one set of scores increases, the second set of scores also increases. they can be negative. Examples? • A negative correlation tells us that as one set of scores increases, the other set decreases. Think of some examples of variables with negative r’s. • The absolute value of a correlation indicates the strength of the relationship. Thus .55 is equal in strength to -.55. How would you describe the correlations shown by these charts? 3 5 7 9 11 13 14 12 10 8 6 4 1 4 3 7 9 10 Se rie s1 2 0 1 12 2 3 4 5 6 10 9 8 7 6 5 10 8 6 4 2 Se rie s1 0 1 1.4 1.2 1 0.8 0.6 0.4 2 3 4 5 6 1.2 1.2 1.2 1.2 1.2 Se rie s1 0.2 0 1 2 3 4 5 Correlation 4 • • • • .25, .70, -.40, .55, -.87, .58, .05 Order these from strongest to weakest -.87, .70, .58, .57, -.40, .25, .05 We will meet 3 different types of correlation coefficients today: • Reliability coefficients - Definitions? • Validity coefficients • Pattern coefficients Reliability • Reliability addresses the stability, consistency, or reproducibility of scores. – Internal consistency – Split half, Cronbach’s alpha – Test-retest – Parallel forms – Inter-rater Reliability 2 • Internal Consistency – How do the items on a scale relate to one another? Are respondents relating to them in the same way? • Test-retest – How do respondents’ scores at Time 1 relate to their scores at Time 2? Reliability 3 • Parallel forms – Begin by creating at least two versions of the exam. How do respondents performance on one version compare to their performance on another version • Inter-rater – Connected to ratings of behavior. How does one rater’s scores compare to another’s? Validity • Validity addresses the accuracy or truthfulness of scores. Are they measuring what we want them to? – – – – Content Criterion - Concurrent Criterion - Predictive Construct – Face Content Validity • Is the assessment tool representative of the domain (behavior, curriculum) being measured? • An assessment tool is scrutinized for its (a) completeness or representativeness, (b) appropriateness, (c) format, and (d) bias – E.g., MSPAS Criterion-related Validity • What is the correlation between our instrument, scale, or test and another variable that measures the same thing, or measures something that is very close to ours? • In concurrent validity, we compare scores on the instrument we are validating to scores on another variable that are obtained at the same time. • In predictive validity, we compare scores on the instrument we are validating to scores on another variable that are obtained at some future time. Structural Validity • Used when an instrument has multiple scales. • Asks the question, “Which items go together best? • For example, how would you group these items from the Self-Description Questionnaire? • 3. I am hopeless in English classes. • 5. Overall, I am no good. • 7. I look forward to mathematics class. • 15. I feel that my life is not very useful. • 24. I get good marks in English. • 28. I hate mathematics. Structural Validity 2 • We expect the English items (3, 24), Math items (7, 28) and global items (5, 15) to group together. • The items that group together make up a new composite variable we call a factor. • We want each item to correlate highly with the factor it clusters on, and less well with other factors. • Typically, we accept item-factor coefficients from about .30 and higher. What can we say about the structural validity of the SDQ given these scores? Item # Verbal Math Global 3 .587 -.044 .624 5 -.016 .024 .561 7 .086 .630 -.059 23 .019 -.015 .625 24 .754 -.006 -.024 28 -.020 .750 .042 Construct Validity • Overarching construct: Is the instrument measuring what it is supposed to? – Dependent on reliability, content and criterion-related validity. • We also look at some other types of validity evidence some times – Convergent validity: r with similar construct – Discriminant validity: r with unrelated construct – Structural validity: What is the structure of the scores on this instrument? Statistical Significance • When we examine group differences in science, we want to make objective rather than subjective decisions. • We use statistics to let us know if the difference we are observing occurs by chance. • In psychology, we typically set our alpha or error rate at 5% (i.e., .05), and we conclude that if a difference was likely less than 5% of the time, that difference is statistically significant. Statistical Significance 2 • When our statistical test tells us that our difference is statistically significant (i.e., < .05). • Statistical significance is affected by a number of variables, including sample size. The larger the sample, the easier it is to achieve statistical significance. • We also look at the magnitude of the difference (or effect size). • A difference may be statistically significant, but have a small effect size. • .10 to . 30 = small effect; .40 to .60 = medium effect; > .60 = large effect.