Scoring Language Tests and Assessments
Page 1: Introduction to Scoring in Language Testing
Models and Frameworks
Describe the "what" of language testing and the constructs for inference.
Frameworks give a general overview but do not specify scoring details which are detailed in the scoring model.
Scoring Importance
Essential for linking evidence from tasks to the evaluated constructs.
Scoring can vary based on the test purpose (diagnostic feedback vs. score-based inference).
Page 2: Score Meaning Interpretation
TOEFL iBT Scoring
Test takers must consult institutions for TOEFL score requirements.
Score interpretations can become fixed, affecting both test users and providers.
Differences in test versions can alter score meanings.
Implications of Score Change
Users develop expectations around score meanings, influencing test reliability and interpretation.
Page 3: Score Meaning and Ability Claims
Iconic Scores
Scores attain defined meanings through usage in educational contexts.
Understanding score function is critical for valid claims about student abilities.
Page 4: Defining Language Quality
Evolution of Rating Scales
Rating scales initially lacked detailed descriptors; evolved to more complex systems.
Early Examples
FSI's simple scales focused on identifiable traits.
A shift towards hierarchical descriptors for assessing proficiency emerged over time.
Page 5: Proficiency Levels
Descriptors of Language Proficiency Levels
Level 1: Elementary Proficiency - Basic communication and understanding.
Level 2: Limited Working Proficiency - Accounts for social interactions; still requires assistance.
Level 3: Minimum Professional Proficiency - Effective in various topics with reasonable fluency.
Level 4: Full Professional Proficiency - High fluency in professional contexts.
Level 5: Native/Bilingual Proficiency - Equivalent to an educated native speaker.
Page 6: Challenges in Descriptors
General Applicability
Difficulty in aligning specific performances with broad descriptors.
Page 7: Types of Scoring Systems
Holistic vs. Multiple Trait Scoring
Holistic scales: Assign single scores based on overall performance.
Primary Trait Scoring: Tailored to individual tasks, ensures evidence is closely linked to scoring.
Page 8: Data-Based Scoring Systems
Empirical Approaches
Use of data and improvements based on sample evaluations.
Page 9: Common European Framework Reference (CEF)
CEF Overview
Defines levels through 'can do statements,' though these may oversimplify actual language abilities.
Page 10: Limitations of 'Can Do' Statements
General Issues
Challenges in determining skill level definitions and their application.
Page 11: Classical Test Theory in Scoring
Overview of Test Scoring
Tests primarily involve correct/incorrect scoring with numerical conversions.
Page 12: Item Difficulty
Difficulty Assessment
Item facility indicates how many participants answer correctly.
Ideal item difficulty should be around 0.5, with acceptable ranges of 0.3 to 0.7.
Page 13: Discrimination Ability
Importance of Item Discrimination
Items should differentiate between high and low ability test-takers effectively.
Page 14: Ensuring Reliability
Reliability in Testing
Importance of consistency in test results across multiple administrations.
Page 15: Approaches to Reliability
Types of Reliability Measures
Test-Retest: Same test administered multiple times.
Parallel Forms: Comparing scores from different test versions.
Split Halves: Evaluating odd vs. even item scoring.
Page 16: Reliability Factors
Influences on Reliability
Number of items, difficulty variation, sample homogeneity all affect reliability estimates.
Page 17: Reliability Coefficients
Kuder-Richardson Reliability Calculation
Steps to calculate internal consistency reliability coefficients reported for tests.
Page 18: Error Measurement in Scoring
Standard Error of Measurement
Estimating the degree of deviation in observed scores versus true scores.
Page 19: Score Transformations for Interpretation
Transformation Applications
Raw scores may be converted to z-scores, then to T-scores for standardized reporting.
Page 20: Introduction to Item Response Theory
IRT Overview
Builds on classical theories for scoring, focusing on continuous metrics and latent traits.
Page 21: Advantages of IRT
Comparison with Classical Test Theory
Provides detailed error estimation independent of test samples and enhances measurement precision.
Page 22: Score Meaning and Decision Making
Endowing Scores with Significance
Iconic scores gain meaning through their application in critical contexts but must be interpreted cautiously due to measurement error.
Page 23: Cut Score Setting Approaches
Angoff and Zieky/Livingston Methods
Angoff: Expert judgment defining cut scores based on item analysis.
Zieky/Livingston: Student performance analysis against defined standards.
Page 24: Summary of Key Concepts
Conclusion on Scoring Procedures
Scoring is linked to evaluating test-taker capabilities; careful consideration is necessary in established cut scores and their implications.