Scoring Language Tests and Assessments

Page 1: Introduction to Scoring in Language Testing

Models and Frameworks

  • Describe the "what" of language testing and the constructs for inference.

  • Frameworks give a general overview but do not specify scoring details which are detailed in the scoring model.

Scoring Importance

  • Essential for linking evidence from tasks to the evaluated constructs.

  • Scoring can vary based on the test purpose (diagnostic feedback vs. score-based inference).

Page 2: Score Meaning Interpretation

TOEFL iBT Scoring

  • Test takers must consult institutions for TOEFL score requirements.

  • Score interpretations can become fixed, affecting both test users and providers.

  • Differences in test versions can alter score meanings.

Implications of Score Change

  • Users develop expectations around score meanings, influencing test reliability and interpretation.

Page 3: Score Meaning and Ability Claims

Iconic Scores

  • Scores attain defined meanings through usage in educational contexts.

  • Understanding score function is critical for valid claims about student abilities.

Page 4: Defining Language Quality

Evolution of Rating Scales

  • Rating scales initially lacked detailed descriptors; evolved to more complex systems.

Early Examples

  • FSI's simple scales focused on identifiable traits.

  • A shift towards hierarchical descriptors for assessing proficiency emerged over time.

Page 5: Proficiency Levels

Descriptors of Language Proficiency Levels

  • Level 1: Elementary Proficiency - Basic communication and understanding.

  • Level 2: Limited Working Proficiency - Accounts for social interactions; still requires assistance.

  • Level 3: Minimum Professional Proficiency - Effective in various topics with reasonable fluency.

  • Level 4: Full Professional Proficiency - High fluency in professional contexts.

  • Level 5: Native/Bilingual Proficiency - Equivalent to an educated native speaker.

Page 6: Challenges in Descriptors

General Applicability

  • Difficulty in aligning specific performances with broad descriptors.

Page 7: Types of Scoring Systems

Holistic vs. Multiple Trait Scoring

  • Holistic scales: Assign single scores based on overall performance.

  • Primary Trait Scoring: Tailored to individual tasks, ensures evidence is closely linked to scoring.

Page 8: Data-Based Scoring Systems

Empirical Approaches

  • Use of data and improvements based on sample evaluations.

Page 9: Common European Framework Reference (CEF)

CEF Overview

  • Defines levels through 'can do statements,' though these may oversimplify actual language abilities.

Page 10: Limitations of 'Can Do' Statements

General Issues

  • Challenges in determining skill level definitions and their application.

Page 11: Classical Test Theory in Scoring

Overview of Test Scoring

  • Tests primarily involve correct/incorrect scoring with numerical conversions.

Page 12: Item Difficulty

Difficulty Assessment

  • Item facility indicates how many participants answer correctly.

  • Ideal item difficulty should be around 0.5, with acceptable ranges of 0.3 to 0.7.

Page 13: Discrimination Ability

Importance of Item Discrimination

  • Items should differentiate between high and low ability test-takers effectively.

Page 14: Ensuring Reliability

Reliability in Testing

  • Importance of consistency in test results across multiple administrations.

Page 15: Approaches to Reliability

Types of Reliability Measures

  1. Test-Retest: Same test administered multiple times.

  2. Parallel Forms: Comparing scores from different test versions.

  3. Split Halves: Evaluating odd vs. even item scoring.

Page 16: Reliability Factors

Influences on Reliability

  • Number of items, difficulty variation, sample homogeneity all affect reliability estimates.

Page 17: Reliability Coefficients

Kuder-Richardson Reliability Calculation

  • Steps to calculate internal consistency reliability coefficients reported for tests.

Page 18: Error Measurement in Scoring

Standard Error of Measurement

  • Estimating the degree of deviation in observed scores versus true scores.

Page 19: Score Transformations for Interpretation

Transformation Applications

  • Raw scores may be converted to z-scores, then to T-scores for standardized reporting.

Page 20: Introduction to Item Response Theory

IRT Overview

  • Builds on classical theories for scoring, focusing on continuous metrics and latent traits.

Page 21: Advantages of IRT

Comparison with Classical Test Theory

  • Provides detailed error estimation independent of test samples and enhances measurement precision.

Page 22: Score Meaning and Decision Making

Endowing Scores with Significance

  • Iconic scores gain meaning through their application in critical contexts but must be interpreted cautiously due to measurement error.

Page 23: Cut Score Setting Approaches

Angoff and Zieky/Livingston Methods

  1. Angoff: Expert judgment defining cut scores based on item analysis.

  2. Zieky/Livingston: Student performance analysis against defined standards.

Page 24: Summary of Key Concepts

Conclusion on Scoring Procedures

  • Scoring is linked to evaluating test-taker capabilities; careful consideration is necessary in established cut scores and their implications.