Scoring Language Tests and Assessments

Page 1: Introduction to Scoring in Language Testing

Models and Frameworks

Describe the "what" of language testing and the constructs for inference.
Frameworks give a general overview but do not specify scoring details which are detailed in the scoring model.

Scoring Importance

Essential for linking evidence from tasks to the evaluated constructs.
Scoring can vary based on the test purpose (diagnostic feedback vs. score-based inference).

Page 2: Score Meaning Interpretation

TOEFL iBT Scoring

Test takers must consult institutions for TOEFL score requirements.
Score interpretations can become fixed, affecting both test users and providers.
Differences in test versions can alter score meanings.

Implications of Score Change

Users develop expectations around score meanings, influencing test reliability and interpretation.

Page 3: Score Meaning and Ability Claims

Iconic Scores

Scores attain defined meanings through usage in educational contexts.
Understanding score function is critical for valid claims about student abilities.

Page 4: Defining Language Quality

Evolution of Rating Scales

Rating scales initially lacked detailed descriptors; evolved to more complex systems.

Early Examples

FSI's simple scales focused on identifiable traits.
A shift towards hierarchical descriptors for assessing proficiency emerged over time.

Page 5: Proficiency Levels

Descriptors of Language Proficiency Levels

Level 1: Elementary Proficiency - Basic communication and understanding.
Level 2: Limited Working Proficiency - Accounts for social interactions; still requires assistance.
Level 3: Minimum Professional Proficiency - Effective in various topics with reasonable fluency.
Level 4: Full Professional Proficiency - High fluency in professional contexts.
Level 5: Native/Bilingual Proficiency - Equivalent to an educated native speaker.

Page 6: Challenges in Descriptors

General Applicability

Difficulty in aligning specific performances with broad descriptors.

Page 7: Types of Scoring Systems

Holistic vs. Multiple Trait Scoring

Holistic scales: Assign single scores based on overall performance.
Primary Trait Scoring: Tailored to individual tasks, ensures evidence is closely linked to scoring.

Page 8: Data-Based Scoring Systems

Empirical Approaches

Use of data and improvements based on sample evaluations.

Page 9: Common European Framework Reference (CEF)

CEF Overview

Defines levels through 'can do statements,' though these may oversimplify actual language abilities.

Page 10: Limitations of 'Can Do' Statements

General Issues

Challenges in determining skill level definitions and their application.

Page 11: Classical Test Theory in Scoring

Overview of Test Scoring

Tests primarily involve correct/incorrect scoring with numerical conversions.

Page 12: Item Difficulty

Difficulty Assessment

Item facility indicates how many participants answer correctly.
Ideal item difficulty should be around 0.5, with acceptable ranges of 0.3 to 0.7.

Page 13: Discrimination Ability

Importance of Item Discrimination

Items should differentiate between high and low ability test-takers effectively.

Page 14: Ensuring Reliability

Reliability in Testing

Importance of consistency in test results across multiple administrations.

Page 15: Approaches to Reliability

Types of Reliability Measures

Test-Retest: Same test administered multiple times.
Parallel Forms: Comparing scores from different test versions.
Split Halves: Evaluating odd vs. even item scoring.

Page 16: Reliability Factors

Influences on Reliability

Number of items, difficulty variation, sample homogeneity all affect reliability estimates.

Page 17: Reliability Coefficients

Kuder-Richardson Reliability Calculation

Steps to calculate internal consistency reliability coefficients reported for tests.

Page 18: Error Measurement in Scoring

Standard Error of Measurement

Estimating the degree of deviation in observed scores versus true scores.

Page 19: Score Transformations for Interpretation

Transformation Applications

Raw scores may be converted to z-scores, then to T-scores for standardized reporting.

Page 20: Introduction to Item Response Theory

IRT Overview

Builds on classical theories for scoring, focusing on continuous metrics and latent traits.

Page 21: Advantages of IRT

Comparison with Classical Test Theory

Provides detailed error estimation independent of test samples and enhances measurement precision.

Page 22: Score Meaning and Decision Making

Endowing Scores with Significance

Iconic scores gain meaning through their application in critical contexts but must be interpreted cautiously due to measurement error.

Page 23: Cut Score Setting Approaches

Angoff and Zieky/Livingston Methods

Angoff: Expert judgment defining cut scores based on item analysis.
Zieky/Livingston: Student performance analysis against defined standards.

Page 24: Summary of Key Concepts

Conclusion on Scoring Procedures

Scoring is linked to evaluating test-taker capabilities; careful consideration is necessary in established cut scores and their implications.