1/35
Eltad reading list
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Roles of Language Testing
Tests mediate between teaching, learning, and wider societal needs
They shape curriculum and classroom practice through their consequences
Testers carry social responsibility for fairness, validity, and washback
Mistrust in Language Tests
Many tests suffer poor quality and misalignment with course objectives
They often fail to measure the skills they intend to assess
Transparency and rigorous design are needed to rebuild confidence
The Backwash Effect
Backwash (or washback) is the impact of a test on teaching and learning
Harmful backwash occurs when test format drives irrelevant practice
Beneficial backwash happens when test preparation reinforces course aims
Sources of Test Inaccuracy
Incongruent content prompts teaching to “teach to the test” rather than the skill
Over-reliance on certain item types (e.g., multiple-choice) can distort what’s measured
Inconsistent scoring and administration undermine reliability
Defining Test Purposes
Every test must have a clearly articulated aim and target constructs
Purpose guides selection of content, format, and scoring procedures
Clarity of purpose is foundational to a test’s validity
Validity vs Reliability
Validity: the degree to which a test actually measures the intended ability
Reliability: the consistency and reproducibility of test scores over time
Both are essential for trustworthy and defensible assessments
Test Design and Teacher Involvement
Teachers should participate in test development to align assessment with instruction
Well-designed tests foster positive washback and support learning goals
Collaborative pressure on examination bodies can raise testing standards
Adapting To Unique Contexts
Testing situations differ by learner profile, stakes, and institutional needs
Standard test models must be tailored to specific contexts and constraints
Testers must balance practicality (time, resources) with pedagogical soundness
Testing as Problem Solving
No single “best” test or technique applies to all contexts
Each testing situation presents a unique problem to be defined
Effective testing hinges on tailoring design to specific needs
Stating The Testing Problem
Begin by articulating the test’s purpose, stakeholders, and constraints
A clear problem statement guides content selection and format choices
Precision at this stage ensures alignment throughout development
Three Core Test Criteria
Accuracy: measures exactly the intended abilities (validity)
Positive backwash: encourages teaching that mirrors test goals
Economy: practical in terms of time, money, and available resources
Defining Test Purpose
Distinguish between placement, diagnostic, achievement, and proficiency aims
Purpose drives item types, task formats, and scoring methods
Well-defined objectives underpin test validity and fairness
Fitness for Purpose Principle
A test must suit its particular educational and institutional context
Matching techniques to learner profiles prevents irrelevant assessment
Avoid borrowing tests wholesale—adapt or design for local needs
Overview of the Problem Solving Cycle
Identify needs and specify constructs (Chapters 2–4)
Secure reliability and validity of measures (Chapters 4–5)
Examine washback effects and practical constraints (Chapters 6–7)
Select and trial test techniques (Chapters 8 onwards)
The Teacher as Tester
Teachers define context-specific requirements and drive alignment
Involvement in test development fosters positive washback
Collaborative design ensures practicality and classroom relevance
Integrating Testing and Teaching
Assessment and instruction form a continuous feedback loop
Tests should reinforce, not distort, curriculum aims
Thoughtful test design supports learning beyond mere exam prep
Purpose of Language Tests
Proficiency tests measure overall ability independent of any specific course.
Achievement tests assess how well learners have met defined course objectives.
Diagnostic tests identify strengths/weaknesses and placement tests assign learners to appropriate levels.
Direct vs Indirect & Discrete-Point vs Integrative
Direct tests require real-world performance (e.g. writing an email, giving a talk).
Indirect tests use surrogate tasks (e.g. multiple-choice items) to infer ability.
Discrete-point items target single language elements; integrative tasks combine grammar, vocabulary, and skills.
Norm-Referenced vs Criterion-Referenced
Norm-referenced tests compare learners’ scores against a peer group (percentiles/ranks).
Criterion-referenced tests measure performance against fixed mastery standards.
Choice influences whether results show relative standing or demonstrable competence.
Objective vs Subjective Item Types
Objective items (e.g. T/F, multiple-choice) have unambiguous right/wrong answers.
Subjective tasks (e.g. essays, oral interviews) rely on rater judgment and analytic scoring.
Balancing speed and reliability (objective) with authenticity and depth (subjective) is key.
Computer-Adaptive Testing (CAT)
CAT dynamically adjusts item difficulty based on each response.
Enhances efficiency and precision by zeroing in on the test-taker’s true ability level.
Requires extensive calibrated item banks and sound algorithms to maintain validity.
Communicative Language Testing
Focuses on authentic, real-world tasks that mirror actual language use.
Integrates multiple skills and promotes interaction under time/processing constraints.
Strives for a balance between task authenticity and reliable, objective scoring.
Core Concept of Validity
Validity is the degree to which a test measures the specific construct it claims to measure.
Construct validity is the overarching notion, requiring both theoretical definition and empirical evidence.
It underpins meaningful interpretation of scores and defensible decision-making.
Content Validity
Ensures test content is a representative sample of the language domain or syllabus objectives.
Achieved through blueprinting and specification checklists to cover all relevant skills and structures.
Guards against construct under-representation by systematic item sampling.
Criterion Related Validity
Concurrent validity: correlation of test scores with an established measure administered at the same time.
Predictive validity: ability of test scores to forecast future performance on real-world tasks.
Strong criterion evidence boosts the test’s practical utility and stakeholder confidence.
Threats to Validity
Construct-irrelevant variance arises when scores reflect unrelated abilities or test-taking skills.
Construct under-representation occurs when essential facets of the target construct are omitted.
External factors (e.g., anxiety, distracting conditions) can distort test performance.
Face and Consequential Validity
Face validity: stakeholders’ perceptions of a test’s appropriateness influence motivation and acceptance.
Consequential validity examines the social and educational impact, including washback effects.
Monitoring washback helps maximize beneficial influences and mitigate harmful side-effects.
The Validation Process
Validation is iterative: define constructs, pilot items, collect data, analyze results, revise tests.
Evidence sources include statistics (item analysis, factor analysis), expert reviews, and learner feedback.
Continuous validation keeps the test aligned with evolving learner populations and contexts.
Applying Validity Evidence
Use validity findings to refine item content, instructions, and scoring rubrics for clarity and fairness.
Involve teachers, specialists, and learners in evaluation to enhance transparency and buy-in.
A strong validity framework leads to more reliable, credible, and effective language assessments.
Concept and Importance of Reliability
Reliability is the extent to which test scores are consistent and repeatable over time.
It reflects how much of the score variance is due to true ability versus random error.
Without adequate reliability, test results cannot be trusted for decision-making.
True Score vs Error
Observed score = True score + Measurement error.
Error sources include test conditions, learner’s physical/psychological state, and scoring inconsistencies.
Identifying error helps in designing tests that minimize its impact.
Major Types of Reliability
Test–retest reliability checks stability of scores over repeated administrations.
Parallel-forms reliability examines equivalence between two different versions of a test.
Inter-rater reliability assesses consistency across different scorers or raters.
Internal Consistency
Split-half method correlates scores from two halves of the same test to estimate consistency.
Cronbach’s alpha provides an overall estimate of how well items hang together.
High internal consistency indicates items measure the same underlying construct.
Calculating and Interpreting Coefficients
Reliability coefficients range from 0 (no consistency) to 1 (perfect consistency).
A commonly accepted benchmark for high-stakes tests is ≥ .80.
Coefficients inform decisions about test length, item quality, and reporting precision.
Enhancing Reliability
Increase the number of high-quality, representative items to average out random errors.
Standardize administration procedures and provide clear instructions to all test-takers.
Use detailed scoring rubrics and train raters to ensure consistent marking.
Balancing Reliability and Practicality
Longer, more homogenous tests boost reliability but may fatigue learners and burden resources.
Authentic, integrative tasks enhance validity but can introduce scoring variability.
Test designers must negotiate trade-offs to suit context, stakes, and learner needs.