Balancing Validity & Reliability in Primary Science Teacher Assessment – Comprehensive Notes

Article & Publication Metadata

Journal: London Review of Education (LRE) – peer-reviewed, open-access (CC BY 4.0 licence)
Article: “Balancing the Demands of Validity and Reliability in Practice: Case Study of a Changing System of Primary Science Summative Assessment”
Author: Sarah Earle, Bath Spa University, UK
• Contact: s.earle@bathspa.ac.uk
Citation format: Earle, S. (2020) London Review of Education, 18 (2): 221\text{–}235. DOI: 10.14324/LRE.18.2.06
Timeline
• Submission: 19 Aug 2019
• Acceptance: 31 Oct 2019
• Publication: 21 Jul 2020
Context: Focuses on the statutory requirement that English primary teachers report summative science judgements at age 11.

Core Purposes & Research Questions

Examine how teachers balance validity & reliability when making summative judgements during a period of national policy change (Levels → Age-Related Expectations).
Two guiding questions:
1. How does a school’s assessment system address the validity/reliability trade-off over time?
2. How can longitudinal study inform guidance for practice?

Key Definitions & Theoretical Constructs

Teacher Assessment (TA): Judgements of attainment made by teachers, often synthesising multiple evidence sources.
Validity
• Measures what it claims; contingent on purpose, use & interpretation.
• In science: requires representative sampling of attitudes, skills, knowledge, inquiry processes.
• Threats: construct under-representation & construct irrelevance.
Reliability
• Consistency/trustworthiness of scores (across raters, occasions, tasks).
• Inter-rater reliability highlighted; complicated by contextualised, practical science & young children’s varied modes of expression.
Trade-off Thesis (Wiliam, Halliday): Increasing reliability (standardisation, narrow tasks) may shrink validity (narrow sample), and vice-versa.
Moderation
• Discussion & comparison of evidence to improve both reliability & teachers’ shared understanding (assessment literacy).
• Seen as “potentially the most effective strategy” (Johnson).
Assessment Literacy
• Teachers’ knowledge, skills & dispositions to design, enact, interpret assessment ethically & effectively.

Policy & System Context (England)

2009: National science tests removed → reliance on statutory TA at Key Stage 2 (age 11).
2014/15: Introduction of new National Curriculum & Age-Related Expectations (AREs) replacing broad “Levels”.
• AREs = detailed ‘can do’ criteria that must be met; government encouraged schools to create proprietary tracking systems.

The TAPS Project Framework

Funded by Primary Science Teaching Trust; uses Design-Based Research (DBR) cycles (collaboration, iteration).
Converts Nuffield Foundation’s “formative → summative pyramid” into:
• TAPS Self-Evaluation Tool (school audit).
• Emphasis on integrating formative data into periodic summative judgements.

Case-Study School (School B) Overview

One-form entry primary; enrolment \approx183 pupils (age 4\text{–}11).
Lower-than-average pupil-premium; high attainment at 11; stable leadership (same Head & Science Leader across 3 years).
DBR Time-frame: March 2013 → June 2016 (Phases 1!!\text{–}!3).
• Data items: 86 (documents, interviews, observations, moderation artefacts).
Triangulation used: methods, time, investigator, source.
Ethical adherence: BERA guidelines; anonymisation; right to withdraw.

Methodological Details

Phase 1: Exploration (Mar 13–Nov 13)
• Codes B1\text{–}B21.
• Focus: mapping existing practices, teacher concerns about reliability & consistency.
Phase 2: Development (Feb 14–Jan 15)
• Codes B22\text{–}B53.
• Interventions: staff moderation meeting, trial of evidence-collection strategies, differentiation templates.
Phase 3: Implementation (Mar 15–Jun 16)
• Codes B54\text{–}B86.
• Embedding refined approaches; emergence of confidence in teacher judgement.

Detailed Findings by Phase

Phase 1 – “Searching for Consistency”

Levelling synonymous with summative assessment; staff training focused on ‘levelling correctly’.
Multiple published criteria sets in use (Schemes A/B, Tests A/B, LA guidance).
• Manageability concerns: heavy marking (“tickled pink / green for growth”) & paper evidence folders.
Reliability weighted > validity; written work privileged, limiting practical/oral evidence capture.

Phase 2 – “Evidence & Moderation”

Staff meeting explored “What’s required to level a piece of work/child?” (Handout B43):
• Need context, support level, verbal contributions, clear criteria, knowledge of progression.
Moderation reframed as professional dialogue (making tacit reasoning explicit).
Still evidence-heavy; every task risked becoming mini-summative, blurring formative purposes.
Manageability addressed via predifferentiated recording sheets & groupings → risk of pre-judging outcomes (closed validity).

Phase 3 – “Range of Information & Confidence”

Shift towards embedding criteria at planning stage; assessment seen as integral to teaching.
“Massive reduction” in reliance on external test papers.
Teachers encouraged to treat listening & informal observations as valid evidence (“hearing a child is valid”).
Less written transcription → quicker tick-based capture; described as “upskilled”, not onerous.
Tasks more open-ended; pupils self-select challenge; mixed-attainment grouping.
Moderation + shared frameworks foster confidence; reliability secured via shared language, not mountains of paperwork.

The “Teacher Assessment Seesaw” Model

Visual metaphor (Earle 2017) refined through case data.
Components:
• Validity side – breadth of curriculum sample; multiple evidence types; open inquiry.
• Reliability side – shared criteria, exemplars, moderation dialogue.
• Manageability fulcrum – workload feasibility; if overloaded, system collapses.
• Shared Understanding beam – assessment literacy & subject progression knowledge underpin equilibrium.
Phase portraits:
1. Phase 1: heavy reliability, low validity, low manageability.
2. Phase 2: reliability + manageability efforts, but some validity sacrifice.
3. Phase 3: closer balance; openness boosts validity while moderated dialogue protects reliability; manageable workloads.

Connections to Broader Literature

Reliability as a necessary but insufficient condition for validity (Stobart 2009).
Wiliam & Black’s distinction formative vs summative; evidence here shows potential bleed-over when every task is documented.
Moderation’s dual function (Klenowski & Wyatt-Smith): consistency + professional learning – clearly evidenced in Phase 2/3.
Teacher bias concerns (Campbell 2015) mitigated by explicit criteria & sharing.
Assessment literacy viewed as developmental (DeLuca et al.); School B’s 3-year journey exemplifies staged growth.

Numerical / Statistical Highlights (LaTeX formatted)

Study period: 3 years.
Data corpus: 86 discrete items.
LRE volume & issue: 18\,(2), pages 221\text{–}235.
Pupil cohort size: \approx183.
Timeline markers: Submission 19-Aug-2019, Acceptance 31-Oct-2019, Publication 21-Jul-2020.

Ethical & Methodological Rigor

Double-blind peer review.
BERA ethical compliance; informed consent & right to withdraw.
Multiple triangulation forms; respondent validation; prolonged engagement.

Practical Implications & Recommendations

Recognise inevitable trade-off; aim for “good-enough” reliability that does not unduly narrow validity.
Embed assessment criteria during planning; gather evidence as learning unfolds, not retrospectively.
Use moderation meetings for collaborative sense-making, not simple level verification.
Value oral & practical evidence; develop concise capture methods (e.g., tick sheets, annotated photos) to keep workload sustainable.
Maintain shared documentation (progression maps, exemplars) to secure consistency when staff turnover occurs.
Provide sustained CPD; literacy develops over years, not workshops.

Possible Exam Checkpoints / Reflective Prompts

Explain why maximum reliability can undermine validity in practical science. Provide at least two concrete classroom examples.
Describe how moderation can simultaneously foster reliability and professional learning.
Using the seesaw model, outline an action plan for re-balancing an assessment system that over-emphasises test evidence.
Critically appraise the statement: “Teacher opinion is too subjective for high-stakes summative assessment.” Use evidence from the article.

Concluding Insights

Balancing validity & reliability is a dynamic, context-dependent process; perfection in both is unattainable.
Over a 3-year change cycle, School B progressed from evidence-hoarding for reliability to a more nuanced, confident, integrated approach.
The Teacher Assessment Seesaw offers a portable heuristic to guide other schools in negotiating this balance while safeguarding manageability and enhancing assessment literacy.