ES

Balancing Validity & Reliability in Primary Science Teacher Assessment – Comprehensive Notes

Article & Publication Metadata
  • Journal: London Review of Education (LRE) – peer-reviewed, open-access (CC BY 4.0 licence)

  • Article: “Balancing the Demands of Validity and Reliability in Practice: Case Study of a Changing System of Primary Science Summative Assessment”

  • Author: Sarah Earle, Bath Spa University, UK
    • Contact: s.earle@bathspa.ac.uk

  • Citation format: Earle, S. (2020) London Review of Education, 18 (2): 221\text{–}235. DOI: 10.14324/LRE.18.2.06

  • Timeline
    • Submission: 19 Aug 2019
    • Acceptance: 31 Oct 2019
    • Publication: 21 Jul 2020

  • Context: Focuses on the statutory requirement that English primary teachers report summative science judgements at age 11.


Core Purposes & Research Questions
  • Examine how teachers balance validity & reliability when making summative judgements during a period of national policy change (Levels → Age-Related Expectations).

  • Two guiding questions:

    1. How does a school’s assessment system address the validity/reliability trade-off over time?

    2. How can longitudinal study inform guidance for practice?


Key Definitions & Theoretical Constructs
  • Teacher Assessment (TA): Judgements of attainment made by teachers, often synthesising multiple evidence sources.

  • Validity
    • Measures what it claims; contingent on purpose, use & interpretation.
    • In science: requires representative sampling of
    attitudes, skills, knowledge, inquiry processes.
    • Threats: construct under-representation & construct irrelevance.

  • Reliability
    • Consistency/trustworthiness of scores (across raters, occasions, tasks).
    • Inter-rater reliability highlighted; complicated by contextualised, practical science & young children’s varied modes of expression.

  • Trade-off Thesis (Wiliam, Halliday): Increasing reliability (standardisation, narrow tasks) may shrink validity (narrow sample), and vice-versa.

  • Moderation
    • Discussion & comparison of evidence to improve both reliability & teachers’ shared understanding (assessment literacy).
    • Seen as “potentially the most effective strategy” (Johnson).

  • Assessment Literacy
    • Teachers’ knowledge, skills & dispositions to design, enact, interpret assessment ethically & effectively.


Policy & System Context (England)
  • 2009: National science tests removed → reliance on statutory TA at Key Stage 2 (age 11).

  • 2014/15: Introduction of new National Curriculum & Age-Related Expectations (AREs) replacing broad “Levels”.
    • AREs = detailed ‘can do’ criteria that must be met; government encouraged schools to create proprietary tracking systems.


The TAPS Project Framework
  • Funded by Primary Science Teaching Trust; uses Design-Based Research (DBR) cycles (collaboration, iteration).

  • Converts Nuffield Foundation’s “formative → summative pyramid” into:
    TAPS Self-Evaluation Tool (school audit).
    • Emphasis on integrating formative data into periodic summative judgements.


Case-Study School (School B) Overview
  • One-form entry primary; enrolment \approx183 pupils (age 4\text{–}11).

  • Lower-than-average pupil-premium; high attainment at 11; stable leadership (same Head & Science Leader across 3 years).

  • DBR Time-frame: March 2013 → June 2016 (Phases 1!!\text{–}!3).
    • Data items: 86 (documents, interviews, observations, moderation artefacts).

  • Triangulation used: methods, time, investigator, source.

  • Ethical adherence: BERA guidelines; anonymisation; right to withdraw.


Methodological Details
  • Phase 1: Exploration (Mar 13–Nov 13)
    • Codes B1\text{–}B21.
    • Focus: mapping existing practices, teacher concerns about reliability & consistency.

  • Phase 2: Development (Feb 14–Jan 15)
    • Codes B22\text{–}B53.
    • Interventions: staff moderation meeting, trial of evidence-collection strategies, differentiation templates.

  • Phase 3: Implementation (Mar 15–Jun 16)
    • Codes B54\text{–}B86.
    • Embedding refined approaches; emergence of confidence in teacher judgement.


Detailed Findings by Phase
Phase 1 – “Searching for Consistency”
  • Levelling synonymous with summative assessment; staff training focused on ‘levelling correctly’.

  • Multiple published criteria sets in use (Schemes A/B, Tests A/B, LA guidance).
    • Manageability concerns: heavy marking (“tickled pink / green for growth”) & paper evidence folders.

  • Reliability weighted > validity; written work privileged, limiting practical/oral evidence capture.

Phase 2 – “Evidence & Moderation”
  • Staff meeting explored “What’s required to level a piece of work/child?” (Handout B43):
    • Need context, support level, verbal contributions, clear criteria, knowledge of progression.

  • Moderation reframed as professional dialogue (making tacit reasoning explicit).

  • Still evidence-heavy; every task risked becoming mini-summative, blurring formative purposes.

  • Manageability addressed via predifferentiated recording sheets & groupings → risk of pre-judging outcomes (closed validity).

Phase 3 – “Range of Information & Confidence”
  • Shift towards embedding criteria at planning stage; assessment seen as integral to teaching.

  • “Massive reduction” in reliance on external test papers.

  • Teachers encouraged to treat listening & informal observations as valid evidence (“hearing a child is valid”).

  • Less written transcription → quicker tick-based capture; described as “upskilled”, not onerous.

  • Tasks more open-ended; pupils self-select challenge; mixed-attainment grouping.

  • Moderation + shared frameworks foster confidence; reliability secured via shared language, not mountains of paperwork.


The “Teacher Assessment Seesaw” Model
  • Visual metaphor (Earle 2017) refined through case data.

  • Components:
    Validity side – breadth of curriculum sample; multiple evidence types; open inquiry.
    Reliability side – shared criteria, exemplars, moderation dialogue.
    Manageability fulcrum – workload feasibility; if overloaded, system collapses.
    Shared Understanding beam – assessment literacy & subject progression knowledge underpin equilibrium.

  • Phase portraits:

    1. Phase 1: heavy reliability, low validity, low manageability.

    2. Phase 2: reliability + manageability efforts, but some validity sacrifice.

    3. Phase 3: closer balance; openness boosts validity while moderated dialogue protects reliability; manageable workloads.


Connections to Broader Literature
  • Reliability as a necessary but insufficient condition for validity (Stobart 2009).

  • Wiliam & Black’s distinction formative vs summative; evidence here shows potential bleed-over when every task is documented.

  • Moderation’s dual function (Klenowski & Wyatt-Smith): consistency + professional learning – clearly evidenced in Phase 2/3.

  • Teacher bias concerns (Campbell 2015) mitigated by explicit criteria & sharing.

  • Assessment literacy viewed as developmental (DeLuca et al.); School B’s 3-year journey exemplifies staged growth.


Numerical / Statistical Highlights (LaTeX formatted)
  • Study period: 3 years.

  • Data corpus: 86 discrete items.

  • LRE volume & issue: 18\,(2), pages 221\text{–}235.

  • Pupil cohort size: \approx183.

  • Timeline markers: Submission 19-Aug-2019, Acceptance 31-Oct-2019, Publication 21-Jul-2020.


Ethical & Methodological Rigor
  • Double-blind peer review.

  • BERA ethical compliance; informed consent & right to withdraw.

  • Multiple triangulation forms; respondent validation; prolonged engagement.


Practical Implications & Recommendations
  • Recognise inevitable trade-off; aim for “good-enough” reliability that does not unduly narrow validity.

  • Embed assessment criteria during planning; gather evidence as learning unfolds, not retrospectively.

  • Use moderation meetings for collaborative sense-making, not simple level verification.

  • Value oral & practical evidence; develop concise capture methods (e.g., tick sheets, annotated photos) to keep workload sustainable.

  • Maintain shared documentation (progression maps, exemplars) to secure consistency when staff turnover occurs.

  • Provide sustained CPD; literacy develops over years, not workshops.


Possible Exam Checkpoints / Reflective Prompts
  • Explain why maximum reliability can undermine validity in practical science. Provide at least two concrete classroom examples.

  • Describe how moderation can simultaneously foster reliability and professional learning.

  • Using the seesaw model, outline an action plan for re-balancing an assessment system that over-emphasises test evidence.

  • Critically appraise the statement: “Teacher opinion is too subjective for high-stakes summative assessment.” Use evidence from the article.


Concluding Insights
  • Balancing validity & reliability is a dynamic, context-dependent process; perfection in both is unattainable.

  • Over a 3-year change cycle, School B progressed from evidence-hoarding for reliability to a more nuanced, confident, integrated approach.

  • The Teacher Assessment Seesaw offers a portable heuristic to guide other schools in negotiating this balance while safeguarding manageability and enhancing assessment literacy.