ES

Assessment for Learning: Validity & Reliability

Context & Background of Assessment for Learning (AfL)

  • Indonesia faces systemic educational quality issues spanning primary to higher education.

    • Cited challenges (Ramly, 2005): teacher strikes, commercial accreditation, non-accommodating evaluation systems, influx of foreign investment, weak teacher content mastery, graduate unemployment, education as “cheap business arena,” moral/behavioural neglect, absence of education taxes.

  • Traditional paradigm = Assessment of Learning (AoL) → measures outcomes only (summative).

  • Emerging paradigm = Assessment for Learning (AfL) → integrates assessment into the learning process (formative orientation).

  • AfL expected to:

    • Empower learners through feedback and reflection.

    • Improve learning processes, not merely certify results.

    • Shift focus from teacher-centred delivery to learner agency and self-regulation.

Key Concepts: Validity & Reliability

  • Reliability: consistency or stability of measurement.

    • If a learner’s actual ability remains stable, repeated testing should yield similar scores (test–retest reliability).

    • Statistical view: agreement among items/observers with overall scale.

    • Coefficient range 0 \le r \le 1 ; higher r ⇒ stronger reliability.

    • Classical methods: test–retest, split-half, parallel forms, Kuder-Richardson, inter-rater.

  • Validity: degree to which an instrument measures the intended construct.

    • Requires reliability, but reliability alone ≠ validity.

    • Two intertwined aspects (Ebel & Frisbie, 1991):

    1. What is measured? (construct/content)

    2. How consistently? (procedural/empirical)

    • Encompasses formal tests, observations, interviews, questionnaires, affective/self-reports, projective techniques.

Historical Development of AfL

  • Roots in Scriven (1967): distinction between formative vs summative evaluation.

  • Bloom, Hastings & Madaus (1971): Handbook on formative & summative evaluation.

  • Sadler (1989): emphasised criteria-based feedback informing students.

  • Black & Wiliam (1998, 2006): AfL theory, classroom strategies, active learner involvement.

  • Global research nodes: U.K., U.S.A., Hong Kong, New Zealand; key scholars include Ecclestone, Gardner, Popham, Stiggins, Cowie, Hattie, Shute.

  • Common AfL practices:

    • Sharing learning criteria.

    • Classroom dialogue & questioning.

    • Descriptive feedback (immediate, specific).

    • Peer & self-assessment.

    • Promoting learner confidence and reflection.

Research Objective & Questions

  • Main aim: Investigate validity and reliability of AfL constructs among lecturers at University of Muhammadiyah Makassar, South Sulawesi, Indonesia.

  • Implicit questions:

    1. Are AfL questionnaire items psychometrically sound for higher-education context?

    2. Can lecturers reliably distinguish and report AfL practices using the instrument’s Likert scales?

Methodology

  • Design: Quantitative descriptive survey, single administration.

  • Sampling: Proportional stratified random; N = 100 lecturers.

  • Location: University of Muhammadiyah Makassar.

  • Analysis tools: SPSS v20 for data entry; Rasch Measurement Model (WINSTEPS) for item/person analysis; inferential tests (t-test, ANOVA, \chi^{2}) planned for broader study.

Instrument Design (50 Likert Items)

  • Six AfL constructs & item counts:

    1. Sharing Learning Objectives (SLO) – 12

    2. Helping Pupils (HP) – 7

    3. Peer & Self-Assessment (PSA) – 9

    4. Providing Feedback (PF) – 8

    5. Promoting Confidence (PC) – 6

    6. Involving in Reviewing & Reflecting (IRR) – 8

  • Likert scale (5 points): Strongly Disagree (SD), Disagree (D), Uncertain (U), Agree (A), Strongly Agree (SA).

  • Development steps (Azrillah, 1996 framework):

    1. Metadata analysis

    2. Expert validation (content & construct): Measurement experts @ UTM; face validity by Language-education experts @ UMM.

    3. Pilot testing

    4. Rasch analysis for final calibration

Validation Process

  • Misfit inspection criteria (Azrillah, 1996):

    • Point-Measure Correlation 0.40 < \text{PtMea Corr} < 0.85

    • Outfit MNSQ 0.50 < \text{MNSQ} < 1.50

    • Outfit ZSTD -2 < Z < 2

  • Dimensionality & separation indices computed to ensure each construct is distinct and scalable.

Data Analysis Techniques

  • Raw ordinal responses coded numerically (1-5).

  • Rasch outputs interpreted at both person and item levels:

    • Reliability (analogous to Cronbach Alpha in classical test theory).

    • Separation (spread of measures) and RMSE (Root Mean-Square Error).

    • Scale calibration distances 1.5 < \Delta\text{step} < 5.0 indicate proper category functioning.

Findings

Person Reliability

  • Initial sample (N = 100):

    • Person reliability r = 0.91 (excellent; Fisher, 2007).

    • Mean measure \bar{\theta} = 2.46 logits.

  • After removing 26 misfitting respondents (sloppy, erratic patterns):

    • Remaining N = 74.

    • Person reliability rose to r = 0.94.

    • Person separation 3.87 ⇒ instrument differentiates ≈ 4 strata of lecturer ability/endorsement.

    • Cronbach Alpha \alpha = 0.94 (internal consistency).

  • Interpretation: Lecturers’ usage/understanding of AfL is measured with high consistency; outliers likely answered carelessly rather than indicting construct flaws.

Item Reliability

  • Item reliability r = 0.96 (excellent).

  • Item separation \approx 5 ⇒ > 5 discernible difficulty levels among items.

  • No major item misfits under Rasch criteria, supporting strong construct representation across six AfL domains.

Scale Validity (Category Functioning)

  • Category calibration table (74 respondents):

    • Step difficulties met recommended bounds except for category 2 (Disagree) and category 5 (Strongly Agree).

    • Respondents struggled to distinguish subtle negative endorsement (D) from neutral/uncertain or to fully utilise extreme positive (SA).

  • Distance anomalies suggest possible collapse/merge of categories or clearer descriptors in future revisions.

Discussion & Implications

  • High person & item reliabilities confirm that the AfL questionnaire is psychometrically sound for Indonesian higher-education lecturers.

  • Misuse of extreme categories implies cultural response style or wording ambiguity; training or scale revision warranted.

  • Demonstrates feasibility of using Rasch analysis (objective, interval-level estimates) over traditional CTT to refine AfL instruments.

  • Findings support broader adoption of AfL, given lecturers show coherent self-reported practices.

Connections to Existing Literature

  • Aligns with Black & Wiliam’s (1998a) emphasis on learner involvement and feedback loops.

  • Echoes Shute’s (2008) argument: well-timed formative feedback enhances cognition and motivation.

  • Supports Reeves (2001) & Harris (2007): AfL focuses on student learning gains rather than instructional delivery.

  • Item reliability parallels Fisher’s (2007) criteria for high-quality rating scales.

Ethical & Practical Considerations

  • Ensuring respondent anonymity and voluntary participation essential, especially when evaluating teaching quality.

  • Instrument refinement must consider linguistic nuances; lecturers must clearly grasp Likert anchors to avoid response bias.

  • Implementation of AfL demands professional development; reliable measurement is only first step toward pedagogical change.

Numerical & Statistical Highlights (LaTeX Notation)

  • Person Reliability (initial): r_{person} = 0.91

  • Person Reliability (after trimming): r_{person} = 0.94

  • Item Reliability: r_{item} = 0.96

  • Acceptable misfit range: 0.5 < \text{MNSQ} < 1.5,\; -2 < Z < 2

  • Category step calibration criterion: 1.5 < \Delta\text{step} < 5.0

  • Reliability coefficient bounds: 0 \le r \le 1

  • Separation indices: \text{Person Separation} = 3.87,\; \text{Item Separation} \approx 5.31

Conclusion

  • The AfL instrument exhibits excellent item reliability and high person reliability, validating its use for surveying lecturer practices in Indonesian higher education.

  • Minor scale-category ambiguities (Disagree, Strongly Agree) highlight need for refinement to enhance discriminative power.

  • Overall, study showcases Rasch modeling as rigorous avenue for establishing measurement validity and reliability, reinforcing the integrity of Assessment for Learning research and its practical application in classrooms.