Assessment for Learning: Validity & Reliability

Context & Background of Assessment for Learning (AfL)

Indonesia faces systemic educational quality issues spanning primary to higher education.
- Cited challenges (Ramly, 2005): teacher strikes, commercial accreditation, non-accommodating evaluation systems, influx of foreign investment, weak teacher content mastery, graduate unemployment, education as “cheap business arena,” moral/behavioural neglect, absence of education taxes.
Traditional paradigm = Assessment of Learning (AoL) → measures outcomes only (summative).
Emerging paradigm = Assessment for Learning (AfL) → integrates assessment into the learning process (formative orientation).
AfL expected to:
- Empower learners through feedback and reflection.
- Improve learning processes, not merely certify results.
- Shift focus from teacher-centred delivery to learner agency and self-regulation.

Key Concepts: Validity & Reliability

Reliability: consistency or stability of measurement.
- If a learner’s actual ability remains stable, repeated testing should yield similar scores (test–retest reliability).
- Statistical view: agreement among items/observers with overall scale.
- Coefficient range 0 \le r \le 1 ; higher r ⇒ stronger reliability.
- Classical methods: test–retest, split-half, parallel forms, Kuder-Richardson, inter-rater.
Validity: degree to which an instrument measures the intended construct.
- Requires reliability, but reliability alone ≠ validity.
- Two intertwined aspects (Ebel & Frisbie, 1991):
1. What is measured? (construct/content)
2. How consistently? (procedural/empirical)
- Encompasses formal tests, observations, interviews, questionnaires, affective/self-reports, projective techniques.

Historical Development of AfL

Roots in Scriven (1967): distinction between formative vs summative evaluation.
Bloom, Hastings & Madaus (1971): Handbook on formative & summative evaluation.
Sadler (1989): emphasised criteria-based feedback informing students.
Black & Wiliam (1998, 2006): AfL theory, classroom strategies, active learner involvement.
Global research nodes: U.K., U.S.A., Hong Kong, New Zealand; key scholars include Ecclestone, Gardner, Popham, Stiggins, Cowie, Hattie, Shute.
Common AfL practices:
- Sharing learning criteria.
- Classroom dialogue & questioning.
- Descriptive feedback (immediate, specific).
- Peer & self-assessment.
- Promoting learner confidence and reflection.

Research Objective & Questions

Main aim: Investigate validity and reliability of AfL constructs among lecturers at University of Muhammadiyah Makassar, South Sulawesi, Indonesia.
Implicit questions:
1. Are AfL questionnaire items psychometrically sound for higher-education context?
2. Can lecturers reliably distinguish and report AfL practices using the instrument’s Likert scales?

Methodology

Design: Quantitative descriptive survey, single administration.
Sampling: Proportional stratified random; N = 100 lecturers.
Location: University of Muhammadiyah Makassar.
Analysis tools: SPSS v20 for data entry; Rasch Measurement Model (WINSTEPS) for item/person analysis; inferential tests (t-test, ANOVA, \chi^{2}) planned for broader study.

Instrument Design (50 Likert Items)

Six AfL constructs & item counts:
1. Sharing Learning Objectives (SLO) – 12
2. Helping Pupils (HP) – 7
3. Peer & Self-Assessment (PSA) – 9
4. Providing Feedback (PF) – 8
5. Promoting Confidence (PC) – 6
6. Involving in Reviewing & Reflecting (IRR) – 8
Likert scale (5 points): Strongly Disagree (SD), Disagree (D), Uncertain (U), Agree (A), Strongly Agree (SA).
Development steps (Azrillah, 1996 framework):
1. Metadata analysis
2. Expert validation (content & construct): Measurement experts @ UTM; face validity by Language-education experts @ UMM.
3. Pilot testing
4. Rasch analysis for final calibration

Validation Process

Misfit inspection criteria (Azrillah, 1996):
- Point-Measure Correlation 0.40 < \text{PtMea Corr} < 0.85
- Outfit MNSQ 0.50 < \text{MNSQ} < 1.50
- Outfit ZSTD -2 < Z < 2
Dimensionality & separation indices computed to ensure each construct is distinct and scalable.

Data Analysis Techniques

Raw ordinal responses coded numerically (1-5).
Rasch outputs interpreted at both person and item levels:
- Reliability (analogous to Cronbach Alpha in classical test theory).
- Separation (spread of measures) and RMSE (Root Mean-Square Error).
- Scale calibration distances 1.5 < \Delta\text{step} < 5.0 indicate proper category functioning.

Findings

Person Reliability

Initial sample (N = 100):
- Person reliability r = 0.91 (excellent; Fisher, 2007).
- Mean measure \bar{\theta} = 2.46 logits.
After removing 26 misfitting respondents (sloppy, erratic patterns):
- Remaining N = 74.
- Person reliability rose to r = 0.94.
- Person separation 3.87 ⇒ instrument differentiates ≈ 4 strata of lecturer ability/endorsement.
- Cronbach Alpha \alpha = 0.94 (internal consistency).
Interpretation: Lecturers’ usage/understanding of AfL is measured with high consistency; outliers likely answered carelessly rather than indicting construct flaws.

Item Reliability

Item reliability r = 0.96 (excellent).
Item separation \approx 5 ⇒ > 5 discernible difficulty levels among items.
No major item misfits under Rasch criteria, supporting strong construct representation across six AfL domains.

Scale Validity (Category Functioning)

Category calibration table (74 respondents):
- Step difficulties met recommended bounds except for category 2 (Disagree) and category 5 (Strongly Agree).
- Respondents struggled to distinguish subtle negative endorsement (D) from neutral/uncertain or to fully utilise extreme positive (SA).
Distance anomalies suggest possible collapse/merge of categories or clearer descriptors in future revisions.

Discussion & Implications

High person & item reliabilities confirm that the AfL questionnaire is psychometrically sound for Indonesian higher-education lecturers.
Misuse of extreme categories implies cultural response style or wording ambiguity; training or scale revision warranted.
Demonstrates feasibility of using Rasch analysis (objective, interval-level estimates) over traditional CTT to refine AfL instruments.
Findings support broader adoption of AfL, given lecturers show coherent self-reported practices.

Connections to Existing Literature

Aligns with Black & Wiliam’s (1998a) emphasis on learner involvement and feedback loops.
Echoes Shute’s (2008) argument: well-timed formative feedback enhances cognition and motivation.
Supports Reeves (2001) & Harris (2007): AfL focuses on student learning gains rather than instructional delivery.
Item reliability parallels Fisher’s (2007) criteria for high-quality rating scales.

Ethical & Practical Considerations

Ensuring respondent anonymity and voluntary participation essential, especially when evaluating teaching quality.
Instrument refinement must consider linguistic nuances; lecturers must clearly grasp Likert anchors to avoid response bias.
Implementation of AfL demands professional development; reliable measurement is only first step toward pedagogical change.

Numerical & Statistical Highlights (LaTeX Notation)

Person Reliability (initial): r_{person} = 0.91
Person Reliability (after trimming): r_{person} = 0.94
Item Reliability: r_{item} = 0.96
Acceptable misfit range: 0.5 < \text{MNSQ} < 1.5,\; -2 < Z < 2
Category step calibration criterion: 1.5 < \Delta\text{step} < 5.0
Reliability coefficient bounds: 0 \le r \le 1
Separation indices: \text{Person Separation} = 3.87,\; \text{Item Separation} \approx 5.31

Conclusion

The AfL instrument exhibits excellent item reliability and high person reliability, validating its use for surveying lecturer practices in Indonesian higher education.
Minor scale-category ambiguities (Disagree, Strongly Agree) highlight need for refinement to enhance discriminative power.
Overall, study showcases Rasch modeling as rigorous avenue for establishing measurement validity and reliability, reinforcing the integrity of Assessment for Learning research and its practical application in classrooms.