18: Metacognition of Testing Effect
Metacognition of the Testing Effect: Guiding Learners to Predict the Benefits of Retrieval (Tullis, Finley, & Benjamin, 2013)
Self-Regulated Learning
Substantial amounts of learning happen outside of the classroom, requiring students to regulate their own processes.
Key areas of self-regulation include:
Allocating study time: Deciding how much time to dedicate to studying.
Selecting items for additional study: Choosing which concepts or materials need more focus.
Monitoring learning: Keeping track of one's understanding and retention of information.
Some students report using self-testing methods as a study technique.
Research questions:
Are learners sensitive to the mnemonic benefits of testing?
Can learners make predictions based on mnemonic cues rather than naïve theories?
Accuracy of Metacognition
Various desirable difficulties in learning do not appear to be reflected in Judgments of Learning (JOLs):
Spacing repetitions: The effect of distributing study sessions over time rather than cramming.
Interleaved practice: Mixing different topics or types of problems during study sessions.
Imagery: Using mental images to enhance memory retention.
Release from proactive interference: The phenomenon where previously learned material hinders the learning of new information, which can be minimized under certain conditions.
Conditions for improving the accuracy of JOLs include:
Opportunity for comparison: Utilizing within-subject or list manipulation.
Item-by-item judgments: Making predictions based on individual items rather than overall performance.
Delay between study and JOL: Allowing a time gap can improve metacognitive reflection.
Avoid incomplete information: Ensuring that all information is available when forming JOLs.
Experiment 1 Overview
Previous studies elicited JOLs for the testing effect immediately after study, using between-subjects or block manipulations, and focused on aggregate/global JOLs.
Findings from prior work indicate:
Re-study can result in better immediate memory performance compared to testing, but leads to worse performance in delayed assessments.
Current study methods include:
Soliciting JOLs immediately post-study and after a delay.
Manipulating re-study versus test conditions within a single list.
Encouraging item-by-item JOLs and comparing cue-only and cue-target JOL conditions.
Cue-Only JOL Condition vs. Cue-Target JOL Condition
Memory for past tests (Finn & Metcalfe, 2007):
Predictions about upcoming test performance can sometimes be overly confident or well-calibrated initially but may later lead to underconfidence on subsequent trials, known as the Underconfidence with Practice (UWP) effect.
UWP effects have been observed:
In both recalled and unrecalled items from Trial 1.
Across fixed and self-paced study times.
With incentives for accuracy, and in both easy and hard materials.
Heuristic use in making immediate JOLs following a test involves:
High JOL for correctly recalled items and low JOL for incorrect items without accounting for potential learning from restudying.
The Monitoring Past Test (MPT) may not be utilized in delayed JOLs.
UWP effects were shown for both recalled and unrecalled items.
Procedure from Finn & Metcalfe (2007, E1):
Study 48 pairs of words, make JOLs after each item (within 10 minutes).
Conduct cued-recall test (3 minutes), repeat the procedure.
During the delayed condition, study 48 pairs again for 2.5 minutes, followed by cues for delayed JOLs (10 minutes) and again conduct cued-recall tests (3 minutes).
Results from Finn & Metcalfe (2007, E1):
Immediate vs. Delayed recall and JOLs measurements include:
Trial 1 Recall: Immediate: .22, Delayed: .11
Trial 2 Recall: Immediate: .40, Delayed: .31
Trial 1 mean JOL: Immediate: .37, Delayed: .20
Trial 2 mean JOL: Immediate: .35, Delayed: .32
Trial 1 calibration (JOL - Recall): Immediate: .15, Delayed: .09
Trial 2 calibration (JOL - Recall): Immediate: -.06, Delayed: .01 (not significant).
Analyzing Contributions to UWP
Exploration of whether unrecalled items disproportionately contribute to UWP.
JOL classifications include:
Recalled on T1 and T2: RR (Recalled-Recalled)
Not recalled on T1 but recalled on T2: FR (Forgotten-Recalled)
Notably, JOLs for FR items tend to be disproportionately low.
Tullis, Finley, & Benjamin (2013): Findings
Cue-only Group G(Phase2Test, Phase3JOL) = .84
Cue-target Group G(Phase2Test, Phase3JOL) = .86
Evidence suggests that test performance informs JOLs:
Cue-only G(Phase3JOL, Final_Test) = .93
Cue-target G(Phase3JOL, Final_Test) = .54
Higher resolution of cue-target JOLs for tested than for re-studied items is observed both immediately and after delays.
Subsequent Experiments Overview (Experiments 2, 3, and 4)
Experiments Structure:
Phase 1: Study 32 word pairs.
Phase 2: Choose to re-study or test the word pairs.
After this phase, participants were asked to predict the number of items they believed they would remember the next day (global prediction).
Phase 3: Conduct a cued recall test the following day.
Phase 4: Study a new list of word pairs.
Phase 5: Similar re-study or test decision was repeated, followed by a global prediction (no actual test for the second list).
Results from Experiment 2 (No Feedback):
Final cued recall results:
Restudy: .19
Tested: .24
Notably, 19 out of 35 participants demonstrated the testing effect.
Experiment 3: Addressing Feedback
Long delays between the original study and the practice/re-study phases can obscure participants' memory of their earlier study methods.
Participants had to remember how many items were remembered or forgotten to inform their predictions.
At the cued recall test on day 2, participants received feedback on their accuracy, informed whether items were correctly recalled and if each had been re-studied or tested in the prior phase.
Results from Experiment 3 (Partial Feedback):
Final cued recall results:
Restudy: .15
Tested: .21
35 out of 53 participants displayed the testing effect.
Experiment 4: Comprehensive Feedback Provided
Participants received explicit feedback about how many items they recalled correctly, immediately after testing for list 1 (before global predictions for list 2).
Results from Experiment 4 (Full Feedback):
Final cued recall results:
Restudy: .17
Tested: .27
19 out of 25 participants exhibited the testing effect.
Combined Analysis of Experiments 2, 3, and 4
Discussion points:
Under optimal conditions, metacognitive judgments can be accurate.
Accounting for the testing effect can be complex due to the difficulties in recognizing delayed benefits.
Providing external support (via performance feedback) aids participants in recognizing the efficacy of their encoding processes.
Promoting awareness of successful encoding strategies (like generation) can result in better choices in future learning contexts.
For making accurate judgments, participants should:
Detect differences in performance based on study conditions.
Attribute variations to the appropriate encoding method (track performance).
Acknowledge and alter their beliefs about effective strategies when given feedback.