Quantity and Diversity of Preliteracy Language Exposure: Computational Model of Reading

Core idea

Diversity of vocabulary knowledge and quantity of preliteracy language exposure both predict literacy development.
Distinguishing their causal roles is hard in behavior alone; tested with a computational triangle model of reading that learns semantic, phonological, and orthographic mappings.
Preliteracy phase: exposure to phonology–semantics relations (quantity and diversity) before reading onset; reading phase: learning orthography–phonology and orthography–semantics mappings.
Key distinction: vocabulary diversity (breadth) vs vocabulary size/quantity (depth/fidelity of representations).

Model framework and research questions

Triangle model components: orthography, phonology, semantics with interconnections; readings map orthography to phonology and semantics, while preliteracy learning maps phonology to semantics and vice versa.
Four research questions guiding the study:
- Do both variety (diversity) and quantity (exposure) contribute to literacy development?
- How do quantity and diversity interact across reading development?
- Do exposure and diversity differentially affect written word comprehension vs reading fluency?
- How well does the triangle model align with the Simple View of Reading (SVR) in explaining reading outcomes?
SVR vs triangle model: SVR emphasizes decoding (orthography→phonology) and oral language (phonology→semantics); triangle model allows reciprocal mappings and context integration, including orthography→semantics pathways.

Model architecture and representations

Architecture: three processing layers with five intervening hidden/attractor layers. Key layers include Orthography, Phonology, Semantics; attractors stabilize representations; context units disambiguate homophones.
Core representations:
- Orthography: 14 slots × 26 letters (364 units total) with first vowel in slot 5 to capture vowel patterns.
- Phonology: 8 slots with onset, vowel, and coda encoding; each phoneme encoded by 25 phonological features.
- Semantics: 2,446 features per WordNet representation.
Word vocabulary and frequency:
- Training corpus: 6,229 monosyllabic words with inflected forms included; frequency-weighted via log-frequency from WSJ corpus.
Context and attractor components:
- 4 context units modulate semantic disambiguation for homophones.
- 50-unit attractors stabilize phonological and semantic representations.

Training procedure and experimental manipulation

Two training phases:
- Preliteracy training: learn mappings between semantic and phonology (speaking and hearing tasks); develop stable phonological and semantic representations via attractors.
- Reading training: learn mappings from orthography to phonology and semantics.
Preliteracy vocabulary sizes (simulated exposure breadth): 1k, 2k, 3k, 4k, 5k, 6k words; word sets chosen by frequency (most frequent N words).
Exposure regimes during preliteracy:
- 400k, 800k, 1.2M, 1.6M, 2M word exposures (sampling from vocabulary according to log-frequency).
Training dynamics:
- Backpropagation through time with typical learning rate ~0.05; cross-entropy as error measure.
- Interleaved preliteracy tasks: speaking (40%), hearing (40%), phonological attractor (10%), semantic attractor (10%).
Simulations:
- 30 preliteracy simulations (6 vocab × 5 exposure) × 4 random initializations; total 120 runs to ensure robustness.

Testing and evaluation metrics

Preliteracy phase assessments: accuracy in speaking (semantics→phonology) and hearing (phonology→semantics).
Literacy phase assessments: reading performance on orthography→semantics and orthography→phonology mappings; error scores and accuracy measured for semantic and phonological outputs.
Analysis approach:
- Generalised linear mixed-effects models with fixed effects for vocabulary size, exposure amount, and reading time; random effects for simulation runs and word items.
- log-transformed reading time to capture nonlinearity.

Key findings

Preliteracy performance (before reading):
- Both exposure and vocabulary size improved accuracy in speaking and hearing tasks; by 2 million exposures, accuracy > 88% for both tasks.
Reading outcomes: fluency (orthography→phonology) and written word comprehension (orthography→semantics)
- Reading fluency:
- Predictors: exposure (β = −0.05, p < .001), vocabulary size (β = 0.25, p < .001), and log reading time (β = 1.45, p < .001).
- Interaction: exposure × vocabulary size (β = 0.06, p < .001).
- Early vs late reading times show different patterns: at early reading (100K), exposure and vocabulary both matter with a significant interaction; at later reading (1M), exposure remains negative and more weakly predictive, interaction becomes non-significant.
- Written word comprehension:
- Predictors: exposure (β = −0.08, p < .001), vocabulary size (β = 0.77, p < .001), log reading time (β = 2.14, p < .001).
- Interaction: exposure × vocabulary size (β = 0.27, p < .001).
- Early reading shows strong exposure benefits, especially with larger vocabularies; with smaller vocabularies (<3000), exposure effects may be more complex.
Effects of oral language and reading fluency on written word comprehension (SVR alignment):
- When including oral language as predictors, exposure (β = −0.02, p < .001) and vocabulary size (β = 0.86, p < .001) predicted written word comprehension, and reading fluency also contributed (β = 0.81, p < .001).
- This mirrors SVR: both oral language skills and decoding contribute to comprehension.
Differential effects across tasks and time:
- Oral language has a larger impact on written word comprehension than on reading fluency; decoding mappings are easier to acquire, so phonology–semantics routes drive comprehension more when vocabulary breadth is limited.
- Lexical diversity (breadth) has a larger influence than exposure quantity, especially for later reading development.
- The interaction between vocabulary size and exposure shows that increasing exposure from a small vocabulary can hinder later reading if it expands within a limited lexical set (referred to as reduced plasticity to incorporate new words).
Practical takeaways from the model:
- Quantity and diversity of preliteracy language exposure have independent and joint effects on reading development.
- Broad vocabulary breadth supports later reading development more robustly than simply increasing the amount of exposure to a narrow vocabulary.
- Early reading benefits from exposure and vocab depth, but excessive exposure to a small vocabulary can impede growth when expanding beyond that set.

Relation to SVR and theoretical implications

SVR prediction: Reading comprehension = word recognition (fluency) + oral language.
The model shows that both decoding and oral language contribute to written word comprehension, aligning with SVR, and extends it by showing bidirectional influences and the role of semantic mappings in fluency tasks.
The model demonstrates bidirectional influences: written word comprehension affects reading fluency and vice versa; supports a more interconnected view of reading development beyond a simple unidirectional SVR.

Practical implications and recommendations

Emphasize lexical diversity in early childhood exposure (e.g., shared reading, varied lexical input) rather than merely increasing the amount of exposure to a small vocabulary.
Early vocabulary breadth may be especially important for later reading comprehension development.
For children with limited oral vocabulary, expanding breadth early can yield better long-term literacy outcomes than simply increasing exposure to a limited set of words.

Limitations and directions for future work

Model limits:
- Vocabulary restricted to monosyllabic words; real language includes polysyllabic and morphologically complex forms.
- Reading tasks focused on single-word processing; discourse and syntactic context not modeled.
- No continued growth of oral vocabulary after literacy onset in the simulations; future work could model ongoing oral vocabulary development during reading.
- Cross-language generalizability remains to be tested; language-specific orthography-to-phonology mappings may alter the balance of decoding vs semantic contributions.
Future directions:
- Incorporate polysyllabic words and morphology.
- Extend to modeling sentences and discourse to capture higher-level comprehension.
- Explore cross-language predictions (e.g., more regular orthographies vs more arbitrary mappings).
- Assess how ongoing oral language growth interacts with literacy learning in the model.

Takeaway summary

Quantity and diversity of preliteracy language exposure make distinct, lasting contributions to literacy development; diversity (breadth) often exerts a stronger or more persistent influence than sheer exposure volume.
A computational triangle model can disentangle these effects and reveal their different impacts across reading outcomes (fluency vs written word comprehension) and across development (early vs later reading).
Aligning with SVR, both oral language skills and decoding contribute to reading comprehension, but their relative influence shifts as reading develops and vocabulary breadth becomes more important.
Practical emphasis for early education: prioritize broad, varied language experiences to expand children’s lexical breadth, alongside foundational decoding skills.