1/49
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
What is a low-resource language (LRL)
A language with limited computational resources, particularly a lack of digital language data, corpora, annotations, tools, and institutional support
Low-resource status is not determined solely by number of speakers
Why is “low resource” not the same as “few speakers”
A language may have millions of speakers but little digital representation (e.g., Yoruba) while a language with few speakers ay have extensive digital resources (e.g., Icelandic)
What factors are often incorrectly assumed to define a low-resource language
Small number of speakers
Endangered status
Geographic isolation
—> These factors may contribute but are largely independent of computational resourcing
What are examples of regions with high linguistic diversity?
Papua New Guinea
Indonesia
Nigeria
India
China
Papua New Guinea is particularly notable for having many languages relative to its geographic size.
Why is Hawai'ian an interesting example when discussing language resources?
Although classified as critically endangered, extensive revitalisation efforts have produced:
Educational programmes
Online resources (e.g., Duolingo)
Standardised dictionaries
Digital archives
—> This shows endangered languages can still become relatively well-resourced
How can language resources be measured
By examining
Corpus size
Number of Wikipedia articles
Digital archives
Speech corpora
NLP tools
Availability in training datasets
Why are Wikipedia article counts often used as a resource indicator
Wikipedia is included in major LLM training datasets such as The Pile, making article count as a useful proxy for digital language representation
What does the Wikipedia disparity reveal about linguistic representation
Languages spoken in highly multilingual regions such as Papua New Guinea have extremely few Wikipedia articles despite representing many living languages
What is Common Crawl?
A large-scale internet archive created through web-scraping that is frequently used to train language models
What does Common Crawl reveal about language inequality
Some languages have almost no representation despite large speaker populations:
English ≈ 42.6%
Kashmiri = 0%
Yoruba ≈ 0.0015%
This demonstrates severe digital inequality.
Why is Yoruba considered low-resource despite having millions of speakers
Yoruba has ~48 million L1 speakers but remains underrepresented in major training datasets and digital resources
Why is Icelandic not necessarily low-resource?
Although it has relatively few speakers, Icelandic has strong institutional support, digitisation efforts, and computational resources
Why is Latin not considered low-resource?
Latin has no native speakers but possesses vast anounts of written text and linguistic documentation
What key distinctions should be made when classifying languages?
low-resource vs endangered
low-resource vs underrepreseted
oral vs written traditions
standardised vs non-standardised orthographies
language vs dialect
What components make up a language "resource"?
corpora
annotation
oethography
NLP tooling
institutional support
What is distributional inequality in language resources?
The unequal distribution of digital data, where languages such as English and Chinese dominate while most languages have very small digital footprints
Why do LLMs perform better for some languages than others?
Due to differences in
data availability
training scale
tokenisation quality
benchmark avilability
linguistic representation
What broader issues are connected to unequal LLM performance?
Colonial language hierarchies
indigenous data governance
linguistic equity
fair evaluation practices
What are scaling laws in LLMs?
The principle that model performance generally improves as training data, model size, and compute increase
How can scaling laws disadvantage low-resource languages?
Languages with little training data cannot benefit equally from scaling, widening performance gaps between high and low-resource languages
Why are LLMs described as "data hungry"?
They require enormous quantities of training data to learn linguistic patterns effectively
What is data scarcity?
The lack of sufficient high-quality language data needed for NLP training and evaluation
What are examples of quantity-related data scarcity problems?
Small corpora
Lack of parallel data
Sparse benchmarks
Limited speech recognition resources
Poor OCR
What are examples of quality-related data scarcity problems?
Limited source diversity
Non-representative language samples
Overreliance on legal or religious texts
Lack of conversational data
Why can limited source diversity be problematic?
Models may learn language patterns that do not reflect everyday speech and actual language use
What is tokenisation?
The process of breaking text into smaller units (tokens) that a model can process.
What is tokenisation inequity?
When some languages are represented less efficiently than others during tokenisation, creating performance disadvantages
Why is English often tokenised efficiently?
Most tokenisation systems are heavily influenced by English and many other major high-resource languages
Which languages are particularly disadvantaged by tokenisation?
Morphologically rich and agglutinative languages such as:
Turkish
Swahili
Why do agglutinative languages create tokenisation challenges?
Single words can contain large amounts of grammatical information, causing them to split into many tokens
How can non-Latin scripts be disadvantaged in tokenisation
They may fragment into inefficient token sequences because tokenisations are often optimised Latin-based scripts
What is transfer learning?
The process by which knowledge from one task, language, or dataset improves performance on another
How does transfer learning benefit low-resource languages
Models can borrow:
grammatical knowledge
semantic patterns
lexical relationships
from better-resourced languages
Which language relationships improve transfer learning effectiveness?
related languages
shared orthographies
similar morphology
similar syntax
What is massive multilingual pretraining?
Training a model on many languages simultaneously so that linguistic knowledge can be shared across languages
What kinds of knowledge are learned during multilingual pretraining?
syntax
semantics
token patterns
cross-lingual relationships
What are shared representations?
Internal model representations that place semantically similar words from different languages near each other in the vector space
What are aligned datasets?
Datasets containing equivalent meanings across multiple languages, allowing models to learn correspondences between languages.
When does transfer learning often fail?
Language isolates
Unrelated languages
Poor tokenisation
Data contamination
Cultural mismatches
What is linguistic equity?
The principle that all languages should have fair opportunities for digital representation, technological development, and AI support.
Why does linguistic equity matter for LLMs?
Unequal language representation can reinforce existing social, political, and technological inequalities.
What ethical concern arises from assuming "more data is always better"?
It ay encouragecollectin of culturally sensitive material without appropriate consent
What types of knowledge may be harmed by indiscriminate data collection?
acred knowledge
Ceremonial speech
Restricted narratives
Community-owned cultural information
What is Indigenous data governance?
The principle that communities should control how their language and cultural data are collected, stored, and used
What is community-centred NLP?
NLP development that actively involves language communities in decisions about data collection and model development.
What practices characterise community-centred NLP?
Co-design
Community annotation
Local governance
Participatory decision-making
What are the main implications of LLMs for low-resource languages?
Poorer performance due to data scarcity
tokenisation disadvantages
limited evaluation resources
potential benefits from transfer learning
ethical concerns around representation and data governance
What is the tension between scale and consent?
Foundation models rely on large-scale data collection, while many communities require contextual and negotiated permission for language use.
What is the tension between preservation and freezing?
Digitisation may preserve a language but may also fossilise one dialect or register as the “official” version
What is the tension between visibility and exploitation?
absence from AI may make a language invisible
inclusion may expose communications to appropriation or misuse