LING 316: Low-Resource Languages, LLMs, Transfer Learning & Linguistic Equity

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/49

There's no tags or description

Looks like no tags are added yet.

Last updated 10:47 AM on 5/31/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

50 Terms

New cards

What is a low-resource language (LRL)

A language with limited computational resources, particularly a lack of digital language data, corpora, annotations, tools, and institutional support
Low-resource status is not determined solely by number of speakers

New cards

Why is “low resource” not the same as “few speakers”

A language may have millions of speakers but little digital representation (e.g., Yoruba) while a language with few speakers ay have extensive digital resources (e.g., Icelandic)

New cards

What factors are often incorrectly assumed to define a low-resource language

Small number of speakers
Endangered status
Geographic isolation

—> These factors may contribute but are largely independent of computational resourcing

New cards

What are examples of regions with high linguistic diversity?

Papua New Guinea
Indonesia
Nigeria
India
China

Papua New Guinea is particularly notable for having many languages relative to its geographic size.

New cards

Why is Hawai'ian an interesting example when discussing language resources?

Although classified as critically endangered, extensive revitalisation efforts have produced:

Educational programmes
Online resources (e.g., Duolingo)
Standardised dictionaries
Digital archives

—> This shows endangered languages can still become relatively well-resourced

New cards

How can language resources be measured

By examining

Corpus size
Number of Wikipedia articles
Digital archives
Speech corpora
NLP tools
Availability in training datasets

New cards

Why are Wikipedia article counts often used as a resource indicator

Wikipedia is included in major LLM training datasets such as The Pile, making article count as a useful proxy for digital language representation

New cards

What does the Wikipedia disparity reveal about linguistic representation

Languages spoken in highly multilingual regions such as Papua New Guinea have extremely few Wikipedia articles despite representing many living languages

New cards

What is Common Crawl?

A large-scale internet archive created through web-scraping that is frequently used to train language models

New cards

What does Common Crawl reveal about language inequality

Some languages have almost no representation despite large speaker populations:

English ≈ 42.6%
Kashmiri = 0%
Yoruba ≈ 0.0015%

This demonstrates severe digital inequality.

New cards

Why is Yoruba considered low-resource despite having millions of speakers

Yoruba has ~48 million L1 speakers but remains underrepresented in major training datasets and digital resources

New cards

Why is Icelandic not necessarily low-resource?

Although it has relatively few speakers, Icelandic has strong institutional support, digitisation efforts, and computational resources

New cards

Why is Latin not considered low-resource?

Latin has no native speakers but possesses vast anounts of written text and linguistic documentation

New cards

What key distinctions should be made when classifying languages?

low-resource vs endangered
low-resource vs underrepreseted
oral vs written traditions
standardised vs non-standardised orthographies
language vs dialect

New cards

What components make up a language "resource"?

corpora
annotation
oethography
NLP tooling
institutional support

New cards

What is distributional inequality in language resources?

The unequal distribution of digital data, where languages such as English and Chinese dominate while most languages have very small digital footprints

New cards

Why do LLMs perform better for some languages than others?

Due to differences in

data availability
training scale
tokenisation quality
benchmark avilability
linguistic representation

New cards

What broader issues are connected to unequal LLM performance?

Colonial language hierarchies
indigenous data governance
linguistic equity
fair evaluation practices

New cards

What are scaling laws in LLMs?

The principle that model performance generally improves as training data, model size, and compute increase

New cards

How can scaling laws disadvantage low-resource languages?

Languages with little training data cannot benefit equally from scaling, widening performance gaps between high and low-resource languages

New cards

Why are LLMs described as "data hungry"?

They require enormous quantities of training data to learn linguistic patterns effectively

New cards

What is data scarcity?

The lack of sufficient high-quality language data needed for NLP training and evaluation

New cards

What are examples of quantity-related data scarcity problems?

Small corpora
Lack of parallel data
Sparse benchmarks
Limited speech recognition resources
Poor OCR

New cards

What are examples of quality-related data scarcity problems?

Limited source diversity
Non-representative language samples
Overreliance on legal or religious texts
Lack of conversational data

New cards

Why can limited source diversity be problematic?

Models may learn language patterns that do not reflect everyday speech and actual language use

New cards

What is tokenisation?

The process of breaking text into smaller units (tokens) that a model can process.

New cards

What is tokenisation inequity?

When some languages are represented less efficiently than others during tokenisation, creating performance disadvantages

New cards

Why is English often tokenised efficiently?

Most tokenisation systems are heavily influenced by English and many other major high-resource languages

New cards

Which languages are particularly disadvantaged by tokenisation?

Morphologically rich and agglutinative languages such as:

Turkish
Swahili

New cards

Why do agglutinative languages create tokenisation challenges?

Single words can contain large amounts of grammatical information, causing them to split into many tokens

New cards

How can non-Latin scripts be disadvantaged in tokenisation

They may fragment into inefficient token sequences because tokenisations are often optimised Latin-based scripts

New cards

What is transfer learning?

The process by which knowledge from one task, language, or dataset improves performance on another

New cards

How does transfer learning benefit low-resource languages

Models can borrow:

grammatical knowledge
semantic patterns
lexical relationships

from better-resourced languages

New cards

Which language relationships improve transfer learning effectiveness?

related languages
shared orthographies
similar morphology
similar syntax

New cards

What is massive multilingual pretraining?

Training a model on many languages simultaneously so that linguistic knowledge can be shared across languages

New cards

What kinds of knowledge are learned during multilingual pretraining?

syntax
semantics
token patterns
cross-lingual relationships

New cards

What are shared representations?

Internal model representations that place semantically similar words from different languages near each other in the vector space

New cards

What are aligned datasets?

Datasets containing equivalent meanings across multiple languages, allowing models to learn correspondences between languages.

New cards

When does transfer learning often fail?

Language isolates
Unrelated languages
Poor tokenisation
Data contamination
Cultural mismatches

New cards

What is linguistic equity?

The principle that all languages should have fair opportunities for digital representation, technological development, and AI support.

New cards

Why does linguistic equity matter for LLMs?

Unequal language representation can reinforce existing social, political, and technological inequalities.

New cards

What ethical concern arises from assuming "more data is always better"?

It ay encouragecollectin of culturally sensitive material without appropriate consent

New cards

What types of knowledge may be harmed by indiscriminate data collection?

acred knowledge
Ceremonial speech
Restricted narratives
Community-owned cultural information

New cards

What is Indigenous data governance?

The principle that communities should control how their language and cultural data are collected, stored, and used

New cards

What is community-centred NLP?

NLP development that actively involves language communities in decisions about data collection and model development.

New cards

What practices characterise community-centred NLP?

Co-design
Community annotation
Local governance
Participatory decision-making

New cards

What are the main implications of LLMs for low-resource languages?

Poorer performance due to data scarcity
tokenisation disadvantages
limited evaluation resources
potential benefits from transfer learning
ethical concerns around representation and data governance

New cards

What is the tension between scale and consent?

Foundation models rely on large-scale data collection, while many communities require contextual and negotiated permission for language use.

New cards

What is the tension between preservation and freezing?

Digitisation may preserve a language but may also fossilise one dialect or register as the “official” version

New cards

What is the tension between visibility and exploitation?

absence from AI may make a language invisible
inclusion may expose communications to appropriation or misuse