LING 316: Low-Resource Languages, LLMs, Transfer Learning & Linguistic Equity

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/49

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 10:47 AM on 5/31/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

50 Terms

1
New cards

What is a low-resource language (LRL)

  • A language with limited computational resources, particularly a lack of digital language data, corpora, annotations, tools, and institutional support

  • Low-resource status is not determined solely by number of speakers

2
New cards

Why is “low resource” not the same as “few speakers”

A language may have millions of speakers but little digital representation (e.g., Yoruba) while a language with few speakers ay have extensive digital resources (e.g., Icelandic)

3
New cards

What factors are often incorrectly assumed to define a low-resource language

  • Small number of speakers

  • Endangered status

  • Geographic isolation

—> These factors may contribute but are largely independent of computational resourcing

4
New cards

What are examples of regions with high linguistic diversity?

  • Papua New Guinea

  • Indonesia

  • Nigeria

  • India

  • China

Papua New Guinea is particularly notable for having many languages relative to its geographic size.

5
New cards

Why is Hawai'ian an interesting example when discussing language resources?

Although classified as critically endangered, extensive revitalisation efforts have produced:

  • Educational programmes

  • Online resources (e.g., Duolingo)

  • Standardised dictionaries

  • Digital archives

—> This shows endangered languages can still become relatively well-resourced

6
New cards

How can language resources be measured

By examining

  • Corpus size

  • Number of Wikipedia articles

  • Digital archives

  • Speech corpora

  • NLP tools

  • Availability in training datasets

7
New cards

Why are Wikipedia article counts often used as a resource indicator

Wikipedia is included in major LLM training datasets such as The Pile, making article count as a useful proxy for digital language representation

8
New cards

What does the Wikipedia disparity reveal about linguistic representation

Languages spoken in highly multilingual regions such as Papua New Guinea have extremely few Wikipedia articles despite representing many living languages

9
New cards

What is Common Crawl?

A large-scale internet archive created through web-scraping that is frequently used to train language models

10
New cards

What does Common Crawl reveal about language inequality

Some languages have almost no representation despite large speaker populations:

  • English ≈ 42.6%

  • Kashmiri = 0%

  • Yoruba ≈ 0.0015%

This demonstrates severe digital inequality.

11
New cards

Why is Yoruba considered low-resource despite having millions of speakers

Yoruba has ~48 million L1 speakers but remains underrepresented in major training datasets and digital resources

12
New cards

Why is Icelandic not necessarily low-resource?

Although it has relatively few speakers, Icelandic has strong institutional support, digitisation efforts, and computational resources

13
New cards

Why is Latin not considered low-resource?

Latin has no native speakers but possesses vast anounts of written text and linguistic documentation

14
New cards

What key distinctions should be made when classifying languages?

  • low-resource vs endangered

  • low-resource vs underrepreseted

  • oral vs written traditions

  • standardised vs non-standardised orthographies

  • language vs dialect

15
New cards

What components make up a language "resource"?

  • corpora

  • annotation

  • oethography

  • NLP tooling

  • institutional support

16
New cards

What is distributional inequality in language resources?

The unequal distribution of digital data, where languages such as English and Chinese dominate while most languages have very small digital footprints

17
New cards

Why do LLMs perform better for some languages than others?

Due to differences in

  • data availability

  • training scale

  • tokenisation quality

  • benchmark avilability

  • linguistic representation

18
New cards

What broader issues are connected to unequal LLM performance?

  • Colonial language hierarchies

  • indigenous data governance

  • linguistic equity

  • fair evaluation practices

19
New cards

What are scaling laws in LLMs?

The principle that model performance generally improves as training data, model size, and compute increase

20
New cards

How can scaling laws disadvantage low-resource languages?

Languages with little training data cannot benefit equally from scaling, widening performance gaps between high and low-resource languages

21
New cards

Why are LLMs described as "data hungry"?

They require enormous quantities of training data to learn linguistic patterns effectively

22
New cards

What is data scarcity?

The lack of sufficient high-quality language data needed for NLP training and evaluation

23
New cards

What are examples of quantity-related data scarcity problems?

  • Small corpora

  • Lack of parallel data

  • Sparse benchmarks

  • Limited speech recognition resources

  • Poor OCR

24
New cards

What are examples of quality-related data scarcity problems?

  • Limited source diversity

  • Non-representative language samples

  • Overreliance on legal or religious texts

  • Lack of conversational data

25
New cards

Why can limited source diversity be problematic?

Models may learn language patterns that do not reflect everyday speech and actual language use

26
New cards

What is tokenisation?

The process of breaking text into smaller units (tokens) that a model can process.

27
New cards

What is tokenisation inequity?

When some languages are represented less efficiently than others during tokenisation, creating performance disadvantages

28
New cards

Why is English often tokenised efficiently?

Most tokenisation systems are heavily influenced by English and many other major high-resource languages

29
New cards

Which languages are particularly disadvantaged by tokenisation?

Morphologically rich and agglutinative languages such as:

  • Turkish

  • Swahili

30
New cards

Why do agglutinative languages create tokenisation challenges?

Single words can contain large amounts of grammatical information, causing them to split into many tokens

31
New cards

How can non-Latin scripts be disadvantaged in tokenisation

They may fragment into inefficient token sequences because tokenisations are often optimised Latin-based scripts

32
New cards

What is transfer learning?

The process by which knowledge from one task, language, or dataset improves performance on another

33
New cards

How does transfer learning benefit low-resource languages

Models can borrow:

  • grammatical knowledge

  • semantic patterns

  • lexical relationships

from better-resourced languages

34
New cards

Which language relationships improve transfer learning effectiveness?

  • related languages

  • shared orthographies

  • similar morphology

  • similar syntax

35
New cards

What is massive multilingual pretraining?

Training a model on many languages simultaneously so that linguistic knowledge can be shared across languages

36
New cards

What kinds of knowledge are learned during multilingual pretraining?

  • syntax

  • semantics

  • token patterns

  • cross-lingual relationships

37
New cards

What are shared representations?

Internal model representations that place semantically similar words from different languages near each other in the vector space

38
New cards

What are aligned datasets?

Datasets containing equivalent meanings across multiple languages, allowing models to learn correspondences between languages.

39
New cards

When does transfer learning often fail?

  • Language isolates

  • Unrelated languages

  • Poor tokenisation

  • Data contamination

  • Cultural mismatches

40
New cards

What is linguistic equity?

The principle that all languages should have fair opportunities for digital representation, technological development, and AI support.

41
New cards

Why does linguistic equity matter for LLMs?

Unequal language representation can reinforce existing social, political, and technological inequalities.

42
New cards

What ethical concern arises from assuming "more data is always better"?

It ay encouragecollectin of culturally sensitive material without appropriate consent

43
New cards

What types of knowledge may be harmed by indiscriminate data collection?

  • acred knowledge

  • Ceremonial speech

  • Restricted narratives

  • Community-owned cultural information

44
New cards

What is Indigenous data governance?

The principle that communities should control how their language and cultural data are collected, stored, and used

45
New cards

What is community-centred NLP?

NLP development that actively involves language communities in decisions about data collection and model development.

46
New cards

What practices characterise community-centred NLP?

  • Co-design

  • Community annotation

  • Local governance

  • Participatory decision-making

47
New cards

What are the main implications of LLMs for low-resource languages?

  • Poorer performance due to data scarcity

  • tokenisation disadvantages

  • limited evaluation resources

  • potential benefits from transfer learning

  • ethical concerns around representation and data governance

48
New cards

What is the tension between scale and consent?

Foundation models rely on large-scale data collection, while many communities require contextual and negotiated permission for language use.

49
New cards

What is the tension between preservation and freezing?

Digitisation may preserve a language but may also fossilise one dialect or register as the “official” version

50
New cards

What is the tension between visibility and exploitation?

  • absence from AI may make a language invisible

  • inclusion may expose communications to appropriation or misuse