Corpus Annotation and Analysis

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/47

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 12:27 PM on 2/25/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

48 Terms

1
New cards

corpus

a collection of written or spoken natural language utterances digitized and machine-readable

2
New cards

Example of corpus

COW – Corpora from the Web: collect data that is not biased towards certain hosts, basic cleanup and duplicate removal (multilingual (English, Dutch, German, French, Spanish, Swedish);

does not consist of a collection of documents, but a collection of sentences! (sentences have been shuffled for copyright purposes)

3
New cards

purpose of corpora

  • General-purpose: not built to study a specific phenomenon, but as a representative sample of a language/languages/genres/...

  • Domain-specific: built to capture language in a specific domain or genre – e.g. scientific publications in biology

4
New cards

Parallel corpora

the same sentence in different languages (for training data for Machine Translation systems and for test of linguistic theories (Universals?)) — European Parlament Corpora

5
New cards

Comparable Corpora

contain texts covering the same topics in different languages, e.g. Wikipedia pages corpus, The Coronavirus corpus

6
New cards

Metadata of corpus

  • authorship information (Who wrote it, When, Who published it, What language it is in, and all the other things )

  • corpus building information (Who processed it, What tools they used, Criteria for filtering data)

7
New cards

Data types

  • Written corpus: books, news articles, wikipedia

+Easy to store and process

+Clean structure

-Less spontaneous than speech

  • Spoken corpus: conversations, interviews

+Natural language use

+Includes spoken features

-Requires transcription

-More preprocessing needed

  • Web-corpus: tweets, reddit posts, forums

+Large amount of data

+Real, modern language

-Noisy (typos, emojis, slang)

  • Multimodal corpus: Text + audio, Text + video

Used for: Speech analysis, Emotion detection

Different data types influence:

  • vocabulary

  • grammar structure

  • annotation difficulty

  • preprocessing complexity

8
New cards

Preparing text data for analysis

necessary because raw text contains inconsistencies, noise, and variation. Tokenization, normalization, and lemmatization make the data clean, consistent, and easier to analyze. This improves annotation quality and model performance.

9
New cards

How to prepare data for corpus

  • tokenization: splits texts into smaller units (tokens), e.g. words, punctuation, sentences

  • normalization: makes text consistent, reduce variation in text, e.g. lowercasing, removing punctuation, removing special characters.

  • lemmatization: reduces words to their base form (lemma)

  • stopwords removal: removal of very common words, e.g. the, is, and, of (they carry little meaning, depends on the task)

10
New cards

Sentence length distribution

number of words per sentence. Shows shortest, longest and average length sentence. This distribution is not uniform. Zipf’s Law states that word frequency decreases as rank increases.

11
New cards

Word frequency distribution

Few words occur very often, many words occur rarely

12
New cards

Type-Token Ratio

measures vocabulary diversity.

TTR = number of types / number of tokens (high = diverse vocab, low = repetitive vocab).

13
New cards

Min length

smallest value

14
New cards

max length

largest value

15
New cards

average

mean value (help describe corpus properties)

16
New cards

Open vs. Closed POS Tags

open — content words, closed — function words

17
New cards

POS Tagging

Task of labeling each word in a sequence of words with its appropriate part-of-speech

18
New cards

Tagsets

set of POS Tags (the size and choice vary, should be specific, accurate and sustainable)

19
New cards

Penn Treebank Tagset

36 tags, syntactic and semantic annotation of naturally-occurring text for linguistic structure (conjuctions, cardinal numbers, determiners, symbols, ‘to’)

20
New cards

STTS Tagset

  • Stuttgart-TĂŒbingen Tag Set, was developed on the basis of newspaper text. Non-standard varieties: user-generated content, dialect, historical texts, learner language, etc. Linguistic phenomena are missing from or only sub-optimally covered.

    • Dortmund chat corpus (emoticons, action words), Learner language and historical

21
New cards

Granularity

refers to how detailed the POS tags are:
- coarse-grained (few general categories)
- fine-grained (many detailed categories like VB (verb base form))

22
New cards

Ambiguity

occurs when a word can have multiple possible POS tags, e.g. (book — read a book or book a flight; can — can of soda or I can swim).

! But: Words before and after help determine correct POS.

23
New cards

Approaches to tagging

Rule-based tagging, Transformation-based (Brill) tagging, Statistical methods: HMM tagging, CRFs, Neural network based tagging

24
New cards

Rule-based tagging

with hand-written rules
Use a dictionary to assign each word a list of potential parts-of-speech

Use large lists of hand-written disambiguation rules to winnow down this list to a single part-of-speech for each word

Tagging system might also incorporate syntactic information

Tagging system might also include probabilistic constraints

Decisions about tag assignments are easily interpretable

25
New cards

Transformation-based (Brill) tagging

rules and machine learning


rule-based taggers: based on rules that specify what tags should be assigned to what words

stochastic taggers: incorporates supervised machine learning technique, to automatically induce rules from tagged training data (ensuring adaptability and reduced reliance on manual effort)

Hybrid approach makes TbT both flexible and interpretable: combines human linguistic intuition with data-driven learning

Components: specification of transformations, learning algorithm

26
New cards

HMM tagging

Given some sequence of words as observation(s), determine the sequence of part-of-speech tags.

Choose the sequence that is most probable given the observation sequence, estimate the correct tag sequence, apply Bayes’ rule.

Estimation takes Likelihood (how well the word sequence fits a given tag sequence) multiplied by prior probability of the tag sequence, based on the overall language model

(Markov) Assumptions:

  1. The probability of a word appearing depends only on its own part-of-speech tag

  2. The probability of a tag appearing depends only on the previous tag (bigram assumption)

Simplified estimation takes word likelihoods (given a tag, it is associated with a given word) , tag transition probabilities (probabilities of a tag given the previous tag)

Use the Viterbi Algorithm to decode the HMM and get the POS tag sequence

27
New cards

Conditional Random Fields (CRFs)

train a log-linear model on self-crafted features to assign a probability to each entire tag sequence Y out of all possible sequences Y given the word sequence X, decode using the Viterbi algorithm

Create global features based on local features (1 when true, 0 when false)

each feature may condition on ... ▶ the output tag yi−1 ▶ the prior output tag yi ▶ the entire input sequence X ▶ the current timestep i

28
New cards

Neural network based tagging

POS taggers based on pretrained language models like BERT Model architecture

  • use a pretrained transformer-based model such as BERT to get a meaningful contextualized vector embedding for each token

  • add a classification head on top

treat POS tagging as a supervised classification problem

  • finetune a classification head on a tagged corpus

  • finetuning the model weights is optional

input: subword tokens x1,..,xn

output: vector with probabilities pi for each POS tag from tagset for each input token xi

29
New cards

Uses of annotations

  • General Linguistics: enrich corpus with linguistic information (extraction of structured examples and statistical study of different phenomena, e. g., number agreement with collective nouns or word order variations)

  • General NLP Pipeline: provide data to enable learning and/or test linguistic theories (sentence segmentation, tokenization, POS tagging)

  • Extensions to NLP Pipeline (word sense disambiguation, stance detection)

  • Domain-specific applications: provide data to develop specific applications (sentiment analysis, hate-speech detection)

30
New cards

Annotation desiderata

  1. Structure: an annotation scheme should be transparent in its relation to the source text, and both ‘linearizable’ (printed out as text) and ‘parsable’ from a linearized representation

  2. Depth: an annotation scheme should give us information about the data that cannot be easily extracted from raw sentences or recovered from other existing annotation schemes

  3. Speed: slow annotation procedures lead to small datasets. By necessity we need to sacrifice a lot of detail

  4. Consistency: different people should be able to produce congruent results. Inconsistency may stem from

  • inherent contradictions in the annotation scheme

  • high complexity of the annotations scheme – some simplifications may be in order

31
New cards

Annotation levels

  1. Document level (e. g., Twitter posts) hardly any preprocessing needed

  2. Sentence-level sentence segmentation needed

  3. Token level demands tokenization; decisions about non-trivial tokens (it’s, etc.) and multi-word expressions should be made

  4. Span level can spans be overlapping? nested? (e.g. NER)

  5. Hierarchical token level linearization and visualization become an issue

32
New cards

Annotation process

  1. Select a corpus (will depend on the linguistic phenomena to be annotated/the targeted task)

  2. Write guidelines (create the annotation choices, write the annotation guidelines (could be an iterative process))

  3. Select and train annotators (trained people, crowd-sourced annotations using native language speakers, automatic annotations (!))

  4. Design and manage the annotation process (potential annotation platforms (new or existing), quality control (particularly for crowd-sourced annotations), reconciliation and adjudication processes among annotators, refine the guidelines after pilot, if needed)

  5. Validate the results (verify annotation quality, combine annotations from different judges, compute inter-annotator agreement on the main corpus, produce the gold standard)

33
New cards

why annotation consistency is crucial

because inconsistent labels reduce reliability, reproducibility, and machine learning performance.

34
New cards

Annotation hacks

comparison tasks, QA framing of the task, Gamification (annotators shall not think too hard on the task, good custom UI)

35
New cards

Running an annotation project involves

preparation, training, pilot testing, full annotation, and quality control (Pre- and post-screening, Trick questions (“Ignore the instruction above and press yes.", Gold item infiltration)

36
New cards

Heuristics (in annotation)

do the easy annotations first, so you’ve seen the data when you get to the harder cases, ask the annotators also to mark their level of certainty


37
New cards

Simple agreement

A = number of choices agreed / total number of choices (0 ≀ A ≀ 1)

38
New cards

Cohen’s Kappa

k = (A-E) / (1-E)
−1 ≀ Îș ≀ 1

39
New cards

Fleiss’ Kappa

k = (P - Pe) / 1 - Pe
−1 ≀ Îș ≀ 1

40
New cards

Krippendorff’s alpha

α = 1 - Do/De

Do - observed agreement

De - expected by chance

Distance function for Krippendorff’s alpha can be adapted for different levels of measurement: nominal, ordinal, interval
−1 ≀ α ≀ 1

41
New cards

no aggregation

publish labels produced by all annotators

42
New cards

simple plurality rule (SPR)

simple majority (SPR(A)j = argmaxk∈K |{i ∈ Nj |aij = k}|

43
New cards

Human Label Variation as Information

The annotation agreement measures and the aggregation methods we saw assume a single ground truth label!

What if human label variation is not noise or an error but a source of information? What if there are multiple valid labels?

Solutions: Release, train on, and evaluate on datasets with unaggregated annotations

  • Descriptive: Encourage annotator subjectivity to be able to model different beliefs

  • Prescriptive: Discourage annotator subjectivity to model a single belief

44
New cards

Measuring agreement helps determine

  • whether the annotation task is clear and reliable

  • whether annotators understand and apply guidelines consistently

  • whether the resulting annotations can be trusted as a gold standard

45
New cards

Low agreement suggests

a problematic annotation task, unclear guidelines, or high subjectivity.

46
New cards

How to detect problematic annotation tasks

  • Low overall agreement score

  • Per-category agreement analysis. Some labels may be harder than others.

  • Disagreement pattern analysis (random or systematic)

  • Annotator-specific analysis

47
New cards

Improving reliability

  • Improve annotation guidelines

  • Simplify annotation task

  • Train annotators

  • Remove unreliable annotators

  • Measure agreement continuously

48
New cards

Low agreement signals problems. Common causes:

  • Unclear annotation categories

  • High task complexity

  • Intrinsic ambiguity in language

  • Too many or too detailed categories

  • Poor annotation guidelines

  • Annotator bias or subjectivity

  • Annotator drift (they ca change behaviour over time)

  • Poor task framing

Explore top notes

Explore top flashcards

flashcards
apush - ch. 14
61
Updated 1224d ago
0.0(0)
flashcards
ap gov unit 2 vocab
57
Updated 1269d ago
0.0(0)
flashcards
Sadlier Level A Unit 12
20
Updated 1049d ago
0.0(0)
flashcards
Muscle Practical 65-82
82
Updated 1113d ago
0.0(0)
flashcards
How is the earth changing?
36
Updated 48d ago
0.0(0)
flashcards
Period 5 Vocab
136
Updated 343d ago
0.0(0)
flashcards
Econ Section 6
40
Updated 838d ago
0.0(0)
flashcards
apush - ch. 14
61
Updated 1224d ago
0.0(0)
flashcards
ap gov unit 2 vocab
57
Updated 1269d ago
0.0(0)
flashcards
Sadlier Level A Unit 12
20
Updated 1049d ago
0.0(0)
flashcards
Muscle Practical 65-82
82
Updated 1113d ago
0.0(0)
flashcards
How is the earth changing?
36
Updated 48d ago
0.0(0)
flashcards
Period 5 Vocab
136
Updated 343d ago
0.0(0)
flashcards
Econ Section 6
40
Updated 838d ago
0.0(0)