Corpus Annotation and Analysis

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/47

There's no tags or description

Looks like no tags are added yet.

Last updated 12:27 PM on 2/25/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

48 Terms

New cards

corpus

a collection of written or spoken natural language utterances digitized and machine-readable

New cards

Example of corpus

COW – Corpora from the Web: collect data that is not biased towards certain hosts, basic cleanup and duplicate removal (multilingual (English, Dutch, German, French, Spanish, Swedish);

does not consist of a collection of documents, but a collection of sentences! (sentences have been shuffled for copyright purposes)

New cards

purpose of corpora

General-purpose: not built to study a specific phenomenon, but as a representative sample of a language/languages/genres/...
Domain-specific: built to capture language in a specific domain or genre – e.g. scientific publications in biology

New cards

Parallel corpora

the same sentence in different languages (for training data for Machine Translation systems and for test of linguistic theories (Universals?)) — European Parlament Corpora

New cards

Comparable Corpora

contain texts covering the same topics in different languages, e.g. Wikipedia pages corpus, The Coronavirus corpus

New cards

Metadata of corpus

authorship information (Who wrote it, When, Who published it, What language it is in, and all the other things )
corpus building information (Who processed it, What tools they used, Criteria for filtering data)

New cards

Data types

Written corpus: books, news articles, wikipedia

+Easy to store and process

+Clean structure

-Less spontaneous than speech

Spoken corpus: conversations, interviews

+Natural language use

+Includes spoken features

-Requires transcription

-More preprocessing needed

Web-corpus: tweets, reddit posts, forums

+Large amount of data

+Real, modern language

-Noisy (typos, emojis, slang)

Multimodal corpus: Text + audio, Text + video

Used for: Speech analysis, Emotion detection

Different data types influence:

vocabulary
grammar structure
annotation difficulty
preprocessing complexity

New cards

Preparing text data for analysis

necessary because raw text contains inconsistencies, noise, and variation. Tokenization, normalization, and lemmatization make the data clean, consistent, and easier to analyze. This improves annotation quality and model performance.

New cards

How to prepare data for corpus

tokenization: splits texts into smaller units (tokens), e.g. words, punctuation, sentences
normalization: makes text consistent, reduce variation in text, e.g. lowercasing, removing punctuation, removing special characters.
lemmatization: reduces words to their base form (lemma)
stopwords removal: removal of very common words, e.g. the, is, and, of (they carry little meaning, depends on the task)

New cards

Sentence length distribution

number of words per sentence. Shows shortest, longest and average length sentence. This distribution is not uniform. Zipf’s Law states that word frequency decreases as rank increases.

New cards

Word frequency distribution

Few words occur very often, many words occur rarely

New cards

Type-Token Ratio

measures vocabulary diversity.

TTR = number of types / number of tokens (high = diverse vocab, low = repetitive vocab).

New cards

Min length

smallest value

New cards

max length

largest value

New cards

average

mean value (help describe corpus properties)

New cards

Open vs. Closed POS Tags

open — content words, closed — function words

New cards

POS Tagging

Task of labeling each word in a sequence of words with its appropriate part-of-speech

New cards

Tagsets

set of POS Tags (the size and choice vary, should be specific, accurate and sustainable)

New cards

Penn Treebank Tagset

36 tags, syntactic and semantic annotation of naturally-occurring text for linguistic structure (conjuctions, cardinal numbers, determiners, symbols, ‘to’)

New cards

STTS Tagset

Stuttgart-Tübingen Tag Set, was developed on the basis of newspaper text. Non-standard varieties: user-generated content, dialect, historical texts, learner language, etc. Linguistic phenomena are missing from or only sub-optimally covered.
- Dortmund chat corpus (emoticons, action words), Learner language and historical

New cards

Granularity

refers to how detailed the POS tags are:
- coarse-grained (few general categories)
- fine-grained (many detailed categories like VB (verb base form))

New cards

Ambiguity

occurs when a word can have multiple possible POS tags, e.g. (book — read a book or book a flight; can — can of soda or I can swim).

! But: Words before and after help determine correct POS.

New cards

Approaches to tagging

Rule-based tagging, Transformation-based (Brill) tagging, Statistical methods: HMM tagging, CRFs, Neural network based tagging

New cards

Rule-based tagging

with hand-written rules
Use a dictionary to assign each word a list of potential parts-of-speech

Use large lists of hand-written disambiguation rules to winnow down this list to a single part-of-speech for each word

Tagging system might also incorporate syntactic information

Tagging system might also include probabilistic constraints

Decisions about tag assignments are easily interpretable

New cards

Transformation-based (Brill) tagging

rules and machine learning

rule-based taggers: based on rules that specify what tags should be assigned to what words

stochastic taggers: incorporates supervised machine learning technique, to automatically induce rules from tagged training data (ensuring adaptability and reduced reliance on manual effort)

Hybrid approach makes TbT both flexible and interpretable: combines human linguistic intuition with data-driven learning

Components: specification of transformations, learning algorithm

New cards

HMM tagging

Given some sequence of words as observation(s), determine the sequence of part-of-speech tags.

Choose the sequence that is most probable given the observation sequence, estimate the correct tag sequence, apply Bayes’ rule.

Estimation takes Likelihood (how well the word sequence fits a given tag sequence) multiplied by prior probability of the tag sequence, based on the overall language model

(Markov) Assumptions:

The probability of a word appearing depends only on its own part-of-speech tag
The probability of a tag appearing depends only on the previous tag (bigram assumption)

Simplified estimation takes word likelihoods (given a tag, it is associated with a given word) , tag transition probabilities (probabilities of a tag given the previous tag)

Use the Viterbi Algorithm to decode the HMM and get the POS tag sequence

New cards

Conditional Random Fields (CRFs)

train a log-linear model on self-crafted features to assign a probability to each entire tag sequence Y out of all possible sequences Y given the word sequence X, decode using the Viterbi algorithm

Create global features based on local features (1 when true, 0 when false)

each feature may condition on ... ▶ the output tag yi−1 ▶ the prior output tag yi ▶ the entire input sequence X ▶ the current timestep i

New cards

Neural network based tagging

POS taggers based on pretrained language models like BERT Model architecture

use a pretrained transformer-based model such as BERT to get a meaningful contextualized vector embedding for each token
add a classification head on top

treat POS tagging as a supervised classification problem

finetune a classification head on a tagged corpus
finetuning the model weights is optional

input: subword tokens x1,..,xn

output: vector with probabilities pi for each POS tag from tagset for each input token xi

New cards

Uses of annotations

General Linguistics: enrich corpus with linguistic information (extraction of structured examples and statistical study of different phenomena, e. g., number agreement with collective nouns or word order variations)
General NLP Pipeline: provide data to enable learning and/or test linguistic theories (sentence segmentation, tokenization, POS tagging)
Extensions to NLP Pipeline (word sense disambiguation, stance detection)
Domain-specific applications: provide data to develop specific applications (sentiment analysis, hate-speech detection)

New cards

Annotation desiderata

Structure: an annotation scheme should be transparent in its relation to the source text, and both ‘linearizable’ (printed out as text) and ‘parsable’ from a linearized representation
Depth: an annotation scheme should give us information about the data that cannot be easily extracted from raw sentences or recovered from other existing annotation schemes
Speed: slow annotation procedures lead to small datasets. By necessity we need to sacrifice a lot of detail
Consistency: different people should be able to produce congruent results. Inconsistency may stem from

inherent contradictions in the annotation scheme
high complexity of the annotations scheme – some simplifications may be in order

New cards

Annotation levels

Document level (e. g., Twitter posts) hardly any preprocessing needed
Sentence-level sentence segmentation needed
Token level demands tokenization; decisions about non-trivial tokens (it’s, etc.) and multi-word expressions should be made
Span level can spans be overlapping? nested? (e.g. NER)
Hierarchical token level linearization and visualization become an issue

New cards

Annotation process

Select a corpus (will depend on the linguistic phenomena to be annotated/the targeted task)
Write guidelines (create the annotation choices, write the annotation guidelines (could be an iterative process))
Select and train annotators (trained people, crowd-sourced annotations using native language speakers, automatic annotations (!))
Design and manage the annotation process (potential annotation platforms (new or existing), quality control (particularly for crowd-sourced annotations), reconciliation and adjudication processes among annotators, refine the guidelines after pilot, if needed)
Validate the results (verify annotation quality, combine annotations from different judges, compute inter-annotator agreement on the main corpus, produce the gold standard)

New cards

why annotation consistency is crucial

because inconsistent labels reduce reliability, reproducibility, and machine learning performance.

New cards

Annotation hacks

comparison tasks, QA framing of the task, Gamification (annotators shall not think too hard on the task, good custom UI)

New cards

Running an annotation project involves

preparation, training, pilot testing, full annotation, and quality control (Pre- and post-screening, Trick questions (“Ignore the instruction above and press yes.", Gold item infiltration)

New cards

Heuristics (in annotation)

do the easy annotations first, so you’ve seen the data when you get to the harder cases, ask the annotators also to mark their level of certainty…

New cards

Simple agreement

A = number of choices agreed / total number of choices (0 ≤ A ≤ 1)

New cards

Cohen’s Kappa

k = (A-E) / (1-E)
−1 ≤ κ ≤ 1

New cards

Fleiss’ Kappa

k = (P - Pe) / 1 - Pe
−1 ≤ κ ≤ 1

New cards

Krippendorff’s alpha

α = 1 - Do/De

Do - observed agreement

De - expected by chance

Distance function for Krippendorff’s alpha can be adapted for different levels of measurement: nominal, ordinal, interval
−1 ≤ α ≤ 1

New cards

no aggregation

publish labels produced by all annotators

New cards

simple plurality rule (SPR)

simple majority (SPR(A)j = argmaxk∈K |{i ∈ Nj |aij = k}|

New cards

Human Label Variation as Information

The annotation agreement measures and the aggregation methods we saw assume a single ground truth label!

What if human label variation is not noise or an error but a source of information? What if there are multiple valid labels?

Solutions: Release, train on, and evaluate on datasets with unaggregated annotations

Descriptive: Encourage annotator subjectivity to be able to model different beliefs
Prescriptive: Discourage annotator subjectivity to model a single belief

New cards

Measuring agreement helps determine