1/47
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
corpus
a collection of written or spoken natural language utterances digitized and machine-readable
Example of corpus
COW â Corpora from the Web: collect data that is not biased towards certain hosts, basic cleanup and duplicate removal (multilingual (English, Dutch, German, French, Spanish, Swedish);
does not consist of a collection of documents, but a collection of sentences! (sentences have been shuffled for copyright purposes)
purpose of corpora
General-purpose: not built to study a specific phenomenon, but as a representative sample of a language/languages/genres/...
Domain-specific: built to capture language in a specific domain or genre â e.g. scientific publications in biology
Parallel corpora
the same sentence in different languages (for training data for Machine Translation systems and for test of linguistic theories (Universals?)) â European Parlament Corpora
Comparable Corpora
contain texts covering the same topics in different languages, e.g. Wikipedia pages corpus, The Coronavirus corpus
Metadata of corpus
authorship information (Who wrote it, When, Who published it, What language it is in, and all the other things )
corpus building information (Who processed it, What tools they used, Criteria for filtering data)
Data types
Written corpus: books, news articles, wikipedia
+Easy to store and process
+Clean structure
-Less spontaneous than speech
Spoken corpus: conversations, interviews
+Natural language use
+Includes spoken features
-Requires transcription
-More preprocessing needed
Web-corpus: tweets, reddit posts, forums
+Large amount of data
+Real, modern language
-Noisy (typos, emojis, slang)
Multimodal corpus: Text + audio, Text + video
Used for: Speech analysis, Emotion detection
Different data types influence:
vocabulary
grammar structure
annotation difficulty
preprocessing complexity
Preparing text data for analysis
necessary because raw text contains inconsistencies, noise, and variation. Tokenization, normalization, and lemmatization make the data clean, consistent, and easier to analyze. This improves annotation quality and model performance.
How to prepare data for corpus
tokenization: splits texts into smaller units (tokens), e.g. words, punctuation, sentences
normalization: makes text consistent, reduce variation in text, e.g. lowercasing, removing punctuation, removing special characters.
lemmatization: reduces words to their base form (lemma)
stopwords removal: removal of very common words, e.g. the, is, and, of (they carry little meaning, depends on the task)
Sentence length distribution
number of words per sentence. Shows shortest, longest and average length sentence. This distribution is not uniform. Zipfâs Law states that word frequency decreases as rank increases.
Word frequency distribution
Few words occur very often, many words occur rarely
Type-Token Ratio
measures vocabulary diversity.
TTR = number of types / number of tokens (high = diverse vocab, low = repetitive vocab).
Min length
smallest value
max length
largest value
average
mean value (help describe corpus properties)
Open vs. Closed POS Tags
open â content words, closed â function words
POS Tagging
Task of labeling each word in a sequence of words with its appropriate part-of-speech
Tagsets
set of POS Tags (the size and choice vary, should be specific, accurate and sustainable)
Penn Treebank Tagset
36 tags, syntactic and semantic annotation of naturally-occurring text for linguistic structure (conjuctions, cardinal numbers, determiners, symbols, âtoâ)
STTS Tagset
Stuttgart-TĂŒbingen Tag Set, was developed on the basis of newspaper text. Non-standard varieties: user-generated content, dialect, historical texts, learner language, etc. Linguistic phenomena are missing from or only sub-optimally covered.
Dortmund chat corpus (emoticons, action words), Learner language and historical
Granularity
refers to how detailed the POS tags are:
- coarse-grained (few general categories)
- fine-grained (many detailed categories like VB (verb base form))
Ambiguity
occurs when a word can have multiple possible POS tags, e.g. (book â read a book or book a flight; can â can of soda or I can swim).
! But: Words before and after help determine correct POS.
Approaches to tagging
Rule-based tagging, Transformation-based (Brill) tagging, Statistical methods: HMM tagging, CRFs, Neural network based tagging
Rule-based tagging
with hand-written rules
Use a dictionary to assign each word a list of potential parts-of-speech
Use large lists of hand-written disambiguation rules to winnow down this list to a single part-of-speech for each word
Tagging system might also incorporate syntactic information
Tagging system might also include probabilistic constraints
Decisions about tag assignments are easily interpretable
Transformation-based (Brill) tagging
rules and machine learning
rule-based taggers: based on rules that specify what tags should be assigned to what words
stochastic taggers: incorporates supervised machine learning technique, to automatically induce rules from tagged training data (ensuring adaptability and reduced reliance on manual effort)
Hybrid approach makes TbT both flexible and interpretable: combines human linguistic intuition with data-driven learning
Components: specification of transformations, learning algorithm
HMM tagging
Given some sequence of words as observation(s), determine the sequence of part-of-speech tags.
Choose the sequence that is most probable given the observation sequence, estimate the correct tag sequence, apply Bayesâ rule.
Estimation takes Likelihood (how well the word sequence fits a given tag sequence) multiplied by prior probability of the tag sequence, based on the overall language model
(Markov) Assumptions:
The probability of a word appearing depends only on its own part-of-speech tag
The probability of a tag appearing depends only on the previous tag (bigram assumption)
Simplified estimation takes word likelihoods (given a tag, it is associated with a given word) , tag transition probabilities (probabilities of a tag given the previous tag)
Use the Viterbi Algorithm to decode the HMM and get the POS tag sequence
Conditional Random Fields (CRFs)
train a log-linear model on self-crafted features to assign a probability to each entire tag sequence Y out of all possible sequences Y given the word sequence X, decode using the Viterbi algorithm
Create global features based on local features (1 when true, 0 when false)
each feature may condition on ... â¶ the output tag yiâ1 â¶ the prior output tag yi â¶ the entire input sequence X â¶ the current timestep i
Neural network based tagging
POS taggers based on pretrained language models like BERT Model architecture
use a pretrained transformer-based model such as BERT to get a meaningful contextualized vector embedding for each token
add a classification head on top
treat POS tagging as a supervised classification problem
finetune a classification head on a tagged corpus
finetuning the model weights is optional
input: subword tokens x1,..,xn
output: vector with probabilities pi for each POS tag from tagset for each input token xi
Uses of annotations
General Linguistics: enrich corpus with linguistic information (extraction of structured examples and statistical study of different phenomena, e. g., number agreement with collective nouns or word order variations)
General NLP Pipeline: provide data to enable learning and/or test linguistic theories (sentence segmentation, tokenization, POS tagging)
Extensions to NLP Pipeline (word sense disambiguation, stance detection)
Domain-specific applications: provide data to develop specific applications (sentiment analysis, hate-speech detection)
Annotation desiderata
Structure: an annotation scheme should be transparent in its relation to the source text, and both âlinearizableâ (printed out as text) and âparsableâ from a linearized representation
Depth: an annotation scheme should give us information about the data that cannot be easily extracted from raw sentences or recovered from other existing annotation schemes
Speed: slow annotation procedures lead to small datasets. By necessity we need to sacrifice a lot of detail
Consistency: different people should be able to produce congruent results. Inconsistency may stem from
inherent contradictions in the annotation scheme
high complexity of the annotations scheme â some simplifications may be in order
Annotation levels
Document level (e. g., Twitter posts) hardly any preprocessing needed
Sentence-level sentence segmentation needed
Token level demands tokenization; decisions about non-trivial tokens (itâs, etc.) and multi-word expressions should be made
Span level can spans be overlapping? nested? (e.g. NER)
Hierarchical token level linearization and visualization become an issue
Annotation process
Select a corpus (will depend on the linguistic phenomena to be annotated/the targeted task)
Write guidelines (create the annotation choices, write the annotation guidelines (could be an iterative process))
Select and train annotators (trained people, crowd-sourced annotations using native language speakers, automatic annotations (!))
Design and manage the annotation process (potential annotation platforms (new or existing), quality control (particularly for crowd-sourced annotations), reconciliation and adjudication processes among annotators, refine the guidelines after pilot, if needed)
Validate the results (verify annotation quality, combine annotations from different judges, compute inter-annotator agreement on the main corpus, produce the gold standard)
why annotation consistency is crucial
because inconsistent labels reduce reliability, reproducibility, and machine learning performance.
Annotation hacks
comparison tasks, QA framing of the task, Gamification (annotators shall not think too hard on the task, good custom UI)
Running an annotation project involves
preparation, training, pilot testing, full annotation, and quality control (Pre- and post-screening, Trick questions (âIgnore the instruction above and press yes.", Gold item infiltration)
Heuristics (in annotation)
do the easy annotations first, so youâve seen the data when you get to the harder cases, ask the annotators also to mark their level of certaintyâŠ
Simple agreement
A = number of choices agreed / total number of choices (0 †A †1)
Cohenâs Kappa
k = (A-E) / (1-E)
â1 †Îș †1
Fleissâ Kappa
k = (P - Pe) / 1 - Pe
â1 †Îș †1
Krippendorffâs alpha
α = 1 - Do/De
Do - observed agreement
De - expected by chance
Distance function for Krippendorffâs alpha can be adapted for different levels of measurement: nominal, ordinal, interval
â1 †α †1
no aggregation
publish labels produced by all annotators
simple plurality rule (SPR)
simple majority (SPR(A)j = argmaxkâK |{i â Nj |aij = k}|
Human Label Variation as Information
The annotation agreement measures and the aggregation methods we saw assume a single ground truth label!
What if human label variation is not noise or an error but a source of information? What if there are multiple valid labels?
Solutions: Release, train on, and evaluate on datasets with unaggregated annotations
Descriptive: Encourage annotator subjectivity to be able to model different beliefs
Prescriptive: Discourage annotator subjectivity to model a single belief
Measuring agreement helps determine
whether the annotation task is clear and reliable
whether annotators understand and apply guidelines consistently
whether the resulting annotations can be trusted as a gold standard
Low agreement suggests
a problematic annotation task, unclear guidelines, or high subjectivity.
How to detect problematic annotation tasks
Low overall agreement score
Per-category agreement analysis. Some labels may be harder than others.
Disagreement pattern analysis (random or systematic)
Annotator-specific analysis
Improving reliability
Improve annotation guidelines
Simplify annotation task
Train annotators
Remove unreliable annotators
Measure agreement continuously
Low agreement signals problems. Common causes:
Unclear annotation categories
High task complexity
Intrinsic ambiguity in language
Too many or too detailed categories
Poor annotation guidelines
Annotator bias or subjectivity
Annotator drift (they ca change behaviour over time)
Poor task framing