Annotation for Machine Learning Lecture 2: Annotation model and guidelines, inter-annotator agreement

Course Progress

  • Week 1: Introduction to machine learning and annotation + Labs
  • Week 2: No lecture (Easter Monday) but labs as scheduled!
  • Week 3: Annotation model, guidelines, inter-annotator agreement + Labs
  • Week 4: No lecture (Liberation Day) but labs as scheduled!
  • Week 5: Setting up a ML experiment + Labs
  • Week 6: ML classifiers + Labs
  • Week 7: Testing, evaluation, revision and reporting but no labs (Ascension Day)
  • Week 8: Wrap up, Exam Prep, Q&A

Assignments

  • Assignment 1 (ungraded): Setting up your working environment
  • Assignment 2: Annotations and guidelines (done in pairs)
  • Assignment 3: Article review / reflection
  • Assignment 4: Annotation with a model + ML experiment
  • Assignment 5: Comparing models
  • Raise your hand if you still have no pair for annotation

Submitting Assignments

  • Keep submitting assignments
  • Assignment 1 is ungraded but compulsory, the other ones are graded
  • To access the exam:
    • All assignments must be submitted
    • Average of 5.5 for the assignments

Last Lecture

  • Characteristics of learning
  • Examples of NLP tasks
  • Levels of linguistic description

Last Lecture: Recap

  • Discuss the following with a person sitting next to you for 3 minutes:
    1. How does a machine learning system differ from a rule-based system?
    2. Give 2 examples of NLP tasks (and describe what they are)
    3. Name (at least) 5 levels of linguistic description

Today's Lecture

  • Annotation development cycle
  • Annotation in practice
  • Inter-annotator agreement

Annotation Development Cycle

  • MATTER cycle

MATTER Cycle

  • Model
  • Annotate
  • Train
  • Revise
  • Evaluate
  • Test

MATTER Cycle Explained

  • Model the phenomenon
  • Annotate the data
  • Train the algorithms
  • Test them on unseen data
  • Evaluate the results
  • Revise the model and algorithms

MATTER: Model the Phenomenon

  • "Model" here is a conceptual model, not a ML algorithm.

MATTER: Model the Phenomenon

  • Model M=T,R,IM = {T, R, I}
    • Term vocabulary, TT
    • Relations between terms, RR
    • Interpretation of terms, II
  • T=Documenttype,Spam,NonspamT = {Document type, Spam, Non-spam}
  • R=Documenttype=SpamNonspamR = {Document type = Spam | Non-spam}
  • I=Spam="textwedontwanttokeep",Nonspam="textwewanttokeep"I = {Spam = "text we don't want to keep", Non-spam = "text we want to keep"}

MATTER: Annotate the Data

  • The actual annotation process
  • Doesn't always work well from the first attempt
  • May need to do multiple iterations
  • Multiple annotators, adjudication
  • Outcome: gold standard data

MAMA Cycle

  • MA-MA (Model – Annotate – Model – Annotate)
  • Annotate
  • Train
  • Model
  • MATTER
  • Test
  • Revise
  • Evaluate

MATTER: Train, Test, Evaluate

  • This is the machine learning part of the cycle.
  • Training a supervised model on (annotated) data
  • Testing the model on unseen (but annotated) data
  • Evaluating the model's performance

MATTER: Train and Test

  • Very important: split your data into training set and test set
    • More about this next week
  • The exact ratio depends on the data set size
    • Common for medium-size data sets: 80% training, 20% test
  • In many cases, you need an extra validation (development) set
    • 80% training, 10% validation, 10% test

MATTER: Evaluate

  • Once you have obtained your test results, compute evaluation metrics.
  • Common ways to evaluate classification algorithms:
    • Accuracy
    • Precision
    • Recall
    • F1-score
    • Confusion matrix
    • ROC curve

MATTER: Revise

  • Possible revisions:
    • Introduce a new tag or type
    • Split an existing tag or type
    • Collect more data
  • The MATTER cycle restarts if revision is made

Annotation In Practice

Before Starting the Annotation

  • 3-minute exercise
    • Work in groups of 2-3
    • Discuss what you need to know before you start annotating data
    • Share your thoughts with the class

Before Starting the Annotation - Key Questions

  • What data set are you annotating?
  • What are the labels (or: what is the annotation model)?
  • What is the unit of annotation, e.g.:
    • Single word (what is the part-of-speech?)
    • Sentence (does this sentence express a positive/negative sentiment?)
    • Text (what is the main topic of this text?)
  • How will you store the annotations?
  • Do you need any specialized knowledge?
  • The goal: how will these annotations be used?

Storing Annotations

  • Two options:
    • inline: in the same document (simple, but changes the raw data)
    • standoff: in a different document (doesn’t affect the data, but more bookkeeping)
  • Common formats:
    • CSV
    • XML
    • JSON
    • TXT
  • NER labels in XML https://www.frontiersin.org/articles/10.3389/fdigh.2018.00002/full

Common Storage Formats

  • Inline linear formats: Mary_PER bought chocolates
  • Inline XML: Mary bought chocolates
  • Column-based formats: Mary,PER bought,0 chocolates,0
  • Standoff annotations: 0,4,PER

Annotation Model

  • Annotation model, or annotation scheme: Structured framework (or set of rules) that defines how data should be annotated.
  • Minimum specification:
    • What are the labels?
    • What is the exact interpretation of each label?

Annotation Model - Key Questions

  • What are the labels?
  • What is the exact interpretation of each label?
  • Example: Annotating emails for spam detection
    • Labels: ?
    • Interpretation: ?

Annotation Model - Spam Detection Example

  • What are the labels?
  • What is the exact interpretation of each label?
  • Example: Annotating emails for spam detection
    • Labels: SPAM, NON-SPAM
    • Interpretation:
      • SPAM is an email we don’t want
      • NON-SPAM is an email we want

Annotation Model - Named Entities Example

  • What are the labels?
  • What is the exact interpretation of each label?
  • Example: Annotating named entities in the text
    • Labels: ?
    • Interpretation: ?

Annotation Model - Named Entities Example

  • What are the labels?
  • What is the exact interpretation of each label?
  • Example: Annotating named entities in the text
    • Labels: PER, LOC, ORG
    • Interpretation:
      • PER is a person
      • LOC is a location
      • ORG is an organization

Annotation Model - Image Annotation Example

  • What are the labels?
  • What is the exact interpretation of each label?
  • Example: Annotating images of cats and dogs (to train a ML classifier)
    • Labels: ?
    • Interpretation: ?

Annotation Model - Image Annotation Example

  • What are the labels?
  • What is the exact interpretation of each label?
  • Example: Annotating images of cats and dogs (to train a ML classifier)
    • Labels: CAT, DOG, N/A
    • Interpretation:
      • CAT means “image of a cat”
      • DOG means “image of a dog”
      • N/A means “not an image of a cat or a dog”

Revisiting the Annotation Model

  • The annotation model can (and often should) be revisited.
    • Example: you start with labels CAT, DOG, and N/A, but after annotating several images you run into this:
    • You should either allow multiple labels (CAT, DOG) or introduce a new label, CAT+DOG.

Annotation Guidelines

  • Annotation guidelines are the instructions for the annotators.
  • They specify how to apply the annotation model to the data.
  • They should answer the following questions:
    • What is the goal of the project?
    • What is each label (tag) and how is it used?
    • What parts of the data and which units do you want to annotate?
    • How will the annotation be created?
  • Include examples!
  • In many cases, guidelines are revisited (in the MAMA cycle).

Annotation Model vs. Guidelines

  • What's the difference between model and guidelines?

Annotation Model vs. Guidelines - Discussion

  • Discuss for 4 minutes *Annotation guidelines *Annotation model *Purpose *Audience *Scope *Focus *Format

Annotation Model vs. Guidelines - Comparison

Annotation guidelinesAnnotation model
PurposeProvide practical instructions for consistent annotationDefine the conceptual structure of annotations
AudienceHuman annotatorsResearchers, data designers
ScopeTask-specific, often with examplesAbstract, applies to multiple datasets
FocusHow to annotate (instructions)What to annotate (categories and relationships)
FormatA document or manual for annotatorsA schema, ontology, etc.

Revisiting Model and Guidelines

  • Set aside some data for developing labels and guidelines.
  • Annotate these documents using the initial model and guidelines.
  • Evaluate your annotation model.
  • Revisit your model and guidelines based on the experience.
  • The MAMA cycle (Model – Annotate – Model – Annotate) (Pustejovsky & Stubbs 2022)

Annotators

Who Does the Annotation?

Who Does the Annotation? - Considerations

  • Preferably not a single person:
    • Misunderstanding guidelines
    • Subjective interpretation
    • Biases (including unconscious bias – e.g., a researcher who wants their hypothesis to be true)
  • Several experts:
    • Compute inter-annotator agreement
    • Possibly discuss complicated cases
  • Many non-expert annotators (crowdsourcing)
  • Automated annotation based on machine learning

How to Choose Annotators?

  • What skills do your annotators need to have to perform your annotation task?
  • Does your annotation task require any specialized knowledge? Vice versa: can specialized knowledge cause issues?
  • What are practical considerations?
    • Money
    • Time
    • Size of dataset

Crowdsourcing

  • Large data annotation = several annotators × a lot of work = a lot of annotators × a bit of work
  • Prolific - https://www.prolific.com/
  • Amazon Mechanical Turk - https://www.mturk.com/
  • Douglas et al. (2023): https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0279720

Crowdsourcing: reCAPTCHA

  • Millions of CAPTCHAS are solved by people every day. reCAPTCHA makes positive use of this human effort by channeling the time spent solving CAPTCHAS into digitizing text, annotating images, and building machine learning datasets. This in turn helps preserve books, improve maps, and solve hard Al problems.

Crowdsourcing: Gamification

  • Incentives for annotators:
    • Financial compensation
    • Access to a web resource
    • Enjoyment
    • Other intrinsic motivation
  • An example of gamification:
    • Challenge – develop annotation games: https://www.izdigital.fau.eu/2020/10/13/annotator-challenge-develop-a-game-to-gamify-the-annotating/

Crowdsourcing: Intrinsic Motivation

  • https://www.woordwaark.nl/
  • Interactive language data set for Gronings.
  • Gronings is a set of low-resource dialects.
  • By contributing to this project, Gronings speakers benefit themselves.

Looking for Annotated Data

  • 5-minute exercise, in groups of 4-5
  • Find an annotated data set of your choice
  • Locate information about the annotations:
    • What are the annotation labels?
    • Are the guidelines publicly available?
    • Who annotated the data?

Inter-Annotator Agreement

Inter-Annotator Agreement (IAA)

  • To what extent do your annotators agree with each other?
  • We need to measure the agreement between different annotators.
    • Can’t ask the same person twice.
  • Often measured on a data sample, as a part of the MAMA cycle.

IAA: Basic Idea

  • Measuring IAA is a way to evaluate your annotations.
  • High IAA – robust and reproducible annotation.
  • Low IAA:
    • Your guidelines are not sufficiently clear.
    • A lot of room for subjective judgements in this task.
    • Your annotators aren’t sufficiently motivated to perform the task well.

Why Measure IAA?

  • Improve your annotation model and guidelines.
  • Detect difficult cases in your data.
  • Compare annotators’ performance in this task.
  • Assess reliability of the annotation process.
    • No reference corpus / gold standard data.
    • Often no single “correct” answer.

Inter-Annotator Agreement Example

  • Task: annotating customer reviews
  • Annotation model: positive/negative

ItemAnnotator AAnnotator B
Review 1++
Review 2+
Review 3
Review 4++
Review 5

Inter-Annotator Agreement Example - Confusion Matrix

  • Task: annotating customer reviews
  • Annotation model: positive/negative
B PositiveB NegativeTotal
A Positive213
A Negative022
Total235

Confusion Matrix

  • In the context of annotation, a confusion matrix shows how the annotators agree or disagree

Confusion Matrix - Good or Not?

  • The confusion matrix must have labeled rows and columns indicating annotators and categories.

Inter-Annotator Agreement - Raw Agreement

  • Raw, or observed agreement AoA_o: the share of items on which annotators agree.

  • Ao=(2+2)5=0.8A_o = \frac{(2+2)}{5} = 0.8

B PositiveB NegativeTotal
A Positive213
A Negative022
Total235

Inter-Annotator Agreement - Raw Agreement with Three Categories

  • Raw, or observed agreement AoA_o: the share of items on which annotators agree.

  • Ao=(2+2+3)110.64A_o = \frac{(2+2+3)}{11} ≈ 0.64

B PositiveB NegativeB NeutralTotal
A Positive2114
A Negative0213
A Neutral0134
Total24511

Issues with Observed IAA

  • What are the drawbacks of this measure?
  • Think about it for 3 minutes, discuss with others

B PositiveB NegativeTotal
A Positive213
A Negative022
Total235
Ao=(2+2)/5=0.8A_o = (2+2) / 5 = 0.8

Issues with Observed IAA - Chance Agreement

  • Annotators may agree by chance!
    • With 2 labels, for each data point there is a 50% chance that 2 annotators agree at random
  • Especially problematic for data with imbalanced classes

Issues with Observed IAA - Imbalanced Classes Example

  • Example: a corpus with 1000 tokens with just a few named entities.
    • Annotating whether each word is a named entity
    • Annotation model:
      • NE: entity
      • 0: non-entity *Last_0 weekend_0, I_0 travelled_0 to_0 Berlin_NE to_0 visit_0 an_0 old_0 friend_0. We_0 spent_0 most_0 of_0 the_0 time_0 exploring_0 museums_0 and drinking_0 coffee_0.

Issues with Observed IAA - High Agreement, Low Usefulness

  • Corpus with 1000 tokens

  • Very high value (almost perfect agreement) but not useful

  • Ao=(982+1)/1000=0.983A_o = (982+1) / 1000 = 0.983

B: entityB: non-entityTotal
A: entity1910
A: non-entity8982990
Total99911000

IAA Scores - Accounting for Chance

  • We need to take into account agreement by chance.
  • IAA scores commonly used instead of raw agreement:
    • Cohen’s kappa (κ\kappa)
    • Fleiss’ kappa (κ\kappa)
    • Krippendorf’s alpha (α\alpha)

Cohen's Kappa

  • Idea of Cohen’s κ\kappa: calculate the expected chance agreement (AeA_e) and take that into consideration.
  • Cohen’s κ\kappa measures how much of the possible agreement beyond chance was obtained.

Cohen's Kappa - Formula

  • κ=AoAe1Ae\kappa = \frac{A_o - A_e}{1 - A_e}
  • AoA_o – observed agreement
  • AeA_e – expected agreement (by chance)
  • (both values between 0 and 1)

Cohen's Kappa - Formula Explained

  • κ=AoAe1Ae\kappa = \frac{A_o - A_e}{1 - A_e}
  • \Ao – A_e: amount of agreement beyond chance
  • 1Ae1 – A_e: maximum possible amount of agreement beyond chance
  • κ\kappa is the ratio between these two values, a number between –1 and 1

Expected Agreement

  • This is the agreement by chance, AeA_e
  • An example with two equally likely labels:
    • 2 annotators (A and B) and 2 labels (TRUE and FALSE)
      • For label TRUE:
        • Annotator A will choose TRUE half of the time, probability 0.5
        • Annotator B will choose TRUE of the time, probability 0.5
        • Both annotators: 0.5 × 0.5 = 0.25
      • Same calculations for label FALSE: 0.5 × 0.5 = 0.25
      • AeA_e = 0.25 + 0.25 = 0.5

Expected Agreement - Verification

  • Let's check if our calculations are correct!
  • Write down all possible label combinations in the table:
  • In 2 out of 4 rows there is agreement (both green or both red)
  • 2 / 4 = 0.5
ItemAnnotator AAnnotator BAgreement?
Combination 1++Yes
Combination 2+No
Combination 3+No
Combination 4Yes

Cohen's Kappa - Unbalanced Dataset Example

  • We already know AoA_o from before: Ao=(982+1)/1000=0.983A_o = (982+1) / 1000 = 0.983
  • Need to compute expected agreement AeA_e
  • κ=AoAe1Ae\kappa = \frac{A_o - A_e}{1 - A_e}
B: entityB: non-entityTotal
A: entity1910
A: non-entity8982990
Total99911000

Cohen's Kappa - Calculation Example (1)

  • Computed per label – first for “entity”:
    • How likely each coder was to choose the label “entity”?
      • Annotator A: 10/1000 = 0.01
      • Annotator B: 9/1000 = 0.009
    • How likely were both to select “entity”?
      • 0.01 × 0.009 = 0.00009
  • κ=AoAe1Ae\kappa = \frac{A_o - A_e}{1 - A_e}
B: entityB: non-entityTotal
A: entity1910
A: non-entity8982990
Total99911000

Cohen's Kappa - Calculation Example (2)

  • Computed per label – now for “non-entity”
    • How likely each coder was to choose the label “non-entity”?
      • Annotator A: 990/1000 = 0.99
      • Annotator B: 991/1000 = 0.991
    • How likely were both to select “non-entity”?
      • 0.99 × 0.991 = 0.98109
  • κ=AoAe1Ae\kappa = \frac{A_o - A_e}{1 - A_e}
B: entityB: non-entityTotal
A: entity1910
A: non-entity8982990
Total99911000

Cohen's Kappa - Calculation Example (3)

  • Computed per label – now add them up!
    • AeA_e = 0.00009 + 0.98109 = 0.98118
  • We already know Ao from before: Ao = (982+1) / 1000 = 0.983
  • We have computed Ae: Ae = 0.98118
  • κ=0.9830.9811810.981180.097\kappa = \frac{0.983 - 0.98118}{1 - 0.98118} ≈ 0.097
  • κ=AoAe1Ae\kappa = \frac{A_o - A_e}{1 - A_e}
B: entityB: non-entityTotal
A: entity1910
A: non-entity8982990
Total99911000

Interpreting Cohen's Kappa

  • According to Landis & Koch (1977):
    • ≤ 0 no agreement
    • 0.01–0.20 none to slight
    • 0.21–0.40 fair
    • 0.41–0.60 moderate
    • 0.61–0.80 substantial
    • 0.81–1.00 almost perfect

Interpreting Cohen's Kappa - Modern Interpretations

  • Modern interpretations are more conservative (McHugh, 2015)

Value of KappaLevel of Agreement
0-.20None
.21-.39Minimal
.40-.59Weak
.60-.79Moderate
.80-90Strong
Above .90Almost Perfect

Interpreting Cohen's Kappa - Example Result

  • κ=0.9830.9811810.981180.097\kappa = \frac{0.983 - 0.98118}{1 - 0.98118} ≈ 0.097
  • There is little to no agreement between the annotators.
  • κ=AoAe1Ae\kappa = \frac{A_o - A_e}{1 - A_e}
B: entityB: non-entityTotal
A: entity1910
A: non-entity8982990
Total99911000

Fleiss' Kappa

  • Cohen's kappa only works with exactly two annotators
  • For three or more annotators: Fleiss' kappa
  • It's a generalization of Cohen's kappa to 3+ annotators

Adjudication

  • Adjudication is a process of resolving disagreements.
  • Possible methods:
    • Let the annotators discuss the difficult cases.
    • Seek expert opinion.
    • Use the majority (i.e., most commonly used) label.

Summary

  • Annotation cycle
  • Annotation model and guidelines
  • Annotators
  • Inter-annotator agreement

Recommended Reading

  • Pustejovsky, J., & Stubbs, A. (2013). Natural language annotation for machine learning. O'Reilly Media.
  • https://rug.on.worldcat.org/oclc/801812987
  • Chapter 1 and Chapter 6

Next Lecture - 12 May

  • Setting up a machine learning experiment