Annotation for Machine Learning Lecture 2: Annotation model and guidelines, inter-annotator agreement

Course Progress

Week 1: Introduction to machine learning and annotation + Labs
Week 2: No lecture (Easter Monday) but labs as scheduled!
Week 3: Annotation model, guidelines, inter-annotator agreement + Labs
Week 4: No lecture (Liberation Day) but labs as scheduled!
Week 5: Setting up a ML experiment + Labs
Week 6: ML classifiers + Labs
Week 7: Testing, evaluation, revision and reporting but no labs (Ascension Day)
Week 8: Wrap up, Exam Prep, Q&A

Assignments

Assignment 1 (ungraded): Setting up your working environment
Assignment 2: Annotations and guidelines (done in pairs)
Assignment 3: Article review / reflection
Assignment 4: Annotation with a model + ML experiment
Assignment 5: Comparing models
Raise your hand if you still have no pair for annotation

Submitting Assignments

Keep submitting assignments
Assignment 1 is ungraded but compulsory, the other ones are graded
To access the exam:
- All assignments must be submitted
- Average of 5.5 for the assignments

Last Lecture

Characteristics of learning
Examples of NLP tasks
Levels of linguistic description

Last Lecture: Recap

Discuss the following with a person sitting next to you for 3 minutes:
1. How does a machine learning system differ from a rule-based system?
2. Give 2 examples of NLP tasks (and describe what they are)
3. Name (at least) 5 levels of linguistic description

Today's Lecture

Annotation development cycle
Annotation in practice
Inter-annotator agreement

Annotation Development Cycle

MATTER cycle

MATTER Cycle

Model
Annotate
Train
Revise
Evaluate
Test

MATTER Cycle Explained

Model the phenomenon
Annotate the data
Train the algorithms
Test them on unseen data
Evaluate the results
Revise the model and algorithms

MATTER: Model the Phenomenon

"Model" here is a conceptual model, not a ML algorithm.

MATTER: Model the Phenomenon

Model $M = {T, R, I}$
- Term vocabulary, $T$
- Relations between terms, $R$
- Interpretation of terms, $I$
$T = {Document type, Spam, Non-spam}$
$R = {Document type = Spam | Non-spam}$
$I = {Spam = "text we don't want to keep", Non-spam = "text we want to keep"}$

MATTER: Annotate the Data

The actual annotation process
Doesn't always work well from the first attempt
May need to do multiple iterations
Multiple annotators, adjudication
Outcome: gold standard data

MAMA Cycle

MA-MA (Model – Annotate – Model – Annotate)
Annotate
Train
Model
MATTER
Test
Revise
Evaluate

MATTER: Train, Test, Evaluate

This is the machine learning part of the cycle.
Training a supervised model on (annotated) data
Testing the model on unseen (but annotated) data
Evaluating the model's performance

MATTER: Train and Test

Very important: split your data into training set and test set
- More about this next week
The exact ratio depends on the data set size
- Common for medium-size data sets: 80% training, 20% test
In many cases, you need an extra validation (development) set
- 80% training, 10% validation, 10% test

MATTER: Evaluate

Once you have obtained your test results, compute evaluation metrics.
Common ways to evaluate classification algorithms:
- Accuracy
- Precision
- Recall
- F1-score
- Confusion matrix
- ROC curve

MATTER: Revise

Possible revisions:
- Introduce a new tag or type
- Split an existing tag or type
- Collect more data
The MATTER cycle restarts if revision is made

Annotation In Practice

Before Starting the Annotation

3-minute exercise
- Work in groups of 2-3
- Discuss what you need to know before you start annotating data
- Share your thoughts with the class

Before Starting the Annotation - Key Questions

What data set are you annotating?
What are the labels (or: what is the annotation model)?
What is the unit of annotation, e.g.:
- Single word (what is the part-of-speech?)
- Sentence (does this sentence express a positive/negative sentiment?)
- Text (what is the main topic of this text?)
How will you store the annotations?
Do you need any specialized knowledge?
The goal: how will these annotations be used?

Storing Annotations

Two options:
- inline: in the same document (simple, but changes the raw data)
- standoff: in a different document (doesn’t affect the data, but more bookkeeping)
Common formats:
- CSV
- XML
- JSON
- TXT
NER labels in XML https://www.frontiersin.org/articles/10.3389/fdigh.2018.00002/full

Common Storage Formats

Inline linear formats: Mary_PER bought chocolates
Inline XML: Mary bought chocolates
Column-based formats: Mary,PER bought,0 chocolates,0
Standoff annotations: 0,4,PER

Annotation Model

Annotation model, or annotation scheme: Structured framework (or set of rules) that defines how data should be annotated.
Minimum specification:
- What are the labels?
- What is the exact interpretation of each label?

Annotation Model - Key Questions

What are the labels?
What is the exact interpretation of each label?
Example: Annotating emails for spam detection
- Labels: ?
- Interpretation: ?

Annotation Model - Spam Detection Example

What are the labels?
What is the exact interpretation of each label?
Example: Annotating emails for spam detection
- Labels: SPAM, NON-SPAM
- Interpretation:
  - SPAM is an email we don’t want
  - NON-SPAM is an email we want

Annotation Model - Named Entities Example

What are the labels?
What is the exact interpretation of each label?
Example: Annotating named entities in the text
- Labels: ?
- Interpretation: ?

Annotation Model - Named Entities Example

What are the labels?
What is the exact interpretation of each label?
Example: Annotating named entities in the text
- Labels: PER, LOC, ORG
- Interpretation:
  - PER is a person
  - LOC is a location
  - ORG is an organization

Annotation Model - Image Annotation Example

What are the labels?
What is the exact interpretation of each label?
Example: Annotating images of cats and dogs (to train a ML classifier)
- Labels: ?
- Interpretation: ?

Annotation Model - Image Annotation Example

What are the labels?
What is the exact interpretation of each label?
Example: Annotating images of cats and dogs (to train a ML classifier)
- Labels: CAT, DOG, N/A
- Interpretation:
  - CAT means “image of a cat”
  - DOG means “image of a dog”
  - N/A means “not an image of a cat or a dog”

Revisiting the Annotation Model

The annotation model can (and often should) be revisited.
- Example: you start with labels CAT, DOG, and N/A, but after annotating several images you run into this:
- You should either allow multiple labels (CAT, DOG) or introduce a new label, CAT+DOG.

Annotation Guidelines

Annotation guidelines are the instructions for the annotators.
They specify how to apply the annotation model to the data.
They should answer the following questions:
- What is the goal of the project?
- What is each label (tag) and how is it used?
- What parts of the data and which units do you want to annotate?
- How will the annotation be created?
Include examples!
In many cases, guidelines are revisited (in the MAMA cycle).

Annotation Model vs. Guidelines

What's the difference between model and guidelines?

Annotation Model vs. Guidelines - Discussion

Discuss for 4 minutes *Annotation guidelines *Annotation model *Purpose *Audience *Scope *Focus *Format

Annotation Model vs. Guidelines - Comparison

	Annotation guidelines	Annotation model
Purpose	Provide practical instructions for consistent annotation	Define the conceptual structure of annotations
Audience	Human annotators	Researchers, data designers
Scope	Task-specific, often with examples	Abstract, applies to multiple datasets
Focus	How to annotate (instructions)	What to annotate (categories and relationships)
Format	A document or manual for annotators	A schema, ontology, etc.

Revisiting Model and Guidelines

Set aside some data for developing labels and guidelines.
Annotate these documents using the initial model and guidelines.
Evaluate your annotation model.
Revisit your model and guidelines based on the experience.
The MAMA cycle (Model – Annotate – Model – Annotate) (Pustejovsky & Stubbs 2022)

Annotators

Who Does the Annotation?

Who Does the Annotation? - Considerations

Preferably not a single person:
- Misunderstanding guidelines
- Subjective interpretation
- Biases (including unconscious bias – e.g., a researcher who wants their hypothesis to be true)
Several experts:
- Compute inter-annotator agreement
- Possibly discuss complicated cases
Many non-expert annotators (crowdsourcing)
Automated annotation based on machine learning

How to Choose Annotators?

What skills do your annotators need to have to perform your annotation task?
Does your annotation task require any specialized knowledge? Vice versa: can specialized knowledge cause issues?
What are practical considerations?
- Money
- Time
- Size of dataset

Crowdsourcing

Large data annotation = several annotators × a lot of work = a lot of annotators × a bit of work
Prolific - https://www.prolific.com/
Amazon Mechanical Turk - https://www.mturk.com/
Douglas et al. (2023): https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0279720

Crowdsourcing: reCAPTCHA

Millions of CAPTCHAS are solved by people every day. reCAPTCHA makes positive use of this human effort by channeling the time spent solving CAPTCHAS into digitizing text, annotating images, and building machine learning datasets. This in turn helps preserve books, improve maps, and solve hard Al problems.

Crowdsourcing: Gamification

Incentives for annotators:
- Financial compensation
- Access to a web resource
- Enjoyment
- Other intrinsic motivation
An example of gamification:
- Challenge – develop annotation games: https://www.izdigital.fau.eu/2020/10/13/annotator-challenge-develop-a-game-to-gamify-the-annotating/

Crowdsourcing: Intrinsic Motivation

https://www.woordwaark.nl/
Interactive language data set for Gronings.
Gronings is a set of low-resource dialects.
By contributing to this project, Gronings speakers benefit themselves.

Looking for Annotated Data

5-minute exercise, in groups of 4-5
Find an annotated data set of your choice
Locate information about the annotations:
- What are the annotation labels?
- Are the guidelines publicly available?
- Who annotated the data?

Inter-Annotator Agreement

Inter-Annotator Agreement (IAA)

To what extent do your annotators agree with each other?
We need to measure the agreement between different annotators.
- Can’t ask the same person twice.
Often measured on a data sample, as a part of the MAMA cycle.

IAA: Basic Idea

Measuring IAA is a way to evaluate your annotations.
High IAA – robust and reproducible annotation.
Low IAA:
- Your guidelines are not sufficiently clear.
- A lot of room for subjective judgements in this task.
- Your annotators aren’t sufficiently motivated to perform the task well.

Why Measure IAA?

Improve your annotation model and guidelines.
Detect difficult cases in your data.
Compare annotators’ performance in this task.
Assess reliability of the annotation process.
- No reference corpus / gold standard data.
- Often no single “correct” answer.

Inter-Annotator Agreement Example

Task: annotating customer reviews
Annotation model: positive/negative

Item	Annotator A	Annotator B
Review 1	+	+
Review 2	+	–
Review 3	–	–
Review 4	+	+
Review 5	–	–

Inter-Annotator Agreement Example - Confusion Matrix

Task: annotating customer reviews
Annotation model: positive/negative

	B Positive	B Negative	Total
A Positive	2	1	3
A Negative	0	2	2
Total	2	3	5

Confusion Matrix

In the context of annotation, a confusion matrix shows how the annotators agree or disagree

Confusion Matrix - Good or Not?

The confusion matrix must have labeled rows and columns indicating annotators and categories.

Inter-Annotator Agreement - Raw Agreement

Raw, or observed agreement $A_o$ : the share of items on which annotators agree.
$A_o = \frac{(2+2)}{5} = 0.8$

	B Positive	B Negative	Total
A Positive	2	1	3
A Negative	0	2	2
Total	2	3	5

Inter-Annotator Agreement - Raw Agreement with Three Categories

Raw, or observed agreement $A_o$ : the share of items on which annotators agree.
$A_o = \frac{(2+2+3)}{11} ≈ 0.64$

	B Positive	B Negative	B Neutral	Total
A Positive	2	1	1	4
A Negative	0	2	1	3
A Neutral	0	1	3	4
Total	2	4	5	11

Issues with Observed IAA

What are the drawbacks of this measure?
Think about it for 3 minutes, discuss with others

	B Positive	B Negative	Total
A Positive	2	1	3
A Negative	0	2	2
Total	2	3	5
$A_o = (2+2) / 5 = 0.8$

Issues with Observed IAA - Chance Agreement

Annotators may agree by chance!
- With 2 labels, for each data point there is a 50% chance that 2 annotators agree at random
Especially problematic for data with imbalanced classes

Issues with Observed IAA - Imbalanced Classes Example

Example: a corpus with 1000 tokens with just a few named entities.
- Annotating whether each word is a named entity
- Annotation model:
  - NE: entity
  - 0: non-entity *Last_0 weekend_0, I_0 travelled_0 to_0 Berlin_NE to_0 visit_0 an_0 old_0 friend_0. We_0 spent_0 most_0 of_0 the_0 time_0 exploring_0 museums_0 and drinking_0 coffee_0.

Issues with Observed IAA - High Agreement, Low Usefulness

Corpus with 1000 tokens
Very high value (almost perfect agreement) but not useful
$A_o = (982+1) / 1000 = 0.983$

	B: entity	B: non-entity	Total
A: entity	1	9	10
A: non-entity	8	982	990
Total	9	991	1000

IAA Scores - Accounting for Chance

We need to take into account agreement by chance.
IAA scores commonly used instead of raw agreement:
- Cohen’s kappa ( $\kappa$ )
- Fleiss’ kappa ( $\kappa$ )
- Krippendorf’s alpha ( $\alpha$ )

Cohen's Kappa

Idea of Cohen’s $\kappa$ : calculate the expected chance agreement ( $A_e$ ) and take that into consideration.
Cohen’s $\kappa$ measures how much of the possible agreement beyond chance was obtained.

Cohen's Kappa - Formula

$\kappa = \frac{A_o - A_e}{1 - A_e}$
$A_o$ – observed agreement
$A_e$ – expected agreement (by chance)
(both values between 0 and 1)

Cohen's Kappa - Formula Explained

$\kappa = \frac{A_o - A_e}{1 - A_e}$
\Ao – A_e: amount of agreement beyond chance
$1 – A_e$ : maximum possible amount of agreement beyond chance
$\kappa$ is the ratio between these two values, a number between –1 and 1

Expected Agreement

This is the agreement by chance, $A_e$
An example with two equally likely labels:
- 2 annotators (A and B) and 2 labels (TRUE and FALSE)
  - For label TRUE:
    - Annotator A will choose TRUE half of the time, probability 0.5
    - Annotator B will choose TRUE of the time, probability 0.5
    - Both annotators: 0.5 × 0.5 = 0.25
  - Same calculations for label FALSE: 0.5 × 0.5 = 0.25
  - $A_e$ = 0.25 + 0.25 = 0.5

Expected Agreement - Verification

Let's check if our calculations are correct!
Write down all possible label combinations in the table:
In 2 out of 4 rows there is agreement (both green or both red)
2 / 4 = 0.5

Item	Annotator A	Annotator B	Agreement?
Combination 1	+	+	Yes
Combination 2	+	–	No
Combination 3	–	+	No
Combination 4	–	–	Yes

Cohen's Kappa - Unbalanced Dataset Example

We already know $A_o$ from before: $A_o = (982+1) / 1000 = 0.983$
Need to compute expected agreement $A_e$
$\kappa = \frac{A_o - A_e}{1 - A_e}$

	B: entity	B: non-entity	Total
A: entity	1	9	10
A: non-entity	8	982	990
Total	9	991	1000

Cohen's Kappa - Calculation Example (1)

Computed per label – first for “entity”:
- How likely each coder was to choose the label “entity”?
  - Annotator A: 10/1000 = 0.01
  - Annotator B: 9/1000 = 0.009
- How likely were both to select “entity”?
  - 0.01 × 0.009 = 0.00009
$\kappa = \frac{A_o - A_e}{1 - A_e}$

	B: entity	B: non-entity	Total
A: entity	1	9	10
A: non-entity	8	982	990
Total	9	991	1000

Cohen's Kappa - Calculation Example (2)

Computed per label – now for “non-entity”
- How likely each coder was to choose the label “non-entity”?
  - Annotator A: 990/1000 = 0.99
  - Annotator B: 991/1000 = 0.991
- How likely were both to select “non-entity”?
  - 0.99 × 0.991 = 0.98109
$\kappa = \frac{A_o - A_e}{1 - A_e}$

	B: entity	B: non-entity	Total
A: entity	1	9	10
A: non-entity	8	982	990
Total	9	991	1000

Cohen's Kappa - Calculation Example (3)

Computed per label – now add them up!
- $A_e$ = 0.00009 + 0.98109 = 0.98118
We already know Ao from before: Ao = (982+1) / 1000 = 0.983
We have computed Ae: Ae = 0.98118
$\kappa = \frac{0.983 - 0.98118}{1 - 0.98118} ≈ 0.097$
$\kappa = \frac{A_o - A_e}{1 - A_e}$

	B: entity	B: non-entity	Total
A: entity	1	9	10
A: non-entity	8	982	990
Total	9	991	1000

Interpreting Cohen's Kappa

According to Landis & Koch (1977):
- ≤ 0 no agreement
- 0.01–0.20 none to slight
- 0.21–0.40 fair
- 0.41–0.60 moderate
- 0.61–0.80 substantial
- 0.81–1.00 almost perfect

Interpreting Cohen's Kappa - Modern Interpretations

Modern interpretations are more conservative (McHugh, 2015)

Value of Kappa	Level of Agreement
0-.20	None
.21-.39	Minimal
.40-.59	Weak
.60-.79	Moderate
.80-90	Strong
Above .90	Almost Perfect

Interpreting Cohen's Kappa - Example Result

$\kappa = \frac{0.983 - 0.98118}{1 - 0.98118} ≈ 0.097$
There is little to no agreement between the annotators.
$\kappa = \frac{A_o - A_e}{1 - A_e}$

	B: entity	B: non-entity	Total
A: entity	1	9	10
A: non-entity	8	982	990
Total	9	991	1000

Fleiss' Kappa

Cohen's kappa only works with exactly two annotators
For three or more annotators: Fleiss' kappa
It's a generalization of Cohen's kappa to 3+ annotators

Adjudication

Adjudication is a process of resolving disagreements.
Possible methods:
- Let the annotators discuss the difficult cases.
- Seek expert opinion.
- Use the majority (i.e., most commonly used) label.

Summary

Annotation cycle
Annotation model and guidelines
Annotators
Inter-annotator agreement

Recommended Reading

Pustejovsky, J., & Stubbs, A. (2013). Natural language annotation for machine learning. O'Reilly Media.
https://rug.on.worldcat.org/oclc/801812987
Chapter 1 and Chapter 6

Next Lecture - 12 May

Setting up a machine learning experiment