Annotation for Machine Learning Lecture 2: Annotation model and guidelines, inter-annotator agreement
Course Progress
- Week 1: Introduction to machine learning and annotation + Labs
- Week 2: No lecture (Easter Monday) but labs as scheduled!
- Week 3: Annotation model, guidelines, inter-annotator agreement + Labs
- Week 4: No lecture (Liberation Day) but labs as scheduled!
- Week 5: Setting up a ML experiment + Labs
- Week 6: ML classifiers + Labs
- Week 7: Testing, evaluation, revision and reporting but no labs (Ascension Day)
- Week 8: Wrap up, Exam Prep, Q&A
Assignments
- Assignment 1 (ungraded): Setting up your working environment
- Assignment 2: Annotations and guidelines (done in pairs)
- Assignment 3: Article review / reflection
- Assignment 4: Annotation with a model + ML experiment
- Assignment 5: Comparing models
- Raise your hand if you still have no pair for annotation
Submitting Assignments
- Keep submitting assignments
- Assignment 1 is ungraded but compulsory, the other ones are graded
- To access the exam:
- All assignments must be submitted
- Average of 5.5 for the assignments
Last Lecture
- Characteristics of learning
- Examples of NLP tasks
- Levels of linguistic description
Last Lecture: Recap
- Discuss the following with a person sitting next to you for 3 minutes:
- How does a machine learning system differ from a rule-based system?
- Give 2 examples of NLP tasks (and describe what they are)
- Name (at least) 5 levels of linguistic description
Today's Lecture
- Annotation development cycle
- Annotation in practice
- Inter-annotator agreement
Annotation Development Cycle
MATTER Cycle
- Model
- Annotate
- Train
- Revise
- Evaluate
- Test
MATTER Cycle Explained
- Model the phenomenon
- Annotate the data
- Train the algorithms
- Test them on unseen data
- Evaluate the results
- Revise the model and algorithms
MATTER: Model the Phenomenon
- "Model" here is a conceptual model, not a ML algorithm.
MATTER: Model the Phenomenon
- Model M=T,R,I
- Term vocabulary, T
- Relations between terms, R
- Interpretation of terms, I
- T=Documenttype,Spam,Non−spam
- R=Documenttype=Spam∣Non−spam
- I=Spam="textwedon′twanttokeep",Non−spam="textwewanttokeep"
MATTER: Annotate the Data
- The actual annotation process
- Doesn't always work well from the first attempt
- May need to do multiple iterations
- Multiple annotators, adjudication
- Outcome: gold standard data
MAMA Cycle
- MA-MA (Model – Annotate – Model – Annotate)
- Annotate
- Train
- Model
- MATTER
- Test
- Revise
- Evaluate
MATTER: Train, Test, Evaluate
- This is the machine learning part of the cycle.
- Training a supervised model on (annotated) data
- Testing the model on unseen (but annotated) data
- Evaluating the model's performance
MATTER: Train and Test
- Very important: split your data into training set and test set
- More about this next week
- The exact ratio depends on the data set size
- Common for medium-size data sets: 80% training, 20% test
- In many cases, you need an extra validation (development) set
- 80% training, 10% validation, 10% test
MATTER: Evaluate
- Once you have obtained your test results, compute evaluation metrics.
- Common ways to evaluate classification algorithms:
- Accuracy
- Precision
- Recall
- F1-score
- Confusion matrix
- ROC curve
MATTER: Revise
- Possible revisions:
- Introduce a new tag or type
- Split an existing tag or type
- Collect more data
- The MATTER cycle restarts if revision is made
Annotation In Practice
Before Starting the Annotation
- 3-minute exercise
- Work in groups of 2-3
- Discuss what you need to know before you start annotating data
- Share your thoughts with the class
Before Starting the Annotation - Key Questions
- What data set are you annotating?
- What are the labels (or: what is the annotation model)?
- What is the unit of annotation, e.g.:
- Single word (what is the part-of-speech?)
- Sentence (does this sentence express a positive/negative sentiment?)
- Text (what is the main topic of this text?)
- How will you store the annotations?
- Do you need any specialized knowledge?
- The goal: how will these annotations be used?
Storing Annotations
- Two options:
- inline: in the same document (simple, but changes the raw data)
- standoff: in a different document (doesn’t affect the data, but more bookkeeping)
- Common formats:
- NER labels in XML https://www.frontiersin.org/articles/10.3389/fdigh.2018.00002/full
- Inline linear formats: Mary_PER bought chocolates
- Inline XML: Mary bought chocolates
- Column-based formats: Mary,PER bought,0 chocolates,0
- Standoff annotations: 0,4,PER
Annotation Model
- Annotation model, or annotation scheme: Structured framework (or set of rules) that defines how data should be annotated.
- Minimum specification:
- What are the labels?
- What is the exact interpretation of each label?
Annotation Model - Key Questions
- What are the labels?
- What is the exact interpretation of each label?
- Example: Annotating emails for spam detection
- Labels: ?
- Interpretation: ?
Annotation Model - Spam Detection Example
- What are the labels?
- What is the exact interpretation of each label?
- Example: Annotating emails for spam detection
- Labels: SPAM, NON-SPAM
- Interpretation:
- SPAM is an email we don’t want
- NON-SPAM is an email we want
Annotation Model - Named Entities Example
- What are the labels?
- What is the exact interpretation of each label?
- Example: Annotating named entities in the text
- Labels: ?
- Interpretation: ?
Annotation Model - Named Entities Example
- What are the labels?
- What is the exact interpretation of each label?
- Example: Annotating named entities in the text
- Labels: PER, LOC, ORG
- Interpretation:
- PER is a person
- LOC is a location
- ORG is an organization
Annotation Model - Image Annotation Example
- What are the labels?
- What is the exact interpretation of each label?
- Example: Annotating images of cats and dogs (to train a ML classifier)
- Labels: ?
- Interpretation: ?
Annotation Model - Image Annotation Example
- What are the labels?
- What is the exact interpretation of each label?
- Example: Annotating images of cats and dogs (to train a ML classifier)
- Labels: CAT, DOG, N/A
- Interpretation:
- CAT means “image of a cat”
- DOG means “image of a dog”
- N/A means “not an image of a cat or a dog”
Revisiting the Annotation Model
- The annotation model can (and often should) be revisited.
- Example: you start with labels CAT, DOG, and N/A, but after annotating several images you run into this:
- You should either allow multiple labels (CAT, DOG) or introduce a new label, CAT+DOG.
Annotation Guidelines
- Annotation guidelines are the instructions for the annotators.
- They specify how to apply the annotation model to the data.
- They should answer the following questions:
- What is the goal of the project?
- What is each label (tag) and how is it used?
- What parts of the data and which units do you want to annotate?
- How will the annotation be created?
- Include examples!
- In many cases, guidelines are revisited (in the MAMA cycle).
Annotation Model vs. Guidelines
- What's the difference between model and guidelines?
Annotation Model vs. Guidelines - Discussion
- Discuss for 4 minutes
*Annotation guidelines
*Annotation model
*Purpose
*Audience
*Scope
*Focus
*Format
Annotation Model vs. Guidelines - Comparison
| Annotation guidelines | Annotation model |
|---|
| Purpose | Provide practical instructions for consistent annotation | Define the conceptual structure of annotations |
| Audience | Human annotators | Researchers, data designers |
| Scope | Task-specific, often with examples | Abstract, applies to multiple datasets |
| Focus | How to annotate (instructions) | What to annotate (categories and relationships) |
| Format | A document or manual for annotators | A schema, ontology, etc. |
Revisiting Model and Guidelines
- Set aside some data for developing labels and guidelines.
- Annotate these documents using the initial model and guidelines.
- Evaluate your annotation model.
- Revisit your model and guidelines based on the experience.
- The MAMA cycle (Model – Annotate – Model – Annotate) (Pustejovsky & Stubbs 2022)
Annotators
Who Does the Annotation?
Who Does the Annotation? - Considerations
- Preferably not a single person:
- Misunderstanding guidelines
- Subjective interpretation
- Biases (including unconscious bias – e.g., a researcher who wants their hypothesis to be true)
- Several experts:
- Compute inter-annotator agreement
- Possibly discuss complicated cases
- Many non-expert annotators (crowdsourcing)
- Automated annotation based on machine learning
How to Choose Annotators?
- What skills do your annotators need to have to perform your annotation task?
- Does your annotation task require any specialized knowledge? Vice versa: can specialized knowledge cause issues?
- What are practical considerations?
Crowdsourcing
- Large data annotation = several annotators × a lot of work = a lot of annotators × a bit of work
- Prolific - https://www.prolific.com/
- Amazon Mechanical Turk - https://www.mturk.com/
- Douglas et al. (2023): https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0279720
Crowdsourcing: reCAPTCHA
- Millions of CAPTCHAS are solved by people every day. reCAPTCHA makes positive use of this human effort by channeling the time spent solving CAPTCHAS into digitizing text, annotating images, and building machine learning datasets. This in turn helps preserve books, improve maps, and solve hard Al problems.
Crowdsourcing: Gamification
- Incentives for annotators:
- Financial compensation
- Access to a web resource
- Enjoyment
- Other intrinsic motivation
- An example of gamification:
- Challenge – develop annotation games: https://www.izdigital.fau.eu/2020/10/13/annotator-challenge-develop-a-game-to-gamify-the-annotating/
Crowdsourcing: Intrinsic Motivation
- https://www.woordwaark.nl/
- Interactive language data set for Gronings.
- Gronings is a set of low-resource dialects.
- By contributing to this project, Gronings speakers benefit themselves.
Looking for Annotated Data
- 5-minute exercise, in groups of 4-5
- Find an annotated data set of your choice
- Locate information about the annotations:
- What are the annotation labels?
- Are the guidelines publicly available?
- Who annotated the data?
Inter-Annotator Agreement
Inter-Annotator Agreement (IAA)
- To what extent do your annotators agree with each other?
- We need to measure the agreement between different annotators.
- Can’t ask the same person twice.
- Often measured on a data sample, as a part of the MAMA cycle.
IAA: Basic Idea
- Measuring IAA is a way to evaluate your annotations.
- High IAA – robust and reproducible annotation.
- Low IAA:
- Your guidelines are not sufficiently clear.
- A lot of room for subjective judgements in this task.
- Your annotators aren’t sufficiently motivated to perform the task well.
Why Measure IAA?
- Improve your annotation model and guidelines.
- Detect difficult cases in your data.
- Compare annotators’ performance in this task.
- Assess reliability of the annotation process.
- No reference corpus / gold standard data.
- Often no single “correct” answer.
Inter-Annotator Agreement Example
- Task: annotating customer reviews
- Annotation model: positive/negative
| Item | Annotator A | Annotator B |
|---|
| Review 1 | + | + |
| Review 2 | + | – |
| Review 3 | – | – |
| Review 4 | + | + |
| Review 5 | – | – |
| | |
Inter-Annotator Agreement Example - Confusion Matrix | | |
- Task: annotating customer reviews
- Annotation model: positive/negative
| B Positive | B Negative | Total |
|---|
| A Positive | 2 | 1 | 3 |
| A Negative | 0 | 2 | 2 |
| Total | 2 | 3 | 5 |
Confusion Matrix
- In the context of annotation, a confusion matrix shows how the annotators agree or disagree
Confusion Matrix - Good or Not?
- The confusion matrix must have labeled rows and columns indicating annotators and categories.
Inter-Annotator Agreement - Raw Agreement
Raw, or observed agreement Ao: the share of items on which annotators agree.
Ao=5(2+2)=0.8
| B Positive | B Negative | Total |
|---|
| A Positive | 2 | 1 | 3 |
| A Negative | 0 | 2 | 2 |
| Total | 2 | 3 | 5 |
Inter-Annotator Agreement - Raw Agreement with Three Categories
Raw, or observed agreement Ao: the share of items on which annotators agree.
Ao=11(2+2+3)≈0.64
| B Positive | B Negative | B Neutral | Total |
|---|
| A Positive | 2 | 1 | 1 | 4 |
| A Negative | 0 | 2 | 1 | 3 |
| A Neutral | 0 | 1 | 3 | 4 |
| Total | 2 | 4 | 5 | 11 |
Issues with Observed IAA
- What are the drawbacks of this measure?
- Think about it for 3 minutes, discuss with others
| B Positive | B Negative | Total |
|---|
| A Positive | 2 | 1 | 3 |
| A Negative | 0 | 2 | 2 |
| Total | 2 | 3 | 5 |
| Ao=(2+2)/5=0.8 | | | |
| | | |
Issues with Observed IAA - Chance Agreement | | | |
- Annotators may agree by chance!
- With 2 labels, for each data point there is a 50% chance that 2 annotators agree at random
- Especially problematic for data with imbalanced classes
Issues with Observed IAA - Imbalanced Classes Example
- Example: a corpus with 1000 tokens with just a few named entities.
- Annotating whether each word is a named entity
- Annotation model:
- NE: entity
- 0: non-entity
*Last_0 weekend_0, I_0 travelled_0 to_0 Berlin_NE to_0 visit_0 an_0 old_0 friend_0. We_0 spent_0 most_0 of_0 the_0 time_0 exploring_0 museums_0 and drinking_0 coffee_0.
Issues with Observed IAA - High Agreement, Low Usefulness
| B: entity | B: non-entity | Total |
|---|
| A: entity | 1 | 9 | 10 |
| A: non-entity | 8 | 982 | 990 |
| Total | 9 | 991 | 1000 |
IAA Scores - Accounting for Chance
- We need to take into account agreement by chance.
- IAA scores commonly used instead of raw agreement:
- Cohen’s kappa (κ)
- Fleiss’ kappa (κ)
- Krippendorf’s alpha (α)
Cohen's Kappa
- Idea of Cohen’s κ: calculate the expected chance agreement (Ae) and take that into consideration.
- Cohen’s κ measures how much of the possible agreement beyond chance was obtained.
- κ=1−AeAo−Ae
- Ao – observed agreement
- Ae – expected agreement (by chance)
- (both values between 0 and 1)
- κ=1−AeAo−Ae
- \Ao – A_e: amount of agreement beyond chance
- 1–Ae: maximum possible amount of agreement beyond chance
- κ is the ratio between these two values, a number between –1 and 1
Expected Agreement
- This is the agreement by chance, Ae
- An example with two equally likely labels:
- 2 annotators (A and B) and 2 labels (TRUE and FALSE)
- For label TRUE:
- Annotator A will choose TRUE half of the time, probability 0.5
- Annotator B will choose TRUE of the time, probability 0.5
- Both annotators: 0.5 × 0.5 = 0.25
- Same calculations for label FALSE: 0.5 × 0.5 = 0.25
- Ae = 0.25 + 0.25 = 0.5
Expected Agreement - Verification
- Let's check if our calculations are correct!
- Write down all possible label combinations in the table:
- In 2 out of 4 rows there is agreement (both green or both red)
- 2 / 4 = 0.5
| Item | Annotator A | Annotator B | Agreement? |
|---|
| Combination 1 | + | + | Yes |
| Combination 2 | + | – | No |
| Combination 3 | – | + | No |
| Combination 4 | – | – | Yes |
Cohen's Kappa - Unbalanced Dataset Example
- We already know Ao from before: Ao=(982+1)/1000=0.983
- Need to compute expected agreement Ae
- κ=1−AeAo−Ae
| B: entity | B: non-entity | Total |
|---|
| A: entity | 1 | 9 | 10 |
| A: non-entity | 8 | 982 | 990 |
| Total | 9 | 991 | 1000 |
Cohen's Kappa - Calculation Example (1)
- Computed per label – first for “entity”:
- How likely each coder was to choose the label “entity”?
- Annotator A: 10/1000 = 0.01
- Annotator B: 9/1000 = 0.009
- How likely were both to select “entity”?
- κ=1−AeAo−Ae
| B: entity | B: non-entity | Total |
|---|
| A: entity | 1 | 9 | 10 |
| A: non-entity | 8 | 982 | 990 |
| Total | 9 | 991 | 1000 |
Cohen's Kappa - Calculation Example (2)
- Computed per label – now for “non-entity”
- How likely each coder was to choose the label “non-entity”?
- Annotator A: 990/1000 = 0.99
- Annotator B: 991/1000 = 0.991
- How likely were both to select “non-entity”?
- κ=1−AeAo−Ae
| B: entity | B: non-entity | Total |
|---|
| A: entity | 1 | 9 | 10 |
| A: non-entity | 8 | 982 | 990 |
| Total | 9 | 991 | 1000 |
Cohen's Kappa - Calculation Example (3)
- Computed per label – now add them up!
- Ae = 0.00009 + 0.98109 = 0.98118
- We already know Ao from before: Ao = (982+1) / 1000 = 0.983
- We have computed Ae: Ae = 0.98118
- κ=1−0.981180.983−0.98118≈0.097
- κ=1−AeAo−Ae
| B: entity | B: non-entity | Total |
|---|
| A: entity | 1 | 9 | 10 |
| A: non-entity | 8 | 982 | 990 |
| Total | 9 | 991 | 1000 |
Interpreting Cohen's Kappa
- According to Landis & Koch (1977):
- ≤ 0 no agreement
- 0.01–0.20 none to slight
- 0.21–0.40 fair
- 0.41–0.60 moderate
- 0.61–0.80 substantial
- 0.81–1.00 almost perfect
Interpreting Cohen's Kappa - Modern Interpretations
- Modern interpretations are more conservative (McHugh, 2015)
| Value of Kappa | Level of Agreement |
|---|
| 0-.20 | None |
| .21-.39 | Minimal |
| .40-.59 | Weak |
| .60-.79 | Moderate |
| .80-90 | Strong |
| Above .90 | Almost Perfect |
| |
Interpreting Cohen's Kappa - Example Result | |
- κ=1−0.981180.983−0.98118≈0.097
- There is little to no agreement between the annotators.
- κ=1−AeAo−Ae
| B: entity | B: non-entity | Total |
|---|
| A: entity | 1 | 9 | 10 |
| A: non-entity | 8 | 982 | 990 |
| Total | 9 | 991 | 1000 |
Fleiss' Kappa
- Cohen's kappa only works with exactly two annotators
- For three or more annotators: Fleiss' kappa
- It's a generalization of Cohen's kappa to 3+ annotators
Adjudication
- Adjudication is a process of resolving disagreements.
- Possible methods:
- Let the annotators discuss the difficult cases.
- Seek expert opinion.
- Use the majority (i.e., most commonly used) label.
Summary
- Annotation cycle
- Annotation model and guidelines
- Annotators
- Inter-annotator agreement
Recommended Reading
- Pustejovsky, J., & Stubbs, A. (2013). Natural language annotation for machine learning. O'Reilly Media.
- https://rug.on.worldcat.org/oclc/801812987
- Chapter 1 and Chapter 6
Next Lecture - 12 May
- Setting up a machine learning experiment