Note

0.0(0)

Take a practice test

Chat with Kai

Explore Top Notes

Rights and protest (IB)

Note

Studied by 484 people

Girl With A Dragon Tattoo (Entire book)

Note

Studied by 42 people

4.5(2)

Ultimate AP Seminar Study Guide

Note

Studied by 7901 people

4.9(10)

Chemistry HL 3.2 Periodic Trends

Note

Studied by 87 people

5.0(1)

DBQ/LEQ Contextualization Ideas & Tips (AP World History)

Note

Studied by 432 people

4.8(5)

44d ago

Text as Data - Exam Preparation Notes

Bug Bounty Winners Sean's Research Claims to Fame Published Student Projects NLP Techniques in Search Language Models Encode Knowledge Where is the Knowledge Encoded?Knowledge Editing Retrieval-Augmented Language Models Integrating Retrieved Text to Generate Answer The Problems with Attention Optimizing Attention and Large Models Lost in the Middle Representing Language by Dense Vectors is a Little “Odd”Jake's Research The Explosion of Biological Data Driven by New Tech New Technologies Enable Precision Medicine Interpretation is the Bottleneck of Precision Medicine Search Tools and Knowledge Bases Let’s Use Text Mining to Build Some Knowledge Bases Example Application for Cancer Treatments Expert Collaborators Manually Annotate Text for Task Supervised Relation Extraction at Scale Resulting Resource After Mining Abstracts & Full-Text Papers Can We Specialise it for Childhood Cancers?Future Projects Research Opportunities Revision Part-of-speech Tagging Why Care About Parts-of-Speech?The Main Methods for Part-of-Speech Tagging Hidden Markov Models A Final Transformer Layer for Part-of-Speech Tagging Word Vectors Types of Word Vectors Transformers Stepping Through a Transformer First Step: Subword Tokenization Second Step: Add Any Special Tokens Third Step: Convert Tokens to IDs Fourth Step: Get the Input Embeddings for Those Token IDs Fifth Step: Add Positional Information Are We Done?Sixth Step: A Transformer Block Inside the Transformer Block Differences for a Decoder Block (Versus an Encoder)Seventh (and More) Step: More Transformer Blocks Transfer Learning: The Pre-Train and Fine-Tune Paradigm Why Does This Work?Visualizing Attention Exam Preparation Scan the Exam Budget Your Time Paper Notes Past Exam Papers and Example Answers Exam Advice – Level of Detail & Marks Too Little Detail Too Much Detail Show Your Work for Calculations Exam Advice – Level of Detail & Marks - Summary Exam Advice – Types of Questions Exam Advice – Types of Questions Exam Advice – Types of Questions - Example Exam Advice – Types of Questions Exam Advice – Types of Questions - Summary Glasgow’s Code of Assessment TASD Intended Learning Outcomes Glasgow’s Code of Assessment - Summary Final Word on Good Practices Assume Nothing, Check Everything Explore Your Data Be Careful of Code Doing Things Without Your Knowledge Use Realistic Data for Your Evaluations Follow the Standard Process Consider Class Imbalance Best Practices for Machine Learning Text as Data - Wrap Up

Bug Bounty Winners

The bug bounty winners are:
- 1st place: Ray-Gi (10 points)
- 2nd place: Saltssaumure (6 points)
- 3rd place: mrktrnbll (5 points)
- 3rd place: Lewism1404 (5 points)

Sean's Research

Dr. Sean MacAvaney's main research areas:
- Using NLP to improve search results.
- Doing it efficiently.

Claims to Fame

Context vectors are useful for determining document relevance.
- Previous work using static word embeddings for search engines were not very successful.
- Using [CLS] embeddings, context embedding similarity, or both significantly improves search results.
- Reference: MacAvaney et al. CEDR: Contextualized Embeddings for Document Ranking. SIGIR 2019
Masked Language Modelling provides helpful document expansion tokens.
- Allows expanding documents to helpful other terms, which can be indexed offline.
- Reference: MacAvaney et al. Expansion via Prediction of Importance with Contextualization. SIGIR 2020
Document similarity can help identify relevant documents.
- Find relevant documents by iteratively scoring documents similar to relevant ones.
- Reference: MacAvaney et al. Adaptive Re-Ranking with a Corpus Graph. CIKM 2022

Published Student Projects

Mitko Gospodinov:
- Text generation models, like Doc2Query, can "hallucinate" content, harming retrieval effectiveness.
- Explored in his master's project and published at #ecir2023.
Expansion Queries Example:
- Shows various queries and their ranking (e.g., "5th what nationality is the name dvorak").
- Some queries are wrong or lead to the wrong person.
Performance Metrics:
- RR@10 (Reciprocal Rank at 10) is used to measure retrieval effectiveness.
- Graph shows RR@10 values for different numbers of total tokens, GPU hours, and filtering phases (Filtering phase, total tokens, Generation phase).
- Table shows Dev RR for sim. GPU hrs.
Prashansa Gupta:
- Observation: ~40% of queries for MS MARCO passage ranking are discarded.
- Investigated the potential for "survivorship bias" in the dataset.
- Reference: Gupta et al. On Survivorship Bias in MS MARCO. SIGIR 2022.
- Reasons for discarding queries: Assessors couldn't find an answer in the top ~10 results.
- Double-checking annotations usually agreed (~85% of the time).
- Very few discarded queries were "ill-formed".
- Answers to around two-thirds of discarded queries could be found in the (v1) corpus using newer ranking models.
- Even with BM25, answers to about half could be found in the top 10.
- Types of queries more likely to be discarded: "Description" and "Numeric".
- The discarded queries belong to a different distribution.
- The majority (71%) were well-formed, but the answer was not found in the top 10 (agrees with label).
- 13% were well-formed, and the answer was found in the top 10 (disagrees on label).
- 13% were well-formed but answered incompletely in the top 10.
Andreas Chari:
- NLP models used for retrieval are affected by the spelling conventions (i.e., British vs American conventions).
- Normalizing can help.
- Reference: Chari et al. On the Effects of Regional Spelling Conventions in Retrieval Models. SIGIR 2023.
Andrew Parry:
- Prompt-based retrieval models are susceptible to "injection" attacks.
- Including special tokens like "relevant" in content can increase its retrievability.
- LLMs can subtly inject this content, too.
- Reference: Parry et al. Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models. ECIR 2024.

NLP Techniques in Search

Using NLP techniques in search can lead to lots of improvements.
Interesting questions/challenges remain:
- Generative models for retrieval – dealing with "hallucination"
- Building models that are "interpretable"
- Learning about model/dataset biases, and correcting them
- Improving search results while maintaining efficiency
- Multi/cross-language search
Research opportunities:
- L4/Master’s project on NLP for search engines.
- Opportunity to publish / contribute to open source software
- No need to have taken the IR course(but it helps!)

Language Models Encode Knowledge

Language models encode knowledge.
- You can ask language models to answer factual questions.
  - e.g., by completing a sentence or directly with questions for the bigger models like ChatGPT.
They are often wrong.
- Example link provided: https://transformer.huggingface.co/doc/distil-gpt2

Where is the Knowledge Encoded?

If a language model can tell us the capital of France, where is that encoded?
- A Transformer is a big set of parameters.
- GPT-3/4 has billions of parameters.
- Which ones of those correspond to Paris being the capital of France?
Basically impossible to know how this knowledge is encoded.

Knowledge Editing

Oh no, my language model is wrong!!!
Retraining from scratch with new knowledge is too costly.
Lots of research on:
- Making minimal edits to LLMs to update individual facts.
- Updating lots of facts at the same time.
- Switching to a mini LM for new facts.

Retrieval-Augmented Language Models

Can you get a language model to look up its sources to find information?
- New models can search a big corpus (e.g., Wikipedia) for relevant text and use that to help complete the missing words.
This enables:
- Adding new information by adding new sources.
- Updating information.
- Asking for the source of knowledge.
- Reference: Izacard, Gautier, and Edouard Grave. "Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering." EACL 2021

Integrating Retrieved Text to Generate Answer

How to integrate multiple passages when generating the answer?
- Encode them and concatenate them.
- Create a new Transformer architecture.
It's still fallible!
- Very susceptible to the power of suggestion, even when external sources are used.
- Example: (Bing Chat)

The Problems with Attention

Attention scales with the length of the sequence ( $N^2$ $$N^2$$).
Scaling problems for long sequences.
Lots of research on more efficient attention mechanisms

Optimizing Attention and Large Models

Storing data in smaller data formats
- Using 16-bit floating point (instead of 32-bit or 64-bit).
Using compression while computing attention
- Compressing the attention weights into smaller data formats.
FlashAttention
- Optimising the hardware memory accesses to be more efficient.
Specialised hardware
- e.g., the Tensor Processing Unit (TPU).
- Reference: Dao, Tri, et al. "FlashAttention: Fast and memory-efficient exact attention with io-awareness."

Lost in the Middle

Big LLMs can now accept vast text as input
- GPT-4 accepts 128,000 tokens!
Enables asking questions about a whole book.
But does the location of the important info matter in the long text?
- Apparently yes!
- Reference: Liu, Nelson F., et al. "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the Association for Computational Linguistics 12 (2024).

Representing Language by Dense Vectors is a Little “Odd”

Dense vectors are a very powerful tool but representing meaning by numbers feels strange.
Is reasoning just about guessing the next word?
Maybe we should keep other methods in mind
Is focusing entirely on transformers putting all our eggs in one basket?
The field of NLP may change dramatically in the future.

Jake's Research

The Explosion of Biological Data Driven by New Tech

DNA sequencing is now fairly cheap so it can be deployed across healthcare systems and beyond
This means that the amount of data that scientists see is vast.
- Biology is becoming a data science.
- References:
  - https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost
  - https://nanoporetech.com/about-us/news/oxford-nanopore-announces-ps100-million-140m-fundraising-global-investors

New Technologies Enable Precision Medicine

“The right drug to the right patient at the right time”
Relies heavily on the latest research
Groups around the world are manually reviewing literature constantly

Interpretation is the Bottleneck of Precision Medicine

We can now get a lot of data on a person & their health
- There are ~20,000 genes in the human genome.
- We can measure their mutations, activity levels, etc.
But what does it mean?!
- Reference: Good, Benjamin M., et al. "Organizing knowledge to enable personalization of medicine in cancer."

Search Tools and Knowledge Bases

Search Tools vs. Knowledge Bases:
- Search Tools: ❌ May require users to read (many) papers, ❌ Cannot easily be used by automated analyses
- Knowledge Bases: ❌ Only good if a KB exists for your problem, ❌ Huge cost burden to create and maintain

Let’s Use Text Mining to Build Some Knowledge Bases

Can we use named entity recognition and relation extraction on research articles?

Example Application for Cancer Treatments

“In colorectal cancer, KRAS mutations were found to be associated with cetuximab resistance.”
- Morris et al 2011. PMID: 22248868

Expert Collaborators Manually Annotate Text for Task

Please read the following sentence and annotate with appropriate cancer/gene relationship. Already tagged 1000 sentences.
Example:
- SNCG level in colon adenocarcinoma is potentially valuable in predicting colon adenocarcinoma patients at high risk of recurrence and shorter survival after surgery. (pubmed)
- gene: SNCG // cancer: colon adenocarcinoma
- Annotation options: None, Prognostic, Diagnostic, Predictive, Other

Supervised Relation Extraction at Scale

Biomedical literature is big!
Need for high-precision predictions
Dependency-path methods have had good success with speed/performance trade-off
Lots of interesting challenges with using Transformer-based methods

Resulting Resource After Mining Abstracts & Full-Text Papers

CIViCmine resource mines genes, cancers, genetic variation, and their associations. (http://bionlp.bcgsc.ca/civicmine/)
Extra applications:
- Mutations that affect drug metabolism
- Genes that affect cancer growth

Can We Specialise it for Childhood Cancers?

A lot of knowledge bases are adult-focused. Let’s do a better job for childhood cancers
Different cancers have different age demographics

Future Projects

Extracting biomedical knowledge from text (and then doing cool stuff with it)
Challenges:
- Extracting complex meanings and contexts of biomedical relations
- Cataloguing the impact of genetic variation
- Scaling information extraction methods to the vast scale of research papers
- Making inferences of new treatments
No biomedical knowledge necessary

Research Opportunities

Get in contact!
- Level 4 / MSci projects
  - If you’d like to work with me, get in contact before August
- PhDs
  - Always happy to talk about PhD opportunities
  - Masters degree is not required

Revision

Part-of-speech Tagging

What is part-of-speech tagging?
- Is a word a noun, a verb, a proper noun, etc?
- Pick the part-of-speech for each word in some text.
- It is a sequence labelling or token classification problem
- There are a few different label sets for part-of-speech
- Some more detailed, some less
- For example: Universal dependencies has the “VERB” tag. (https://universaldependencies.org/u/pos/)

Why Care About Parts-of-Speech?

Part-of-speech is valuable for:
- Extracting meaning from a sentence
- Disambiguating the meaning of a word
- As a feature for other text processing pipelines
But BERT/Transformers?!
- An entirely Transformer-based approach will implicitly use part-of-speech so you don’t need to deal with it

The Main Methods for Part-of-Speech Tagging

Hidden Markov Models (HMM)
- Uses Markov assumption - only context used is the previous token
- Different algorithms for solving transitions between hidden states
BERT token classification
- Use context vectors from pretrained language model
- Classifier takes vector and predicts POS for each token
Both approaches require an annotated dataset to learn from

Hidden Markov Models

Model problem as a series of transitions between hidden states (i.e. the parts-of-speech) and emissions
Different algorithms to solve problem:
- Greedy algorithm - takes most likely part-of-speech at each step
- Viterbi algorithm - looks at entire context to plot most likely path

A Final Transformer Layer for Part-of-Speech Tagging

Same approach used for Named Entity Recognition in the labs

Use a final layer that takes each context vector and classifies it to a part of speech.

Word Vectors

Types of Word Vectors

(1) Static and sparse data (the IBM model)
- Represents each word as the frequencies of the words that appear around it (from a large corpus)
- Much like how we represent sparse documents!
  Shark -> attack ocean fish eat mind endure …
  134 531 451 432 0 4
  Bear -> attack ocean fish eat mind endure …
  745 2 137 312 313 123
  (length: vocab size)
  (length: vocab size)
(2) Static and dense data
- Represents each word as smaller dense vectors (usually 100-1000 dimensions)
- Usually trained via a small neural network to predict surrounding words
- Or can be “compressed” version of sparse vectors from last slide using method like SVD
- Easier to work with. Why?
(3) Contextualised and dense *(Length: fixed, ~100 dimensions)
- Shark - > [0.8, 0.2, 0.9, 0.4, 0.2, …]
- Bear -> [0.4, 0.5, 0.1, 0.5, 0.9, …]
- Builds a representation for the specific instance of the word within a specific document.
- Uses models like Transformers to build them
- (Length: fixed, ~768 dimensions)
  Shark -> …The shark attacked… -> [0.8, 0.2, 0.9, 0.4, 0.2, …]
  Bear -> …The bear ate… -> [0.4, 0.5, 0.1, 0.5, 0.9, …]
  Bear -> …bear in mind… -> [0.1, 0.2, 0.8, 0.5, 0.3, …]

Transformers

Stepping Through a Transformer

The whole process of using a Transformer can be a bit mysterious.
- Text goes in and context vectors come out?
- Excluding the final layers that do token classification, etc.
A lot of this process is done automatically but let’s work through it
Example: ``Kelvingrove park is beautiful''

First Step: Subword Tokenization

Tokenizer has been trained using a large corpus of text
- Possibly using BPE or a similar method
Splits up uncommon words (e.g. kelvingrove -> kelvin and ##grove)
We’ll use the ‘bert-base-uncased’ tokenizer (notice the lowercasing)
- which uses ## to signify a token that doesn’t start a word
Example:
``Kelvingrove park is beautiful'' to ['kelvin', '##grove', 'park', 'is', 'beautiful']

Second Step: Add Any Special Tokens

Some tokenization methods use special tokens
- BERT adds [CLS] and [SEP] at the beginning at end of text
Example:
['kelvin', '##grove', 'park', 'is', 'beautiful'] to [‘[CLS]’, 'kelvin', '##grove', 'park', 'is', 'beautiful', ‘[SEP]’ ]

Third Step: Convert Tokens to IDs

During its training, the tokenizer created a vocabulary of tokens and their IDs
Here we use that mapping
- This is using ‘bert-base-uncased’ which has a vocabulary of 30,522 tokens
- It will be different for another tokenizer (e.g. GPT)
Example:
[‘[CLS]’, 'kelvin', '##grove', 'park', 'is', 'beautiful', ‘[SEP]’ ] to [101, 24810, 21525, 2380, 2003, 3376, 102]

Fourth Step: Get the Input Embeddings for Those Token IDs

During its training, the transformer has learned initial vectors (effectively static vectors) for each token
- BERT models often use vectors with 768 elements
- With a vocab of 30,522, the model has a matrix storing the input embeddings of dimension 30522x768

Fifth Step: Add Positional Information

Token vectors contain no information about their order
Create positional embedding vectors
- BERT learns positional vectors for all possible 512 token locations
- Other models use some sine and cosine equations given the token location
Add the positional vectors to the input tokens

Are We Done?

We’ve got vectors with position. But we are not done!
These vectors represent each token individually!
They don’t factor in the context of the whole sentence
What does “park” mean?

Sixth Step: A Transformer Block

Initial inputs are vectors that don’t have context yet
Transformer block uses self-attention and other mechanisms to create context vectors

Inside the Transformer Block

Self-attention
- Every output vector is a weighted combination of all the input vectors
- Weights are based on the importance of that token in interpreting the token of interest
Multi-head Attention
- Do attention X times so that you get X vectors for each token
Feed Forward
- Traditional neural network
- Allows for non-linearity
Add & Norm
- Little tricks for making training easier

Differences for a Decoder Block (Versus an Encoder)

For decoder-only transformers (e.g. GPT):
- Self-attention is only allowed to examine tokens earlier in the input
- No looking ahead!
For encoder-decoder transformers (e.g. T5):
- Used for translating between languages
- Attention cannot look ahead but can examine the encoded text in the other language

Seventh (and More) Step: More Transformer Blocks

Take the outputted vectors from one Transformer block and feed them into the next block
Each Transformer block has different parameters
- e.g. different attention matrices, etc.
- Layer 12 will do things differently to Layer 1

Transfer Learning: The Pre-Train and Fine-Tune Paradigm

Train an ML system on one task and then adapt it to another new task.
Common paradigm: train a language model for one task (e.g., masked language modelling) and then further train it on a new task (e.g., text classification, summarization, etc.).
Keep many of the same model parameters (most of the transformer layers) but replace the final bits to do the desired task.

Why Does This Work?

When training a model from scratch, it needs to learn a lot about the language itself:
- what words mean, how they modify one another, etc.
By training a task with nearly limitless data (language modelling), the model learns these common language patterns, which can be reused on other language tasks.

Visualizing Attention

Reference: https://github.com/jessevig/bertviz

Exam Preparation

Scan the Exam

Get a quick sense initially of what topics are coming up in each question
- Trickier online, but you can click through questions
You don’t want to be surprised by a particularly long or challenging looking question later in the exam

Budget Your Time

In Text As Data, there will be three 20 mark questions
- Do them all
Watch your time in each question
Don’t get stuck spending too much time in a question with few marks
- Spend your time where the marks are

Paper Notes

You can bring paper notes to the exam
- No digital notes this year
Good ideas:
- Distil the course down into ~2 pages of quick lookup notes
- Have a section for equations
- Work with other students to build good notes
Bad ideas:
- Printing all the course slides
- Printing screeds of ChatGPT-generated content
- Printing all the past papers with solutions
Do not waste your time searching your notes

Past Exam Papers and Example Answers

Past exam papers and example answers are on the Moodle
The Masters version of the exam is slightly harder.
These are reasonably representative of the upcoming exams.
In-person exams will have less challenging arithmetic, and you should practice the various calculations throughout the course.

Exam Advice – Level of Detail & Marks

Level of detail & marks given as [#] for each question.
In general, each mark is roughly one idea or one point in the answer.

Too Little Detail

You are not fully answering the question.
You will probably only be able to get partial marks at best
Stopwords like “and” do not carry much meaning.

Too Much Detail

You’re spending a lot of time writing on a question that gives very little marks
Often does not demonstrate understanding of how the course material relates to the question.
LLMs are often trained on many public datasets, so the LLM may have been trained on Circa. This is called “data contamination”.*
- If the LLM was contaminated, the results can’t be considered zero-shot since it was directly trained on IAC data
  Example:
  LLMs are often trained on many public databases so the LLM have been trained on Circa. This is considered as “data contamination”. If LLMs was contaminated the results can’t be zero-shot, since it was directed trained on LAC data. (Just Right Detail Level)

Show Your Work for Calculations

Many questions specify to show your work.
Even if they do not, you cannot be given partial marks if we do not see where you went wrong!
You’re doing the work anyway – you should show it!

Exam Advice – Level of Detail & Marks - Summary

Use [#] as a guide on how much detail to provide.
Be careful about both too much detail and too little detail.
Show your work for calculations.

Exam Advice – Types of Questions

The exams are open book and open notes, so questions that only ask you to recall information do not indicate mastery of the subject.
Questions focus on application, analysis, evaluation, creation.
All questions are free-text and manually marked (no multiple-choice or auto-grading)

Exam Advice – Types of Questions

Questions often require you to make connections between different concepts in the course or other fundamental CS concepts.
To answer this question, you need to:
- Understand how UTF works (Lecture 1)
- Understand how BPE works (Lecture 5)
- Make connections between these ideas

Exam Advice – Types of Questions - Example

Unicode: A variable-byte (2-4 bytes) encoding
- Encodes the alphabets of many languages
- Emojis, maths symbols and lots more
- Is frequently updated with new characters (v15.1 released in Sept 2023)
Byte pair encoding (BPE)
- Inputs:
  - Large corpus of text to learn tokenization
  - The desired size of vocabulary
- Algorithm:
  1. Pretokenise the corpus documents into words using a tokeniser (this will remove whitespace)
  2. Create a vocabulary of symbols: all unique characters in the corpus (i.e. all letters, numbers, etc)
  3. Repeat until the desired vocab size is reached:
    - Find the most common neighboring symbols in the corpus
    - Replace all instances of the pair with a new character and add a new character to the vocabulary

Exam Advice – Types of Questions

Due to the type of questions, content copied from notes will rarely give any/many marks.
- (We’re not grading how good of search engines you are; we are grading your mastery of the topic!)
Directly answer the question!
- Even if the answer can be inferred from the submitted text, we want to see YOU do the inference.

Exam Advice – Types of Questions - Summary

Questions focus on application, analysis, evaluation, creation, not remembering.
As such, text directly from lectures are unlikely to give many marks.

Glasgow’s Code of Assessment

Don’t be worried if you are not confident in every answer! The exam questions need to be challenging for 70% to be “exemplary range and depth of ILOs”.

TASD Intended Learning Outcomes

By the end of this course students will be able to:

Describe classical models for textual representations such as the one-hot encoding, bag-of-words models, and sequences with language modelling.
Identify potential applications of text analytics in practice.
Describe various common techniques for classification, clustering and topic modelling, and select the appropriate machine learning task for a potential document processing application.
Represent data as features to serve as input to machine learning models.
Assess machine learning model quality in terms of relevant error metrics for document processing tasks, in an appropriate experimental design.
Deploy unsupervised and machine learned approaches for document/text analytics tasks.
Critically analyse and critique recent developments in natural language and text processing academic literature.
Evaluate and explain the appropriate application of recent research developments to real-world problems.

Glasgow’s Code of Assessment - Summary

The exam will be challenging because 70% indicates “exemplary range and depth of attainment”
Be familiar with the course ILOs

Final Word on Good Practices

Assume Nothing, Check Everything

Never trust that your data is clean and ready to use
Write defensive code
- Use asserts a lot
- Check the inputs to functions
- Use Python typing where appropriate
If you get perfect or zero results, that’s a good indicator that things have gone wrong
- Reasonable-looking results can still be wrong - try to check as much as you can

Explore Your Data

Do sanity check on everything you can think of (e.g. column mins, maxs)
Look for duplicates
Check a few of the labels yourself (if you can)
Unsupervised approaches:
- Cluster your data to see if there are any groups you should be aware of
- Use dimensionality reduction methods to visualise your data in 2D
Plot everything you can think of (different types of plots can be very helpful)

Be Careful of Code Doing Things Without Your Knowledge

A lot of packages will try to be “helpful”. This can trip you up if you’re not careful
Example:
- Pandas tries to make things numbers. Sometimes this is not appropriate patientid,diagnosisid 1,010 2,014 3,10.0 4,14.0 5,

Use Realistic Data for Your Evaluations

The data that you use for evaluation needs to be realistic of the real problem
If not, you are creating an ML system specific to that data that cannot generalise
You can do all kinds of things to the training set (e.g. resampling) that you should not do to validation/test set

Follow the Standard Process

Check your data
1. Manual inspection
2. Check label counts
3. Use unsupervised approaches to explore your data
Build and tune pipelines using the training and validation set
1. Use appropriate evaluation metrics
2. Inspect the mistakes to get ideas for improving the model
Pick your best model and evaluate on the test set

Consider Class Imbalance

Can cause problems with:
- Training: Classifier can favor the majority class and just always predict that
- Evaluation: Using the wrong metric (especially accuracy) can obscure what is going on
  Predicted | Positive | Negative
  ---|---|
  Actual Positive | TP |FN
  Negative |FP| TN

The metric to focus on will depend on the project

Reference: https://arxiv.org/abs/2108.02497
Outlines a number of the pitfalls when doing machine learning
Learning best practices takes time and reflection on what/why you are doing things

Best Practices for Machine Learning

Use training/validation/test splits of data
Be systematic
- Set up experiments!
Watch out for class imbalance
Think about your evaluation metrics
- Always better to use 2+ metrics
Look at your data
- Helps you design a better classifier
- Helps you understand the mistakes
Be cynical about data and the classifier
- Data is always messy
- Classifiers always look for an easy solution

Text as Data - Wrap Up

Please fill out the feedback form to let us know how we can improve the course with three quick questions. (https://tinyurl.com/textasdatafeedback)
Answers are anonymous. Enjoy! :D

Note

0.0(0)

Take a practice test

Chat with Kai

Explore Top Notes

Rights and protest (IB)

Note

Studied by 484 people

Girl With A Dragon Tattoo (Entire book)

Note

Studied by 42 people

4.5(2)

Ultimate AP Seminar Study Guide

Note

Studied by 7901 people

4.9(10)

Chemistry HL 3.2 Periodic Trends

Note

Studied by 87 people

5.0(1)

DBQ/LEQ Contextualization Ideas & Tips (AP World History)

Note

Studied by 432 people

4.8(5)

Text as Data - Exam Preparation Notes

Bug Bounty Winners

The bug bounty winners are:
- 1st place: Ray-Gi (10 points)
- 2nd place: Saltssaumure (6 points)
- 3rd place: mrktrnbll (5 points)
- 3rd place: Lewism1404 (5 points)

Sean's Research

Dr. Sean MacAvaney's main research areas:
- Using NLP to improve search results.
- Doing it efficiently.

Claims to Fame

Context vectors are useful for determining document relevance.
- Previous work using static word embeddings for search engines were not very successful.
- Using [CLS] embeddings, context embedding similarity, or both significantly improves search results.
- Reference: MacAvaney et al. CEDR: Contextualized Embeddings for Document Ranking. SIGIR 2019
Masked Language Modelling provides helpful document expansion tokens.
- Allows expanding documents to helpful other terms, which can be indexed offline.
- Reference: MacAvaney et al. Expansion via Prediction of Importance with Contextualization. SIGIR 2020
Document similarity can help identify relevant documents.
- Find relevant documents by iteratively scoring documents similar to relevant ones.
- Reference: MacAvaney et al. Adaptive Re-Ranking with a Corpus Graph. CIKM 2022

Published Student Projects

Mitko Gospodinov:
- Text generation models, like Doc2Query, can "hallucinate" content, harming retrieval effectiveness.
- Explored in his master's project and published at #ecir2023.
Expansion Queries Example:
- Shows various queries and their ranking (e.g., "5th what nationality is the name dvorak").
- Some queries are wrong or lead to the wrong person.
Performance Metrics:
- RR@10 (Reciprocal Rank at 10) is used to measure retrieval effectiveness.
- Graph shows RR@10 values for different numbers of total tokens, GPU hours, and filtering phases (Filtering phase, total tokens, Generation phase).
- Table shows Dev RR for sim. GPU hrs.
Prashansa Gupta:
- Observation: ~40% of queries for MS MARCO passage ranking are discarded.
- Investigated the potential for "survivorship bias" in the dataset.
- Reference: Gupta et al. On Survivorship Bias in MS MARCO. SIGIR 2022.
- Reasons for discarding queries: Assessors couldn't find an answer in the top ~10 results.
- Double-checking annotations usually agreed (~85% of the time).
- Very few discarded queries were "ill-formed".
- Answers to around two-thirds of discarded queries could be found in the (v1) corpus using newer ranking models.
- Even with BM25, answers to about half could be found in the top 10.
- Types of queries more likely to be discarded: "Description" and "Numeric".
- The discarded queries belong to a different distribution.
- The majority (71%) were well-formed, but the answer was not found in the top 10 (agrees with label).
- 13% were well-formed, and the answer was found in the top 10 (disagrees on label).
- 13% were well-formed but answered incompletely in the top 10.
Andreas Chari:
- NLP models used for retrieval are affected by the spelling conventions (i.e., British vs American conventions).
- Normalizing can help.
- Reference: Chari et al. On the Effects of Regional Spelling Conventions in Retrieval Models. SIGIR 2023.
Andrew Parry:
- Prompt-based retrieval models are susceptible to "injection" attacks.
- Including special tokens like "relevant" in content can increase its retrievability.
- LLMs can subtly inject this content, too.
- Reference: Parry et al. Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models. ECIR 2024.

NLP Techniques in Search

Using NLP techniques in search can lead to lots of improvements.
Interesting questions/challenges remain:
- Generative models for retrieval – dealing with "hallucination"
- Building models that are "interpretable"
- Learning about model/dataset biases, and correcting them
- Improving search results while maintaining efficiency
- Multi/cross-language search
Research opportunities:
- L4/Master’s project on NLP for search engines.
- Opportunity to publish / contribute to open source software
- No need to have taken the IR course(but it helps!)

Language Models Encode Knowledge

Language models encode knowledge.
- You can ask language models to answer factual questions.
  - e.g., by completing a sentence or directly with questions for the bigger models like ChatGPT.
They are often wrong.
- Example link provided: https://transformer.huggingface.co/doc/distil-gpt2

Where is the Knowledge Encoded?

If a language model can tell us the capital of France, where is that encoded?
- A Transformer is a big set of parameters.
- GPT-3/4 has billions of parameters.
- Which ones of those correspond to Paris being the capital of France?
Basically impossible to know how this knowledge is encoded.

Knowledge Editing

Oh no, my language model is wrong!!!
Retraining from scratch with new knowledge is too costly.
Lots of research on:
- Making minimal edits to LLMs to update individual facts.
- Updating lots of facts at the same time.
- Switching to a mini LM for new facts.

Retrieval-Augmented Language Models

Can you get a language model to look up its sources to find information?
- New models can search a big corpus (e.g., Wikipedia) for relevant text and use that to help complete the missing words.
This enables:
- Adding new information by adding new sources.
- Updating information.
- Asking for the source of knowledge.
- Reference: Izacard, Gautier, and Edouard Grave. "Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering." EACL 2021

Integrating Retrieved Text to Generate Answer

How to integrate multiple passages when generating the answer?
- Encode them and concatenate them.
- Create a new Transformer architecture.
It's still fallible!
- Very susceptible to the power of suggestion, even when external sources are used.
- Example: (Bing Chat)

The Problems with Attention

Attention scales with the length of the sequence ( $N^2$ ).
Scaling problems for long sequences.
Lots of research on more efficient attention mechanisms

Optimizing Attention and Large Models

Storing data in smaller data formats
- Using 16-bit floating point (instead of 32-bit or 64-bit).
Using compression while computing attention
- Compressing the attention weights into smaller data formats.
FlashAttention
- Optimising the hardware memory accesses to be more efficient.
Specialised hardware
- e.g., the Tensor Processing Unit (TPU).
- Reference: Dao, Tri, et al. "FlashAttention: Fast and memory-efficient exact attention with io-awareness."

Lost in the Middle

Big LLMs can now accept vast text as input
- GPT-4 accepts 128,000 tokens!
Enables asking questions about a whole book.
But does the location of the important info matter in the long text?
- Apparently yes!
- Reference: Liu, Nelson F., et al. "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the Association for Computational Linguistics 12 (2024).

Representing Language by Dense Vectors is a Little “Odd”

Dense vectors are a very powerful tool but representing meaning by numbers feels strange.
Is reasoning just about guessing the next word?
Maybe we should keep other methods in mind
Is focusing entirely on transformers putting all our eggs in one basket?
The field of NLP may change dramatically in the future.

Jake's Research

The Explosion of Biological Data Driven by New Tech

DNA sequencing is now fairly cheap so it can be deployed across healthcare systems and beyond
This means that the amount of data that scientists see is vast.
- Biology is becoming a data science.
- References:
  - https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost
  - https://nanoporetech.com/about-us/news/oxford-nanopore-announces-ps100-million-140m-fundraising-global-investors

New Technologies Enable Precision Medicine

“The right drug to the right patient at the right time”
Relies heavily on the latest research
Groups around the world are manually reviewing literature constantly

Interpretation is the Bottleneck of Precision Medicine

We can now get a lot of data on a person & their health
- There are ~20,000 genes in the human genome.
- We can measure their mutations, activity levels, etc.
But what does it mean?!
- Reference: Good, Benjamin M., et al. "Organizing knowledge to enable personalization of medicine in cancer."

Search Tools and Knowledge Bases

Search Tools vs. Knowledge Bases:
- Search Tools: ❌ May require users to read (many) papers, ❌ Cannot easily be used by automated analyses
- Knowledge Bases: ❌ Only good if a KB exists for your problem, ❌ Huge cost burden to create and maintain

Let’s Use Text Mining to Build Some Knowledge Bases

Can we use named entity recognition and relation extraction on research articles?

Example Application for Cancer Treatments

“In colorectal cancer, KRAS mutations were found to be associated with cetuximab resistance.”
- Morris et al 2011. PMID: 22248868

Expert Collaborators Manually Annotate Text for Task

Please read the following sentence and annotate with appropriate cancer/gene relationship. Already tagged 1000 sentences.
Example:
- SNCG level in colon adenocarcinoma is potentially valuable in predicting colon adenocarcinoma patients at high risk of recurrence and shorter survival after surgery. (pubmed)
- gene: SNCG // cancer: colon adenocarcinoma
- Annotation options: None, Prognostic, Diagnostic, Predictive, Other

Supervised Relation Extraction at Scale

Biomedical literature is big!
Need for high-precision predictions
Dependency-path methods have had good success with speed/performance trade-off
Lots of interesting challenges with using Transformer-based methods

Resulting Resource After Mining Abstracts & Full-Text Papers

CIViCmine resource mines genes, cancers, genetic variation, and their associations. (http://bionlp.bcgsc.ca/civicmine/)
Extra applications:
- Mutations that affect drug metabolism
- Genes that affect cancer growth

Can We Specialise it for Childhood Cancers?

A lot of knowledge bases are adult-focused. Let’s do a better job for childhood cancers
Different cancers have different age demographics

Future Projects

Extracting biomedical knowledge from text (and then doing cool stuff with it)
Challenges:
- Extracting complex meanings and contexts of biomedical relations
- Cataloguing the impact of genetic variation
- Scaling information extraction methods to the vast scale of research papers
- Making inferences of new treatments
No biomedical knowledge necessary

Research Opportunities

Get in contact!
- Level 4 / MSci projects
  - If you’d like to work with me, get in contact before August
- PhDs
  - Always happy to talk about PhD opportunities
  - Masters degree is not required

Revision

Part-of-speech Tagging

What is part-of-speech tagging?
- Is a word a noun, a verb, a proper noun, etc?
- Pick the part-of-speech for each word in some text.
- It is a sequence labelling or token classification problem
- There are a few different label sets for part-of-speech
- Some more detailed, some less
- For example: Universal dependencies has the “VERB” tag. (https://universaldependencies.org/u/pos/)

Why Care About Parts-of-Speech?

Part-of-speech is valuable for:
- Extracting meaning from a sentence
- Disambiguating the meaning of a word
- As a feature for other text processing pipelines
But BERT/Transformers?!
- An entirely Transformer-based approach will implicitly use part-of-speech so you don’t need to deal with it

The Main Methods for Part-of-Speech Tagging

Hidden Markov Models (HMM)
- Uses Markov assumption - only context used is the previous token
- Different algorithms for solving transitions between hidden states
BERT token classification
- Use context vectors from pretrained language model
- Classifier takes vector and predicts POS for each token
Both approaches require an annotated dataset to learn from

Hidden Markov Models

Model problem as a series of transitions between hidden states (i.e. the parts-of-speech) and emissions
Different algorithms to solve problem:
- Greedy algorithm - takes most likely part-of-speech at each step
- Viterbi algorithm - looks at entire context to plot most likely path

A Final Transformer Layer for Part-of-Speech Tagging

Same approach used for Named Entity Recognition in the labs

Use a final layer that takes each context vector and classifies it to a part of speech.

Word Vectors

Types of Word Vectors

(1) Static and sparse data (the IBM model)
- Represents each word as the frequencies of the words that appear around it (from a large corpus)
- Much like how we represent sparse documents!
  Shark -> attack ocean fish eat mind endure …
  134 531 451 432 0 4
  Bear -> attack ocean fish eat mind endure …
  745 2 137 312 313 123
  (length: vocab size)
  (length: vocab size)
(2) Static and dense data
- Represents each word as smaller dense vectors (usually 100-1000 dimensions)
- Usually trained via a small neural network to predict surrounding words
- Or can be “compressed” version of sparse vectors from last slide using method like SVD
- Easier to work with. Why?
(3) Contextualised and dense *(Length: fixed, ~100 dimensions)
- Shark - > [0.8, 0.2, 0.9, 0.4, 0.2, …]
- Bear -> [0.4, 0.5, 0.1, 0.5, 0.9, …]
- Builds a representation for the specific instance of the word within a specific document.
- Uses models like Transformers to build them
- (Length: fixed, ~768 dimensions)
  Shark -> …The shark attacked… -> [0.8, 0.2, 0.9, 0.4, 0.2, …]
  Bear -> …The bear ate… -> [0.4, 0.5, 0.1, 0.5, 0.9, …]
  Bear -> …bear in mind… -> [0.1, 0.2, 0.8, 0.5, 0.3, …]

Transformers

Stepping Through a Transformer

The whole process of using a Transformer can be a bit mysterious.
- Text goes in and context vectors come out?
- Excluding the final layers that do token classification, etc.
A lot of this process is done automatically but let’s work through it
Example: ``Kelvingrove park is beautiful''

First Step: Subword Tokenization

Tokenizer has been trained using a large corpus of text
- Possibly using BPE or a similar method
Splits up uncommon words (e.g. kelvingrove -> kelvin and ##grove)
We’ll use the ‘bert-base-uncased’ tokenizer (notice the lowercasing)
- which uses ## to signify a token that doesn’t start a word
Example:
``Kelvingrove park is beautiful'' to ['kelvin', '##grove', 'park', 'is', 'beautiful']

Second Step: Add Any Special Tokens

Some tokenization methods use special tokens
- BERT adds [CLS] and [SEP] at the beginning at end of text
Example:
['kelvin', '##grove', 'park', 'is', 'beautiful'] to [‘[CLS]’, 'kelvin', '##grove', 'park', 'is', 'beautiful', ‘[SEP]’ ]

Third Step: Convert Tokens to IDs

During its training, the tokenizer created a vocabulary of tokens and their IDs
Here we use that mapping
- This is using ‘bert-base-uncased’ which has a vocabulary of 30,522 tokens
- It will be different for another tokenizer (e.g. GPT)
Example:
[‘[CLS]’, 'kelvin', '##grove', 'park', 'is', 'beautiful', ‘[SEP]’ ] to [101, 24810, 21525, 2380, 2003, 3376, 102]

Fourth Step: Get the Input Embeddings for Those Token IDs

During its training, the transformer has learned initial vectors (effectively static vectors) for each token
- BERT models often use vectors with 768 elements
- With a vocab of 30,522, the model has a matrix storing the input embeddings of dimension 30522x768

Fifth Step: Add Positional Information

Token vectors contain no information about their order
Create positional embedding vectors
- BERT learns positional vectors for all possible 512 token locations
- Other models use some sine and cosine equations given the token location
Add the positional vectors to the input tokens

Are We Done?

We’ve got vectors with position. But we are not done!
These vectors represent each token individually!
They don’t factor in the context of the whole sentence
What does “park” mean?

Sixth Step: A Transformer Block

Initial inputs are vectors that don’t have context yet
Transformer block uses self-attention and other mechanisms to create context vectors

Inside the Transformer Block

Self-attention
- Every output vector is a weighted combination of all the input vectors
- Weights are based on the importance of that token in interpreting the token of interest
Multi-head Attention
- Do attention X times so that you get X vectors for each token
Feed Forward
- Traditional neural network
- Allows for non-linearity
Add & Norm
- Little tricks for making training easier

Differences for a Decoder Block (Versus an Encoder)

For decoder-only transformers (e.g. GPT):
- Self-attention is only allowed to examine tokens earlier in the input
- No looking ahead!
For encoder-decoder transformers (e.g. T5):
- Used for translating between languages
- Attention cannot look ahead but can examine the encoded text in the other language

Seventh (and More) Step: More Transformer Blocks

Take the outputted vectors from one Transformer block and feed them into the next block
Each Transformer block has different parameters
- e.g. different attention matrices, etc.
- Layer 12 will do things differently to Layer 1

Transfer Learning: The Pre-Train and Fine-Tune Paradigm

Train an ML system on one task and then adapt it to another new task.
Common paradigm: train a language model for one task (e.g., masked language modelling) and then further train it on a new task (e.g., text classification, summarization, etc.).
Keep many of the same model parameters (most of the transformer layers) but replace the final bits to do the desired task.

Why Does This Work?

When training a model from scratch, it needs to learn a lot about the language itself:
- what words mean, how they modify one another, etc.
By training a task with nearly limitless data (language modelling), the model learns these common language patterns, which can be reused on other language tasks.

Visualizing Attention

Reference: https://github.com/jessevig/bertviz

Exam Preparation

Scan the Exam

Get a quick sense initially of what topics are coming up in each question
- Trickier online, but you can click through questions
You don’t want to be surprised by a particularly long or challenging looking question later in the exam

Budget Your Time

In Text As Data, there will be three 20 mark questions
- Do them all
Watch your time in each question
Don’t get stuck spending too much time in a question with few marks
- Spend your time where the marks are

Paper Notes

You can bring paper notes to the exam
- No digital notes this year
Good ideas:
- Distil the course down into ~2 pages of quick lookup notes
- Have a section for equations
- Work with other students to build good notes
Bad ideas:
- Printing all the course slides
- Printing screeds of ChatGPT-generated content
- Printing all the past papers with solutions
Do not waste your time searching your notes

Past Exam Papers and Example Answers

Past exam papers and example answers are on the Moodle
The Masters version of the exam is slightly harder.
These are reasonably representative of the upcoming exams.
In-person exams will have less challenging arithmetic, and you should practice the various calculations throughout the course.

Exam Advice – Level of Detail & Marks

Level of detail & marks given as [#] for each question.
In general, each mark is roughly one idea or one point in the answer.

Too Little Detail

You are not fully answering the question.
You will probably only be able to get partial marks at best
Stopwords like “and” do not carry much meaning.

Too Much Detail

You’re spending a lot of time writing on a question that gives very little marks
Often does not demonstrate understanding of how the course material relates to the question.
LLMs are often trained on many public datasets, so the LLM may have been trained on Circa. This is called “data contamination”.*
- If the LLM was contaminated, the results can’t be considered zero-shot since it was directly trained on IAC data
  Example:
  LLMs are often trained on many public databases so the LLM have been trained on Circa. This is considered as “data contamination”. If LLMs was contaminated the results can’t be zero-shot, since it was directed trained on LAC data. (Just Right Detail Level)

Show Your Work for Calculations

Many questions specify to show your work.
Even if they do not, you cannot be given partial marks if we do not see where you went wrong!
You’re doing the work anyway – you should show it!

Exam Advice – Level of Detail & Marks - Summary

Use [#] as a guide on how much detail to provide.
Be careful about both too much detail and too little detail.
Show your work for calculations.

Exam Advice – Types of Questions

The exams are open book and open notes, so questions that only ask you to recall information do not indicate mastery of the subject.
Questions focus on application, analysis, evaluation, creation.
All questions are free-text and manually marked (no multiple-choice or auto-grading)

Exam Advice – Types of Questions

Questions often require you to make connections between different concepts in the course or other fundamental CS concepts.
To answer this question, you need to:
- Understand how UTF works (Lecture 1)
- Understand how BPE works (Lecture 5)
- Make connections between these ideas

Exam Advice – Types of Questions - Example

Unicode: A variable-byte (2-4 bytes) encoding
- Encodes the alphabets of many languages
- Emojis, maths symbols and lots more
- Is frequently updated with new characters (v15.1 released in Sept 2023)
Byte pair encoding (BPE)
- Inputs:
  - Large corpus of text to learn tokenization
  - The desired size of vocabulary
- Algorithm:
  1. Pretokenise the corpus documents into words using a tokeniser (this will remove whitespace)
  2. Create a vocabulary of symbols: all unique characters in the corpus (i.e. all letters, numbers, etc)
  3. Repeat until the desired vocab size is reached:
    - Find the most common neighboring symbols in the corpus
    - Replace all instances of the pair with a new character and add a new character to the vocabulary

Exam Advice – Types of Questions

Due to the type of questions, content copied from notes will rarely give any/many marks.
- (We’re not grading how good of search engines you are; we are grading your mastery of the topic!)
Directly answer the question!
- Even if the answer can be inferred from the submitted text, we want to see YOU do the inference.

Exam Advice – Types of Questions - Summary

Questions focus on application, analysis, evaluation, creation, not remembering.
As such, text directly from lectures are unlikely to give many marks.

Glasgow’s Code of Assessment

Don’t be worried if you are not confident in every answer! The exam questions need to be challenging for 70% to be “exemplary range and depth of ILOs”.

TASD Intended Learning Outcomes

By the end of this course students will be able to:

Describe classical models for textual representations such as the one-hot encoding, bag-of-words models, and sequences with language modelling.
Identify potential applications of text analytics in practice.
Describe various common techniques for classification, clustering and topic modelling, and select the appropriate machine learning task for a potential document processing application.
Represent data as features to serve as input to machine learning models.
Assess machine learning model quality in terms of relevant error metrics for document processing tasks, in an appropriate experimental design.
Deploy unsupervised and machine learned approaches for document/text analytics tasks.
Critically analyse and critique recent developments in natural language and text processing academic literature.
Evaluate and explain the appropriate application of recent research developments to real-world problems.

Glasgow’s Code of Assessment - Summary

The exam will be challenging because 70% indicates “exemplary range and depth of attainment”
Be familiar with the course ILOs

Final Word on Good Practices

Assume Nothing, Check Everything

Never trust that your data is clean and ready to use
Write defensive code
- Use asserts a lot
- Check the inputs to functions
- Use Python typing where appropriate
If you get perfect or zero results, that’s a good indicator that things have gone wrong
- Reasonable-looking results can still be wrong - try to check as much as you can

Explore Your Data

Do sanity check on everything you can think of (e.g. column mins, maxs)
Look for duplicates
Check a few of the labels yourself (if you can)
Unsupervised approaches:
- Cluster your data to see if there are any groups you should be aware of
- Use dimensionality reduction methods to visualise your data in 2D
Plot everything you can think of (different types of plots can be very helpful)

Be Careful of Code Doing Things Without Your Knowledge

A lot of packages will try to be “helpful”. This can trip you up if you’re not careful
Example:
- Pandas tries to make things numbers. Sometimes this is not appropriate patientid,diagnosisid 1,010 2,014 3,10.0 4,14.0 5,

Use Realistic Data for Your Evaluations

The data that you use for evaluation needs to be realistic of the real problem
If not, you are creating an ML system specific to that data that cannot generalise
You can do all kinds of things to the training set (e.g. resampling) that you should not do to validation/test set

Follow the Standard Process

Check your data
1. Manual inspection
2. Check label counts
3. Use unsupervised approaches to explore your data
Build and tune pipelines using the training and validation set
1. Use appropriate evaluation metrics
2. Inspect the mistakes to get ideas for improving the model
Pick your best model and evaluate on the test set

Consider Class Imbalance

Can cause problems with:
- Training: Classifier can favor the majority class and just always predict that
- Evaluation: Using the wrong metric (especially accuracy) can obscure what is going on
  Predicted | Positive | Negative
  ---|---|
  Actual Positive | TP |FN
  Negative |FP| TN

The metric to focus on will depend on the project

Reference: https://arxiv.org/abs/2108.02497
Outlines a number of the pitfalls when doing machine learning
Learning best practices takes time and reflection on what/why you are doing things

Best Practices for Machine Learning

Use training/validation/test splits of data
Be systematic
- Set up experiments!
Watch out for class imbalance
Think about your evaluation metrics
- Always better to use 2+ metrics
Look at your data
- Helps you design a better classifier
- Helps you understand the mistakes
Be cynical about data and the classifier
- Data is always messy
- Classifiers always look for an easy solution

Text as Data - Wrap Up

Please fill out the feedback form to let us know how we can improve the course with three quick questions. (https://tinyurl.com/textasdatafeedback)
Answers are anonymous. Enjoy! :D