44d ago

Text as Data - Exam Preparation Notes

Bug Bounty WinnersSean's ResearchClaims to FamePublished Student ProjectsNLP Techniques in SearchLanguage Models Encode KnowledgeWhere is the Knowledge Encoded?Knowledge EditingRetrieval-Augmented Language ModelsIntegrating Retrieved Text to Generate AnswerThe Problems with AttentionOptimizing Attention and Large ModelsLost in the MiddleRepresenting Language by Dense Vectors is a Little “Odd”Jake's ResearchThe Explosion of Biological Data Driven by New TechNew Technologies Enable Precision MedicineInterpretation is the Bottleneck of Precision MedicineSearch Tools and Knowledge BasesLet’s Use Text Mining to Build Some Knowledge BasesExample Application for Cancer TreatmentsExpert Collaborators Manually Annotate Text for TaskSupervised Relation Extraction at ScaleResulting Resource After Mining Abstracts & Full-Text PapersCan We Specialise it for Childhood Cancers?Future ProjectsResearch OpportunitiesRevisionPart-of-speech TaggingWhy Care About Parts-of-Speech?The Main Methods for Part-of-Speech TaggingHidden Markov ModelsA Final Transformer Layer for Part-of-Speech TaggingWord VectorsTypes of Word VectorsTransformersStepping Through a TransformerFirst Step: Subword TokenizationSecond Step: Add Any Special TokensThird Step: Convert Tokens to IDsFourth Step: Get the Input Embeddings for Those Token IDsFifth Step: Add Positional InformationAre We Done?Sixth Step: A Transformer BlockInside the Transformer BlockDifferences for a Decoder Block (Versus an Encoder)Seventh (and More) Step: More Transformer BlocksTransfer Learning: The Pre-Train and Fine-Tune ParadigmWhy Does This Work?Visualizing AttentionExam PreparationScan the ExamBudget Your TimePaper NotesPast Exam Papers and Example AnswersExam Advice – Level of Detail & MarksToo Little DetailToo Much DetailShow Your Work for CalculationsExam Advice – Level of Detail & Marks - SummaryExam Advice – Types of QuestionsExam Advice – Types of QuestionsExam Advice – Types of Questions - ExampleExam Advice – Types of QuestionsExam Advice – Types of Questions - SummaryGlasgow’s Code of AssessmentTASD Intended Learning OutcomesGlasgow’s Code of Assessment - SummaryFinal Word on Good PracticesAssume Nothing, Check EverythingExplore Your DataBe Careful of Code Doing Things Without Your KnowledgeUse Realistic Data for Your EvaluationsFollow the Standard ProcessConsider Class ImbalanceBest Practices for Machine LearningText as Data - Wrap Up

Bug Bounty Winners

  • The bug bounty winners are:

    • 1st place: Ray-Gi (10 points)

    • 2nd place: Saltssaumure (6 points)

    • 3rd place: mrktrnbll (5 points)

    • 3rd place: Lewism1404 (5 points)

Sean's Research

  • Dr. Sean MacAvaney's main research areas:

    • Using NLP to improve search results.

    • Doing it efficiently.

Claims to Fame

  • Context vectors are useful for determining document relevance.

    • Previous work using static word embeddings for search engines were not very successful.

    • Using [CLS] embeddings, context embedding similarity, or both significantly improves search results.

    • Reference: MacAvaney et al. CEDR: Contextualized Embeddings for Document Ranking. SIGIR 2019

  • Masked Language Modelling provides helpful document expansion tokens.

    • Allows expanding documents to helpful other terms, which can be indexed offline.

    • Reference: MacAvaney et al. Expansion via Prediction of Importance with Contextualization. SIGIR 2020

  • Document similarity can help identify relevant documents.

    • Find relevant documents by iteratively scoring documents similar to relevant ones.

    • Reference: MacAvaney et al. Adaptive Re-Ranking with a Corpus Graph. CIKM 2022

Published Student Projects

  • Mitko Gospodinov:

    • Text generation models, like Doc2Query, can "hallucinate" content, harming retrieval effectiveness.

    • Explored in his master's project and published at #ecir2023.

  • Expansion Queries Example:

    • Shows various queries and their ranking (e.g., "5th what nationality is the name dvorak").

    • Some queries are wrong or lead to the wrong person.

  • Performance Metrics:

    • RR@10 (Reciprocal Rank at 10) is used to measure retrieval effectiveness.

    • Graph shows RR@10 values for different numbers of total tokens, GPU hours, and filtering phases (Filtering phase, total tokens, Generation phase).

    • Table shows Dev RR for sim. GPU hrs.

  • Prashansa Gupta:

    • Observation: ~40% of queries for MS MARCO passage ranking are discarded.

    • Investigated the potential for "survivorship bias" in the dataset.

    • Reference: Gupta et al. On Survivorship Bias in MS MARCO. SIGIR 2022.

    • Reasons for discarding queries: Assessors couldn't find an answer in the top ~10 results.

    • Double-checking annotations usually agreed (~85% of the time).

    • Very few discarded queries were "ill-formed".

    • Answers to around two-thirds of discarded queries could be found in the (v1) corpus using newer ranking models.

    • Even with BM25, answers to about half could be found in the top 10.

    • Types of queries more likely to be discarded: "Description" and "Numeric".

    • The discarded queries belong to a different distribution.

    • The majority (71%) were well-formed, but the answer was not found in the top 10 (agrees with label).

    • 13% were well-formed, and the answer was found in the top 10 (disagrees on label).

    • 13% were well-formed but answered incompletely in the top 10.

  • Andreas Chari:

    • NLP models used for retrieval are affected by the spelling conventions (i.e., British vs American conventions).

    • Normalizing can help.

    • Reference: Chari et al. On the Effects of Regional Spelling Conventions in Retrieval Models. SIGIR 2023.

  • Andrew Parry:

    • Prompt-based retrieval models are susceptible to "injection" attacks.

    • Including special tokens like "relevant" in content can increase its retrievability.

    • LLMs can subtly inject this content, too.

    • Reference: Parry et al. Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models. ECIR 2024.

NLP Techniques in Search

  • Using NLP techniques in search can lead to lots of improvements.

  • Interesting questions/challenges remain:

    • Generative models for retrieval – dealing with "hallucination"

    • Building models that are "interpretable"

    • Learning about model/dataset biases, and correcting them

    • Improving search results while maintaining efficiency

    • Multi/cross-language search

  • Research opportunities:

    • L4/Master’s project on NLP for search engines.

    • Opportunity to publish / contribute to open source software

    • No need to have taken the IR course(but it helps!)

Language Models Encode Knowledge

  • Language models encode knowledge.

    • You can ask language models to answer factual questions.

      • e.g., by completing a sentence or directly with questions for the bigger models like ChatGPT.

  • They are often wrong.

    • Example link provided: https://transformer.huggingface.co/doc/distil-gpt2

Where is the Knowledge Encoded?

  • If a language model can tell us the capital of France, where is that encoded?

    • A Transformer is a big set of parameters.

    • GPT-3/4 has billions of parameters.

    • Which ones of those correspond to Paris being the capital of France?

  • Basically impossible to know how this knowledge is encoded.

Knowledge Editing

  • Oh no, my language model is wrong!!!

  • Retraining from scratch with new knowledge is too costly.

  • Lots of research on:

    • Making minimal edits to LLMs to update individual facts.

    • Updating lots of facts at the same time.

    • Switching to a mini LM for new facts.

Retrieval-Augmented Language Models

  • Can you get a language model to look up its sources to find information?

    • New models can search a big corpus (e.g., Wikipedia) for relevant text and use that to help complete the missing words.

  • This enables:

    • Adding new information by adding new sources.

    • Updating information.

    • Asking for the source of knowledge.

    • Reference: Izacard, Gautier, and Edouard Grave. "Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering." EACL 2021

Integrating Retrieved Text to Generate Answer

  • How to integrate multiple passages when generating the answer?

    • Encode them and concatenate them.

    • Create a new Transformer architecture.

  • It's still fallible!

    • Very susceptible to the power of suggestion, even when external sources are used.

    • Example: (Bing Chat)

The Problems with Attention

  • Attention scales with the length of the sequence (N2N^2$$N^2$$).

  • Scaling problems for long sequences.

  • Lots of research on more efficient attention mechanisms

Optimizing Attention and Large Models

  • Storing data in smaller data formats

    • Using 16-bit floating point (instead of 32-bit or 64-bit).

  • Using compression while computing attention

    • Compressing the attention weights into smaller data formats.

  • FlashAttention

    • Optimising the hardware memory accesses to be more efficient.

  • Specialised hardware

    • e.g., the Tensor Processing Unit (TPU).

    • Reference: Dao, Tri, et al. "FlashAttention: Fast and memory-efficient exact attention with io-awareness."

Lost in the Middle

  • Big LLMs can now accept vast text as input

    • GPT-4 accepts 128,000 tokens!

  • Enables asking questions about a whole book.

  • But does the location of the important info matter in the long text?

    • Apparently yes!

    • Reference: Liu, Nelson F., et al. "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the Association for Computational Linguistics 12 (2024).

Representing Language by Dense Vectors is a Little “Odd”

  • Dense vectors are a very powerful tool but representing meaning by numbers feels strange.

  • Is reasoning just about guessing the next word?

  • Maybe we should keep other methods in mind

  • Is focusing entirely on transformers putting all our eggs in one basket?

  • The field of NLP may change dramatically in the future.

Jake's Research

The Explosion of Biological Data Driven by New Tech

  • DNA sequencing is now fairly cheap so it can be deployed across healthcare systems and beyond

  • This means that the amount of data that scientists see is vast.

    • Biology is becoming a data science.

    • References:

      • https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost

      • https://nanoporetech.com/about-us/news/oxford-nanopore-announces-ps100-million-140m-fundraising-global-investors

New Technologies Enable Precision Medicine

  • “The right drug to the right patient at the right time”

  • Relies heavily on the latest research

  • Groups around the world are manually reviewing literature constantly

Interpretation is the Bottleneck of Precision Medicine

  • We can now get a lot of data on a person & their health

    • There are ~20,000 genes in the human genome.

    • We can measure their mutations, activity levels, etc.

  • But what does it mean?!

    • Reference: Good, Benjamin M., et al. "Organizing knowledge to enable personalization of medicine in cancer."

Search Tools and Knowledge Bases

  • Search Tools vs. Knowledge Bases:

    • Search Tools: ❌ May require users to read (many) papers, ❌ Cannot easily be used by automated analyses

    • Knowledge Bases: ❌ Only good if a KB exists for your problem, ❌ Huge cost burden to create and maintain

Let’s Use Text Mining to Build Some Knowledge Bases

  • Can we use named entity recognition and relation extraction on research articles?

Example Application for Cancer Treatments

  • “In colorectal cancer, KRAS mutations were found to be associated with cetuximab resistance.”

    • Morris et al 2011. PMID: 22248868

Expert Collaborators Manually Annotate Text for Task

  • Please read the following sentence and annotate with appropriate cancer/gene relationship. Already tagged 1000 sentences.

  • Example:

    • SNCG level in colon adenocarcinoma is potentially valuable in predicting colon adenocarcinoma patients at high risk of recurrence and shorter survival after surgery. (pubmed)

    • gene: SNCG // cancer: colon adenocarcinoma

    • Annotation options: None, Prognostic, Diagnostic, Predictive, Other

Supervised Relation Extraction at Scale

  • Biomedical literature is big!

  • Need for high-precision predictions

  • Dependency-path methods have had good success with speed/performance trade-off

  • Lots of interesting challenges with using Transformer-based methods

Resulting Resource After Mining Abstracts & Full-Text Papers

  • CIViCmine resource mines genes, cancers, genetic variation, and their associations. (http://bionlp.bcgsc.ca/civicmine/)

  • Extra applications:

    • Mutations that affect drug metabolism

    • Genes that affect cancer growth

Can We Specialise it for Childhood Cancers?

  • A lot of knowledge bases are adult-focused. Let’s do a better job for childhood cancers

  • Different cancers have different age demographics

Future Projects

  • Extracting biomedical knowledge from text (and then doing cool stuff with it)

  • Challenges:

    • Extracting complex meanings and contexts of biomedical relations

    • Cataloguing the impact of genetic variation

    • Scaling information extraction methods to the vast scale of research papers

    • Making inferences of new treatments

  • No biomedical knowledge necessary

Research Opportunities

  • Get in contact!

    • Level 4 / MSci projects

      • If you’d like to work with me, get in contact before August

    • PhDs

      • Always happy to talk about PhD opportunities

      • Masters degree is not required

Revision

Part-of-speech Tagging

  • What is part-of-speech tagging?

    • Is a word a noun, a verb, a proper noun, etc?

    • Pick the part-of-speech for each word in some text.

    • It is a sequence labelling or token classification problem

    • There are a few different label sets for part-of-speech

    • Some more detailed, some less

    • For example: Universal dependencies has the “VERB” tag. (https://universaldependencies.org/u/pos/)

Why Care About Parts-of-Speech?

  • Part-of-speech is valuable for:

    • Extracting meaning from a sentence

    • Disambiguating the meaning of a word

    • As a feature for other text processing pipelines

  • But BERT/Transformers?!

    • An entirely Transformer-based approach will implicitly use part-of-speech so you don’t need to deal with it

The Main Methods for Part-of-Speech Tagging

  • Hidden Markov Models (HMM)

    • Uses Markov assumption - only context used is the previous token

    • Different algorithms for solving transitions between hidden states

  • BERT token classification

    • Use context vectors from pretrained language model

    • Classifier takes vector and predicts POS for each token

  • Both approaches require an annotated dataset to learn from

Hidden Markov Models

  • Model problem as a series of transitions between hidden states (i.e. the parts-of-speech) and emissions

  • Different algorithms to solve problem:

    • Greedy algorithm - takes most likely part-of-speech at each step

    • Viterbi algorithm - looks at entire context to plot most likely path

A Final Transformer Layer for Part-of-Speech Tagging

Same approach used for Named Entity Recognition in the labs

  • Use a final layer that takes each context vector and classifies it to a part of speech.

Word Vectors

Types of Word Vectors

  • (1) Static and sparse data (the IBM model)

    • Represents each word as the frequencies of the words that appear around it (from a large corpus)

    • Much like how we represent sparse documents!
      Shark -> attack ocean fish eat mind endure …
      134 531 451 432 0 4
      Bear -> attack ocean fish eat mind endure …
      745 2 137 312 313 123
      (length: vocab size)
      (length: vocab size)

  • (2) Static and dense data

    • Represents each word as smaller dense vectors (usually 100-1000 dimensions)

    • Usually trained via a small neural network to predict surrounding words

    • Or can be “compressed” version of sparse vectors from last slide using method like SVD

    • Easier to work with. Why?

  • (3) Contextualised and dense *(Length: fixed, ~100 dimensions)

    • Shark - > [0.8, 0.2, 0.9, 0.4, 0.2, …]

    • Bear -> [0.4, 0.5, 0.1, 0.5, 0.9, …]

    • Builds a representation for the specific instance of the word within a specific document.

    • Uses models like Transformers to build them

    • (Length: fixed, ~768 dimensions)
      Shark -> …The shark attacked… -> [0.8, 0.2, 0.9, 0.4, 0.2, …]
      Bear -> …The bear ate… -> [0.4, 0.5, 0.1, 0.5, 0.9, …]
      Bear -> …bear in mind… -> [0.1, 0.2, 0.8, 0.5, 0.3, …]

Transformers

Stepping Through a Transformer

  • The whole process of using a Transformer can be a bit mysterious.

    • Text goes in and context vectors come out?

    • Excluding the final layers that do token classification, etc.

  • A lot of this process is done automatically but let’s work through it
    Example: ``Kelvingrove park is beautiful''

First Step: Subword Tokenization

  • Tokenizer has been trained using a large corpus of text

    • Possibly using BPE or a similar method

  • Splits up uncommon words (e.g. kelvingrove -> kelvin and ##grove)

  • We’ll use the ‘bert-base-uncased’ tokenizer (notice the lowercasing)

    • which uses ## to signify a token that doesn’t start a word

  • Example:
    ``Kelvingrove park is beautiful'' to ['kelvin', '##grove', 'park', 'is', 'beautiful']

Second Step: Add Any Special Tokens

  • Some tokenization methods use special tokens

    • BERT adds [CLS] and [SEP] at the beginning at end of text

  • Example:
    ['kelvin', '##grove', 'park', 'is', 'beautiful'] to [‘[CLS]’, 'kelvin', '##grove', 'park', 'is', 'beautiful', ‘[SEP]’ ]

Third Step: Convert Tokens to IDs

  • During its training, the tokenizer created a vocabulary of tokens and their IDs

  • Here we use that mapping

    • This is using ‘bert-base-uncased’ which has a vocabulary of 30,522 tokens

    • It will be different for another tokenizer (e.g. GPT)

  • Example:
    [‘[CLS]’, 'kelvin', '##grove', 'park', 'is', 'beautiful', ‘[SEP]’ ] to [101, 24810, 21525, 2380, 2003, 3376, 102]

Fourth Step: Get the Input Embeddings for Those Token IDs

  • During its training, the transformer has learned initial vectors (effectively static vectors) for each token

    • BERT models often use vectors with 768 elements

    • With a vocab of 30,522, the model has a matrix storing the input embeddings of dimension 30522x768

Fifth Step: Add Positional Information

  • Token vectors contain no information about their order

  • Create positional embedding vectors

    • BERT learns positional vectors for all possible 512 token locations

    • Other models use some sine and cosine equations given the token location

  • Add the positional vectors to the input tokens

Are We Done?

  • We’ve got vectors with position. But we are not done!

  • These vectors represent each token individually!

  • They don’t factor in the context of the whole sentence

  • What does “park” mean?

Sixth Step: A Transformer Block

  • Initial inputs are vectors that don’t have context yet

  • Transformer block uses self-attention and other mechanisms to create context vectors

Inside the Transformer Block
  • Self-attention

    • Every output vector is a weighted combination of all the input vectors

    • Weights are based on the importance of that token in interpreting the token of interest

  • Multi-head Attention

    • Do attention X times so that you get X vectors for each token

  • Feed Forward

    • Traditional neural network

    • Allows for non-linearity

  • Add & Norm

    • Little tricks for making training easier

Differences for a Decoder Block (Versus an Encoder)

  • For decoder-only transformers (e.g. GPT):

    • Self-attention is only allowed to examine tokens earlier in the input

    • No looking ahead!

  • For encoder-decoder transformers (e.g. T5):

    • Used for translating between languages

    • Attention cannot look ahead but can examine the encoded text in the other language

Seventh (and More) Step: More Transformer Blocks

  • Take the outputted vectors from one Transformer block and feed them into the next block

  • Each Transformer block has different parameters

    • e.g. different attention matrices, etc.

    • Layer 12 will do things differently to Layer 1

Transfer Learning: The Pre-Train and Fine-Tune Paradigm

  • Train an ML system on one task and then adapt it to another new task.

  • Common paradigm: train a language model for one task (e.g., masked language modelling) and then further train it on a new task (e.g., text classification, summarization, etc.).

  • Keep many of the same model parameters (most of the transformer layers) but replace the final bits to do the desired task.

Why Does This Work?

  • When training a model from scratch, it needs to learn a lot about the language itself:

    • what words mean, how they modify one another, etc.

  • By training a task with nearly limitless data (language modelling), the model learns these common language patterns, which can be reused on other language tasks.

Visualizing Attention

  • Reference: https://github.com/jessevig/bertviz

Exam Preparation

Scan the Exam

  • Get a quick sense initially of what topics are coming up in each question

    • Trickier online, but you can click through questions

  • You don’t want to be surprised by a particularly long or challenging looking question later in the exam

Budget Your Time

  • In Text As Data, there will be three 20 mark questions

    • Do them all

  • Watch your time in each question

  • Don’t get stuck spending too much time in a question with few marks

    • Spend your time where the marks are

Paper Notes

  • You can bring paper notes to the exam

    • No digital notes this year

  • Good ideas:

    • Distil the course down into ~2 pages of quick lookup notes

    • Have a section for equations

    • Work with other students to build good notes

  • Bad ideas:

    • Printing all the course slides

    • Printing screeds of ChatGPT-generated content

    • Printing all the past papers with solutions

  • Do not waste your time searching your notes

Past Exam Papers and Example Answers

  • Past exam papers and example answers are on the Moodle

  • The Masters version of the exam is slightly harder.

  • These are reasonably representative of the upcoming exams.

  • In-person exams will have less challenging arithmetic, and you should practice the various calculations throughout the course.

Exam Advice – Level of Detail & Marks

  • Level of detail & marks given as [#] for each question.

  • In general, each mark is roughly one idea or one point in the answer.

Too Little Detail
  • You are not fully answering the question.

  • You will probably only be able to get partial marks at best

  • Stopwords like “and” do not carry much meaning.

Too Much Detail
  • You’re spending a lot of time writing on a question that gives very little marks

  • Often does not demonstrate understanding of how the course material relates to the question.

  • LLMs are often trained on many public datasets, so the LLM may have been trained on Circa. This is called “data contamination”.*

    • If the LLM was contaminated, the results can’t be considered zero-shot since it was directly trained on IAC data
      Example:
      LLMs are often trained on many public databases so the LLM have been trained on Circa. This is considered as “data contamination”. If LLMs was contaminated the results can’t be zero-shot, since it was directed trained on LAC data. (Just Right Detail Level)

Show Your Work for Calculations
  • Many questions specify to show your work.

  • Even if they do not, you cannot be given partial marks if we do not see where you went wrong!

  • You’re doing the work anyway – you should show it!

Exam Advice – Level of Detail & Marks - Summary

  1. Use [#] as a guide on how much detail to provide.

  2. Be careful about both too much detail and too little detail.

  3. Show your work for calculations.

Exam Advice – Types of Questions

  • The exams are open book and open notes, so questions that only ask you to recall information do not indicate mastery of the subject.

  • Questions focus on application, analysis, evaluation, creation.

  • All questions are free-text and manually marked (no multiple-choice or auto-grading)

Exam Advice – Types of Questions

  • Questions often require you to make connections between different concepts in the course or other fundamental CS concepts.

  • To answer this question, you need to:

    • Understand how UTF works (Lecture 1)

    • Understand how BPE works (Lecture 5)

    • Make connections between these ideas

Exam Advice – Types of Questions - Example

  • Unicode: A variable-byte (2-4 bytes) encoding

    • Encodes the alphabets of many languages

    • Emojis, maths symbols and lots more

    • Is frequently updated with new characters (v15.1 released in Sept 2023)

  • Byte pair encoding (BPE)

    • Inputs:

      • Large corpus of text to learn tokenization

      • The desired size of vocabulary

    • Algorithm:

      1. Pretokenise the corpus documents into words using a tokeniser (this will remove whitespace)

      2. Create a vocabulary of symbols: all unique characters in the corpus (i.e. all letters, numbers, etc)

      3. Repeat until the desired vocab size is reached:

        • Find the most common neighboring symbols in the corpus

        • Replace all instances of the pair with a new character and add a new character to the vocabulary

Exam Advice – Types of Questions

  • Due to the type of questions, content copied from notes will rarely give any/many marks.

    • (We’re not grading how good of search engines you are; we are grading your mastery of the topic!)

  • Directly answer the question!

    • Even if the answer can be inferred from the submitted text, we want to see YOU do the inference.

Exam Advice – Types of Questions - Summary

  1. Questions focus on application, analysis, evaluation, creation, not remembering.

  2. As such, text directly from lectures are unlikely to give many marks.

Glasgow’s Code of Assessment

  • Don’t be worried if you are not confident in every answer! The exam questions need to be challenging for 70% to be “exemplary range and depth of ILOs”.

TASD Intended Learning Outcomes

By the end of this course students will be able to:

  1. Describe classical models for textual representations such as the one-hot encoding, bag-of-words models, and sequences with language modelling.

  2. Identify potential applications of text analytics in practice.

  3. Describe various common techniques for classification, clustering and topic modelling, and select the appropriate machine learning task for a potential document processing application.

  4. Represent data as features to serve as input to machine learning models.

  5. Assess machine learning model quality in terms of relevant error metrics for document processing tasks, in an appropriate experimental design.

  6. Deploy unsupervised and machine learned approaches for document/text analytics tasks.

  7. Critically analyse and critique recent developments in natural language and text processing academic literature.

  8. Evaluate and explain the appropriate application of recent research developments to real-world problems.

Glasgow’s Code of Assessment - Summary

  • The exam will be challenging because 70% indicates “exemplary range and depth of attainment”

  • Be familiar with the course ILOs

Final Word on Good Practices

Assume Nothing, Check Everything

  • Never trust that your data is clean and ready to use

  • Write defensive code

    • Use asserts a lot

    • Check the inputs to functions

    • Use Python typing where appropriate

  • If you get perfect or zero results, that’s a good indicator that things have gone wrong

    • Reasonable-looking results can still be wrong - try to check as much as you can

Explore Your Data

  • Do sanity check on everything you can think of (e.g. column mins, maxs)

  • Look for duplicates

  • Check a few of the labels yourself (if you can)

  • Unsupervised approaches:

    • Cluster your data to see if there are any groups you should be aware of

    • Use dimensionality reduction methods to visualise your data in 2D

  • Plot everything you can think of (different types of plots can be very helpful)

Be Careful of Code Doing Things Without Your Knowledge

  • A lot of packages will try to be “helpful”. This can trip you up if you’re not careful

  • Example:

    • Pandas tries to make things numbers. Sometimes this is not appropriate patientid,diagnosisid 1,010 2,014 3,10.0 4,14.0 5,

Use Realistic Data for Your Evaluations

  • The data that you use for evaluation needs to be realistic of the real problem

  • If not, you are creating an ML system specific to that data that cannot generalise

  • You can do all kinds of things to the training set (e.g. resampling) that you should not do to validation/test set

Follow the Standard Process

  1. Check your data

    1. Manual inspection

    2. Check label counts

    3. Use unsupervised approaches to explore your data

  2. Build and tune pipelines using the training and validation set

    1. Use appropriate evaluation metrics

    2. Inspect the mistakes to get ideas for improving the model

  3. Pick your best model and evaluate on the test set

Consider Class Imbalance

  • Can cause problems with:

    • Training: Classifier can favor the majority class and just always predict that

    • Evaluation: Using the wrong metric (especially accuracy) can obscure what is going on
      Predicted | Positive | Negative
      ---|---|
      Actual Positive | TP |FN
      Negative |FP| TN

The metric to focus on will depend on the project

  • Reference: https://arxiv.org/abs/2108.02497

  • Outlines a number of the pitfalls when doing machine learning

  • Learning best practices takes time and reflection on what/why you are doing things

Best Practices for Machine Learning

  • Use training/validation/test splits of data

  • Be systematic

    • Set up experiments!

  • Watch out for class imbalance

  • Think about your evaluation metrics

    • Always better to use 2+ metrics

  • Look at your data

    • Helps you design a better classifier

    • Helps you understand the mistakes

  • Be cynical about data and the classifier

    • Data is always messy

    • Classifiers always look for an easy solution

Text as Data - Wrap Up

  • Please fill out the feedback form to let us know how we can improve the course with three quick questions. (https://tinyurl.com/textasdatafeedback)

  • Answers are anonymous. Enjoy! :D


knowt logo

Text as Data - Exam Preparation Notes

Bug Bounty Winners

  • The bug bounty winners are:

    • 1st place: Ray-Gi (10 points)

    • 2nd place: Saltssaumure (6 points)

    • 3rd place: mrktrnbll (5 points)

    • 3rd place: Lewism1404 (5 points)

Sean's Research

  • Dr. Sean MacAvaney's main research areas:

    • Using NLP to improve search results.

    • Doing it efficiently.

Claims to Fame

  • Context vectors are useful for determining document relevance.

    • Previous work using static word embeddings for search engines were not very successful.

    • Using [CLS] embeddings, context embedding similarity, or both significantly improves search results.

    • Reference: MacAvaney et al. CEDR: Contextualized Embeddings for Document Ranking. SIGIR 2019

  • Masked Language Modelling provides helpful document expansion tokens.

    • Allows expanding documents to helpful other terms, which can be indexed offline.

    • Reference: MacAvaney et al. Expansion via Prediction of Importance with Contextualization. SIGIR 2020

  • Document similarity can help identify relevant documents.

    • Find relevant documents by iteratively scoring documents similar to relevant ones.

    • Reference: MacAvaney et al. Adaptive Re-Ranking with a Corpus Graph. CIKM 2022

Published Student Projects

  • Mitko Gospodinov:

    • Text generation models, like Doc2Query, can "hallucinate" content, harming retrieval effectiveness.

    • Explored in his master's project and published at #ecir2023.

  • Expansion Queries Example:

    • Shows various queries and their ranking (e.g., "5th what nationality is the name dvorak").

    • Some queries are wrong or lead to the wrong person.

  • Performance Metrics:

    • RR@10 (Reciprocal Rank at 10) is used to measure retrieval effectiveness.

    • Graph shows RR@10 values for different numbers of total tokens, GPU hours, and filtering phases (Filtering phase, total tokens, Generation phase).

    • Table shows Dev RR for sim. GPU hrs.

  • Prashansa Gupta:

    • Observation: ~40% of queries for MS MARCO passage ranking are discarded.

    • Investigated the potential for "survivorship bias" in the dataset.

    • Reference: Gupta et al. On Survivorship Bias in MS MARCO. SIGIR 2022.

    • Reasons for discarding queries: Assessors couldn't find an answer in the top ~10 results.

    • Double-checking annotations usually agreed (~85% of the time).

    • Very few discarded queries were "ill-formed".

    • Answers to around two-thirds of discarded queries could be found in the (v1) corpus using newer ranking models.

    • Even with BM25, answers to about half could be found in the top 10.

    • Types of queries more likely to be discarded: "Description" and "Numeric".

    • The discarded queries belong to a different distribution.

    • The majority (71%) were well-formed, but the answer was not found in the top 10 (agrees with label).

    • 13% were well-formed, and the answer was found in the top 10 (disagrees on label).

    • 13% were well-formed but answered incompletely in the top 10.

  • Andreas Chari:

    • NLP models used for retrieval are affected by the spelling conventions (i.e., British vs American conventions).

    • Normalizing can help.

    • Reference: Chari et al. On the Effects of Regional Spelling Conventions in Retrieval Models. SIGIR 2023.

  • Andrew Parry:

    • Prompt-based retrieval models are susceptible to "injection" attacks.

    • Including special tokens like "relevant" in content can increase its retrievability.

    • LLMs can subtly inject this content, too.

    • Reference: Parry et al. Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models. ECIR 2024.

NLP Techniques in Search

  • Using NLP techniques in search can lead to lots of improvements.

  • Interesting questions/challenges remain:

    • Generative models for retrieval – dealing with "hallucination"

    • Building models that are "interpretable"

    • Learning about model/dataset biases, and correcting them

    • Improving search results while maintaining efficiency

    • Multi/cross-language search

  • Research opportunities:

    • L4/Master’s project on NLP for search engines.

    • Opportunity to publish / contribute to open source software

    • No need to have taken the IR course(but it helps!)

Language Models Encode Knowledge

  • Language models encode knowledge.

    • You can ask language models to answer factual questions.

      • e.g., by completing a sentence or directly with questions for the bigger models like ChatGPT.

  • They are often wrong.

    • Example link provided: https://transformer.huggingface.co/doc/distil-gpt2

Where is the Knowledge Encoded?

  • If a language model can tell us the capital of France, where is that encoded?

    • A Transformer is a big set of parameters.

    • GPT-3/4 has billions of parameters.

    • Which ones of those correspond to Paris being the capital of France?

  • Basically impossible to know how this knowledge is encoded.

Knowledge Editing

  • Oh no, my language model is wrong!!!

  • Retraining from scratch with new knowledge is too costly.

  • Lots of research on:

    • Making minimal edits to LLMs to update individual facts.

    • Updating lots of facts at the same time.

    • Switching to a mini LM for new facts.

Retrieval-Augmented Language Models

  • Can you get a language model to look up its sources to find information?

    • New models can search a big corpus (e.g., Wikipedia) for relevant text and use that to help complete the missing words.

  • This enables:

    • Adding new information by adding new sources.

    • Updating information.

    • Asking for the source of knowledge.

    • Reference: Izacard, Gautier, and Edouard Grave. "Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering." EACL 2021

Integrating Retrieved Text to Generate Answer

  • How to integrate multiple passages when generating the answer?

    • Encode them and concatenate them.

    • Create a new Transformer architecture.

  • It's still fallible!

    • Very susceptible to the power of suggestion, even when external sources are used.

    • Example: (Bing Chat)

The Problems with Attention

  • Attention scales with the length of the sequence (N2N^2).

  • Scaling problems for long sequences.

  • Lots of research on more efficient attention mechanisms

Optimizing Attention and Large Models

  • Storing data in smaller data formats

    • Using 16-bit floating point (instead of 32-bit or 64-bit).

  • Using compression while computing attention

    • Compressing the attention weights into smaller data formats.

  • FlashAttention

    • Optimising the hardware memory accesses to be more efficient.

  • Specialised hardware

    • e.g., the Tensor Processing Unit (TPU).

    • Reference: Dao, Tri, et al. "FlashAttention: Fast and memory-efficient exact attention with io-awareness."

Lost in the Middle

  • Big LLMs can now accept vast text as input

    • GPT-4 accepts 128,000 tokens!

  • Enables asking questions about a whole book.

  • But does the location of the important info matter in the long text?

    • Apparently yes!

    • Reference: Liu, Nelson F., et al. "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the Association for Computational Linguistics 12 (2024).

Representing Language by Dense Vectors is a Little “Odd”

  • Dense vectors are a very powerful tool but representing meaning by numbers feels strange.

  • Is reasoning just about guessing the next word?

  • Maybe we should keep other methods in mind

  • Is focusing entirely on transformers putting all our eggs in one basket?

  • The field of NLP may change dramatically in the future.

Jake's Research

The Explosion of Biological Data Driven by New Tech

  • DNA sequencing is now fairly cheap so it can be deployed across healthcare systems and beyond

  • This means that the amount of data that scientists see is vast.

    • Biology is becoming a data science.

    • References:

      • https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost

      • https://nanoporetech.com/about-us/news/oxford-nanopore-announces-ps100-million-140m-fundraising-global-investors

New Technologies Enable Precision Medicine

  • “The right drug to the right patient at the right time”

  • Relies heavily on the latest research

  • Groups around the world are manually reviewing literature constantly

Interpretation is the Bottleneck of Precision Medicine

  • We can now get a lot of data on a person & their health

    • There are ~20,000 genes in the human genome.

    • We can measure their mutations, activity levels, etc.

  • But what does it mean?!

    • Reference: Good, Benjamin M., et al. "Organizing knowledge to enable personalization of medicine in cancer."

Search Tools and Knowledge Bases

  • Search Tools vs. Knowledge Bases:

    • Search Tools: ❌ May require users to read (many) papers, ❌ Cannot easily be used by automated analyses

    • Knowledge Bases: ❌ Only good if a KB exists for your problem, ❌ Huge cost burden to create and maintain

Let’s Use Text Mining to Build Some Knowledge Bases

  • Can we use named entity recognition and relation extraction on research articles?

Example Application for Cancer Treatments

  • “In colorectal cancer, KRAS mutations were found to be associated with cetuximab resistance.”

    • Morris et al 2011. PMID: 22248868

Expert Collaborators Manually Annotate Text for Task

  • Please read the following sentence and annotate with appropriate cancer/gene relationship. Already tagged 1000 sentences.

  • Example:

    • SNCG level in colon adenocarcinoma is potentially valuable in predicting colon adenocarcinoma patients at high risk of recurrence and shorter survival after surgery. (pubmed)

    • gene: SNCG // cancer: colon adenocarcinoma

    • Annotation options: None, Prognostic, Diagnostic, Predictive, Other

Supervised Relation Extraction at Scale

  • Biomedical literature is big!

  • Need for high-precision predictions

  • Dependency-path methods have had good success with speed/performance trade-off

  • Lots of interesting challenges with using Transformer-based methods

Resulting Resource After Mining Abstracts & Full-Text Papers

  • CIViCmine resource mines genes, cancers, genetic variation, and their associations. (http://bionlp.bcgsc.ca/civicmine/)

  • Extra applications:

    • Mutations that affect drug metabolism

    • Genes that affect cancer growth

Can We Specialise it for Childhood Cancers?

  • A lot of knowledge bases are adult-focused. Let’s do a better job for childhood cancers

  • Different cancers have different age demographics

Future Projects

  • Extracting biomedical knowledge from text (and then doing cool stuff with it)

  • Challenges:

    • Extracting complex meanings and contexts of biomedical relations

    • Cataloguing the impact of genetic variation

    • Scaling information extraction methods to the vast scale of research papers

    • Making inferences of new treatments

  • No biomedical knowledge necessary

Research Opportunities

  • Get in contact!

    • Level 4 / MSci projects

      • If you’d like to work with me, get in contact before August

    • PhDs

      • Always happy to talk about PhD opportunities

      • Masters degree is not required

Revision

Part-of-speech Tagging

  • What is part-of-speech tagging?

    • Is a word a noun, a verb, a proper noun, etc?

    • Pick the part-of-speech for each word in some text.

    • It is a sequence labelling or token classification problem

    • There are a few different label sets for part-of-speech

    • Some more detailed, some less

    • For example: Universal dependencies has the “VERB” tag. (https://universaldependencies.org/u/pos/)

Why Care About Parts-of-Speech?

  • Part-of-speech is valuable for:

    • Extracting meaning from a sentence

    • Disambiguating the meaning of a word

    • As a feature for other text processing pipelines

  • But BERT/Transformers?!

    • An entirely Transformer-based approach will implicitly use part-of-speech so you don’t need to deal with it

The Main Methods for Part-of-Speech Tagging

  • Hidden Markov Models (HMM)

    • Uses Markov assumption - only context used is the previous token

    • Different algorithms for solving transitions between hidden states

  • BERT token classification

    • Use context vectors from pretrained language model

    • Classifier takes vector and predicts POS for each token

  • Both approaches require an annotated dataset to learn from

Hidden Markov Models

  • Model problem as a series of transitions between hidden states (i.e. the parts-of-speech) and emissions

  • Different algorithms to solve problem:

    • Greedy algorithm - takes most likely part-of-speech at each step

    • Viterbi algorithm - looks at entire context to plot most likely path

A Final Transformer Layer for Part-of-Speech Tagging

Same approach used for Named Entity Recognition in the labs

  • Use a final layer that takes each context vector and classifies it to a part of speech.

Word Vectors

Types of Word Vectors

  • (1) Static and sparse data (the IBM model)

    • Represents each word as the frequencies of the words that appear around it (from a large corpus)

    • Much like how we represent sparse documents!
      Shark -> attack ocean fish eat mind endure …
      134 531 451 432 0 4
      Bear -> attack ocean fish eat mind endure …
      745 2 137 312 313 123
      (length: vocab size)
      (length: vocab size)

  • (2) Static and dense data

    • Represents each word as smaller dense vectors (usually 100-1000 dimensions)

    • Usually trained via a small neural network to predict surrounding words

    • Or can be “compressed” version of sparse vectors from last slide using method like SVD

    • Easier to work with. Why?

  • (3) Contextualised and dense *(Length: fixed, ~100 dimensions)

    • Shark - > [0.8, 0.2, 0.9, 0.4, 0.2, …]

    • Bear -> [0.4, 0.5, 0.1, 0.5, 0.9, …]

    • Builds a representation for the specific instance of the word within a specific document.

    • Uses models like Transformers to build them

    • (Length: fixed, ~768 dimensions)
      Shark -> …The shark attacked… -> [0.8, 0.2, 0.9, 0.4, 0.2, …]
      Bear -> …The bear ate… -> [0.4, 0.5, 0.1, 0.5, 0.9, …]
      Bear -> …bear in mind… -> [0.1, 0.2, 0.8, 0.5, 0.3, …]

Transformers

Stepping Through a Transformer

  • The whole process of using a Transformer can be a bit mysterious.

    • Text goes in and context vectors come out?

    • Excluding the final layers that do token classification, etc.

  • A lot of this process is done automatically but let’s work through it
    Example: ``Kelvingrove park is beautiful''

First Step: Subword Tokenization

  • Tokenizer has been trained using a large corpus of text

    • Possibly using BPE or a similar method

  • Splits up uncommon words (e.g. kelvingrove -> kelvin and ##grove)

  • We’ll use the ‘bert-base-uncased’ tokenizer (notice the lowercasing)

    • which uses ## to signify a token that doesn’t start a word

  • Example:
    ``Kelvingrove park is beautiful'' to ['kelvin', '##grove', 'park', 'is', 'beautiful']

Second Step: Add Any Special Tokens

  • Some tokenization methods use special tokens

    • BERT adds [CLS] and [SEP] at the beginning at end of text

  • Example:
    ['kelvin', '##grove', 'park', 'is', 'beautiful'] to [‘[CLS]’, 'kelvin', '##grove', 'park', 'is', 'beautiful', ‘[SEP]’ ]

Third Step: Convert Tokens to IDs

  • During its training, the tokenizer created a vocabulary of tokens and their IDs

  • Here we use that mapping

    • This is using ‘bert-base-uncased’ which has a vocabulary of 30,522 tokens

    • It will be different for another tokenizer (e.g. GPT)

  • Example:
    [‘[CLS]’, 'kelvin', '##grove', 'park', 'is', 'beautiful', ‘[SEP]’ ] to [101, 24810, 21525, 2380, 2003, 3376, 102]

Fourth Step: Get the Input Embeddings for Those Token IDs

  • During its training, the transformer has learned initial vectors (effectively static vectors) for each token

    • BERT models often use vectors with 768 elements

    • With a vocab of 30,522, the model has a matrix storing the input embeddings of dimension 30522x768

Fifth Step: Add Positional Information

  • Token vectors contain no information about their order

  • Create positional embedding vectors

    • BERT learns positional vectors for all possible 512 token locations

    • Other models use some sine and cosine equations given the token location

  • Add the positional vectors to the input tokens

Are We Done?

  • We’ve got vectors with position. But we are not done!

  • These vectors represent each token individually!

  • They don’t factor in the context of the whole sentence

  • What does “park” mean?

Sixth Step: A Transformer Block

  • Initial inputs are vectors that don’t have context yet

  • Transformer block uses self-attention and other mechanisms to create context vectors

Inside the Transformer Block
  • Self-attention

    • Every output vector is a weighted combination of all the input vectors

    • Weights are based on the importance of that token in interpreting the token of interest

  • Multi-head Attention

    • Do attention X times so that you get X vectors for each token

  • Feed Forward

    • Traditional neural network

    • Allows for non-linearity

  • Add & Norm

    • Little tricks for making training easier

Differences for a Decoder Block (Versus an Encoder)

  • For decoder-only transformers (e.g. GPT):

    • Self-attention is only allowed to examine tokens earlier in the input

    • No looking ahead!

  • For encoder-decoder transformers (e.g. T5):

    • Used for translating between languages

    • Attention cannot look ahead but can examine the encoded text in the other language

Seventh (and More) Step: More Transformer Blocks

  • Take the outputted vectors from one Transformer block and feed them into the next block

  • Each Transformer block has different parameters

    • e.g. different attention matrices, etc.

    • Layer 12 will do things differently to Layer 1

Transfer Learning: The Pre-Train and Fine-Tune Paradigm

  • Train an ML system on one task and then adapt it to another new task.

  • Common paradigm: train a language model for one task (e.g., masked language modelling) and then further train it on a new task (e.g., text classification, summarization, etc.).

  • Keep many of the same model parameters (most of the transformer layers) but replace the final bits to do the desired task.

Why Does This Work?

  • When training a model from scratch, it needs to learn a lot about the language itself:

    • what words mean, how they modify one another, etc.

  • By training a task with nearly limitless data (language modelling), the model learns these common language patterns, which can be reused on other language tasks.

Visualizing Attention

  • Reference: https://github.com/jessevig/bertviz

Exam Preparation

Scan the Exam

  • Get a quick sense initially of what topics are coming up in each question

    • Trickier online, but you can click through questions

  • You don’t want to be surprised by a particularly long or challenging looking question later in the exam

Budget Your Time

  • In Text As Data, there will be three 20 mark questions

    • Do them all

  • Watch your time in each question

  • Don’t get stuck spending too much time in a question with few marks

    • Spend your time where the marks are

Paper Notes

  • You can bring paper notes to the exam

    • No digital notes this year

  • Good ideas:

    • Distil the course down into ~2 pages of quick lookup notes

    • Have a section for equations

    • Work with other students to build good notes

  • Bad ideas:

    • Printing all the course slides

    • Printing screeds of ChatGPT-generated content

    • Printing all the past papers with solutions

  • Do not waste your time searching your notes

Past Exam Papers and Example Answers

  • Past exam papers and example answers are on the Moodle

  • The Masters version of the exam is slightly harder.

  • These are reasonably representative of the upcoming exams.

  • In-person exams will have less challenging arithmetic, and you should practice the various calculations throughout the course.

Exam Advice – Level of Detail & Marks

  • Level of detail & marks given as [#] for each question.

  • In general, each mark is roughly one idea or one point in the answer.

Too Little Detail
  • You are not fully answering the question.

  • You will probably only be able to get partial marks at best

  • Stopwords like “and” do not carry much meaning.

Too Much Detail
  • You’re spending a lot of time writing on a question that gives very little marks

  • Often does not demonstrate understanding of how the course material relates to the question.

  • LLMs are often trained on many public datasets, so the LLM may have been trained on Circa. This is called “data contamination”.*

    • If the LLM was contaminated, the results can’t be considered zero-shot since it was directly trained on IAC data
      Example:
      LLMs are often trained on many public databases so the LLM have been trained on Circa. This is considered as “data contamination”. If LLMs was contaminated the results can’t be zero-shot, since it was directed trained on LAC data. (Just Right Detail Level)

Show Your Work for Calculations
  • Many questions specify to show your work.

  • Even if they do not, you cannot be given partial marks if we do not see where you went wrong!

  • You’re doing the work anyway – you should show it!

Exam Advice – Level of Detail & Marks - Summary

  1. Use [#] as a guide on how much detail to provide.

  2. Be careful about both too much detail and too little detail.

  3. Show your work for calculations.

Exam Advice – Types of Questions

  • The exams are open book and open notes, so questions that only ask you to recall information do not indicate mastery of the subject.

  • Questions focus on application, analysis, evaluation, creation.

  • All questions are free-text and manually marked (no multiple-choice or auto-grading)

Exam Advice – Types of Questions

  • Questions often require you to make connections between different concepts in the course or other fundamental CS concepts.

  • To answer this question, you need to:

    • Understand how UTF works (Lecture 1)

    • Understand how BPE works (Lecture 5)

    • Make connections between these ideas

Exam Advice – Types of Questions - Example

  • Unicode: A variable-byte (2-4 bytes) encoding

    • Encodes the alphabets of many languages

    • Emojis, maths symbols and lots more

    • Is frequently updated with new characters (v15.1 released in Sept 2023)

  • Byte pair encoding (BPE)

    • Inputs:

      • Large corpus of text to learn tokenization

      • The desired size of vocabulary

    • Algorithm:

      1. Pretokenise the corpus documents into words using a tokeniser (this will remove whitespace)

      2. Create a vocabulary of symbols: all unique characters in the corpus (i.e. all letters, numbers, etc)

      3. Repeat until the desired vocab size is reached:

        • Find the most common neighboring symbols in the corpus

        • Replace all instances of the pair with a new character and add a new character to the vocabulary

Exam Advice – Types of Questions

  • Due to the type of questions, content copied from notes will rarely give any/many marks.

    • (We’re not grading how good of search engines you are; we are grading your mastery of the topic!)

  • Directly answer the question!

    • Even if the answer can be inferred from the submitted text, we want to see YOU do the inference.

Exam Advice – Types of Questions - Summary

  1. Questions focus on application, analysis, evaluation, creation, not remembering.

  2. As such, text directly from lectures are unlikely to give many marks.

Glasgow’s Code of Assessment

  • Don’t be worried if you are not confident in every answer! The exam questions need to be challenging for 70% to be “exemplary range and depth of ILOs”.

TASD Intended Learning Outcomes

By the end of this course students will be able to:

  1. Describe classical models for textual representations such as the one-hot encoding, bag-of-words models, and sequences with language modelling.

  2. Identify potential applications of text analytics in practice.

  3. Describe various common techniques for classification, clustering and topic modelling, and select the appropriate machine learning task for a potential document processing application.

  4. Represent data as features to serve as input to machine learning models.

  5. Assess machine learning model quality in terms of relevant error metrics for document processing tasks, in an appropriate experimental design.

  6. Deploy unsupervised and machine learned approaches for document/text analytics tasks.

  7. Critically analyse and critique recent developments in natural language and text processing academic literature.

  8. Evaluate and explain the appropriate application of recent research developments to real-world problems.

Glasgow’s Code of Assessment - Summary

  • The exam will be challenging because 70% indicates “exemplary range and depth of attainment”

  • Be familiar with the course ILOs

Final Word on Good Practices

Assume Nothing, Check Everything

  • Never trust that your data is clean and ready to use

  • Write defensive code

    • Use asserts a lot

    • Check the inputs to functions

    • Use Python typing where appropriate

  • If you get perfect or zero results, that’s a good indicator that things have gone wrong

    • Reasonable-looking results can still be wrong - try to check as much as you can

Explore Your Data

  • Do sanity check on everything you can think of (e.g. column mins, maxs)

  • Look for duplicates

  • Check a few of the labels yourself (if you can)

  • Unsupervised approaches:

    • Cluster your data to see if there are any groups you should be aware of

    • Use dimensionality reduction methods to visualise your data in 2D

  • Plot everything you can think of (different types of plots can be very helpful)

Be Careful of Code Doing Things Without Your Knowledge

  • A lot of packages will try to be “helpful”. This can trip you up if you’re not careful

  • Example:

    • Pandas tries to make things numbers. Sometimes this is not appropriate patientid,diagnosisid 1,010 2,014 3,10.0 4,14.0 5,

Use Realistic Data for Your Evaluations

  • The data that you use for evaluation needs to be realistic of the real problem

  • If not, you are creating an ML system specific to that data that cannot generalise

  • You can do all kinds of things to the training set (e.g. resampling) that you should not do to validation/test set

Follow the Standard Process

  1. Check your data

    1. Manual inspection

    2. Check label counts

    3. Use unsupervised approaches to explore your data

  2. Build and tune pipelines using the training and validation set

    1. Use appropriate evaluation metrics

    2. Inspect the mistakes to get ideas for improving the model

  3. Pick your best model and evaluate on the test set

Consider Class Imbalance

  • Can cause problems with:

    • Training: Classifier can favor the majority class and just always predict that

    • Evaluation: Using the wrong metric (especially accuracy) can obscure what is going on
      Predicted | Positive | Negative
      ---|---|
      Actual Positive | TP |FN
      Negative |FP| TN

The metric to focus on will depend on the project

  • Reference: https://arxiv.org/abs/2108.02497

  • Outlines a number of the pitfalls when doing machine learning

  • Learning best practices takes time and reflection on what/why you are doing things

Best Practices for Machine Learning

  • Use training/validation/test splits of data

  • Be systematic

    • Set up experiments!

  • Watch out for class imbalance

  • Think about your evaluation metrics

    • Always better to use 2+ metrics

  • Look at your data

    • Helps you design a better classifier

    • Helps you understand the mistakes

  • Be cynical about data and the classifier

    • Data is always messy

    • Classifiers always look for an easy solution

Text as Data - Wrap Up

  • Please fill out the feedback form to let us know how we can improve the course with three quick questions. (https://tinyurl.com/textasdatafeedback)

  • Answers are anonymous. Enjoy! :D