This is a sentence.

  • Natural Language Processing (NLP) is covered under COM3029 & COMM061.
  • Dr. Diptesh Kanojia is the instructor, and his email is d.kanojia@surrey.ac.uk.

The Need to Communicate

  • Language is a communication tool composed of written and sound symbols that varies across regions.
  • Communication involves message exchange, with language as the primary but not the only method.
  • Non-linguistic communication includes gestures and graphical representations.
  • Humans naturally extend language-based communication to interactions with computers.

Language Challenges

  • Ambiguity in language requires context for disambiguation (interpretation, irony, culture).
  • Examples of ambiguity:
    • "I don’t really walk to the bank at night" (bank as in river bank, or financial institute)
    • "I saw someone with the telescope."
  • Language evolves through neologisms and non-standard language use (e.g., "Get Brexit done").
  • Slang varies across generations (e.g., "(no) cap, sus, tea" for Gen-Z, "Dude, kewl, fricking" for Gen-X).
  • Idioms like "Over the moon" and "Piece of cake" present interpretive challenges.
  • Entity names (e.g., "The lord of the rings") require specific recognition.

Starting with Words

  • Words are fundamental data in NLP.
  • Languages average around 200,000 unique words, ranging from 10,000 to 1 million.
  • Some words occur more frequently than others, following Zipf’s Law: freq \times rank ≈ constant. The frequency of a word has an inverse relationship to its rank in the frequency table.

NLP Leverages Linguistics

  • Linguistics studies language, including morphology, meaning, context, and socio-cultural/historical influences.
  • Understanding language structure is crucial for building NLP systems, specifically:
    • Words
    • Morphology
    • Parts of speech
    • Syntax parsing
    • Semantics
    • Textual entailment
  • Example:
    • Phrase: NLP is a really cool subject
    • Parts of speech: NNP VBZ DT RB JJ NN
    • Phrases and clauses: NP NP VP S
    • Cool means fashionable but also hard, showing contrast.

Language Related Building Tools

  • Libraries and services available for language-related tasks include:
    • Tokenizing (NLTK, HuggingFace Tokenizers)
    • Lemmatising or Stemming
    • Grammar detection & correction (e.g., silvio/docker-languagetool)
    • Part-of-Speech (PoS) tagging (Spacy, Stanza)
    • Named Entity Recognition (NER) (Spacy, Flair)
    • Coreference resolution (Stanford CoreNLP)
    • Textual Entailment (NLI models on HuggingFace)

Natural Language Processing

  • Natural Language Processing (NLP) is communication between humans and computers using natural language.
  • Text is transformed into numerical data for computers to interpret the world.
  • Training language models requires diverse and vast text corpora from news, websites, blogs, and books.
  • Multilingual NLP systems need decent representation for each language in the corpus.

Fundamental differences between humans and machines process language

  • Machines need Language to be converted into numeric form.

Goals of Natural Language Processing

  • Understand Human Language
    • Ensure machines understand human language
  • Analyze Human Language
    • Textual analytics, extraction, and retrieval to analyze the information present in human language.
  • Generate Human Language
    • Generation of understandable human language to interface with people.

NLP Problems Intersects with Various Fields

  • Linguistics
    • Morphology
    • Finite-state machines
    • Morphology Analyzer
    • Syntactics
    • Parser
    • Semantics
    • Parsing
  • Computer Science
    • Machine/Deep learning
    • Sentiment Analysis
    • Information Retrieval
    • Summarization
    • Probability theory
  • Cognates
    • Lexicon
    • Ontology generation
    • Graphs & trees
    • Machine Translation
    • Sense Disambiguation

NLP Positioning

  • NLP lies at the intersection of Linguistics, Artificial Intelligence, Machine Learning, and Deep Learning.

Artificial Intelligence

  • Artificial Intelligence (AI) is the ability of machines to mimic human-like intelligence and cognitive functions.
  • Artificial General Intelligence (AGI) aims for computers to learn and perform tasks without explicit programming, similar to human learning.
  • ChatGPT exemplifies a step toward AGI.

Deeper AI: Machine Learning & Deep Learning

  • Machine Learning (ML) uses data examples for computers to make decisions, contrasting conventional programming with hardcoded rules.
  • ML algorithms observe data and generate models for software systems.
  • Deep Learning (DL) is a subset of ML that "deepens" the algorithm's internal architecture for complex non-linear problems, often involving hidden layers in Artificial Neural Networks (ANN).
  • Nowadays DL is assumed even when people refer to ML.

Classical and Neural Language Modeling

  • Classical language modeling predicts the probability of a sequence of words.
  • Neural language modeling utilizes neural models for encoding text and makes similar predictions.
  • The Transformers architecture includes pre-trained language models (autoencoders) and autoregressive decoders.

Application of NLP: Information Extraction

  • Information Extraction extracts structured data from unstructured text to build knowledge structures.
    • Extract relationships to build knowledge structures such as ontology

Application of NLP: Information Retrieval

  • Information retrieval seeks and retrieves documents or information within documents.
  • Common applications include search engines, autocomplete, and typing predictions.
  • Information retrieval systems involve retrieving relevant documents and understanding user needs, evolving based on conversation, human interaction, and large language models (LLM) for product search and recommendation.

Applications of NLP – Classification

  • Classification predicts class or labels related to a document.
  • Typical outputs are binary, multi-class, or multi-label.
  • Applications: email spam filtering, sentiment analysis, emotion/sarcasm/hate detection.
    • Example: spam classifiers.

Sarcasm Detection

  • Sarcasm detection identifies mocking, contemptuous, or ironic language.
  • Vital for online review summarizers, dialog systems, recommendation systems, and sentiment analyzers.

Applications of NLP – Editing

  • Assisted editing software aids humans with spelling, grammar, styling, and clarification.

Applications of NLP – Question Answering

  • Question Answering extracts and composes responses by understanding questions and retrieving relevant documents.
  • Functions are becoming a commodity, like Google’s search service.

Applications of NLP – Machine Translation

  • Machine Translation (MT) automatically transforms text between languages, focusing on syntactic and semantic accuracy, exemplified by Google Translate.

Applications of NLP – Dialogue Systems

  • Dialogue Systems / Conversation Agents create human-computer dialogues using NLP, with applications like Web chatbots and phone bots.

Applications of NLP – Summarisation

  • Text summarisation reduces text while preserving key information, using:
    • Extractive summarisation (selection of key phrases)
    • Abstractive summarisation (generating new phrases)

Cognitive-NLP

  • Cognitive-NLP applies psycholinguistics or neurolinguistics research to NLP.
  • Methodologies include:
    • Eye tracking
    • EEG Readings
    • fMRI Readings
  • It follows the eye-mind hypothesis about the maintenance of visual gaze on a single location, for a certain amount of time.
  • Text-only training may have reached a saturation point.
  • The next step in NLP involves multi-modal sources.
    • Saccade: A rapid movement of the eye between fixation points.
    • Fixation: visual gaze on a single location, for a certain amount of time.
    • Scanpath: The path followed by the viewer's eyes when reading a document or observing a scene.

Challenges in Building NLP Systems

  • The presence of 7,117 known living languages.

Why Build Multilingual NLP Systems?

  • Building multilingual NLP systems to the language used by the users so the system can interact with people.
  • English has the highest percentage of web content.

Code-mixed Data Case Study - NLU - Aggression Detection

  • Modelled as 3-class classification problem
  • Two data sets based on social median
    • Political Aggression Platforms [Rawat et al., 2023]
    • Multi-task Aggression and Offense Detection [Nafis et al., 2023]
  • Challenges
    • Crawling User-generated Content (UGC)
    • Filtering code-mixed data
    • Manual filtering vs. LID-based thresholding
    • Pre-training with code-mixed UGC data helps performance
    • Nuanced annotation for ‘covertly aggressive’
    • Cross-dataset performance: UGC-specific challenge?
    • D1: facebook comments
    • D2: tweets from X Political Aggression

Programming a System to Process Natural Language Is Challenging

  • Programming a system to process natural language using rules and heuristics is difficult due to the number of languages and complexity/exceptions in each.
  • To solve the problem, computers figure out the rules by observing patterns in the language.
  • Relying on statical relationships between words
  • Vast vocabulary variations
  • Processing requires a lot of data

Scale Challenges

  • Google Search index contains billions of webpages, making data collection, storage, and processing challenging.
  • NLP methods can be applied similarly across languages despite differences in vocabulary and structure.
  • Solutions process language at character, token, or complex linguistic structure levels.

The NLP process is not straight forward Or even consistent

  • Relies on extensive pre-processing of the data
  • Sometimes you might need to visualise and transform the data in order to understand how to process it
  • Statistical analysis alone might not be sufficient and most of the times semantic analysis is needed to improve accuracy
  • Requires complicated mathematical calculations
  • Variations in data size or quality can have huge impact on performance
  • Variations in data can also require a completely different architecture solution
  • Additional context or knowledge of the world or a situation might still be needed to decode and process natural language (e.g. telling someone morning as a greeting)

NLP at Scale

  • NLP processes are iterative and complex, with code and data having separate lifecycles and varying project needs, requiring defined workflows and states.
  • Plus the typical traditional operational challenges of any software project such as:
    • Infrastructure management and infrastructure as code (IaC)
    • Observability / auditability / traceability
    • Multi-environment setup
    • Automation

Machine Learning Operations

  • MLOps defined as the practice of connecting the data scientists / engineers and operations engineers in running and managing the machine learning implementations.
  • Clear boundaries need to be defined so that the collaboration and communication generally becomes part of the machine learning lifecycle.

NLP Lifecycle management

  • Different Personas
    • Data Engineers
    • Data Scientist
    • ML Engineers
  • ML Workflow Automation / Management / Continuous Delivery
    • Data Sourcing
    • Model Development
    • Model Deployment & Inference Production Integration
    • Feature Engineering
    • Model Evaluation
    • Model Monitoring
    • Data Quality Assurance
    • Security / Patching / Networking / etc
    • System Administrators

Diving Deeper in Language Morphology

  • Morphology is the study of words and their forms.

  • Words are put together based on rules. Those rules that govern order in a sequence of words (like a sentence) are called the grammar of a language.

  • There are structured relationships between words. For example, the word process has related forms such as:

    • processed, processes, processing
    • reprocess, preprocess
  • But we could NOT have for example processpre… so the grammar matters

Morphemes

  • The unit of the meaning of a word is called morphemes. There are two types:

    • stems: processed has stem process
    • affixes: prefix: reprocess suffix: processed infix: λαμβάνω (Greek for “I take” from έλαβα “I took”, else λάβω “to take”) circumfix: legnagyobb (Hungarian for “biggest” from nagy “big”, else nagyobb “bigger”)

From Stems to Lemmas

  • Another word form frequently used is called lemma. Lemma is the canonical form (dictionary form) of a set of words:

    • The set {processed, processes, processing}: have all the lemma process
    • Lemmatization is an optional NLP pipeline task- breaks input text (word or sequence of words) into lemmas
  • In the above example both the stem and the lemma are the same, but this is not always the case:

    • The set {create, created, creates, creating}: have all the lemma create – the stem creat
    • The set {cry, cried, cries, crying}: have all the lemma cry – the stem cr
    • The set {is, are, was, were}: have all the lemma be – the stems remain as is, are, wa(s), wer(e)
    • ^^^___^ → irregular forms!
    • Stemming- breaks input text (words or sequence) into stems. It is not all that straightforward… so how do we choose?
  • Can you think of rules (as abstract as you can) to reach the lemma, if you are able to stem a word?

Morphology - Inflectional and Derivational.

  • Inflectional : does not really changes the meaning, e.g., we could inflect:

    • nouns with +s (car → cars) and for possessive case with +’s (car → car’s)
    • verbs with +ed and +ing (play → played) and a special 3rd person singular present form with +s (play’s)
    • adjectives with +er (clever → cleverer) and superlative +est forms (cleverest)
  • Derivational : changing the part-of-speech, so might change the meaning, e.g.,

    • game → gamify → gamifies → gamification → … → gamificationism → gamificationist
  • Compounds or Open Compounds

    • course work → coursework; web site → website; bench mark → benchmark
      • higher education; open source; artificial intelligence; object orientated programming; deep learning
  • Acronyms / Abbreviations

    • UNICEF is an agency. (i.e., the formerly United Nations International Children's Emergency Fund)
      • I wanted to WFH today (i.e., Working From Home)
      • lol, brb, afk, tbh, asap, fyi, cu, l8r, imo, myob, np, thx, btw, ty… omg so many! invented words, but syntactically correct!

Lemmatization

  • Lemmatization is the process of reducing a word to its base or dictionary form, known as the lemma.
    • Helps standardize words and improve text analysis
      • accuracy, especially for tasks like information retrieval and machine translation
    • Dictionary-based
      • Lookup the word in a lexicon to find its lemma.
    • Algorithm-based
      • Use rules and morphological analysis to determine the lemma.
    • Hybrid
      • Combine dictionary-based and algorithm-based approaches.

Tokenization

  • Tokenization is the process of segmenting text into smaller units called tokens.
    • Foundational step for language modelling
    • Helps work with a limited vocabulary
    • These tokens can be words, subwords, or even characters, depending on the chosen method.

Problems With Tokenization

  • Natural languages have a vast vocabulary, and new words are constantly being created.
    • Representing every possible word can lead to extremely large vocabularies -> imagine computational costs and memory usage for NLP models if all words are in vocab.
    • Limit vocab. beyond a point-> models may encounter words during inference that were not present in their training vocabulary. [Out of vocabulary or OOV words]

Tokenization

  • Word-level Ex:

      -