This is a sentence.

Natural Language Processing (NLP) is covered under COM3029 & COMM061.
Dr. Diptesh Kanojia is the instructor, and his email is d.kanojia@surrey.ac.uk.

The Need to Communicate

Language is a communication tool composed of written and sound symbols that varies across regions.
Communication involves message exchange, with language as the primary but not the only method.
Non-linguistic communication includes gestures and graphical representations.
Humans naturally extend language-based communication to interactions with computers.

Language Challenges

Ambiguity in language requires context for disambiguation (interpretation, irony, culture).
Examples of ambiguity:
- "I don’t really walk to the bank at night" (bank as in river bank, or financial institute)
- "I saw someone with the telescope."
Language evolves through neologisms and non-standard language use (e.g., "Get Brexit done").
Slang varies across generations (e.g., "(no) cap, sus, tea" for Gen-Z, "Dude, kewl, fricking" for Gen-X).
Idioms like "Over the moon" and "Piece of cake" present interpretive challenges.
Entity names (e.g., "The lord of the rings") require specific recognition.

Starting with Words

Words are fundamental data in NLP.
Languages average around 200,000 unique words, ranging from 10,000 to 1 million.
Some words occur more frequently than others, following Zipf’s Law: $freq \times rank ≈ constant$ . The frequency of a word has an inverse relationship to its rank in the frequency table.

NLP Leverages Linguistics

Linguistics studies language, including morphology, meaning, context, and socio-cultural/historical influences.
Understanding language structure is crucial for building NLP systems, specifically:
- Words
- Morphology
- Parts of speech
- Syntax parsing
- Semantics
- Textual entailment
Example:
- Phrase: NLP is a really cool subject
- Parts of speech: NNP VBZ DT RB JJ NN
- Phrases and clauses: NP NP VP S
- Cool means fashionable but also hard, showing contrast.

Language Related Building Tools

Libraries and services available for language-related tasks include:
- Tokenizing (NLTK, HuggingFace Tokenizers)
- Lemmatising or Stemming
- Grammar detection & correction (e.g., silvio/docker-languagetool)
- Part-of-Speech (PoS) tagging (Spacy, Stanza)
- Named Entity Recognition (NER) (Spacy, Flair)
- Coreference resolution (Stanford CoreNLP)
- Textual Entailment (NLI models on HuggingFace)

Natural Language Processing

Natural Language Processing (NLP) is communication between humans and computers using natural language.
Text is transformed into numerical data for computers to interpret the world.
Training language models requires diverse and vast text corpora from news, websites, blogs, and books.
Multilingual NLP systems need decent representation for each language in the corpus.

Fundamental differences between humans and machines process language

Machines need Language to be converted into numeric form.

Goals of Natural Language Processing

Understand Human Language
- Ensure machines understand human language
Analyze Human Language
- Textual analytics, extraction, and retrieval to analyze the information present in human language.
Generate Human Language
- Generation of understandable human language to interface with people.

NLP Problems Intersects with Various Fields

Linguistics
- Morphology
- Finite-state machines
- Morphology Analyzer
- Syntactics
- Parser
- Semantics
- Parsing
Computer Science
- Machine/Deep learning
- Sentiment Analysis
- Information Retrieval
- Summarization
- Probability theory
Cognates
- Lexicon
- Ontology generation
- Graphs & trees
- Machine Translation
- Sense Disambiguation

NLP Positioning

NLP lies at the intersection of Linguistics, Artificial Intelligence, Machine Learning, and Deep Learning.

Artificial Intelligence

Artificial Intelligence (AI) is the ability of machines to mimic human-like intelligence and cognitive functions.
Artificial General Intelligence (AGI) aims for computers to learn and perform tasks without explicit programming, similar to human learning.
ChatGPT exemplifies a step toward AGI.

Deeper AI: Machine Learning & Deep Learning

Machine Learning (ML) uses data examples for computers to make decisions, contrasting conventional programming with hardcoded rules.
ML algorithms observe data and generate models for software systems.
Deep Learning (DL) is a subset of ML that "deepens" the algorithm's internal architecture for complex non-linear problems, often involving hidden layers in Artificial Neural Networks (ANN).
Nowadays DL is assumed even when people refer to ML.

Classical and Neural Language Modeling

Classical language modeling predicts the probability of a sequence of words.
Neural language modeling utilizes neural models for encoding text and makes similar predictions.
The Transformers architecture includes pre-trained language models (autoencoders) and autoregressive decoders.

Application of NLP: Information Extraction

Information Extraction extracts structured data from unstructured text to build knowledge structures.
- Extract relationships to build knowledge structures such as ontology

Application of NLP: Information Retrieval

Information retrieval seeks and retrieves documents or information within documents.
Common applications include search engines, autocomplete, and typing predictions.
Information retrieval systems involve retrieving relevant documents and understanding user needs, evolving based on conversation, human interaction, and large language models (LLM) for product search and recommendation.

Applications of NLP – Classification

Classification predicts class or labels related to a document.
Typical outputs are binary, multi-class, or multi-label.
Applications: email spam filtering, sentiment analysis, emotion/sarcasm/hate detection.
- Example: spam classifiers.

Sarcasm Detection

Sarcasm detection identifies mocking, contemptuous, or ironic language.
Vital for online review summarizers, dialog systems, recommendation systems, and sentiment analyzers.

Applications of NLP – Editing

Assisted editing software aids humans with spelling, grammar, styling, and clarification.

Applications of NLP – Question Answering

Question Answering extracts and composes responses by understanding questions and retrieving relevant documents.
Functions are becoming a commodity, like Google’s search service.

Applications of NLP – Machine Translation

Machine Translation (MT) automatically transforms text between languages, focusing on syntactic and semantic accuracy, exemplified by Google Translate.

Applications of NLP – Dialogue Systems

Dialogue Systems / Conversation Agents create human-computer dialogues using NLP, with applications like Web chatbots and phone bots.

Applications of NLP – Summarisation

Text summarisation reduces text while preserving key information, using:
- Extractive summarisation (selection of key phrases)
- Abstractive summarisation (generating new phrases)

Cognitive-NLP

Cognitive-NLP applies psycholinguistics or neurolinguistics research to NLP.
Methodologies include:
- Eye tracking
- EEG Readings
- fMRI Readings
It follows the eye-mind hypothesis about the maintenance of visual gaze on a single location, for a certain amount of time.
Text-only training may have reached a saturation point.
The next step in NLP involves multi-modal sources.
- Saccade: A rapid movement of the eye between fixation points.
- Fixation: visual gaze on a single location, for a certain amount of time.
- Scanpath: The path followed by the viewer's eyes when reading a document or observing a scene.

Challenges in Building NLP Systems

The presence of 7,117 known living languages.

Why Build Multilingual NLP Systems?

Building multilingual NLP systems to the language used by the users so the system can interact with people.
English has the highest percentage of web content.

Code-mixed Data Case Study - NLU - Aggression Detection

Modelled as 3-class classification problem
Two data sets based on social median
- Political Aggression Platforms [Rawat et al., 2023]
- Multi-task Aggression and Offense Detection [Nafis et al., 2023]
Challenges
- Crawling User-generated Content (UGC)
- Filtering code-mixed data
- Manual filtering vs. LID-based thresholding
- Pre-training with code-mixed UGC data helps performance
- Nuanced annotation for ‘covertly aggressive’
- Cross-dataset performance: UGC-specific challenge?
- D1: facebook comments
- D2: tweets from X Political Aggression

Programming a System to Process Natural Language Is Challenging

Programming a system to process natural language using rules and heuristics is difficult due to the number of languages and complexity/exceptions in each.
To solve the problem, computers figure out the rules by observing patterns in the language.
Relying on statical relationships between words
Vast vocabulary variations
Processing requires a lot of data

Scale Challenges

Google Search index contains billions of webpages, making data collection, storage, and processing challenging.
NLP methods can be applied similarly across languages despite differences in vocabulary and structure.
Solutions process language at character, token, or complex linguistic structure levels.

The NLP process is not straight forward Or even consistent

Relies on extensive pre-processing of the data
Sometimes you might need to visualise and transform the data in order to understand how to process it
Statistical analysis alone might not be sufficient and most of the times semantic analysis is needed to improve accuracy
Requires complicated mathematical calculations
Variations in data size or quality can have huge impact on performance
Variations in data can also require a completely different architecture solution
Additional context or knowledge of the world or a situation might still be needed to decode and process natural language (e.g. telling someone morning as a greeting)

NLP at Scale

NLP processes are iterative and complex, with code and data having separate lifecycles and varying project needs, requiring defined workflows and states.
Plus the typical traditional operational challenges of any software project such as:
- Infrastructure management and infrastructure as code (IaC)
- Observability / auditability / traceability
- Multi-environment setup
- Automation

Machine Learning Operations

MLOps defined as the practice of connecting the data scientists / engineers and operations engineers in running and managing the machine learning implementations.
Clear boundaries need to be defined so that the collaboration and communication generally becomes part of the machine learning lifecycle.

NLP Lifecycle management

Different Personas
- Data Engineers
- Data Scientist
- ML Engineers
ML Workflow Automation / Management / Continuous Delivery
- Data Sourcing
- Model Development
- Model Deployment & Inference Production Integration
- Feature Engineering
- Model Evaluation
- Model Monitoring
- Data Quality Assurance
- Security / Patching / Networking / etc
- System Administrators

Diving Deeper in Language Morphology

Morphology is the study of words and their forms.
Words are put together based on rules. Those rules that govern order in a sequence of words (like a sentence) are called the grammar of a language.
There are structured relationships between words. For example, the word process has related forms such as:
- processed, processes, processing
- reprocess, preprocess
But we could NOT have for example processpre… so the grammar matters

Morphemes

The unit of the meaning of a word is called morphemes. There are two types:
- stems: processed has stem process
- affixes: prefix: reprocess suffix: processed infix: λαμβάνω (Greek for “I take” from έλαβα “I took”, else λάβω “to take”) circumfix: legnagyobb (Hungarian for “biggest” from nagy “big”, else nagyobb “bigger”)

From Stems to Lemmas

Another word form frequently used is called lemma. Lemma is the canonical form (dictionary form) of a set of words:
- The set {processed, processes, processing}: have all the lemma process
- Lemmatization is an optional NLP pipeline task- breaks input text (word or sequence of words) into lemmas
In the above example both the stem and the lemma are the same, but this is not always the case:
- The set {create, created, creates, creating}: have all the lemma create – the stem creat
- The set {cry, cried, cries, crying}: have all the lemma cry – the stem cr
- The set {is, are, was, were}: have all the lemma be – the stems remain as is, are, wa(s), wer(e)
- ^^^___^ → irregular forms!
- Stemming- breaks input text (words or sequence) into stems. It is not all that straightforward… so how do we choose?
Can you think of rules (as abstract as you can) to reach the lemma, if you are able to stem a word?

Morphology - Inflectional and Derivational.

Inflectional : does not really changes the meaning, e.g., we could inflect:
- nouns with +s (car → cars) and for possessive case with +’s (car → car’s)
- verbs with +ed and +ing (play → played) and a special 3rd person singular present form with +s (play’s)
- adjectives with +er (clever → cleverer) and superlative +est forms (cleverest)
Derivational : changing the part-of-speech, so might change the meaning, e.g.,
- game → gamify → gamifies → gamification → … → gamificationism → gamificationist
Compounds or Open Compounds
- course work → coursework; web site → website; bench mark → benchmark
  - higher education; open source; artificial intelligence; object orientated programming; deep learning
Acronyms / Abbreviations
- UNICEF is an agency. (i.e., the formerly United Nations International Children's Emergency Fund)
  - I wanted to WFH today (i.e., Working From Home)
  - lol, brb, afk, tbh, asap, fyi, cu, l8r, imo, myob, np, thx, btw, ty… omg so many! invented words, but syntactically correct!

Lemmatization

Lemmatization is the process of reducing a word to its base or dictionary form, known as the lemma.
- Helps standardize words and improve text analysis
  - accuracy, especially for tasks like information retrieval and machine translation
- Dictionary-based
  - Lookup the word in a lexicon to find its lemma.
- Algorithm-based
  - Use rules and morphological analysis to determine the lemma.
- Hybrid
  - Combine dictionary-based and algorithm-based approaches.

Tokenization

Tokenization is the process of segmenting text into smaller units called tokens.
- Foundational step for language modelling
- Helps work with a limited vocabulary
- These tokens can be words, subwords, or even characters, depending on the chosen method.

Problems With Tokenization

Natural languages have a vast vocabulary, and new words are constantly being created.
- Representing every possible word can lead to extremely large vocabularies -> imagine computational costs and memory usage for NLP models if all words are in vocab.
- Limit vocab. beyond a point-> models may encounter words during inference that were not present in their training vocabulary. [Out of vocabulary or OOV words]

Tokenization

Word-level Ex: