1/14
Week 11 ATA
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Natural Language Processing
A subfield of AI which deals with methods to analyze, model and understand human language.
Language Modelling
Predicting the next word in a sentence based on the context.
Useful for speech recognition, optical character recognition, handwriting recognition, machine translation, spelling correction.
Text Classification
Assigning the text into a known set of categories based on its content
Useful for spam detection, sentiment analysis
Information Extraction
Extracting relevant information from text (E.g. calendar events, names)
Information Retrieval
Finding relevant documents from a large collection based on user query. (E.g. Google search)
Text Summarization
Summarizing longer documents, retaining the core content and overall meaning
Question Answering
Answering questions posed in natural language
Machine Translation
Converting a piece of text from one language to another
Topic Modelling
Uncovering the topical structure of a large collection of documents
Feature Engineering / Text Representation
Deals with representing text in a form that can be used directly by ML algorithms.
Text needs to be represented in numeric format before it can be used by a ML model
Text is commonly represented as vectors of numbers (vectorization)
Tokenization
The process of breaking down a piece of text into its ‘units’
Units can be characters, words, n-grams, sub-words, etc…
Word Embedding
Dense m-dimensional vector
Captures the meaning of words
Similar words are represented as “nearby” vectors in m-dimensional space
Context-free Word Embedding
Each word has a fixed embedding, regardless of context.
E.g. river bank and money bank use the same embedding. The word “bank” has different meanings in this example, but uses the same embedding regardless.
Contextual Word Embedding
Words in different contexts have different embeddings
For example, a river bank and a money bank use different embeddings in this case. The word “bank” may remain the same in both sentences, but uses different embeddings as its meaning changes based on the context.
Handling Out-of-Vocabulary Words
OOV can occur when the test data contains words that are not seen in the training data.
Solutions:
Use <UNK> token for unknown words
Use sub-word tokenizers