Text Representation & Preprocessing

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/14

Earn XP

Description and Tags

Week 11 ATA

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

15 Terms

New cards

Natural Language Processing

A subfield of AI which deals with methods to analyze, model and understand human language.

New cards

Language Modelling

Predicting the next word in a sentence based on the context.
Useful for speech recognition, optical character recognition, handwriting recognition, machine translation, spelling correction.

New cards

Text Classification

Assigning the text into a known set of categories based on its content
Useful for spam detection, sentiment analysis

New cards

Information Extraction

Extracting relevant information from text (E.g. calendar events, names)

New cards

Information Retrieval

Finding relevant documents from a large collection based on user query. (E.g. Google search)

New cards

Text Summarization

Summarizing longer documents, retaining the core content and overall meaning

New cards

Question Answering

Answering questions posed in natural language

New cards

Machine Translation

Converting a piece of text from one language to another

New cards

Topic Modelling

Uncovering the topical structure of a large collection of documents

New cards

Feature Engineering / Text Representation

Deals with representing text in a form that can be used directly by ML algorithms.
Text needs to be represented in numeric format before it can be used by a ML model
- Text is commonly represented as vectors of numbers (vectorization)

New cards

Tokenization

The process of breaking down a piece of text into its ‘units’
Units can be characters, words, n-grams, sub-words, etc…

New cards

Word Embedding

Dense m-dimensional vector
Captures the meaning of words
- Similar words are represented as “nearby” vectors in m-dimensional space

New cards

Context-free Word Embedding

Each word has a fixed embedding, regardless of context.
- E.g. river bank and money bank use the same embedding. The word “bank” has different meanings in this example, but uses the same embedding regardless.

New cards

Contextual Word Embedding

Words in different contexts have different embeddings
- For example, a river bank and a money bank use different embeddings in this case. The word “bank” may remain the same in both sentences, but uses different embeddings as its meaning changes based on the context.

New cards

Handling Out-of-Vocabulary Words

OOV can occur when the test data contains words that are not seen in the training data.

Solutions: