Text Representation & Preprocessing

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/14

flashcard set

Earn XP

Description and Tags

Week 11 ATA

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

15 Terms

1
New cards

Natural Language Processing

A subfield of AI which deals with methods to analyze, model and understand human language.

2
New cards

Language Modelling

  • Predicting the next word in a sentence based on the context.

  • Useful for speech recognition, optical character recognition, handwriting recognition, machine translation, spelling correction.

3
New cards

Text Classification

  • Assigning the text into a known set of categories based on its content

  • Useful for spam detection, sentiment analysis

4
New cards

Information Extraction

Extracting relevant information from text (E.g. calendar events, names)

5
New cards

Information Retrieval

Finding relevant documents from a large collection based on user query. (E.g. Google search)

6
New cards

Text Summarization

Summarizing longer documents, retaining the core content and overall meaning

7
New cards

Question Answering

Answering questions posed in natural language

8
New cards

Machine Translation

Converting a piece of text from one language to another

9
New cards

Topic Modelling

Uncovering the topical structure of a large collection of documents

10
New cards

Feature Engineering / Text Representation

  • Deals with representing text in a form that can be used directly by ML algorithms.

  • Text needs to be represented in numeric format before it can be used by a ML model

    • Text is commonly represented as vectors of numbers (vectorization)

11
New cards

Tokenization

  • The process of breaking down a piece of text into its ‘units’

  • Units can be characters, words, n-grams, sub-words, etc…

12
New cards

Word Embedding

  • Dense m-dimensional vector

  • Captures the meaning of words

    • Similar words are represented as “nearby” vectors in m-dimensional space

13
New cards

Context-free Word Embedding

  • Each word has a fixed embedding, regardless of context.

    • E.g. river bank and money bank use the same embedding. The word “bank” has different meanings in this example, but uses the same embedding regardless.

14
New cards

Contextual Word Embedding

  • Words in different contexts have different embeddings

    • For example, a river bank and a money bank use different embeddings in this case. The word “bank” may remain the same in both sentences, but uses different embeddings as its meaning changes based on the context.

15
New cards

Handling Out-of-Vocabulary Words

OOV can occur when the test data contains words that are not seen in the training data.

Solutions:

  • Use <UNK> token for unknown words

  • Use sub-word tokenizers