DATA MINING PRELIMS (TEXT MINING)

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/58

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

59 Terms

New cards

TEXT MINING

is the practice of analyzing vast collections of textual materials to capture key concepts, trends and hidden relationships

Deals with unstructured data – conversations, declarations or even tweets or comments

New cards

Structured
Multimedia
Free Text

Enumerate Text Formats

New cards

Lexical Analysis (POS)
Semantic Analysis
Syntactic Analysis (Parsing)
Pragmatic Analysis (Speech act)

Enumerate Basic Concepts of NLP

New cards

Word-level ambiguity
Syntactic Ambiguity
Anaphora resolution
Presupposition

Enumerate DIFFICULTIES OF NLP

New cards

Presupposition

“He has quit smoking.” implies that he smoked before

New cards

Anaphora resolution

”John persuaded Bill to a TV for himself.”

(himself = John or Bill?)

New cards

Syntactic Ambiguity

“natural language processing”

“A man saw a boy with a telescope.”

New cards

Word-level ambiguity

“design” can be a noun or a verb (ambiguous POS)

“root” has multiple meanings (Ambiguous sense)

New cards

Convert Accented Characters
Expand Contractions
Tokenization
Stemming
Lemmatization
Parts of Speech tagging
Stopwords removal

Data Pre-processing for Textual Data

New cards

Stopwords removal

Common words that occur in sentences that does not add much value to meaning of the sentence

Acts as bridge and ensure that sentences are grammatically correct.

Words that are filtered out before processing natural language data

Example: the, is, in, for, to, in etc.

New cards

Parts of Speech tagging

Assigning parts of speech to each words in a sentence

Tagging is performed at the token level

New cards

Lemmatization

Performs morphological analysis of the words
Example: Word: Helps, Morphological Info: Third person singular number, present tense help, Lemma: help

New cards

Wordnet
spaCY
TextBlob
Pattern
Stanford CoreNLP

Examples of Lemmatization

New cards

Stemming

A way to reduce a word to its root(stem) word

Removes prefixes and suffixes

Less input dimensions

Make training data more dense

Reduce the size of the dictionary

Helps to normalize the word in the document
Word: cats, Suffix: s, Stem: cat

Word: caring, Suffix: ing, Stem: care/car

New cards

Porter stemmer

Krovetz stemmer

Examples of Stemming

New cards

Tokenization

Used to split a phrase, sentence, paragraph or an entire document into smaller units in words or terms.

It helps to interpret the meaning of the text by analyzing the words present in the text

Count the number of words in a text

Example: “I saw a cat” → Tokens: “I”, “saw”, “a”, “cat”

New cards

Expand Contractions

don’t – do not, can’t – can not

New cards

Data Pre-processing for Textual Data

Latté – latte, Café – café

New cards

Machine-based
Rule-based
Hybrid Approach

TEXT CLASSIFICATION Approaches

New cards

Machine-based

classification is learned based on pre-labeled datasets.

New cards

Rule-based

texts are grouped based on handcraft linguistic rules.

Users define a list of words for each group.

Examples:

Robredo, Lacson, Marcos - categorized into politics

Christianity, Islam, Atheism - categorized into religion

New cards

Hybrid Approach

combination of rule-based and machine based approach.

Uses the list of words to label the dataset. The classification is iteratively improved by updating the list of words

New cards

BAG OF WORDS (BOW) MODEL

a vector represents the frequency of words in a predefined dictionary of a word list.

keeps a count of total occurrences of most frequently used words

one of the methods used to transform tokens into a set of features

ignores grammar and order of words

New cards

Clean Text
Tokenize
Build Vocab
Generate Vectors

Process of BAG OF WORDS (BOW) MODEL

New cards

Vocabulary

list of words in a corpus

New cards

Feature vector

is an n-dimensional vector of numerical features that represent some object.

New cards

N-gram

is an N-token sequence of words

New cards

2-gram (bigram)

a two-word sequence of words (“really good”)

New cards

3-gram (trigram)

a three-word sequence of words (“not at all”)

New cards

Binary term occurrence

considers whether a term occurs in the document,

Ex:

The quick brown fox jumped over the lazy dog.

The dog is lazy.

The fox is brown.

New cards

TERM FREQUENCY AND INVERSE DOCUMENT FREQUENCY

considers overall documents of weight of words

Measure the score in order to get the information retrieval (IR) or summarization

Used to reflect how relevant a term is in a given document

helps to establish how important a particular word is in the

context of the document corpus. TF-IDF takes into account the number of times the word appears in the document and offset by the number of documents that appear in the corpus

New cards

Term frequency (TF)

represents the number of occurrences of terms in a document

is the frequency of term divided by a total number of terms in the document

New cards

Inverse Document Frequency

is obtained by dividing the total number of documents by the number of documents containing the term and then taking the logarithm of that quotient

New cards

Count Vectorizer:

gives the number of frequency with respect to index of vocabulary

New cards

Naive Bayes
Support Vector Machine

Enumerate some MACHINE-BASE APPROACH

New cards

Naive Bayes

used for text classification and text analysis

based on Bayes’ theorem

describes the relationship of conditional probabilities of statistical quantities

New cards

Support Vector Machine

supervised learning algorithm for classification

The goal is to create the best line or decision boundary that can segregate n-dimensional space into classes

New cards

Hyperplane

multiple line boundaries to separate classes

New cards

Support Vectors

data points that are the closest to the hyperplane

New cards

Margin

the distance between the vectors and the hyperplane

New cards

Linear SVM
Non-linear SVM

Two types of SVM

New cards

Linear SVM

Type of SVM used for linearly separable data (classified into 2 classes)

New cards

Non-linear SVM

Type of SVM used for non-linearly separated data

New cards

Random Forest Algorithm

supervised learning algorithm for classification

Based on ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model.

New cards

ensemble learning

Random Forest is based on _____ _____, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model.

New cards

higher

The greater number of trees in the forest leads to _____ accuracy and prevents the problem of overfitting.

New cards

Random Forest Algorithm

There should be some actual values in the feature variable of the dataset so that the classifier can predict accurate results rather than a guessed result.

The predictions from each tree must have very low correlations.

WHY USE IT?
It takes less training time as compared to other algorithms.

It predicts output with high accuracy, even for the large dataset it runs efficiently.

It can also maintain accuracy when a large proportion of data is missing.

New cards

Lexicon-based appraoch

Another name for RULE-BASED APPROACH

New cards

RULE-BASED/LEXICON-BASED APPROACH

uses a lexicon for sentiment analysis
words in texts are labeled as positive or negative (and sometimes as neutral) with the help of a so-called valence dictionary.

New cards

Sentiment Analysis

the classification of texts according to the emotion that the text appears to convey.
typically classifies texts according to positive, negative and neutral
some applications: Market analysis, Social media monitoring, Customer feedback analysis, Market research

New cards

Lexicon

vocabulary of a group of people, language or field

New cards

Dictionary-based approach
Corpus-based approach

Types of Rule-based/Lexicon-based approach

New cards

Dictionary-based approach

is created by taking a few words initially.

Then an online dictionary, thesaurus or WordNet can be used to expand this _____ by incorporating synonyms and antonyms of those words.

is expanded till no new words can be added to that dictionary.

can be refined by manual inspection.

New cards

Corpus-based approach

finds the sentiment orientation of context-specific words

New cards

Statistical Approach
Semantic Approach

Two types of Corpus-based approach

New cards

Statistical Approach

The words that show erratic behavior in positive behavior are considered to have positive polarity.

If they show negative recurrence in the negative text they have negative polarity.

If the frequency is equal in both positive and negative text then the word has neutral polarity.

New cards

Semantic approach

This approach assigns sentiment values to words and the words that are semantically closer to those words; this can be done by finding synonyms and antonyms with respect to that word.

New cards

Num of Positive - Num of Negative / Total Num

Formula for Sentiment Score