CMSC320 Topics after Midterm 2

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/68

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

69 Terms

New cards

Natural Language Processing

Interaction between computers and human language

New cards

Syntax

Grammatical structure of a sentence

New cards

Semantics

Meaning of a sentence

New cards

What are the potential challenges of processing and interpreting text?

A single sentence can have different semantic meanings depending on the context

New cards

Applications of NLP

Sentiment analysis, topic modeling, question answering, named entity resolution, text summarization

New cards

Sentiment Analysis

Classifying the emotional intent of text. Gives the probability the the sentiment is positive, negative, or neutral.

1 - positive

0 - negative

New cards

Named Entity Recognition (NER)

Identifies and categorizes named entities within text into predefined categories (persons, organizations, locations, etc.)

New cards

Sentence Segmentation

Divides the paragraph into different sentences for better understanding

New cards

What is the NLP Pipeline?

Sentence segmentation, word tokenization, stemming, lemmatization, stop word analysis

New cards

Corpus

A collection of documents

New cards

Text Cleaning

Remove unwanted characters, symbols, and noise from the raw text to make it cleaner and more uniform. Removes punctuation, special characters, and whitespace.

New cards

Tokenization

Split the text into smaller, manageable units like words or subwords, called tokens

New cards

Stop-word removal

Handle common words like “the” or “and” by removing them to focus on meaningful content

New cards

Typo Correction

Replace a word with the word in our dictionary with the nearest edit distance

New cards

Stemming

Reducing words to their stem/base form by removing suffixes. Crude chopping.

New cards

Lemmatization

Uses context to find the correct dictionary form. Best for tasks where meaning and context are important.

New cards

Feature Extraction

Converting texts into numerical vectors for machine learning tasks

New cards

Bag of Words

Representation of text that describes the occurrence of words within a document. Order doesn’t matter, just occurrences.

New cards

What are the pros of Bag of Words?

Simple, efficient, application to any language, and capture word importance.

New cards

What are the cons of Bag of Words?

Loss of word order, limited context, grows large with a big vocabulary size

New cards

N-Grams

Variation of Bag of Words that captures sequences of N adjacent words.

New cards

Term Frequency-Inverse Document Frequency (TF-IDF)

A way of measuring how relevant a word is to a document in a collection of documents

New cards

Term Frequency (TF)

How many times a term appears in a given document

New cards

Document Frequency (DF)

Number of documents in which the word is present

New cards

How do you compute TF-IDF

w_x,y= tf_x,y * log (N/df_x)

New cards

How is cosine similarity used in NLP?

Preprocessing, vectorization, similarity calculation, and ranking.

New cards

Which representation should be used for sparse graphs?

Adjacency Lists or Adjacency Dictionaries

New cards

Which representation should be used for dense graphs?

Adjacency Matrix

New cards

Graph Featurization

Transforms graph structural information into numerical features that can be used for ML tasks

New cards

Graph Level Features

Features that describe the entire graph as a whole and capture global structural patterns (graph diameter, average path length, modularity)

New cards

Edge Level Features

Features that describe the relationships or connections between two nodes (edge betweenness centrality, common neighbors)

New cards

Node Level Features

Features that describe individual nodes in a graph (degree, centrality measure)

New cards

Centrality

Measures how “central” or important a node is within a graph

New cards

Centrality Analysis

Discover the most important node(s) in one network

New cards

Degree Centrality

Importance of a node based on the degree of that node.

New cards

Closeness Centrality

Importance of a node based on how close it is to all the other nodes in the graph. Average shortest path distance from current node to all other nodes

New cards

Steps to calculate closeness centrality

Find shortest distance

Sum shortest distances

Apply formula

New cards

Betweenness Centrality

Identifies nodes that act as bridges along the shortest paths between pairs of nodes in a network

New cards

Vertex Betweenness

Measures the important of nodes in a graph. Indicates how many shortest paths between other nodes pass through a particular node.

New cards

Edge Betweenness

Measures the important of edges in a network, rather than nodes. Calculates how many shortest paths between pairs of nodes pass through a particular edge.

New cards

Steps to calculate betweenness centrality

Identify shortest paths

Count paths through node

Calculate betweenness

New cards

Girvan-Newman Algorithm

Algorithm for the detection and analysis of community structures relies on the iterative elimination of edges that have the highest number of shortest paths between nodes passing through them. By removing edges from the graph one-by-one, the network breaks down into smaller pieces, so-called communities.

New cards

Modularity

Measure of the strength of division of a network into modules or communities. Differences between the number of edges within modules and the expected number of edges if the edges were distributed randomly. Ranges from -1 to 1.

1 - Indicates strong division

New cards

Recommender SyStem

Algorithms that recommends a particular product to users they are likely to consume based on their preferences, behavior, or past interactions.

New cards

Content-based Recommendation

Predicts what a user will like based on their past likes and item features. Requires information on the content and the user profile.

New cards

Process of Content-Based Recommendation

Featurize Items

Calculate Similarity (cosine similarity)

Learn User Preferences

Recommend

New cards

Pros of Content Based Systems

Works independent from other users, personalized, supports new/unpopular items

New cards

Cons of Content Based Systems

Feature selection is difficult, cold start is difficult, limits diversity, maintenance overhead (retrains as user taste changes)

New cards

Collaborative Filtering

Recommends a user products on the basis of the preferences of other users with similar taste

New cards

User-based nearest-neighbor collaborative filtering

Recommendations based on the preferences and behaviors of similar users

New cards

Utility or User-Item Matrix

Matrix that captures the interactions between N users and M items. Is N x M

New cards

Jaccard Similarity

J(A, B) = | A & B| / | A U B |

Ignores rating values

New cards

What is the issue with cosine similarity in collaborative based systems?

Treats missing ratings as negative

New cards

Centered Cosine Similarity

Calculates the means of individual rows and normalizes the ratings by subtracting the row means. Then, apply cosine similarity formula.

New cards

Item-based collaborative filtering

Recommend items similar to those a user already likes. Assumes users will prefer items resembling their past preferences

New cards

Steps of Item-based collaborative filtering

For given item i, find other similar items

Estimate rating for item i based on ratings for similar items

Apply algorithm

New cards

Pros of Collaborative Filtering

No domain knowledge needed, diverse recommendations, uses more info

New cards

Cons of Collaborative Filtering

Data sparsity, cold start problems (new users, new items), popularity bias

New cards

Complementary idea

Find rules that associate the presence of one set of items with another set of items

New cards

Association rules

Discover relationships between items that frequently occur together in transactions

New cards

Support count

Frequency of occurrence of an itemset

New cards

Metrics of Association Rules

Support

Confidence

New cards

Support (association rule metric)

Fraction of transactions that contain both X and Y

New cards

Confidence

Measures how often items in Y appear in transactions that contain X

New cards

Steps of Mining Association Rules

Frequent Itemset Generation - Generate all itemsets whose support count >= minsup

Rule Generation - Generate high-confidence rules from each frequent itemset

New cards

Apriori principle

If an itemset is frequent, then all of its subsets must also be frequent

New cards

Explicit values

Values that designers intend for their products to embody

New cards

Collateral values

Values that crop up as side effects of design decisions and the way users interact with those designs. These values are not intentionally designed into the system.

New cards

Normative Language

Evaluative statements. Expresses the speaker’s opinion