CMSC320 Topics after Midterm 2

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/68

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

69 Terms

1
New cards

Natural Language Processing

Interaction between computers and human language

2
New cards

Syntax

Grammatical structure of a sentence

3
New cards

Semantics

Meaning of a sentence

4
New cards

What are the potential challenges of processing and interpreting text?

A single sentence can have different semantic meanings depending on the context

5
New cards

Applications of NLP

Sentiment analysis, topic modeling, question answering, named entity resolution, text summarization

6
New cards

Sentiment Analysis

Classifying the emotional intent of text. Gives the probability the the sentiment is positive, negative, or neutral.

1 - positive

0 - negative

7
New cards

Named Entity Recognition (NER)

Identifies and categorizes named entities within text into predefined categories (persons, organizations, locations, etc.)

8
New cards

Sentence Segmentation

Divides the paragraph into different sentences for better understanding

9
New cards

What is the NLP Pipeline?

Sentence segmentation, word tokenization, stemming, lemmatization, stop word analysis

10
New cards

Corpus

A collection of documents

11
New cards

Text Cleaning

Remove unwanted characters, symbols, and noise from the raw text to make it cleaner and more uniform. Removes punctuation, special characters, and whitespace.

12
New cards

Tokenization

Split the text into smaller, manageable units like words or subwords, called tokens

13
New cards

Stop-word removal

Handle common words like “the” or “and” by removing them to focus on meaningful content

14
New cards

Typo Correction

Replace a word with the word in our dictionary with the nearest edit distance

15
New cards

Stemming

Reducing words to their stem/base form by removing suffixes. Crude chopping.

16
New cards

Lemmatization

Uses context to find the correct dictionary form. Best for tasks where meaning and context are important.

17
New cards

Feature Extraction

Converting texts into numerical vectors for machine learning tasks

18
New cards

Bag of Words

Representation of text that describes the occurrence of words within a document. Order doesn’t matter, just occurrences.

19
New cards

What are the pros of Bag of Words?

Simple, efficient, application to any language, and capture word importance.

20
New cards

What are the cons of Bag of Words?

Loss of word order, limited context, grows large with a big vocabulary size

21
New cards

N-Grams

Variation of Bag of Words that captures sequences of N adjacent words.

22
New cards

Term Frequency-Inverse Document Frequency (TF-IDF)

A way of measuring how relevant a word is to a document in a collection of documents

23
New cards

Term Frequency (TF)

How many times a term appears in a given document

24
New cards

Document Frequency (DF)

Number of documents in which the word is present

25
New cards

How do you compute TF-IDF

wx,y = tfx,y * log (N/dfx)

26
New cards

How is cosine similarity used in NLP?

Preprocessing, vectorization, similarity calculation, and ranking.

27
New cards

Which representation should be used for sparse graphs?

Adjacency Lists or Adjacency Dictionaries

28
New cards

Which representation should be used for dense graphs?

Adjacency Matrix

29
New cards

Graph Featurization

Transforms graph structural information into numerical features that can be used for ML tasks

30
New cards

Graph Level Features

Features that describe the entire graph as a whole and capture global structural patterns (graph diameter, average path length, modularity)

31
New cards

Edge Level Features

Features that describe the relationships or connections between two nodes (edge betweenness centrality, common neighbors)

32
New cards

Node Level Features

Features that describe individual nodes in a graph (degree, centrality measure)

33
New cards

Centrality

Measures how “central” or important a node is within a graph

34
New cards

Centrality Analysis

Discover the most important node(s) in one network

35
New cards

Degree Centrality

Importance of a node based on the degree of that node.

36
New cards

Closeness Centrality

Importance of a node based on how close it is to all the other nodes in the graph. Average shortest path distance from current node to all other nodes

37
New cards

Steps to calculate closeness centrality

Find shortest distance

Sum shortest distances

Apply formula

38
New cards

Betweenness Centrality

Identifies nodes that act as bridges along the shortest paths between pairs of nodes in a network

39
New cards

Vertex Betweenness

Measures the important of nodes in a graph. Indicates how many shortest paths between other nodes pass through a particular node.

40
New cards

Edge Betweenness

Measures the important of edges in a network, rather than nodes. Calculates how many shortest paths between pairs of nodes pass through a particular edge.

41
New cards

Steps to calculate betweenness centrality

Identify shortest paths

Count paths through node

Calculate betweenness

42
New cards

Girvan-Newman Algorithm

Algorithm for the detection and analysis of community structures relies on the iterative elimination of edges that have the highest number of shortest paths between nodes passing through them. By removing edges from the graph one-by-one, the network breaks down into smaller pieces, so-called communities.

43
New cards

Modularity

Measure of the strength of division of a network into modules or communities. Differences between the number of edges within modules and the expected number of edges if the edges were distributed randomly. Ranges from -1 to 1.

1 - Indicates strong division

44
New cards

Recommender SyStem

Algorithms that recommends a particular product to users they are likely to consume based on their preferences, behavior, or past interactions.

45
New cards

Content-based Recommendation

Predicts what a user will like based on their past likes and item features. Requires information on the content and the user profile.

46
New cards

Process of Content-Based Recommendation

Featurize Items

Calculate Similarity (cosine similarity)

Learn User Preferences

Recommend

47
New cards

Pros of Content Based Systems

Works independent from other users, personalized, supports new/unpopular items

48
New cards

Cons of Content Based Systems

Feature selection is difficult, cold start is difficult, limits diversity, maintenance overhead (retrains as user taste changes)

49
New cards

Collaborative Filtering

Recommends a user products on the basis of the preferences of other users with similar taste

50
New cards

User-based nearest-neighbor collaborative filtering

Recommendations based on the preferences and behaviors of similar users

51
New cards

Utility or User-Item Matrix

Matrix that captures the interactions between N users and M items. Is N x M

52
New cards

Jaccard Similarity

J(A, B) = | A & B| / | A U B |

Ignores rating values

53
New cards

What is the issue with cosine similarity in collaborative based systems?

Treats missing ratings as negative

54
New cards

Centered Cosine Similarity

Calculates the means of individual rows and normalizes the ratings by subtracting the row means. Then, apply cosine similarity formula.

55
New cards

Item-based collaborative filtering

Recommend items similar to those a user already likes. Assumes users will prefer items resembling their past preferences

56
New cards

Steps of Item-based collaborative filtering

For given item i, find other similar items

Estimate rating for item i based on ratings for similar items

Apply algorithm

57
New cards

Pros of Collaborative Filtering

No domain knowledge needed, diverse recommendations, uses more info

58
New cards

Cons of Collaborative Filtering

Data sparsity, cold start problems (new users, new items), popularity bias

59
New cards

Complementary idea

Find rules that associate the presence of one set of items with another set of items

60
New cards

Association rules

Discover relationships between items that frequently occur together in transactions

61
New cards

Support count

Frequency of occurrence of an itemset

62
New cards

Metrics of Association Rules

Support

Confidence

63
New cards

Support (association rule metric)

Fraction of transactions that contain both X and Y

64
New cards

Confidence

Measures how often items in Y appear in transactions that contain X

65
New cards

Steps of Mining Association Rules

Frequent Itemset Generation - Generate all itemsets whose support count >= minsup

Rule Generation - Generate high-confidence rules from each frequent itemset

66
New cards

Apriori principle

If an itemset is frequent, then all of its subsets must also be frequent

67
New cards

Explicit values

Values that designers intend for their products to embody

68
New cards

Collateral values

Values that crop up as side effects of design decisions and the way users interact with those designs. These values are not intentionally designed into the system.

69
New cards

Normative Language

Evaluative statements. Expresses the speaker’s opinion