CP421 Data Mining Week 4-7

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/59

There's no tags or description

Looks like no tags are added yet.

Last updated 8:48 PM on 11/19/25

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

60 Terms

New cards

How do online algorithms differ from offline algorithms?

Classic “offline” algorithm: you see the entire input, then compute some function of it

“online” algorithm: you don’t see entire dataset, but rather one piece at a time and you need to make irrevocable decisions along the way (similar to data stream model)

New cards

Explain the goal of the bipartite matching algorithm, and the 2 different types of matching

Bipartite matching: given 2 sets of nodes (boys and girls) and edges connecting them (preferences), your goal is to match boys to girls so that maximum number of preferences is satisfied.

M = {(1,a), (2,b), (3,d)} is an example of a matching, where we connect 3 boys (1,2,3) to 3 girls (a,b,d) with common preferences. The cardinality of matching |M| = 3 (i.e., the number of matched pairs = 3).

Perfect matching: all vertices of the graph are matched (e.g., M = {(1,c), (2,b), (3,d), (4,a)})
Maximum matching: a matching that contains the largest possible number of matches
- NOTE: every perfect matching is a maximum matching, but not the other way around

so, in other words, the goal of bipartite matching is to find the maximum matching for a given bipartite graph, or a perfect matching if one exists.

<p>Bipartite matching: given 2 sets of nodes (boys and girls) and edges connecting them (preferences), your goal is to match boys to girls so that maximum number of preferences is satisfied.</p><p></p><p>M = {(1,a), (2,b), (3,d)} is an example of a matching, where we connect 3 boys (1,2,3) to 3 girls (a,b,d) with common preferences. The cardinality of matching |M| = 3 (i.e., the number of matched pairs = 3).</p><ul><li><p>Perfect matching: all vertices of the graph are matched (e.g., M = {(1,c), (2,b), (3,d), (4,a)})</p></li><li><p>Maximum matching: a matching that contains the largest possible number of matches</p><ul><li><p>NOTE: every perfect matching is a maximum matching, but not the other way around</p></li></ul></li></ul><p></p><p>so, in other words, the goal of bipartite matching is to find the<strong> maximum matching</strong> for a given bipartite graph, or a <strong>perfect matching</strong> if one exists. </p><p></p>

New cards

What’s the best approach for “offline” bipartite matching?

There’s a polynomial-time offline algorithm based on augmenting paths.

New cards

How do “online” bipartite matching algorithms work in general?

Here, we do not know the entire bipartite graph upfront.

Instead:

we are initially given the set “boys”
in each round, one girl’s choices are revealed (i.e., one girl’s edges are revealed)
at that time, we have to decide to either pair the girl with a boy or not pair them with any boy

New cards

How does the greedy approach work for “online” bipartite matching? How good is it compared to the optimal approach?

Greedy approach: pair the new girl with any eligible boy. If there is none, do not pair the girl.

Performance analysis:

Suppose this approach produces matching M_greedy
Suppose an optimal matching is M_opt
- this represents the best offline algorithm (i.e., optimal matching when we see the whole bipartite graph)
Then, this approach’s competitive ratio = greedy’s worst performance over all possible inputs I = min_{all possible inputs I} (|M_greedy| / |M_opt|)
For cases where M_greedy ≠ M_opt, |M_greedy| / |M_opt| >= ½
- thus, the greedy approach performs, at worst, half as good as the optimal offline approach

New cards

Explain the AdWords problem in terms of its overall structure, inputs, and output

Performance-based advertising background:

advertisers bid on search keywords
when someone searches for that keyword, the highest bidder’s ad is shown
advertiser is charged only if the ad is clicked on
key problem: what ads should be shown for a given search query?

AdWords problem:

a stream of queries arrives at the search engine: q₁, q₂, …
several advertisers bid on each query
- their bid for a given search query represents how much they’re willing to pay when someone clicks on the ad after searching that query
- their click-through rate (CTR) for a given search query is how likely someone is to click their ad after searching that query
when query q_iarrives, the search engine must pick a subset of advertisers whose ads are shown
goal: maximize search engine’s revenue
- so, instead of raw bids, we use “expected revenue per click” (i.e., Bid * CTR)
the optimal solution is to sort by expected revenue per click, and only show the advertisers at the top
- unfortunately, this solution isn’t possible since the exact CTR of an ad is unknown, and advertisers have limited budgets and bid on multiple ads. So, we have to use other algorithms instead.

AdWords problem inputs:

set of bids by advertisers for search queries
CTR for each advertiser-query pair
- NOTE: we don’t exactly know these values; they’re the search engine’s best estimates based on historical data
budget for each advertiser (perhaps for 1 month, or 1 day, etc.)
limit on the number of ads to be displayed with each search query

AdWords problem output:

respond to each search query with a set of advertisers such that:
1. the size of the set <= limit on the number of ads per query
2. each advertiser shown has actually bid on the search query
3. each advertiser shown has enough budget left to pay for the ad if it is clicked upon

<p>Performance-based advertising background:</p><ul><li><p>advertisers bid on search keywords</p></li><li><p>when someone searches for that keyword, the highest bidder’s ad is shown</p></li><li><p>advertiser is charged only if the ad is clicked on</p></li><li><p>key problem: what ads should be shown for a given search query?</p></li></ul><p></p><p>AdWords problem:</p><ul><li><p>a stream of queries arrives at the search engine: q<sub>1</sub>, q<sub>2</sub>, …</p></li><li><p>several advertisers bid on each query</p><ul><li><p>their bid for a given search query represents how much they’re willing to pay when someone clicks on the ad after searching that query</p></li><li><p>their click-through rate (CTR) for a given search query is how likely someone is to click their ad after searching that query</p></li></ul></li><li><p>when query q<sub>i </sub>arrives, the search engine must pick a subset of advertisers whose ads are shown</p></li><li><p>goal: maximize search engine’s revenue</p><ul><li><p>so, instead of raw bids, we use “expected revenue per click” (i.e., Bid * CTR)</p></li></ul></li><li><p>the optimal solution is to sort by expected revenue per click, and only show the advertisers at the top</p><ul><li><p>unfortunately, this solution isn’t possible since the exact CTR of an ad is unknown, and advertisers have limited budgets and bid on multiple ads. So, we have to use other algorithms instead.</p></li></ul></li></ul><p></p><p>AdWords problem inputs:</p><ol><li><p>set of bids by advertisers for search queries</p></li><li><p>CTR for each advertiser-query pair</p><ul><li><p>NOTE: we don’t exactly know these values; they’re the search engine’s best estimates based on historical data</p></li></ul></li><li><p>budget for each advertiser (perhaps for 1 month, or 1 day, etc.)</p></li><li><p>limit on the number of ads to be displayed with each search query</p></li></ol><p>AdWords problem output:</p><ul><li><p>respond to each search query with a set of advertisers such that:</p><ol><li><p>the size of the set <= limit on the number of ads per query</p></li><li><p>each advertiser shown has actually bid on the search query</p></li><li><p>each advertiser shown has enough budget left to pay for the ad if it is clicked upon</p></li></ol></li></ul><p></p>

New cards

What are the two complications with the AdWords problem? Mention exploration and exploitation

Budget
- each advertiser has a limited budget, and the search engine must guarantee that the advertiser will not be charged more than their daily budget
CTR of a query-ad pair is unknown (i.e., each ad has a different likelihood of being clicked for a given query)
- CTR estimate for a given query-ad pair is measured historically, and it’s a very hard problem where we must balance exploration vs exploitation
- exploit: should we keep showing an ad for which we have good historical estimates of CTR?
- explore: should we show a brand new ad to get a better sense of its CTR?

New cards

Describe the greedy algorithm for the AdWords problem, and its competitive ratio.

Assume a simplified environment.

Suppose the following simplified assumptions:

there is 1 ad shown for each query
all advertisers have the same budget B
all ads are equally likely to be clicked (i.e., equal CTR)
the value of each ad is the same (=1) (i.e., expected revenue per click = bid * CTR = 1)

The greedy algorithm works as follows:

for a query, pick any advertiser who has bid 1 for that query
competitive ratio: ½
- worst case scenario: A bids on query x, B bids on queries x and y, and both have budgets of $4. Assume the query stream is x x x x y y y y. Then, the greedy approach could show ad B 4 times in a row (for query x). Then, it fails to show any ad for query y. Here, the greedy algorithm’s revenue = $4, while the optimal algorithm’s revenue = $8

NOTE: the greedy algorithm is deterministic - it always resolves draws in the same way

New cards

Describe the BALANCE algorithm for the AdWords problem, and its competitive ratio.

Assume a simplified environment.

Suppose the following simplified assumptions:

there is 1 ad shown for each query
all advertisers have the same budget B
all ads are equally likely to be clicked (i.e., equal CTR)
the value of each ad is the same (=1) (i.e., expected revenue per click = bid * CTR = 1)

The BALANCE algorithm works as follows:

for a query, pick the advertiser with the largest unspent budget
break ties arbitrarily, but in a deterministic way
competitive ratio for 2 advertisers = ¾
competitive ratio for any number of advertisers = 1-1/e = approx. 0.63

No online algorithm has a better competitive ratio

New cards

Describe 3 applications of the market-basket model

The market-basket model consists of a large set of items, and a large set of baskets where each basket is a small subset of items.

Potential applications include:

baskets = things one customer bought during one trip to the grocery store, items = things sold in grocery store
baskets = sentences, items = documents containing those sentences
- i.e., basket for sentence x contains all documents d1, d2, … in which sentence x appears
- items that appear together too often (in the same basket) could represent plagiarism
- notice items do not have to be “in” baskets; it’s the other way around here (since baskets (sentences) are “in” items (documents)), but typically its better if baskets have a small number of items while items can be in a large number of baskets
baskets = patients, items = drugs and side effects
- used to detect combinations of drugs that result in particular side effects

New cards

Describe what frequent itemsets are (in terms of support) in the Market-Basket model.

Given a large set of items (things sold in supermarket) and a large set of baskets (each being a small subset of items representing things one customer bought on one trip), we often want to find sets of items that appear together “frequently” in baskets

Given an itemset I, its support is the number of baskets containing all items in I

Given a support threshold s, the sets of items that appear in at least s baskets are called frequent itemsets

<p>Given a large set of items (things sold in supermarket) and a large set of baskets (each being a small subset of items representing things one customer bought on one trip), we often want to find sets of items that appear together “frequently” in baskets</p><p></p><p>Given an itemset I, its support is the number of baskets containing all items in I</p><p></p><p>Given a support threshold s, the sets of items that appear in at least s baskets are called<strong> frequent itemsets</strong></p>

New cards

Describe what association rules are in the Market-Basket model

Given a large set of items (things sold in supermarket) and a large set of baskets (each being a small subset of items representing things one customer bought on one trip), we often want to discover association rules:
people who bought {x,y,x} tend to buy {v,w}.

Association rules are if-then rules about the contents of baskets:

{i1,i2,…,ik} → j means “if a basket contains items i1, i2, …, ik, then it is ‘likely’ to contain item j”

NOTE: in order to find association rules, we must first find frequent itemsets.

New cards

How can we find the support, confidence, and interest of association rules? e.g., find the confidence and interest of rule {m,b} → c

Recall, association rules are structured as {i1, i2,…, ik} → j. In practice, there are many rules, but we only want to find significant/interesting ones.

Support of association rule I → j: support of I = # baskets containing items i1, i2,…, ik

Confidence of association rule I → j: probability of j given I = i1, i2,…, ik

conf(I → j) = support(IUj) / support(I) = (# baskets containing items i1, i2,…, ik, and j) / (# baskets containing items i1, i2,…, ik)
output closer to 1 = high confidence

Interest of association rule I → j: difference between its confidence and the fraction of baskets that contain j

interest(I → j) = conf(I → J) - Pr[j] = (confidence of I → j) - ((# baskets containing item j)/(# baskets))
output > 0.5 or < -0.5 = high interest

$Recall, association rules are structured as {i1, i2,…, ik} → j. In practice, there are many rules, but we only want to find significant/interesting ones.Support of association rule I → j: support of I = # baskets containing items i1, i2,…, ikConfidence of association rule I → j: probability of j given I = i1, i2,…, ik<ul><li>conf(I → j) = support(IUj) / support(I) = (# baskets containing items i1, i2,…, ik, and j) / (# baskets containing items i1, i2,…, ik)</li><li>output closer to 1 = high confidence</li></ul>Interest of association rule I → j: difference between its confidence and the fraction of baskets that contain j<ul><li>interest(I → j) = conf(I → J) - Pr[j] = (confidence of I → j) - ((# baskets containing item j)/(# baskets))</li><li>output > 0.5 or < -0.5 = high interest</li></ul>$

New cards

what’s the general process for finding good quality association rules? Mention the hard part, and what we can derive from a rule in terms of set frequency.

problem/goal: find all association rules with support >= s and confidence >= c

hard part: finding frequent itemsets

if I = {i1, i2, …, ik} → j has high support and confidence, then both {i1, i2, …, ik} and {i1, i2, …, ik, j} will be “frequent”

procedure for mining association rules:

find frequent itemsets I: I₁, I₂, …, I_n
generate rules
- for every subset A of a frequent itemset I_n, generate a rule A → setDiff(I_n, A)
  - since I_n is frequent, so is A
- compute the rule’s confidence
  - keep in mind that if A→D is below confidence, so is A→CD
  - we can generate “bigger” rules from smaller ones
- output rules above confidence threshold

<p>problem/goal: find all association rules with support >= s and confidence >= c</p><p></p><p>hard part: finding frequent itemsets</p><ul><li><p>if I = {i1, i2, …, ik} → j has high support and confidence, then both {i1, i2, …, ik} and {i1, i2, …, ik, j} will be “frequent”</p></li></ul><p></p><p>procedure for mining association rules:</p><ol><li><p>find frequent itemsets I: I<sub>1</sub>, I<sub>2</sub>, …, I<sub>n</sub></p></li><li><p>generate rules</p><ul><li><p>for every subset A of a frequent itemset I<sub>n</sub>, generate a rule A → setDiff(I<sub>n</sub>, A)</p><ul><li><p>since I<sub>n</sub> is frequent, so is A</p></li></ul></li><li><p>compute the rule’s confidence </p><ul><li><p>keep in mind that if A→D is below confidence, so is A→CD</p></li><li><p>we can generate “bigger” rules from smaller ones</p></li></ul></li><li><p>output rules above confidence threshold</p></li></ul></li></ol><p></p>

New cards

describe how data is stored when dealing with finding frequent itemsets, and what the main bottleneck is for frequent-itemset algorithms.

Data is kept in flat files (e.g., csv or txt) stored on disk. It is stored basket-by-basket, where items are positive integers and boundaries between baskets are -1. Baskets are small but we have many baskets and many items.

For each basket, we’re gonna need to find all of its subsets of a particular size. We do this in main memory; to generate all subsets of size k, we use k nested loops. Since baskets are small and k is typically small, this cost is negligible.

So, the true cost of mining this data is the number of disk I/O’s (i.e., how often we move data (baskets) between disk and main memory). In practice, all the algorithms we discuss (for finding frequent itemsets) read the data in passes, where all baskets are read in turn (once per pass). Thus, we ultimately measure the cost of finding frequent itemsets by the number of passes an algorithm takes.

NOTE: number of disk I/O’s = number of passes * number of blocks that we move in and out of disk (can contain multiple baskets if main memory size is sufficient).

Thus, for many frequent-itemset algorithms, main memory size is the critical bottleneck. As we read baskets, we need to count something (e.g., occurrences of pairs). However, this means we must be able to store counts of all possible item pairs in main memory (thus the number of different things we can count is limited). We must keep these counts in main memory since swapping counts in and out of disk is a disaster. e.g., assume each pair’s count takes up 4 bytes, so if we have 100,000 items, then there are 5 billion pairs, which means we need 20 GB in main memory to store all pair counts.

To find frequent itemsets, we need to first generate all possible itemsets, and then count them. So, we always generate all itemsets but only count (keep track) of itemsets that in the end turn out to be frequent.

New cards

what size of itemsets is hardest to deal with when finding frequent itemsets?

The hardest problem often turns out to be finding the frequent pairs because frequent pairs are common and the probability of being frequent drops exponentially with size.

frequent singletons = not too difficult since there aren’t too many items, so we can count them all in main memory pretty easily as we pass through the baskets
frequent pairs = often too many exist to be able to count in main memory
frequent triplets = rare, so we can pretty easily count them in main memory

Thus, we concentrate on pairs when discussing frequent-itemset algorithms.

New cards

Describe the naive approach to finding frequent pairs. When does it fail?

read file once, counting in main memory the occurrences of each pair:

from each basket of n items, generate its n(n-1)/2 pairs by two nested loops

this approach fails if (# items)² exceeds main memory

e.g., if each pair’s count takes up 4 bytes and we have 100,000 items, then there are 5 billion pairs, which means we need 20 GB in main memory to store all pair counts

New cards

Describe both approaches of counting pairs in memory, and when each approach is better. What’s the problem with both approaches?

Approach 1: count/store all pairs using a triangular matrix

only requires 4 bytes per pair, but requires us to store all possible pairs
- justification: n = total number of items, and we count pairs {i,j} i<j. We also keep pair counts in lexicographic order: {1,2}, {1,3}, … , {1,n}, {2,3}, {2,4}, … , {2,n}, {3,4}, …. So, pair {i,j} is at position (i-1)(n-i/2)+j-i. Total number of unique pairs is n(n-1)/2; total bytes = 2n^2. So, triangular matrix only requires 4 bytes per pair.

Approach 2: keep table of triples [i,j,c] = “count of pair {i,j} is c”

if integers and item ids are 4 bytes, we need 12 bytes per pair, but we only have to store pairs with count > 0

Therefore, approach 1 stores all possible pairs (4 bytes per pair) while approach 2 only stores pairs that actually occur (12 bytes per pair). Thus, approach 2 beats approach 1 if less than 1/3 of possible pairs actually occur.

The problem with both approaches is that if we have too many items, the pairs can’t fit in memory. So, we need another method.

$Approach 1: count/store all pairs using a triangular matrix<ul><li>only requires 4 bytes per pair, but requires us to store all possible pairs<ul><li>justification: n = total number of items, and we count pairs {i,j} i<j. We also keep pair counts in lexicographic order: {1,2}, {1,3}, … , {1,n}, {2,3}, {2,4}, … , {2,n}, {3,4}, …. So, pair {i,j} is at position (i-1)(n-i/2)+j-i. Total number of unique pairs is n(n-1)/2; total bytes = 2n^2. So, triangular matrix only requires 4 bytes per pair.</li></ul></li></ul>Approach 2: keep table of triples [i,j,c] = “count of pair {i,j} is c”<ul><li>if integers and item ids are 4 bytes, we need 12 bytes per pair, but we only have to store pairs with count > 0</li></ul>Therefore, approach 1 stores all possible pairs (4 bytes per pair) while approach 2 only stores pairs that actually occur (12 bytes per pair). Thus, approach 2 beats approach 1 if less than 1/3 of possible pairs actually occur.The problem with both approaches is that if we have too many items, the pairs can’t fit in memory. So, we need another method.$

New cards

Describe how the A-Priori algorithm works, and what 2 key ideas it relies on.

A-Priori is a 2-pass approach that limits the need for main memory (i.e., it minimizes what it counts by ignoring impossible candidates). It relies on 2 key ideas:

Monotonicity: if an itemset I appears at least s times, so does every subset J of I
Contrapositive for pairs: if item i does not appear in s baskets, then no pair including i can appear in s baskets

A-Priori works as follows:

Pass 1: read baskets and count - in main memory - how often each individual item appears
- requires memory proportional to # of items
- NOTE: we’re only interested in whether an item’s count is >= s; thus, we only need to store its count up to s and nothing more beyond that (e.g., if s = 10,000, we only need to keep 2 bytes per count, since a count can only go up to 10,000)
Pass 2: read baskets from disk again and only count - in main memory - pairs where both items were frequent, ignoring any pair containing an infrequent item; keep only frequent pairs (>=s)
- requires memory proportional to square of frequent items only (for counts), as well as a list of the frequent items from pass 1 (so we know what must be counted)
  - by additionally storing frequent items, we can read through the items in a basket and ignore any items that aren’t in the list of frequent items; then, for all remaining items, we generate all possible pairs and increment each of their counts

<p>A-Priori is a 2-pass approach that limits the need for main memory (i.e., it minimizes what it counts by ignoring impossible candidates). It relies on 2 key ideas:</p><ol><li><p>Monotonicity: if an itemset I appears at least s times, so does every subset J of I</p></li><li><p>Contrapositive for pairs: if item i does not appear in s baskets, then no pair including i can appear in s baskets</p></li></ol><p></p><p>A-Priori works as follows:</p><ul><li><p>Pass 1: read baskets and count - in main memory - how often each individual item appears</p><ul><li><p>requires memory proportional to # of items</p></li><li><p>NOTE: we’re only interested in whether an item’s count is >= s; thus, we only need to store its count up to s and nothing more beyond that (e.g., if s = 10,000, we only need to keep 2 bytes per count, since a count can only go up to 10,000)</p></li></ul></li><li><p>Pass 2: read baskets from disk again and only count - in main memory - pairs where both items were frequent, ignoring any pair containing an infrequent item; keep only frequent pairs (>=s)</p><ul><li><p>requires memory proportional to square of frequent items only (for counts), as well as a list of the frequent items from pass 1 (so we know what must be counted)</p><ul><li><p>by additionally storing frequent items, we can read through the items in a basket and ignore any items that aren’t in the list of frequent items; then, for all remaining items, we generate all possible pairs and increment each of their counts</p></li></ul></li></ul></li></ul><p></p>

New cards

what’s a slight modification we can make to potentially save more space in the A-Priori algorithm?

Recall, we need to store a list of all n frequent items in the second pass.

However, to save more space, we can renumber the n frequent items so the pair counts fit compactly in a triangular matrix.

So, we renumber frequent items 1,2,…n, and keep a lookup table relating these new numbers to the original item numbers. Since we’re now storing 2 numbers for each frequent item, this takes up more memory than what we previously stored (which was just the list of frequent items).

However, now we can use the triangular matrix method to store pair counts instead of the triples-table method. However, this is ONLY beneficial if more than 1/3 possible pairs actually occur; otherwise, the triples-table method actually uses less memory.

<p>Recall, we need to store a list of all n frequent items in the second pass. </p><p></p><p>However, to save more space, we can renumber the n frequent items so the pair counts fit compactly in a triangular matrix. </p><p></p><p>So, we renumber frequent items 1,2,…n, and keep a lookup table relating these new numbers to the original item numbers. Since we’re now storing 2 numbers for each frequent item, this takes up more memory than what we previously stored (which was just the list of frequent items).</p><p></p><p>However, now we can use the triangular matrix method to store pair counts instead of the triples-table method. However, this is ONLY beneficial if more than 1/3 possible pairs actually occur; otherwise, the triples-table method actually uses less memory.</p>

New cards

Describe how the general A-Priori algorithm works

For each k, we construct 2 sets of k-tuples (sets of size k):

Ck = candidate k-tuples = those that might be frequent sets (support >= s) based on the information from the pass for k-1
- we must ensure that, for each candidate here, all of their possible subsets of size k-1 are frequent (i.e., exist in Lk-1)
Lk = set of truly frequent k-tuples

New cards

What size k requires the most memory for general A-Priori?

We perform one pass (of reading the dataset) for each k (itemset size), so we need room in main memory to count each candidate k-tuple. Typically, k=2 requires the most memory.

New cards

Describe how the PCY algorithm works

The main problem with A-Priori is that Pass 1 has lots of idle memory, and storing all candidate pair counts in pass 2 needs a huge amount of memory. PCY solves this by reusing idle memory during Pass 1 to help reduce how much we have to count in Pass 1.

Pass 1:

read baskets and count - in main memory - how often each individual item appears
additionally (at the same time), maintain a hash table with as many buckets as fit in memory. For each basket, generate all pairs of items and hash each pair into one of these buckets, incrementing the bucket’s count by 1 each time a pair is hashed into it.
- NOTE: for a bucket’s count, we only need to go up to s
- Observation: if a bucket contains a frequent pair, then the bucket is surely frequent. On the other hand, for a bucket with total count less than s, none of its pairs can be frequent.

Between passes:

replace buckets by a bit vector, where 1 = frequent bucket (i.e., count >= s) and 0 = infrequent bucket
since we replaced 4-byte integer counts with bits, the bit vector only requires 1/32 memory (i.e., instead of representing each bucket’s count with a 4-byte integer, we only use 1 bit)

Pass 2:

read baskets from disk again and only count - in main memory - pairs that meet both conditions:
1. both items are frequent (>= s)
2. the pair hashes to a bucket whose bit in the bit vector is 1
only keep frequent pairs (count >= s)
- NOTE: both conditions above are necessary for a pair to have a chance of being frequent

<p><span style="background-color: transparent;"><span>The main problem with A-Priori is that Pass 1 has lots of idle memory, and storing all candidate pair counts in pass 2 needs a huge amount of memory. PCY solves this by reusing idle memory during Pass 1 to help reduce how much we have to count in Pass 1.</span></span></p><p></p><p><span style="background-color: transparent;"><span>Pass 1: </span></span></p><ul><li><p>read baskets and count - in main memory - how often each individual item appears</p></li><li><p>additionally (at the same time), maintain a hash table with as many buckets as fit in memory. For each basket, generate all pairs of items and hash each pair into one of these buckets, incrementing the bucket’s count by 1 each time a pair is hashed into it. </p><ul><li><p>NOTE: for a bucket’s count, we only need to go up to s</p></li><li><p>Observation: if a bucket contains a frequent pair, then the bucket is surely frequent. On the other hand, for a bucket with total count less than s, none of its pairs can be frequent.</p></li></ul></li></ul><p>Between passes:</p><ul><li><p>replace buckets by a bit vector, where 1 = frequent bucket (i.e., count >= s) and 0 = infrequent bucket</p></li><li><p>since we replaced 4-byte integer counts with bits, the bit vector only requires 1/32 memory (i.e., instead of representing each bucket’s count with a 4-byte integer, we only use 1 bit)</p></li></ul><p>Pass 2:</p><ul><li><p>read baskets from disk again and only count - in main memory - pairs that meet both conditions:</p><ol><li><p>both items are frequent (>= s)</p></li><li><p>the pair hashes to a bucket whose bit in the bit vector is 1</p></li></ol></li><li><p>only keep frequent pairs (count >= s)</p><ul><li><p>NOTE: both conditions above are necessary for a pair to have a chance of being frequent</p></li></ul></li></ul><p></p>

New cards

Describe the main memory details of PCY, and when it beats A-Priori

buckets require a few bytes each; # of buckets is O(main memory size)
on the second pass, a table of (item, item, count) is essential; we CANNOT use a triangular matrix approach
- why? because pairs that are eliminated because they hash to an infrequent bucket are scattered all over the place and cannot be organized into a nice triangular array
- thus, the hash table must eliminate approximately 2/3 of candidate pairs for PCY to beat A-Priori

Pass 1:

space to count each item: typically 4 bytes per item
rest of the space is used for as large of a hash table as we can, containing a 4-byte counts for each bucket
- Since each bucket requires 4 bytes, we need to know the number of buckets. Turns out, the # buckets is O(main memory size).

<ul><li><p><span style="background-color: transparent;"><span>buckets require a few bytes each; # of buckets is O(main memory size)</span></span></p></li><li><p><span style="background-color: transparent;"><span>on the second pass, a table of (item, item, count) is essential; we CANNOT use a triangular matrix approach</span></span></p><ul><li><p>why? because pairs that are eliminated because they hash to an infrequent bucket are scattered all over the place and cannot be organized into a nice triangular array</p></li><li><p><span style="background-color: transparent;"><span>thus, the hash table must eliminate approximately 2/3 of candidate pairs for PCY to beat A-Priori</span></span></p></li></ul></li></ul><p></p><p>Pass 1:</p><ul><li><p>space to count each item: typically 4 bytes per item</p></li><li><p>rest of the space is used for as large of a hash table as we can, containing a 4-byte counts for each bucket</p><ul><li><p>Since each bucket requires 4 bytes, we need to know the number of buckets. Turns out, the # buckets is O(main memory size).</p></li></ul></li></ul><p></p>

New cards

what is bi-partitioning, and what defines a “good” partition of an undirected graph G(V,E)?

Networks are often organized into modules, clusters, and communities, and our goal is to find densely linked clusters/partitions
For bi-partitioning, given an undirected graph G(V,E), our task is to divide the vertices into two disjoint groups A, B
A “good” partition is one where we:
- Maximize the number of within-group connections (i.e., edges)
- Minimize the number of between-group connections

<ul><li><p><span style="background-color: transparent;"><span>Networks are often organized into modules, clusters, and communities, and our goal is to find densely linked clusters/partitions</span></span></p></li><li><p><span style="background-color: transparent;"><span>For bi-partitioning, given an undirected graph G(V,E), our task is to divide the vertices into two disjoint groups A, B</span></span></p></li><li><p><span style="background-color: transparent;"><span>A “good” partition is one where we:</span></span></p><ul><li><p><span style="background-color: transparent;"><span>Maximize the number of within-group connections (i.e., edges)</span></span></p></li><li><p><span style="background-color: transparent;"><span>Minimize the number of between-group connections</span></span></p></li></ul></li></ul><p></p>

New cards

How do we express partition quality via cut for undirected and directed graphs?

Cut: set of edges with only one vertex in a group (i.e., sum of weights for all edges with one endpoint in group A and the other in group B)

<p><span style="background-color: transparent;"><strong><span>Cut</span></strong><span>: set of edges with only one vertex in a group (i.e., sum of weights for all edges with one endpoint in group A and the other in group B)<br></span></span></p>

New cards

what’s the minimum cut criterion for partitioning a graph? What’s the problem with it?

Minimum-cut criterion: minimize weight of connections between groups (i.e., find A and B that minimize cut(A,B); argmin_A,B cut(A,B))
- Problem: min-cut only considers external cluster connections and does not consider internal cluster connectivity
  - Degenerate cases can occur, where the optimal cut != minimum cut
  - Solution: normalized-cut criterion

<ul><li><p><span style="background-color: transparent;"><strong><span>Minimum-cut criterion</span></strong><span>: minimize weight of connections between groups (i.e., find A and B that minimize cut(A,B); argmin</span><sub><span>A,B</span></sub><span> cut(A,B))</span></span></p><ul><li><p><span style="background-color: transparent;"><span>Problem: min-cut only considers external cluster connections and does not consider internal cluster connectivity</span></span></p><ul><li><p><span style="background-color: transparent;"><span>Degenerate cases can occur, where the optimal cut != minimum cut</span></span></p></li><li><p><span style="background-color: transparent;"><span>Solution: normalized-cut criterion</span></span></p></li></ul></li></ul></li></ul><p></p>

New cards

what’s the normalized cut criterion for partitioning a graph? How do we calculate it for weighted and unweighted graphs?

Normalized-cut criterion: calculates connectivity between groups relative to the density of each group
- preferred over minimum cut criterion; produces more balanced partitions
ncut(A,B) = cut(A,B)/vol(A) + cut(A,B)/vol(B)
- vol(A) = total weight of the edges with at least one endpoint (k) in A = sum of degrees of all nodes in A
  - for weighted graphs, the degree of a node is the sum of the weights of the edges touching it
- cut(A,B) = total weight of edges Eij where i∈A and j∈B
- NOTE: lower scores are better; 0.065 identifies 10x better partition than 0.5

<ul><li><p><span style="background-color: transparent;"><strong><span>Normalized-cut criterion</span></strong><span>: calculates connectivity between groups relative to the density of each group</span></span></p><ul><li><p>preferred over minimum cut criterion; produces more balanced partitions</p></li></ul></li><li><p><span style="background-color: transparent;"><span>ncut(A,B) = cut(A,B)/vol(A) + cut(A,B)/vol(B)</span></span></p><ul><li><p><span style="background-color: transparent;"><span>vol(A) = total weight of the edges with at least one endpoint (k) in A = sum of degrees of all nodes in A</span></span></p><ul><li><p><span style="background-color: transparent;"><span>for weighted graphs, the degree of a node is the sum of the weights of the edges touching it</span></span></p></li></ul></li><li><p><span style="background-color: transparent;"><span>cut(A,B) = total weight of edges Eij where i∈A and j∈B</span></span></p></li><li><p><span style="background-color: transparent;"><span>NOTE: lower scores are better; 0.065 identifies 10x better partition than 0.5</span></span></p></li></ul></li></ul><p><br></p>

New cards

Can we actually find the optimal normalized cut for a graph in practice?

Computing the exact optimal (minimum) normalized cut is NP-hard; we can use methods like spectral clustering to relax it and find an approximate solution

New cards

Define the following: adjacency matrix, degree matrix, and Laplacian matrix

Matrix representations of graphs:

Adjacency matrix (A): nxn matrix, where aij = 1 if an edge exists between nodes i and j
- symmetric since graph is undirected
- real/orthogonal eigenvectors

Degree matrix (D): nxn diagonal matrix, where dii = degree of node i (i.e., number of edges connecting to node i)
Laplacian matrix (L): nxn symmetric matrix, where L = D - A
- eigenvalues are non-negative real numbers, and eigenvectors are real/orthogonal
- sum of each row is 0, sum of each col is 0
- trivial eigenpair: x = (1,...,1)

New cards

How do we bi-partition a graph using spectral partitioning?

Pre-processing: construct matrix representation of graph
Decomposition: compute eigenvalues (λ) and eigenvectors (x) of the matrix, and map each point to a lower-dimensional representation based on one or more eigenvectors (xi)
- Lx = λx
- NOTE: each eigenvector xi (i’th column in x) corresponds to the i’th smallest eigenvalue λi; λ1 is the trivial mode (λ1=0)
- e.g., map points/vertices to lower-dimensional representation based on eigenvector x2, where each node i corresponds to scalar x2[i] for i = 1…6
Grouping: assign points to two or more clusters based on the new representation
- e.g., sort components of the reduced 1d vector (eigenvector x2) and identify clusters by splitting the sorted vector into two (using a split of 0): nodes 1,2,3 -> cluster A, nodes 4,5,6 -> cluster B
- NOTE: splitting at 0 or the median value are naive approaches; there are more expensive approaches not covered in this course

<ol><li><p>Pre-processing: construct matrix representation of graph</p></li><li><p>Decomposition: compute eigenvalues (λ) and eigenvectors (x) of the matrix, and map each point to a lower-dimensional representation based on one or more eigenvectors (xi)</p><ul><li><p>Lx = λx</p></li><li><p>NOTE: each eigenvector xi (i’th column in x) corresponds to the i’th smallest eigenvalue λi; λ1 is the trivial mode (λ1=0)</p></li><li><p>e.g., map points/vertices to lower-dimensional representation based on eigenvector x2, where each node i corresponds to scalar x2[i] for i = 1…6</p></li></ul></li><li><p>Grouping: assign points to two or more clusters based on the new representation</p><ul><li><p>e.g., sort components of the reduced 1d vector (eigenvector x2) and identify clusters by splitting the sorted vector into two (using a split of 0): nodes 1,2,3 -> cluster A, nodes 4,5,6 -> cluster B</p></li><li><p>NOTE: splitting at 0 or the median value are naive approaches; there are more expensive approaches not covered in this course</p></li></ul></li></ol><p></p>

New cards

How do we perform K-way spectral clustering to partition a graph into k non-overlapping clusters/communities?

Approach 1: recursive bi-partitioning
- recursively apply bi-partitioning algorithm in a hierarchical divisive manner
- inefficient and unstable
Approach 2 (better): cluster multiple eigenvectors
- build a reduced space from multiple eigenvectors
- i.e., the idea is that we describe each point/node by multiple eigenvectors (e.g., point p = values from eigenvectors corresponding to 1st - 3rd smallest eigenvalues = (x1[p], x2[p], x3[p])), and then run k-means clustering on the resultant 3-dimensional point representations

New cards

What’s the purpose for overlapping communities and how can we differentiate between non-overlapping and overlapping communities in an adjacency matrix?

In cases like Facebook relationships (where nodes = Facebook users, edges = friendships), many social communities exist and overlap with each other (e.g., same work, same high school, same math class, same basketball team); for examples like this, it makes sense for clusters/communities to overlap instead of being separate.

Non-overlapping communities: adjacency matrix has distinct tiles

this is the type of clustering we’ve worked with so far (partitioning)

Overlapping communities: adjacency matrix has tiles that overlap

edge density in overlapping sections is higher

<p><span style="background-color: transparent;"><span>In cases like Facebook relationships (where nodes = Facebook users, edges = friendships), many social communities exist and overlap with each other (e.g., same work, same high school, same math class, same basketball team); for examples like this, it makes sense for clusters/communities to overlap instead of being separate.</span></span></p><p></p><p><span style="background-color: transparent;"><strong><span>Non-overlapping communities</span></strong><span>: adjacency matrix has distinct tiles</span></span></p><ul><li><p><span style="background-color: transparent;"><span>this is the type of clustering we’ve worked with so far (partitioning)</span></span></p></li></ul><p></p><p><span style="background-color: transparent;"><strong><span>Overlapping communities</span></strong><span>: adjacency matrix has tiles that overlap</span></span></p><ul><li><p><span style="background-color: transparent;"><span>edge density in overlapping sections is higher</span></span></p></li></ul><p></p>

New cards

What is the goal of AGM (Community-Affiliation Graph Model)?

goal: define a model that can generate networks

the model will have a set of parameters that we need to estimate using our data, and these estimates will implicitly detect communities

i.e.,

key question: given a set of nodes, how do communities “generate” edges of the network?
solution: use AGM (which is a generative model) to generate the network’s edges

<p>goal: define a model that can generate networks</p><ul><li><p>the model will have a set of parameters that we need to estimate using our data, and these estimates will implicitly detect communities</p></li></ul><p></p><p>i.e.,</p><ul><li><p>key question: given a set of nodes, how do communities “generate” edges of the network? </p></li><li><p>solution: use AGM (which is a generative model) to generate the network’s edges</p></li></ul><p></p>

New cards

Given an AGM (which is a generative model), how do we produce a network step-by-step?

AGM is a generative model B(V,C,M, {pc}) for graphs, where:

V = nodes (same as nodes in our network)
C = communities (same as clusters in our network)
M = memberships (connections of nodes to communities)
Pc = probability of community c

Given an AGM, we generate the network’s edges as follows:

For each pair of nodes u,v in community c, we connect them with probability Pc
- overall edge probability = P(u,v) = 1 - [product of cEM_uM_v](1-p_c) = 1 - [product of (1-Pc) for each community c that BOTH u and v belong to], where Mu = set of communities node u belongs to
  - if u,v share no communities, P(u,v) = 0
  - i.e., think of it as an OR function: if at least one community says YES, we create an edge
Once we have P(u,v), simply flip a coin with this probability to decide whether to connect an edge between these nodes or not; repeat this for all node pairs

<p>AGM is a generative model B(V,C,M, {pc}) for graphs, where:</p><ul><li><p><span style="background-color: transparent;"><span>V = nodes (same as nodes in our network)</span></span></p></li><li><p><span style="background-color: transparent;"><span>C = communities (same as clusters in our network)</span></span></p></li><li><p><span style="background-color: transparent;"><span>M = memberships (connections of nodes to communities)</span></span></p></li><li><p><span style="background-color: transparent;"><span>Pc = probability of community c</span></span></p></li></ul><p></p><p>Given an AGM, we generate the network’s edges as follows:</p><ul><li><p><span style="background-color: transparent;"><span>For each pair of nodes u,v in community c, we connect them with probability Pc</span></span></p><ul><li><p><span style="background-color: transparent;"><span>overall edge probability = P(u,v) = 1 - [product of cEM</span><u><sub><span>u</span></sub></u><sub><span> </span></sub><span>M</span><sub><span>v</span></sub><span>](1-p</span><sub><span>c</span></sub><span>) = 1 - [product of (1-Pc) for each community c that BOTH u and v belong to], where Mu = set of communities node u belongs to</span></span></p><ul><li><p><span style="background-color: transparent;"><span>if u,v share no communities, P(u,v) = 0</span></span></p></li><li><p><span style="background-color: transparent;"><span>i.e., think of it as an OR function: if at least one community says YES, we create an edge</span></span></p></li></ul></li></ul></li><li><p><span style="background-color: transparent;"><span>Once we have P(u,v), simply flip a coin with this probability to decide whether to connect an edge between these nodes or not; repeat this for all node pairs</span></span></p></li></ul><p></p>

New cards

What types of community structures can be expressed by an AGM?

AGM is very flexible; it can express a variety of community structures such as non-overlapping, overlapping, nested, etc.

<p>AGM is very flexible; <span style="background-color: transparent;"><span>it can express a variety of community structures such as non-overlapping, overlapping, nested, etc.</span></span></p>

New cards

Explain how we can generate a network given a relaxed AGM, where memberships (M) have strengths

see pic: describes how to calculate edge probability for each pair of nodes, which we use to generate the final network.

New cards

Explain how we can generate a network given a directed AGM, where community membership edges are directed

Recall, the Factor matrix F shows node community membership strengths (rows = nodes, cols = communities); can be used to calculate P(u,v)

FvA -> strength of node v’s membership to community A
Fv -> vector of community membership strengths of node v

However, now when calculating edge probabilities we use 2 matrices:

F = FuA → community membership strength of outgoing edge from node u to community A
H = HvA → community membership strength of incoming edge from community A to node v

The probability of generating a directed edge from u to v is 1 - exp(-Fu Hv^t)

<p>Recall, the <span style="background-color: transparent;"><strong><span>Factor matrix F</span></strong><span> shows node community membership strengths (rows = nodes, cols = communities); can be used to calculate P(u,v)</span></span></p><ul><li><p><span style="background-color: transparent;"><span>FvA -> strength of node v’s membership to community A</span></span></p></li><li><p><span style="background-color: transparent;"><span>Fv -> vector of community membership strengths of node v</span></span></p></li></ul><p></p><p>However, now when calculating edge probabilities we use 2 matrices:</p><ul><li><p>F = FuA → community membership strength of <strong>outgoing edge</strong> from node u to community A</p></li><li><p>H = HvA → community membership strength of <strong>incoming edge</strong> from community A to node v</p></li></ul><p></p><p>The probability of generating a directed edge from u to v is 1 - exp(-Fu Hv^t)</p>

New cards

What is the goal of trawling?

Goal: detect small, dense communities in web graphs that are highly interconnected (e.g., people that are all talking about the same set of things/topics)
Intuition: a community on the web can be modeled as a dense bipartite subgraph, where:
- left side (X) = people or sources (bloggers, users, authors)
- right side (Y) = objects or targets (webpages, topics, papers)

<ul><li><p><span style="background-color: transparent;"><span>Goal: detect small, dense communities in web graphs that are highly interconnected (e.g., people that are all talking about the same set of things/topics)</span></span></p></li><li><p><span style="background-color: transparent;"><span>Intuition: a community on the web can be modeled as a </span><strong><span>dense bipartite subgraph</span></strong><span>, where:</span></span></p><ul><li><p><span style="background-color: transparent;"><span>left side (X) = people or sources (bloggers, users, authors)</span></span></p></li><li><p><span style="background-color: transparent;"><span>right side (Y) = objects or targets (webpages, topics, papers)</span></span></p></li></ul></li></ul><p></p>

New cards

how do we define small, dense communities in highly interconnected web graphs?

We produce a complete (i.e., fully connected) bipartite subgraph K_s,t

s nodes on the left, each linking to exactly the same t other nodes on the right
i.e., K_s,t is a fully connected subgraph with s nodes on one side, each connected to t nodes on the other side

i.e., Given a frequency threshold f, we want to define Ks,t, where we have s nodes (each node si is represented as a set Si, where Si’s elements are the nodes that node si connects to), and a set of t nodes (for which all t nodes exist in each Si). In plain english, Ks,t = a set of s nodes, each with edges connecting to the same set of t nodes.

<p>We produce a complete (i.e., fully connected) bipartite subgraph K<sub>s,t</sub></p><ul><li><p>s nodes on the left, each linking to exactly the same t other nodes on the right</p></li><li><p>i.e., K<sub>s,t</sub> is a fully connected subgraph with s nodes on one side, each connected to t nodes on the other side</p></li></ul><p></p><p>i.e., <span style="background-color: transparent;"><span>Given a frequency threshold f, we want to define Ks,t, where we have s nodes (each node si is represented as a set Si, where Si’s elements are the nodes that node si connects to), and a set of t nodes (for which all t nodes exist in each Si). In plain english, Ks,t = a set of s nodes, each with edges connecting to the same set of t nodes. </span></span></p><p></p>

New cards

What’s the connection between frequent itemsets in market basket analysis and complete bipartite graphs K_s,t?

Setting:
- Market = Universe U of n items
- Baskets = m subsets of U: S1, S2, …, Sm, where Si is a set of items one person bought
- Support = frequency threshold f
Goal: find all subsets T such that T is a subset of Si for at least f sets of Si’s (i.e., items in T were bought together at least f times)
- For a given webgraph, we want to find all possible/existing bipartite subgraphs Ks,t; these indicate frequent itemsets
Connection: frequent itemsets (groups of items that appear together in many baskets) = complete bipartite graphs

New cards

Given baskets, how do we produce a complete bipartite graph for a frequent itemset?

represent each node i as a set Si of nodes that i points to
given a minimum support (s) and itemset size (t), look for all sets Y (frequent itemsets) of size t that occur in s sets Si
- each of these represents a complete bipartite graph K_s,t
for example, assume we find a frequent itemset Y={a,b,c}, meaning there are s nodes (x,y,z) that link to all of {a,b,c}
- from here, we can define K_s,t = set Y of size t that occurs in s sets Si, where nodes x,y,z on the left connect to all of nodes a,b,c on the right

<ul><li><p>represent each node i as a set Si of nodes that i points to</p></li><li><p>given a minimum support (s) and itemset size (t), look for all sets Y (frequent itemsets) of size t that occur in s sets Si</p><ul><li><p>each of these represents a complete bipartite graph K<sub>s,t</sub></p></li></ul></li><li><p>for example, assume we find a frequent itemset Y={a,b,c}, meaning there are s nodes (x,y,z) that link to all of {a,b,c}</p><ul><li><p>from here, we can define K<sub>s,t</sub> = set Y of size t that occurs in s sets Si, where nodes x,y,z on the left connect to all of nodes a,b,c on the right</p></li></ul></li></ul><p></p>

New cards

For this graph, find all frequent itemsets and corresponding complete bipartite subgraphs, assuming s=2

New cards

What is the purpose of clustering?

Given a cloud of data points (with a notion of distance between them), we can understand its structure by grouping them into clusters
- members of a cluster = similar/close to each other
- members of different clusters = dissimilar
- members of no cluster = outliers
Usually, points are in a high-dimensional space (i.e., many features), and distance is measured using a method like Euclidian, Cosine, Jaccard, etc.

New cards

What are the 2 methods of clustering?

Hierarchical:
- Agglomerative (bottom up): each point starts as a cluster and we repeatedly combine the two nearest clusters into one
- Divisive (top down): start with one cluster and recursively split it

Point assignment: maintain a set of clusters and points belong to the nearest cluster (e.g., K-means)

New cards

What are the 3 key questions for hierarchical agglomerative clustering methods?

Remember, the key operation for hierarchical agglomerative clustering is to repeatedly combine two nearest clusters. So, we have three important questions:

1. How do you represent a cluster of more than one point? (i.e., how do you represent the “location” of each cluster to tell which pair of clusters is closest?)

2. How do you determine the nearness of clusters?

3. When to stop combining clusters?

New cards

Provide the answer to the first key question for hierarchical agglomerative clustering methods:

1. How do you represent a cluster of more than one point? (i.e., how do you represent the “location” of each cluster to tell which pair of clusters is closest?)

Solution: represent each cluster by its centroid (e.g., for Euclidean, each cluster’s centroid is the average of its data points)

Solution (Non-Euclidean): when we can’t use the “average” of points but rather the points themselves, we represent each cluster by its clustroid (the data point closest to other points (e.g., smallest sum of squares distances to other points))

<p><span style="background-color: transparent;"><span>Solution: represent each cluster by its </span><strong><span>centroid </span></strong><span>(e.g., for Euclidean, each cluster’s centroid is the average of its data points)</span></span></p><p></p><p><span style="background-color: transparent;"><span>Solution (Non-Euclidean): when we can’t use the “average” of points but rather the points themselves, we represent each cluster by its </span><strong><span>clustroid </span></strong><span>(the data point closest to other points (e.g., smallest sum of squares distances to other points))</span></span></p>

New cards

Provide the answer to the second key question for hierarchical agglomerative clustering methods:

2. How do you determine the nearness of clusters?

Solution: measure cluster distances by distances of centroids
Solution (Non-Euclidean): treat clustroid as if it were the centroid

New cards

Provide the answer to the third key question for hierarchical agglomerative clustering methods:

3. When to stop combining clusters?

Approach 1: predefined k, stop when we have k clusters (makes sense when we know the data naturally falls into k classes)
Approach 2: stop when the next merge would create a cluster with low cohesion (i.e., a bad cluster)
- cohesion could be defined as:
  - (1) the diameter of the merged cluster = maximum distance between points in the cluster
  - (2) radius = maximum distance (or average distance) of a point from the centroid or clustroid
  - (3) density-based approach = diameter (or average distance) divided by number of points in the cluster

New cards

What’s the naive implementation of hierarchial clustering, and what’s the performance?

Naive implementation: compute pairwise distances between all pairs of clusters, then merge
- Performance: O(N^3),
- However, careful implementation using priority queue can reduce it to O(N^2 log(N))
  - Still too expensive for really big datasets that do not fit in memory

New cards

Describe how the K-means clustering algorithm works, step-by-step

NOTEL this algorithm assumes Euclidean distance (see formula in pic) for n dimensions and points p, q:

Pick k (number of clusters)
Initialize clusters by picking one point per cluster

Approach 1: sampling
- cluster a sample of the data (fitting in main memory) using hierarchical clustering to obtain k clusters
- pick a point from each cluster (e.g., point closest to centroid)
Approach 2: picking dispersed set of points
- pick first point at random
- pick the next point to be the one whose minimum distance from the selected points is as large as possible
- repeat until you have k points

For each remaining point, place it in the cluster whose current centroid is the nearest
After all points are assigned, update the locations of centroids of the k clusters
Reassign all points to the closest centroid (points may move between clusters)
Repeat 4 and 5 until convergence (i.e., until points don’t move between clusters and centroids stabilize)

<p><span style="background-color: transparent;"><span>NOTEL this algorithm assumes Euclidean distance (see formula in pic) for n dimensions and points p, q:</span></span></p><ol><li><p><span style="background-color: transparent;"><span>Pick k (number of clusters)</span></span></p></li><li><p><span style="background-color: transparent;"><span>Initialize clusters by picking one point per cluster</span></span></p></li></ol><ul><li><p><span style="background-color: transparent;"><span>Approach 1: sampling</span></span></p><ul><li><p><span style="background-color: transparent;"><span>cluster a sample of the data (fitting in main memory) using hierarchical clustering to obtain k clusters</span></span></p></li><li><p><span style="background-color: transparent;"><span>pick a point from each cluster (e.g., point closest to centroid)</span></span></p></li></ul></li><li><p><span style="background-color: transparent;"><span>Approach 2: picking dispersed set of points</span></span></p><ul><li><p><span style="background-color: transparent;"><span>pick first point at random</span></span></p></li><li><p><span style="background-color: transparent;"><span>pick the next point to be the one whose minimum distance from the selected points is as large as possible</span></span></p></li><li><p><span style="background-color: transparent;"><span>repeat until you have k points</span></span></p></li></ul></li></ul><ol start="3"><li><p><span style="background-color: transparent;"><span>For each remaining point, place it in the cluster whose current centroid is the nearest</span></span></p></li><li><p><span style="background-color: transparent;"><span>After all points are assigned, update the locations of centroids of the k clusters</span></span></p></li><li><p><span style="background-color: transparent;"><span>Reassign all points to the closest centroid (points may move between clusters)</span></span></p></li><li><p><span style="background-color: transparent;"><span>Repeat 4 and 5 until convergence (i.e., until points don’t move between clusters and centroids stabilize)</span></span></p></li></ol><p></p>

New cards

What’s the optimal way to select k for k-means clustering?

How to select k:
- Try different k, looking at changes in average distance to centroid as k increases
- Optimal k occurs when average falls rapidly, followed by little change

<ul><li><p><span style="background-color: transparent;"><span>How to select k: </span></span></p><ul><li><p><span style="background-color: transparent;"><span>Try different k, looking at changes in average distance to centroid as k increases</span></span></p></li><li><p><span style="background-color: transparent;"><span>Optimal k occurs when average falls rapidly, followed by little change</span></span></p></li></ul></li></ul><p></p>

New cards

What’s the time complexity of k-means clustering; how can we improve this performance?

In each round, we have to examine each input point exactly once to find closest centroid
Each round is O(kN) for N points, k clusters
Issue: number of rounds to convergence can be very large
- Solution: cluster in a single pass over the data using the BFR algorithm

New cards

what is the BFR algorithm’s purpose?

BFR algorithm: extension of k-means to large data

Variant of k-means designed to handle very large datasets stored in disk
Efficient way to summarize clusters (want memory required O(clusters) and not O(data))

New cards

What’s the key assumption made by the BFR algorithm? Because of this, what shape are the clusters?

Assumes clusters are normally distributed around a centroid in a Euclidean space (i.e., data is normally distributed in each dimension around its cluster’s centroid)
- We can quantify the likelihood of finding a point in the cluster at a given distance from the cluster’s centroid along each dimension
  - e.g., for dimension X, 68% of points in the cluster are within 1 sd of the mean (i.e., the centroid’s value (0)), and 95% are within 2 sd
  - Assumes dimensions are independent of each other

Standard deviations in different dimensions may vary; each dimension has its own mean and sd
Normal distribution assumption implies clusters are axis-aligned ellipses
- NOTE: ellipse is longer along dimensions with higher sd

<ul><li><p><span style="background-color: transparent;"><span>Assumes clusters are </span><strong><span>normally distributed</span></strong><span> around a centroid in a Euclidean space (i.e., data is normally distributed in each dimension around its cluster’s centroid)</span></span></p><ul><li><p><span style="background-color: transparent;"><span>We can quantify the likelihood of finding a point in the cluster at a given distance from the cluster’s centroid along each dimension</span></span></p><ul><li><p><span style="background-color: transparent;"><span>e.g., for dimension X, 68% of points in the cluster are within 1 sd of the mean (i.e., the centroid’s value (0)), and 95% are within 2 sd</span></span></p></li><li><p><span style="background-color: transparent;"><span>Assumes dimensions are independent of each other</span></span></p></li></ul></li></ul></li></ul><ul><li><p><span style="background-color: transparent;"><span>Standard deviations in different dimensions may vary; each dimension has its own mean and sd</span></span></p></li><li><p><span style="background-color: transparent;"><span>Normal distribution assumption implies clusters are axis-aligned </span><strong><span>ellipses</span></strong></span></p><ul><li><p><span style="background-color: transparent;"><span>NOTE: ellipse is longer along dimensions with higher sd</span></span></p></li></ul></li></ul><p></p>

New cards

Describe the BFR algorithm in-depth

start with first load in-memory and select k initial centroids using any main memory clustering algorithm
assign points to DS
assign remaining points to CS, RS
summarize all clusters in DS and CS, store summary statistics (N, SUM, SUMSQ) in main memory, discard DS and CS points
1. each cluster is fully represented by its summary statistics
store RS points in main memory
load next set of data in-memory (including previous RS points and DS/CS statistics)
add points that are sufficiently close to a DS cluster centroid into the DS
adjust statistics of the DS clusters to account for the new points
1. incrementing N, SUM, and SUMSQ is easy and not computationally complex
discard DS points
for all remaining points, run main memory clustering algorithm to assign them to CS clusters
store new CS cluster statistics, adding all unassigned points to the RS
discard CS points, store RS points in main memory
consider merging any new/existing CS clusters into one (based on variance)
repeat 6-13 until final round; in final round, merge all CS clusters into one and all RS points into their nearest DS or CS cluster

Detailed algorithm:

Algorithm:
- At any given time, main memory contains (1) current set of points read from disk, (2) metadata of clusters from previously read points

Points are read from disk one main memory-full at a time

Most points from previous memory loads are summarized by simple statistics

To begin, from the initial load, we select the initial k centroids by some sensible approach:

Take k random points
Take a small random sample
Cluster optimally (e.g., pick a random sample point, and then k-1 more points, each as far from the previously selected points as possible)

Keep track of 3 classes of points:
- Discard Set (DS): points close enough to a centroid to be summarized (points in blue ellipse)
- Compression Set (CS): groups of points that are close together but not close to any existing centroid (points in yellow ellipses)
  - these points are summarized, but not assigned to a cluster
- Retained Set (RS): isolated points waiting to be assigned to a compression set (orange dots)
  - we have to keep track of these points in memory (unlike points in the DS and CS, which we throw away and only keep metadata for)

Summarize sets of points:
- For each cluster, the Discard Set (DS) is summarized by:
  - (1) integer N, representing the number of points in the cluster
  - (2) vector SUM, whose i’th component is the sum of the coordinates of the points in dimension i
  - (3) vector SUMSQ, whose i’th component is the sum of squares of the coordinates of the points in dimension i
  - Comments:
    - 2d+1 values represent a cluster of any size, where d = number of dimensions
      - why? len(N)+len(SUM)+len(SUMSQ) = 1+d+d = 1+2d
    - SUMi / N = average in each dimension (i.e., the centroid), where SUMi = i’th component of SUM
    - (SUMSQi / N) - (SUMi / N)^2 = variance of a cluster’s discard set in dimension i, where SUMSQi = i’th component of SUMSQ
      - sd = sqrt(var)
Process the memory-load of points:

Find points that are sufficiently close to a cluster centroid and add those points to that cluster and the DS

adjust statistics of the clusters to account for the new points: N, SUM, SUMSQ

these points are summarized and then discarded
how do we decide if a point is close enough to a cluster? BFR suggestion is to take points with a Mahalanobis distance (likelihood of point belonging to a centroid) less than a threshold (i.e., take points with a high likelihood of belonging to the currently nearest centroid (fulfilling normal distribution))

Use any main memory clustering algorithm to cluster the remaining points and the old RS

clusters go to the CS, outlying points to the RS

Consider merging compressed sets in the CS

how do we decide whether two compressed sets (CS) deserve to be combined into one? BFR suggestion is to compute the variance of the combined subcluster (N, SUM, and SUMSQ allow us to calculate this quickly), and combine if the combined variance is below some threshold

If this is the last round, merge all compressed sets in the CS and all RS points into their nearest cluster

New cards

What’s the issue with BFR and k-means, and what’s the solution?

Issue: BFR/k-means assumes clusters are normally distributed in each dimension, and - since axes are fixed - ellipses at an angle are NOT ok

Solution: CURE (Clustering Using REpresentatives) algorithm

<ul><li><p><span style="background-color: transparent;"><span>Issue: BFR/k-means assumes clusters are normally distributed in each dimension, and - since axes are fixed - ellipses at an angle are NOT ok</span></span></p></li></ul><ul><li><p><span style="background-color: transparent;"><span>Solution: </span><strong><span>CURE (Clustering Using REpresentatives) algorithm</span></strong></span></p></li></ul><p></p>

New cards

What does the CURE algorithm assume? What shape does it allow clusters to be, and how are clusters represented?

CURE algorithm: extension of k-means to clusters of arbitrary shapes

Assumes a Euclidean distance (i.e., between any two points, we can always find the midpoint by taking the average)
Allows clusters to assume any shape
Uses a collection of representative points to represent clusters
- Unlike BFR/k-means, which use one point to represent a cluster

New cards

How does the CURE algorithm work?

2-pass algorithm:

Pass 1:
1. Pick a random sample of points that fit in main memory
2. Create initial clusters: cluster these points hierarchically (group nearest points/clusters)
3. Pick representative points: (1) for each cluster, pick a sample of k representative points (as dispersed as possible), (2) move each representative point - for example - 20% toward the centroid of the cluster

Pass 2:
1. Rescan the whole dataset and visit each point p in the dataset
  - NOTE: visit all points, not just the sample points
2. Place each point p in the closest cluster
  - closest = find closest representative to p and assign it to the representative’s cluster