SECT 5

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/69

There's no tags or description

Looks like no tags are added yet.

Last updated 12:41 PM on 3/13/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

70 Terms

New cards

Unsupervised Learning

Process of analyzing data without pre-classified labels in order to uncover hidden meaning and structure.
Algorithm creates and designs labels based on similarity

New cards

Purpose of Unsupervised learning

basis for exploration, interpretation and supervised learning
discover meaning and structure of the data

New cards

Clustering

process of grouping unlabeled data where similiar ones are in one cluster and dissimilar in another

New cards

Purpose of Clustering

Data Understanding
Class Identification
Outlier and Noise detection

New cards

Clustering Methods

Partition based
Hierarchical Based
Density based

New cards

Partition Based

Divide data in k groups based on similarity

New cards

K means Clustering Idea

Each point is assigned to cluster with nearest centroid
each cluster is represented by a centroid
Centroids values update till their assignments stabilize

New cards

K means Setup

K value i.e no. of clusters must be defined
Centroids must be initialized i.e can be random pts, diff values result differently

New cards

K means Algo

works best for compact, well-defined clusters

New cards

K means pros

efficient O(tkn)
easy
widely used

New cards

K means cons

sensitive to outlier and noise
must define k
assumes convex and spherical clusters
may converge to local optimum
requires a well-defined mean

New cards

k medoids idea

each cluster is centered around a medoid (most central point)
points are assigned to nearest medoid
each iteration chooses medoid as most representative point

New cards

k medoids over k means

medoids are actual data points
robust to noise n outliers

New cards

k medoid algo

useful when robustness and interpretability are more imp than speed

New cards

k medoids pros

easy to implement
medoids are real datapoints
works with any distance measure
robust to noise n outliers

New cards

k medoids cons

may converge to local optimum
must predefine k
assumes convex n spherical clusters
computationally more expensive

New cards

PAM (Partitioning Around Medoids)

classic k medoids algo
for medoid selection uses systematic swap
more robust but costly

New cards

CLARA(Clustering LARge Applications)

runs PAM on a sample
reduces runtime but quality depends on sample

New cards

CLARANS(Clustering Large Applications upon RANdomized Search)

Randomized version of PAM
only on a subset of possible swaps
balances efficiency and quality

New cards

Commonalities of k means n k medoids

belongingness to a cluster is dependent on distance to the center element, thus represents Vornoi diagram

New cards

Limitations of k means and k medoids

assums a convex partioning
must predefine k

New cards

Expectation Maximization

With incomplete data, it helps find missing values with expectation and helps to refine the model with maximization

New cards

Need for EM

provides a principled, probabilitic process
genereal use in clustering, Handling Missing Values, HMMs

New cards

EM For Clustering Idea

Each cluster is a gaussian distribution(cov, mean, weights)
points assigned probabailitically
best when clusters are non spherical and need probabilitic assignment

New cards

EM for clustering pros

captures elliptical/non-spherical clusters
more flexible
soft clustering
grounded in probabilitic work

New cards

EM Clustering Steps

assign pts to cluster distributions
reestimate mean n variance

<ol><li><p>assign pts to cluster distributions</p></li><li><p>reestimate mean n variance</p></li></ol><p></p>

New cards

EM Clustering cons

computationally heavy
must predefine k
sensitive to initialization and can converge to local optimum
assume all clusters follow same distribution

New cards

Silhouette coefficient

measures how well a point fits in its own cluster vs othger clusters

used for evaluating clusters

New cards

Silhouette Coefficient Formula

New cards

Hierachical Clustering

builds a hierachy of nested clusters without defining the no. of clusters

New cards

Hierachical Clustering idea

in the beginning each pt is it’s own cluster
clusters are merged or split
produces a dendogram to cshow the different clusters

New cards

Types of hierachical Clustering

Single linkage
complete linkage
centroid linkage

New cards

Dendogram

is a tree diagram that shows the arrangement of clusters produced by hierarchlical clustering
a cut in the dendogram shows hoe to partition the clusters

New cards

Hierarchical clustering Algo

New cards

Single Linkage

Distance is defined as the minimum distance between any pair of points
O(n²)
sensitive to noise and outliers
good for detecting arbitarily shaped clusters

<ul><li><p>Distance is defined as the minimum distance between any pair of points</p></li><li><p>O(n<sup>2 </sup>)</p></li><li><p>sensitive to noise and outliers</p></li><li><p>good for detecting arbitarily shaped clusters</p></li></ul><p></p>

New cards

Complete Linkage

maximum distance between any pair of points
O(n²)
favours compact, spherical clusters
still sensitive to outliers

<ul><li><p>maximum distance between any pair of points</p></li><li><p>O(n<sup>2 </sup>)</p></li><li><p>favours compact, spherical clusters</p></li><li><p>still sensitive to outliers</p></li></ul><p></p>

New cards

Centroid Linkage

Distance is defined as distance between the centroids of 2 clusters
O(n)
can produce inversions
considers all points

<ul><li><p>Distance is defined as distance between the centroids of 2 clusters</p></li><li><p>O(n)</p></li><li><p>can produce inversions</p></li><li><p>considers all points</p></li></ul><p></p>

New cards

Density Based Clustering

Clusters are defined dense regions of points separated by areas of low density

New cards

density based clustering idea

points inside a cluster are densely connected
sparse regions act as separators
noise n outliers are unassigned

New cards

Density based clustering advantages

tackles noise and outliers
no need to define no. of clusters
works for arbitary shaped clusters

New cards

Core object

An object with atleast minpoint no. of neighbours within ε
forms heart of the region

<ul><li><p>An object with atleast minpoint no. of neighbours within <span><span>ε</span></span></p></li><li><p><span><span>forms heart of the region</span><span><br></span></span></p></li></ul><p></p>

New cards

border object

lies within neighnourhood of core object but it itself has <Minpt no. of neighbours

New cards

Directly density reachable

a point p is directly density reachable to a point q if p lies within ε of q and q is core object

New cards

density reachable

a point p is density reachable to point q if there exist a chain from q to p such that each point in the chain is directly density reachable pointing towards p
q is core object p can be a border object

New cards

Density connected

two points p and q are density connected if both are density reachable to a common core object
p and q can be border objects

New cards

DBSCAN (Density BasedSpatial Clustering with Applications of Noise

identify core obj and find the minpoints within ε
grow clusters by connectly density reachable points
pts that are not density reachable to any core obj are labelled as noise

New cards

DBSCAN Algo

New cards

DBSCAN Pros

detectsd noise n outliers
detects arbitary shape
works well for evenly dense clusters

New cards

DBSCAN Cons

Require 2 params
sensitive to param changes
only for uniformly dense
degrades in in performance for high dimension

New cards

OPTICS(Ordering Points to Identify Clustering Structure)

use an ordering of points based on density
produces a reachability plot
works for varying density
generalizes DBSCAN ; is flexible

New cards

Reachability plot

shows valleys for clusters and peaks for sparse regions

New cards

core distance

minimum ε such that point becomes a core object

New cards

Reachability distance

measures how far a point is from prev point

New cards

why clustering given deep learning era

unsupervised learning is essential:1st step of exploration, reveal structure
complements deep learning: used for pretraining n representation learning
practically applicable : faster, cheaper and simpler

New cards

Drawbacks of clustering

curse of dimensionality: distances becomes less meaningful, clusters lose separation
partition based: vornoi breaks down in high dim
hierarchical: overpowered by noise
density: sparseness

New cards

Clustering in High dimension

ADAPT: dimensionality reduction, subset clustering: search clusters in subsets, feature selection or weighting: reduce noise dimensions
However distance is still unreliable and most of it depends on preprocessing chossing
Solution: instead of grouping in clusters we find patterns as in association rule mining

New cards

Transaction Data