1/11
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Characteristics of the Input Data are Important
Dimensionality
• Noise and Outliers
• Type of Distribution
• Type of Data / Attributes – dictates type of similarity
Data: Pre-processing
– Normalize Data
– Eliminate Outliers
Data: Post-processing
– Eliminate small clusters that may represent outliers
– Split “loose” clusters; i.e., clusters with relatively high SSE
– Merge clusters that are “close” and that have relatively low SSE
– Can use these steps during the clustering process
K – Means Clustering
Given a value of k, the k-means algorithm randomly assigns each observation to one of the k clusters.
• After all observations have been assigned to a cluster, the resulting cluster centroids are calculated.
• Using the updated cluster centroids, all observations are reassigned to the cluster with the closest centroid
How to choose “k”?
Choose “k” based on how results will be used
– Example: “How many market segments do we want?”
• Also, experiment with slightly different k’s
If the no of clusters, k, is not clearly established by the context of the business problem, the k-means algorithm can be repeated for several values of k to identify promising values.
Suitability of k-Means Clustering
Suitable when you know how many clusters you want and you have a larger data set (e.g., more than 500 observations)
• This method is appropriate for larger tables upto millions of rows and allows only numerical data.
• Partitions the observations, which is appropriate if trying to summarize the data with k “average” observations that describe the data with the minimum amount of error.
Clustering should result in groups….
made of observations that are more similar too each other than they are to observations in other clusters.
Cluster cohesion
relates to the distance between observations within the same cluster.
Cluster separation
relates to the distance between observations in different clusters.
Cluster interpretability
relating to how much insight clusters provide.
Cluster stability
referring to how robust is the set of clusters with respect to slight changes in the data
Clustering is an…
UNSUPERVISED TECHNIQUE