What is Cluster Analysis?
− Finding groups of objects such that
• the objects in a group will be similar to one another
• and different from the objects in other groups.
− Goal: Get a better understanding of the data
Intra-cluster distances are minimized
Inter-cluster distances are maximized
Clustering to Describe the Data
Objectives: exploring the data and finding all possible meaningful groups in the data.
• Some business applications of clustering are:
1. Marketing: Finding the common groups of customers based on all past customer behaviors, potential customers’ attributes, and/ or purchase patterns. This task is helpful to segment the customers, identify prototype customers (description of a typical customer of a group), and tailor a marketing message to the customers in a group.
2. Session grouping: In web analytics, clustering is helpful to understand clusters of clickstream patterns and discover different kinds of clickstream profiles. One clickstream profile may be that of customers who knows what they want and proceed straight to checkout. Another profile may be that of customers that research the products, read through customer reviews, and make a purchase during a later sessions. Clustering the web sessions by profile helps an e-commerce company to provide features fitting each customer profile.
3. Document clustering: One common text mining task is to automatically group documents into groups of similar topics. Document clustering provides a way of identifying key topics, comprehending and summarizing these clustered groups rather than having to read through whole documents. Document clustering is used for routing customer support incidents, online content sites, forensic investigations, etc.
Major Clustering Techniques and Algorithms
• A large number of clustering algorithms exists
• The choice of cluster algorithm depends on
– the type of data available
– the particular purpose and application
• The user has to choose the clustering technique carefully
– domain knowledge required
• Clustering algorithms compare data objects regarding to their similarity or dissimilarity
• Depending on the clustering technique used, the number of groups or clusters is either user-defined or automatically determined by the algorithm from the dataset.
Types Of Clusterings
• Partitional:
• non-overlapping subsets, such that each example is in exactly one cluster
• Hierarchical:
• a set of nested clusters organised as a tree
• Density Based:
• examples in dense areas form a cluster, examples in sparse areas are not assigned to a cluster
10 Clusters in Retail Banking- A Professional Clients Behavioural Segmentation
Direct Freak
Classic Saver
Branch cash Inner
Speculator
Branch cash Outer
Tax Averse
Youngster & Starter
Passer-by
Richest
Private user
Partitional Clustering – K-Means
• k-Means clustering creates k partitions in n-dimensional space, where n is the number of attributes in a given dataset.
• To partition the dataset, a proximity measure has to be defined.
• The most commonly used measure for a numeric attribute is the Euclidean distance. The outcome of k-means clustering provides a clear partition space for Cluster 1 and a narrow space for the other two clusters, Cluster 2 and Cluster 3
Euclidean Distance
Typically, distances between data objects are used for the determination of similarity
Calculating the Distance
• If two objects can be represented as feature vectors, then we can compute the distance between them
Association Analysis
• Association Analysis measures the strength of co-occurrence between one item and another. The objective of this class of data science algorithms is to find usable patterns in the co-occurrences of the items.
• Association rules learning is a branch of an unsupervised learning process that discovers hidden patterns in data, in the form of easily recognizable rules.
• Association algorithms are widely used in retail analysis of transactions, recommendation engines, and online clickstream analysis across web pages.
• One of the popular applications of this technique is called market basket analysis, which finds co-occurrences of one retail item with another item within the same retail purchase transaction.
Association Analysis Applications
• Cross-selling
– On-the-spot recommendations with recommendation engines
– Direct marketing campaigns and targeted offering
• Store layout
– To improve customer experience, facilitate purchases, or induce cross-selling opportunities
• Content or information optimization
– Catalogue optimization and personalization
– Music recommendation
– To reduce waste, increase exposure to relevant content*
Rules-based Method
• It takes analytical skill and business knowledge to successfully apply the outcome of association analysis. The model outcome of an association analysis can be represented as a set of rules, like the one below: {ItemA} -> { ItemB}
• If Item A is found in a transaction or a basket, there is a strong propensity of occurrence of Item B within the same transaction. Here, Item A is the antecedent of the rule and Item B is consequent of the rule.
• The antecedent and consequent of the rule can contain more than one item, like {Item A and Item C}. To mine these kinds of rules from the data, we would need to analyze all previous customer purchase transactions.
• In a retail business, dealing with millions of transactions made in a day.
• Two of the key considerations of association analysis are computational time and resources.
• Over the last two decades newer and more efficient algorithms have been developed to mitigate this problem.
Method for Finding Association Rules: 3 steps
• The method for finding association rules through data science involves the following sequential steps:
• Step 1: Prepare the data in transaction format. An association algorithm needs input data to be formatted in a particular format.
• Step 2: Short-list frequently occurring item sets. Item sets are combinations of items. An association algorithm limits the analysis to the most frequently occurring items, so the final rule set extracted in the next step is more meaningful.
• Step 3: Generate relevant association rules from item sets. Finally, the algorithm generates and filters the rules based on the interest measure.
Example: Online Clickstream Analysis across Web Pages
• Let’s consider a media website, like Yahoo News, with categories such as news, politics, finance, entertainment, sports, and arts.
• A session or transaction in this example is one visit for the website, where the same user accesses content from different categories, within a certain session period.
Confidence of the Rule
• The confidence of a rule measures the likelihood of occurrence of the consequence of the rule out of all the transactions that contain the antecedent of the rule.