Analytics

Data mined without clear outcome in mind; exploratory in nature; not pre-defined or classified, discovered from the data

Unsupervised Learning Methods

  • Descriptive and exploratory in nature

  • Mined to uncover previously unknown patterns without outcome in mind

  • Make use of unlabeled data: no specified outcome variable

  • Assessing how good the results are (subjective)

Supervised Learning Methods

  • Prediction methods

  • Mined with clear objective to predict a specific outcome

  • Make use of labeled data; outcome variable is known

  • Assessing performance of model is done with objective metrics

Association Rules

Discovering relationships among items

General Idea → Investigating co-occurrence of items, events, variables and so on

Recommendation Systems are main application (e-commerce)

Market Basket Analysis

Definitions

  • X→Y : Association rule (relationship between item sets)

    • Association rules are direction

    • Item sets cannot be overlapping

  • X : antecedent (body)

  • Y : consequent (head)

Metrics / Measures

Assessing association rules

  • (Support Count) count of transactions containing item-set [σ]

    • Support count of X, support count of Y and support count X→Y

    • Not directional

  • (Support Percentage) fraction of transactions that contain all item-sets; probability randomly selected transaction contains item set [s]

    • # of transaction with item set / total # of transactions

    • Support percent of X, support percent Y, support percent X→Y

    • Not directional

  • (Confidence) how often Y appears with transactions containing X; probability transaction including X will include Y [conf.]

    • Count (X→Y) / Count (X)

    • Directional

  • (Lift) measure o how much more likely 2 item sets co occur [lift]

    • Conf. (X→Y) / S(Y)

    • No direction

Algorithms - finding good candidates

  • To find rules that are good candidates, only consider frequent item-sets; specify minimum support (minsupp) and minimum confidence (minconfidence) based on domain knowledge or business goals

  • Frequent item-sets → support >= minsupp

    • Minimum support threshold lightens computational load

Apriori Algorithm

  • Key idea → if item set X is not frequent then any item sets containing X cannot be frequent

    • Algorithm process:

      1. consider all one-item-sets, compute support, keep sets with supp >= minsupp

      2. form two-item-sets with items sets from step 1, keep sets with supp >=minsupp

      3. form three-item-sets with items from step 1, keep sets with supp >=minsupp

      4. continue till no further combo’s to consider, resulting item sets satisfy minsupp

      5. calculate confidence of item-sets, keep sets with conf. >= minconfidence; this is the final set of rules

        • Can also calculate lift of these rules

Interpretation of results

  • Confidence: when X happens, [conf.]% of the time Y also happens

  • Lift: when X happens, Y is [lift-1] percent more/less liekly

    • Lift > 1; customers who buy X are more likely to buy Y

    • Lift = 1; customers who buy X are as likely to buy Y

    • Lift < 1; customers who buy X are less likely to buy Y

Cluster Analysis

Organizing observations in internally similar and meaningful clusters

Cluster Properties

  • High Intra-Similarity → points in same cluster are similar to each other

  • Lower Inter-Similarity → points in different clusters are different

Applications of Clustering

  1. Discover natural patterns in data

  2. Facilitate analysis of large datasets

Why Algorithms

  • Data sets in N-dimensional spaces; datasets with M observations of N dimensions, we need to:

    • Understand similarity between data

    • Decide which clustering method

    • How many clusters (stopping criteria)

    • Evaluate quality of clusters

Distance Measures (similarity between individual data points)

  • Numerical Data (higher distance → more different; lower distance → more similar)

    • Euclidean Distance (straight line distance)

    • Manhattan Distance (rectilinear distance)

  • Binary Data (higher distance → more different; lower distance → more similar)

    • Matching Distance (# of mismatches over total #)

    • Jaccard Distance (excludes null observations)

      • Used for asymmetric binary data (i.e non senior can mean different things)

  • Data Normalization

    • Different type of attributes → mixed data

    • Attributes that take values in different ranges → non comparable data

    • (Data Normalization) eliminate specific units of measurement to transform them to common scale

      • Min-Max Normalization → rescale attributes to have values between 0 and 1 using min and max

        • New Value = (x-min) / (max-min)

      • Standardization → transform each attribute to have mean of 0 and sd of 1; Need to transform binary too

        • New Value = (x-sample mean) / sample sd

          • Step 1: compute mean + SD

          • Step 2: perform standardization

Linkage Methods (distance between clusters)

  • Single Linkage: min pairwise distance between points from diff. clusters

    • Compute all distances between points + pick min

    • Sensitive to outliers, good for odd shapes

  • Complete Linkage: max pairwise distance between points from diff. clusters

    • Compute all distances between points + pick max

    • Tends to break large clusters

  • Average Linkage: average pairwise distance between points from diff. clusters

    • Compute all distances between points + take avg.

    • Tends to produce similar results to centroid linkage

  • Centroid Linkage: distance between two clusters centroids (cluster means)

    • Compute cluster’s centroids (avg. of X, avg. of Y) then compute distance

    • Tends to produce similar results to avg. linkage

  • Ward’s Method: minimize within-cluster variance ( sum of squared pairwise distances between cluster centroid + points)

    • Robust to outliers, creates denser spherical clusters

Clustering Methods

  • Hierarchical clustering (bottom-up clustering) → forming larger clusters from smaller ones

    • # of clusters is not specified, distance measure and linkage method are specified

    • (Distance Matrix) square matrix containing pairwise distances between data points

    • Algorithm process → Takes two nearest clusters and forms new cluster till one big cluster

    • (Dendrogram) Diagram showing clusters hierarchy

      • Length of lines is proportionate to distance

      • Depending on # of clusters we need, start at top and work down, make squares, then C1 (x,x,x,x) etc.

    • Pros → flexible, produces solution for diff. cluster #’s

    • Cons → not scalable, can be outlier sensitive

  • Partitioning-based clustering (k-means clustering) → directly partition data into K-groups where k is # of clusters, usually with euclidian distance

    • Objective → produce specified # of clusters with lowest within-cluster variance

    • Process

      1. 2 points picked at random, all data points assigned to these points in clusters based on euclidian distance

      2. Recalculates cluster centroids and reassigns points to centroid based on distance

      3. Repeat step 2 until convergence (re-assignment doesn’t change anymore)

    • Pros → Very scalable

    • Cons → poor initialization can lead to bad results, bad for irregular shapes and outliers

Assessing Quality of Clusters

  • Within Sum of Squared Errors (WSS)

    • Lower the WSS → more cohesive clusters are → high intra-similarity

  • Between Sum of Squared Errors (BSS)

    • Higher the BSS → larger distance of centroids from each other → lower inter-similar

Choosing # of Clusters

  • Dendrogram

    • Clusters can emerge naturally

  • Elbow Plot

    • Plot WSS / BSS as function of # of clusters

    • Pick # corresponding with elbow point (where curve begins to flatten)

Total Sum Squared (TSS) = sum of BSS and WSS

Classification

Predicting Categorical values of outcome variable

Objective → Building a prediction model for a clear, defined outcome variable using data for which we already know outcome

Split Data (Randomly); normally 65%-85% training data

  • (Training Data) Used to run algorithm and learn model; model construction

    • Construct a classification model based on training data

  • (Testing Data) Used to test model and assess performance; model testing

    • Measure accuracy of your model using test data

Classification → Outcome Variable is binary/categorical

K-NN (Nearest Neighbors) - simple and heuristic

Must identify NN based on distance measure (euclidian)

  • Data needs to be normalized/ standardized (don’t normalize outcome class)

  • NN are points with shortest distance from new point

  • Once NN identified → new observation is classified based on majority rule

K = Number of neighbors to consider

  • Chosen to MAX accuracy, lowest error rate

    • Small k values → capture local structure of data but are very sensitive to potential outliers

    • Larger k values → less sensitive to potential outliers byt iss capturing local strucutre of data

  • If K is even # there can be ties → class is picked randomly by algorithm

    • Recommended not to use even number to avoid this

Classification can be improved by using weighted distance

  • Giving more weight to closer neighbors

Evaluation of Model

  • Error → among all predictions, how many did the model get wrong

  • Accuracy → among all predictions, how many did the model get right

  • Recall → among the true positive class, how many was the model able to retrieve; proportion out of all postive instances in dataset

    • Useful when false negatives are costly; high recall indicates model is good at finding relevant positive instances

  • Precision → among the predicted class, how many was the model able to retrieve; proportion of true positive predictions out of all predicted positive instances

    • Useful when false positives are costly; high precision indicates that positive prediction are most likely correct

  • F1 Scores → harmonic mean of precision and recall, a more reliable measure than accuracy given unbalanced data

Accuracy is not a good measure in unbalanced datasets and high accuracy model may not be good at classifying specific classes → rely on class specific measures like recall and precision

Precision + Recall Trade-off

  • Greater precision, lower recall → lower chance of false positives (good) but lower chance of true negatives (bad)

  • Greater recall, lower precision → high chance of false negatives (bad) but higher chance of true positives (good)

How it Works

  • k-NN will classify new observation according to predominant class (outcome variable) among the identified nearest neighbors observations

  • Need to decide K → how many neighbors to consider

    • Look at decided number of nearest neighbors and classify

      • Ex: If K=3, nn are no purchase, purchase and purchase then the outcome would be purchase

      • In a tie situation, the algorithm decides randomly

Summary

  • Pros

    • Easy to use and implement

    • Doesn’t require statistical/distribution assumptions

  • Cons

    • Lazy classifier → no real model building

    • Bad for large datasets → slow learning, not good for real time

Decision Trees - systematic and interpretable

Goal → classify data into predefined classed based on decision rules, organized as a tree

Part of a Decision Tree

  • Top node → root node

  • Middle nodes → child (decision) nodes

  • Final node → leaf (final) nodes

Decision trees split data based on attribute-value combination that minimizes entropy

  • To determine root node → algorithm asses every attribute-value and chooses combination that maximizes agreement

  • To determine leaf node (final nodes) → majority rule

    • The class of the majority among the data-points in that node

    • Stops either when all data points have same class or when their is no remaining attribute-value combo that could be used to further split the data

  • The same attributes can be picked multiple times if paritions aren’t overlapping

  • Algorithm does not need to use all attributes

Entropy (measure of agreement)

  • Lower entropy → more observations in partition agree in terms of class outcome

  • Entropy of 0 → all observation in partition have same class

  • Entropy of 1 → max disagreement (50:50 split in outcome values)

Evaluation

  • Accuracy for overall model

  • Recall + Precision for individual classes

  • F1 Scores for individual classes

Confusion Matrix

  • True Positive (TP) → # of cases for which we predict true, and the actual class is true

  • True Negative (TN) → # of cases for which we predict false, and the actual class is false

  • False Positive (FP) → # of cases for which we predict true, but actual class is false

  • False Negative (FN) → # of cases for which we predict false, but actual class is true

Classification Probabilities

  • k-NN → % of a class among k-neighbors

  • Decision tree → % of a class in leaf node

Generated based on cutoff value → can lead to different classification results

Numeric Prediction

Predicting continuous values of outcome variable

Objective → predicting a numerical outcome for new observation based on data containing obersvations whose outcome variable is already known

Split Data (Randomly); normally 65%-85% training data

  • (Training Data) Used to run algorithm and learn model; model construction

    • Construct a classification model based on training data

  • (Testing Data) Used to test model and assess performance; model testing

    • Measure accuracy of your model using test data

K-NN (Nearest Neighbors)

Identify nearest neighbors based on distance measure (euclidian)

  • Data needs to be normalized (standardized); outcome does not

  • Nearest neighbors are points with shortest distance from new point

  • Once NN have been identified → numeric prediction for new observation is obtained using average of the NN outcome values

K = number of neighbors to consider

  • Chosen to MIN. RMSE

  • Cannot be ties since predicted outcome is determined by taking an average

  • Predictions can be improved using weighted sitance

Evaluation

  • ME (Mean Error)

  • MAE (Mean Absolute Error)

  • MAPE (Mean Absolute % Error)

  • RMSE (Root Mean Square Error)

How it Works

  • Need to decide K → how many neighbors to consider

    • Look at decided number of nearest neighbors and take the average of the outcomes

      • Ex: If K=3, nn are 50,5 and 25 then the outcome would be 26.66

Summary

  • Pros

    • Simple, captures complex relationship without building model

    • Doesn’t make any assumptions abut data distribution

  • Cons

    • If you have a lot of attributes you need lots of observations to make reasonable predictions

    • Very slow, bad for real time predictions

Regression Trees

Splits data based on the attribute-value combination that minimized the Sum of Squared Deviations (SSD)

  • To determine leaf node (final node) → uses average of the outcome values of data points in that node

  • Regression trees tend to discretize outcome

Sum of Squared Deviations (SSD)

  • Lower SSD → more observations in the partition agree in terms of continuous outcome

  • SSD = 0 → all observations in a partition have the same outcome value

  • Regression trees split data based on attribute-value combo that minimizes SSD

Evaluation

  • Prediction Error → difference between predicted outcome and actual outcome

  • ME (Mean Error) → average of prediction errors, indicates over or under prediction

    • Negative → under predicting

    • Positive → over predicting

  • MAE (Mean Absolute Error) → magnitude of the average error

  • MAPE (Mean Absolute % Error) → relative to actual values, gives % score of how much predictions deviate on average from actual values

  • RMSE (Root Mean Square Error) → measures average magnitude of error, penalizes larger errors

Summary

  • Pros

    • No parametric assumptions

    • Good for variable selection

    • Robust to outliers

  • Cons

    • Unstable → changes to data lead to different splits

    • Can miss relationships between attributes