Analytics
Data mined without clear outcome in mind; exploratory in nature; not pre-defined or classified, discovered from the data
Unsupervised Learning Methods
Descriptive and exploratory in nature
Mined to uncover previously unknown patterns without outcome in mind
Make use of unlabeled data: no specified outcome variable
Assessing how good the results are (subjective)
Supervised Learning Methods
Prediction methods
Mined with clear objective to predict a specific outcome
Make use of labeled data; outcome variable is known
Assessing performance of model is done with objective metrics
Association Rules
Discovering relationships among items
General Idea → Investigating co-occurrence of items, events, variables and so on
Recommendation Systems are main application (e-commerce)
Market Basket Analysis
Definitions
X→Y : Association rule (relationship between item sets)
Association rules are direction
Item sets cannot be overlapping
X : antecedent (body)
Y : consequent (head)
Metrics / Measures
Assessing association rules
(Support Count) count of transactions containing item-set [σ]
Support count of X, support count of Y and support count X→Y
Not directional
(Support Percentage) fraction of transactions that contain all item-sets; probability randomly selected transaction contains item set [s]
# of transaction with item set / total # of transactions
Support percent of X, support percent Y, support percent X→Y
Not directional
(Confidence) how often Y appears with transactions containing X; probability transaction including X will include Y [conf.]
Count (X→Y) / Count (X)
Directional
(Lift) measure o how much more likely 2 item sets co occur [lift]
Conf. (X→Y) / S(Y)
No direction
Algorithms - finding good candidates
To find rules that are good candidates, only consider frequent item-sets; specify minimum support (minsupp) and minimum confidence (minconfidence) based on domain knowledge or business goals
Frequent item-sets → support >= minsupp
Minimum support threshold lightens computational load
Apriori Algorithm
Key idea → if item set X is not frequent then any item sets containing X cannot be frequent
Algorithm process:
consider all one-item-sets, compute support, keep sets with supp >= minsupp
form two-item-sets with items sets from step 1, keep sets with supp >=minsupp
form three-item-sets with items from step 1, keep sets with supp >=minsupp
continue till no further combo’s to consider, resulting item sets satisfy minsupp
calculate confidence of item-sets, keep sets with conf. >= minconfidence; this is the final set of rules
Can also calculate lift of these rules
Interpretation of results
Confidence: when X happens, [conf.]% of the time Y also happens
Lift: when X happens, Y is [lift-1] percent more/less liekly
Lift > 1; customers who buy X are more likely to buy Y
Lift = 1; customers who buy X are as likely to buy Y
Lift < 1; customers who buy X are less likely to buy Y
Cluster Analysis
Organizing observations in internally similar and meaningful clusters
Cluster Properties
High Intra-Similarity → points in same cluster are similar to each other
Lower Inter-Similarity → points in different clusters are different
Applications of Clustering
Discover natural patterns in data
Facilitate analysis of large datasets
Why Algorithms
Data sets in N-dimensional spaces; datasets with M observations of N dimensions, we need to:
Understand similarity between data
Decide which clustering method
How many clusters (stopping criteria)
Evaluate quality of clusters
Distance Measures (similarity between individual data points)
Numerical Data (higher distance → more different; lower distance → more similar)
Euclidean Distance (straight line distance)
Manhattan Distance (rectilinear distance)
Binary Data (higher distance → more different; lower distance → more similar)
Matching Distance (# of mismatches over total #)
Jaccard Distance (excludes null observations)
Used for asymmetric binary data (i.e non senior can mean different things)
Data Normalization
Different type of attributes → mixed data
Attributes that take values in different ranges → non comparable data
(Data Normalization) eliminate specific units of measurement to transform them to common scale
Min-Max Normalization → rescale attributes to have values between 0 and 1 using min and max
New Value = (x-min) / (max-min)
Standardization → transform each attribute to have mean of 0 and sd of 1; Need to transform binary too
New Value = (x-sample mean) / sample sd
Step 1: compute mean + SD
Step 2: perform standardization
Linkage Methods (distance between clusters)
Single Linkage: min pairwise distance between points from diff. clusters
Compute all distances between points + pick min
Sensitive to outliers, good for odd shapes
Complete Linkage: max pairwise distance between points from diff. clusters
Compute all distances between points + pick max
Tends to break large clusters
Average Linkage: average pairwise distance between points from diff. clusters
Compute all distances between points + take avg.
Tends to produce similar results to centroid linkage
Centroid Linkage: distance between two clusters centroids (cluster means)
Compute cluster’s centroids (avg. of X, avg. of Y) then compute distance
Tends to produce similar results to avg. linkage
Ward’s Method: minimize within-cluster variance ( sum of squared pairwise distances between cluster centroid + points)
Robust to outliers, creates denser spherical clusters
Clustering Methods
Hierarchical clustering (bottom-up clustering) → forming larger clusters from smaller ones
# of clusters is not specified, distance measure and linkage method are specified
(Distance Matrix) square matrix containing pairwise distances between data points
Algorithm process → Takes two nearest clusters and forms new cluster till one big cluster
(Dendrogram) Diagram showing clusters hierarchy
Length of lines is proportionate to distance
Depending on # of clusters we need, start at top and work down, make squares, then C1 (x,x,x,x) etc.
Pros → flexible, produces solution for diff. cluster #’s
Cons → not scalable, can be outlier sensitive
Partitioning-based clustering (k-means clustering) → directly partition data into K-groups where k is # of clusters, usually with euclidian distance
Objective → produce specified # of clusters with lowest within-cluster variance
Process
2 points picked at random, all data points assigned to these points in clusters based on euclidian distance
Recalculates cluster centroids and reassigns points to centroid based on distance
Repeat step 2 until convergence (re-assignment doesn’t change anymore)
Pros → Very scalable
Cons → poor initialization can lead to bad results, bad for irregular shapes and outliers
Assessing Quality of Clusters
Within Sum of Squared Errors (WSS)
Lower the WSS → more cohesive clusters are → high intra-similarity
Between Sum of Squared Errors (BSS)
Higher the BSS → larger distance of centroids from each other → lower inter-similar
Choosing # of Clusters
Dendrogram
Clusters can emerge naturally
Elbow Plot
Plot WSS / BSS as function of # of clusters
Pick # corresponding with elbow point (where curve begins to flatten)
Total Sum Squared (TSS) = sum of BSS and WSS
Classification
Predicting Categorical values of outcome variable
Objective → Building a prediction model for a clear, defined outcome variable using data for which we already know outcome
Split Data (Randomly); normally 65%-85% training data
(Training Data) Used to run algorithm and learn model; model construction
Construct a classification model based on training data
(Testing Data) Used to test model and assess performance; model testing
Measure accuracy of your model using test data
Classification → Outcome Variable is binary/categorical
K-NN (Nearest Neighbors) - simple and heuristic
Must identify NN based on distance measure (euclidian)
Data needs to be normalized/ standardized (don’t normalize outcome class)
NN are points with shortest distance from new point
Once NN identified → new observation is classified based on majority rule
K = Number of neighbors to consider
Chosen to MAX accuracy, lowest error rate
Small k values → capture local structure of data but are very sensitive to potential outliers
Larger k values → less sensitive to potential outliers byt iss capturing local strucutre of data
If K is even # there can be ties → class is picked randomly by algorithm
Recommended not to use even number to avoid this
Classification can be improved by using weighted distance
Giving more weight to closer neighbors
Evaluation of Model
Error → among all predictions, how many did the model get wrong
Accuracy → among all predictions, how many did the model get right
Recall → among the true positive class, how many was the model able to retrieve; proportion out of all postive instances in dataset
Useful when false negatives are costly; high recall indicates model is good at finding relevant positive instances
Precision → among the predicted class, how many was the model able to retrieve; proportion of true positive predictions out of all predicted positive instances
Useful when false positives are costly; high precision indicates that positive prediction are most likely correct
F1 Scores → harmonic mean of precision and recall, a more reliable measure than accuracy given unbalanced data
Accuracy is not a good measure in unbalanced datasets and high accuracy model may not be good at classifying specific classes → rely on class specific measures like recall and precision
Precision + Recall Trade-off
Greater precision, lower recall → lower chance of false positives (good) but lower chance of true negatives (bad)
Greater recall, lower precision → high chance of false negatives (bad) but higher chance of true positives (good)
How it Works
k-NN will classify new observation according to predominant class (outcome variable) among the identified nearest neighbors observations
Need to decide K → how many neighbors to consider
Look at decided number of nearest neighbors and classify
Ex: If K=3, nn are no purchase, purchase and purchase then the outcome would be purchase
In a tie situation, the algorithm decides randomly
Summary
Pros
Easy to use and implement
Doesn’t require statistical/distribution assumptions
Cons
Lazy classifier → no real model building
Bad for large datasets → slow learning, not good for real time
Decision Trees - systematic and interpretable
Goal → classify data into predefined classed based on decision rules, organized as a tree
Part of a Decision Tree
Top node → root node
Middle nodes → child (decision) nodes
Final node → leaf (final) nodes
Decision trees split data based on attribute-value combination that minimizes entropy
To determine root node → algorithm asses every attribute-value and chooses combination that maximizes agreement
To determine leaf node (final nodes) → majority rule
The class of the majority among the data-points in that node
Stops either when all data points have same class or when their is no remaining attribute-value combo that could be used to further split the data
The same attributes can be picked multiple times if paritions aren’t overlapping
Algorithm does not need to use all attributes
Entropy (measure of agreement)
Lower entropy → more observations in partition agree in terms of class outcome
Entropy of 0 → all observation in partition have same class
Entropy of 1 → max disagreement (50:50 split in outcome values)
Evaluation
Accuracy for overall model
Recall + Precision for individual classes
F1 Scores for individual classes
Confusion Matrix
True Positive (TP) → # of cases for which we predict true, and the actual class is true
True Negative (TN) → # of cases for which we predict false, and the actual class is false
False Positive (FP) → # of cases for which we predict true, but actual class is false
False Negative (FN) → # of cases for which we predict false, but actual class is true
Classification Probabilities
k-NN → % of a class among k-neighbors
Decision tree → % of a class in leaf node
Generated based on cutoff value → can lead to different classification results
Numeric Prediction
Predicting continuous values of outcome variable
Objective → predicting a numerical outcome for new observation based on data containing obersvations whose outcome variable is already known
Split Data (Randomly); normally 65%-85% training data
(Training Data) Used to run algorithm and learn model; model construction
Construct a classification model based on training data
(Testing Data) Used to test model and assess performance; model testing
Measure accuracy of your model using test data
K-NN (Nearest Neighbors)
Identify nearest neighbors based on distance measure (euclidian)
Data needs to be normalized (standardized); outcome does not
Nearest neighbors are points with shortest distance from new point
Once NN have been identified → numeric prediction for new observation is obtained using average of the NN outcome values
K = number of neighbors to consider
Chosen to MIN. RMSE
Cannot be ties since predicted outcome is determined by taking an average
Predictions can be improved using weighted sitance
Evaluation
ME (Mean Error)
MAE (Mean Absolute Error)
MAPE (Mean Absolute % Error)
RMSE (Root Mean Square Error)
How it Works
Need to decide K → how many neighbors to consider
Look at decided number of nearest neighbors and take the average of the outcomes
Ex: If K=3, nn are 50,5 and 25 then the outcome would be 26.66
Summary
Pros
Simple, captures complex relationship without building model
Doesn’t make any assumptions abut data distribution
Cons
If you have a lot of attributes you need lots of observations to make reasonable predictions
Very slow, bad for real time predictions
Regression Trees
Splits data based on the attribute-value combination that minimized the Sum of Squared Deviations (SSD)
To determine leaf node (final node) → uses average of the outcome values of data points in that node
Regression trees tend to discretize outcome
Sum of Squared Deviations (SSD)
Lower SSD → more observations in the partition agree in terms of continuous outcome
SSD = 0 → all observations in a partition have the same outcome value
Regression trees split data based on attribute-value combo that minimizes SSD
Evaluation
Prediction Error → difference between predicted outcome and actual outcome
ME (Mean Error) → average of prediction errors, indicates over or under prediction
Negative → under predicting
Positive → over predicting
MAE (Mean Absolute Error) → magnitude of the average error
MAPE (Mean Absolute % Error) → relative to actual values, gives % score of how much predictions deviate on average from actual values
RMSE (Root Mean Square Error) → measures average magnitude of error, penalizes larger errors
Summary
Pros
No parametric assumptions
Good for variable selection
Robust to outliers
Cons
Unstable → changes to data lead to different splits
Can miss relationships between attributes