Analytics

Data mined without clear outcome in mind; exploratory in nature; not pre-defined or classified, discovered from the data

Unsupervised Learning Methods

Descriptive and exploratory in nature
Mined to uncover previously unknown patterns without outcome in mind
Make use of unlabeled data: no specified outcome variable
Assessing how good the results are (subjective)

Supervised Learning Methods

Prediction methods
Mined with clear objective to predict a specific outcome
Make use of labeled data; outcome variable is known
Assessing performance of model is done with objective metrics

Association Rules

Discovering relationships among items

General Idea → Investigating co-occurrence of items, events, variables and so on

Recommendation Systems are main application (e-commerce)

Market Basket Analysis

Definitions

X→Y : Association rule (relationship between item sets)
- Association rules are direction
- Item sets cannot be overlapping
X : antecedent (body)
Y : consequent (head)

Metrics / Measures

Assessing association rules

(Support Count) count of transactions containing item-set [σ]
- Support count of X, support count of Y and support count X→Y
- Not directional
(Support Percentage) fraction of transactions that contain all item-sets; probability randomly selected transaction contains item set [s]
- # of transaction with item set / total # of transactions
- Support percent of X, support percent Y, support percent X→Y
- Not directional
(Confidence) how often Y appears with transactions containing X; probability transaction including X will include Y [conf.]
- Count (X→Y) / Count (X)
- Directional
(Lift) measure o how much more likely 2 item sets co occur [lift]
- Conf. (X→Y) / S(Y)
- No direction

Algorithms - finding good candidates

To find rules that are good candidates, only consider frequent item-sets; specify minimum support (minsupp) and minimum confidence (minconfidence) based on domain knowledge or business goals
Frequent item-sets → support >= minsupp
- Minimum support threshold lightens computational load

Apriori Algorithm

Key idea → if item set X is not frequent then any item sets containing X cannot be frequent
- Algorithm process:
  1. consider all one-item-sets, compute support, keep sets with supp >= minsupp
  2. form two-item-sets with items sets from step 1, keep sets with supp >=minsupp
  3. form three-item-sets with items from step 1, keep sets with supp >=minsupp
  4. continue till no further combo’s to consider, resulting item sets satisfy minsupp
  5. calculate confidence of item-sets, keep sets with conf. >= minconfidence; this is the final set of rules
    - Can also calculate lift of these rules

Interpretation of results

Confidence: when X happens, [conf.]% of the time Y also happens
Lift: when X happens, Y is [lift-1] percent more/less liekly
- Lift > 1; customers who buy X are more likely to buy Y
- Lift = 1; customers who buy X are as likely to buy Y
- Lift < 1; customers who buy X are less likely to buy Y

Cluster Analysis

Organizing observations in internally similar and meaningful clusters

Cluster Properties

High Intra-Similarity → points in same cluster are similar to each other
Lower Inter-Similarity → points in different clusters are different

Applications of Clustering

Discover natural patterns in data
Facilitate analysis of large datasets

Why Algorithms

Data sets in N-dimensional spaces; datasets with M observations of N dimensions, we need to:
- Understand similarity between data
- Decide which clustering method
- How many clusters (stopping criteria)
- Evaluate quality of clusters

Distance Measures (similarity between individual data points)

Numerical Data (higher distance → more different; lower distance → more similar)
- Euclidean Distance (straight line distance)
- Manhattan Distance (rectilinear distance)
Binary Data (higher distance → more different; lower distance → more similar)
- Matching Distance (# of mismatches over total #)
- Jaccard Distance (excludes null observations)
  - Used for asymmetric binary data (i.e non senior can mean different things)
Data Normalization
- Different type of attributes → mixed data
- Attributes that take values in different ranges → non comparable data
- (Data Normalization) eliminate specific units of measurement to transform them to common scale
  - Min-Max Normalization → rescale attributes to have values between 0 and 1 using min and max
    - New Value = (x-min) / (max-min)
  - Standardization → transform each attribute to have mean of 0 and sd of 1; Need to transform binary too
    - New Value = (x-sample mean) / sample sd
      - Step 1: compute mean + SD
      - Step 2: perform standardization

Linkage Methods (distance between clusters)

Single Linkage: min pairwise distance between points from diff. clusters
- Compute all distances between points + pick min
- Sensitive to outliers, good for odd shapes
Complete Linkage: max pairwise distance between points from diff. clusters
- Compute all distances between points + pick max
- Tends to break large clusters
Average Linkage: average pairwise distance between points from diff. clusters
- Compute all distances between points + take avg.
- Tends to produce similar results to centroid linkage
Centroid Linkage: distance between two clusters centroids (cluster means)
- Compute cluster’s centroids (avg. of X, avg. of Y) then compute distance
- Tends to produce similar results to avg. linkage
Ward’s Method: minimize within-cluster variance ( sum of squared pairwise distances between cluster centroid + points)
- Robust to outliers, creates denser spherical clusters

Clustering Methods

Hierarchical clustering (bottom-up clustering) → forming larger clusters from smaller ones
- # of clusters is not specified, distance measure and linkage method are specified
- (Distance Matrix) square matrix containing pairwise distances between data points
- Algorithm process → Takes two nearest clusters and forms new cluster till one big cluster
- (Dendrogram) Diagram showing clusters hierarchy
  - Length of lines is proportionate to distance
  - Depending on # of clusters we need, start at top and work down, make squares, then C1 (x,x,x,x) etc.
- Pros → flexible, produces solution for diff. cluster #’s
- Cons → not scalable, can be outlier sensitive
Partitioning-based clustering (k-means clustering) → directly partition data into K-groups where k is # of clusters, usually with euclidian distance
- Objective → produce specified # of clusters with lowest within-cluster variance
- Process
  1. 2 points picked at random, all data points assigned to these points in clusters based on euclidian distance
  2. Recalculates cluster centroids and reassigns points to centroid based on distance
  3. Repeat step 2 until convergence (re-assignment doesn’t change anymore)
- Pros → Very scalable
- Cons → poor initialization can lead to bad results, bad for irregular shapes and outliers

Assessing Quality of Clusters

Within Sum of Squared Errors (WSS)
- Lower the WSS → more cohesive clusters are → high intra-similarity
Between Sum of Squared Errors (BSS)
- Higher the BSS → larger distance of centroids from each other → lower inter-similar

Choosing # of Clusters

Dendrogram
- Clusters can emerge naturally
Elbow Plot
- Plot WSS / BSS as function of # of clusters
- Pick # corresponding with elbow point (where curve begins to flatten)

Total Sum Squared (TSS) = sum of BSS and WSS

Classification

Predicting Categorical values of outcome variable

Objective → Building a prediction model for a clear, defined outcome variable using data for which we already know outcome

Split Data (Randomly); normally 65%-85% training data

(Training Data) Used to run algorithm and learn model; model construction
- Construct a classification model based on training data
(Testing Data) Used to test model and assess performance; model testing
- Measure accuracy of your model using test data

Classification → Outcome Variable is binary/categorical

K-NN (Nearest Neighbors) - simple and heuristic

Must identify NN based on distance measure (euclidian)

Data needs to be normalized/ standardized (don’t normalize outcome class)
NN are points with shortest distance from new point
Once NN identified → new observation is classified based on majority rule

K = Number of neighbors to consider

Chosen to MAX accuracy, lowest error rate
- Small k values → capture local structure of data but are very sensitive to potential outliers
- Larger k values → less sensitive to potential outliers byt iss capturing local strucutre of data
If K is even # there can be ties → class is picked randomly by algorithm
- Recommended not to use even number to avoid this

Classification can be improved by using weighted distance

Giving more weight to closer neighbors

Evaluation of Model

Error → among all predictions, how many did the model get wrong
Accuracy → among all predictions, how many did the model get right
Recall → among the true positive class, how many was the model able to retrieve; proportion out of all postive instances in dataset
- Useful when false negatives are costly; high recall indicates model is good at finding relevant positive instances
Precision → among the predicted class, how many was the model able to retrieve; proportion of true positive predictions out of all predicted positive instances
- Useful when false positives are costly; high precision indicates that positive prediction are most likely correct
F1 Scores → harmonic mean of precision and recall, a more reliable measure than accuracy given unbalanced data

Accuracy is not a good measure in unbalanced datasets and high accuracy model may not be good at classifying specific classes → rely on class specific measures like recall and precision

Precision + Recall Trade-off

Greater precision, lower recall → lower chance of false positives (good) but lower chance of true negatives (bad)
Greater recall, lower precision → high chance of false negatives (bad) but higher chance of true positives (good)

How it Works

k-NN will classify new observation according to predominant class (outcome variable) among the identified nearest neighbors observations
Need to decide K → how many neighbors to consider
- Look at decided number of nearest neighbors and classify
  - Ex: If K=3, nn are no purchase, purchase and purchase then the outcome would be purchase
  - In a tie situation, the algorithm decides randomly

Summary

Pros
- Easy to use and implement
- Doesn’t require statistical/distribution assumptions
Cons
- Lazy classifier → no real model building
- Bad for large datasets → slow learning, not good for real time

Decision Trees - systematic and interpretable

Goal → classify data into predefined classed based on decision rules, organized as a tree

Part of a Decision Tree

Top node → root node
Middle nodes → child (decision) nodes
Final node → leaf (final) nodes

Decision trees split data based on attribute-value combination that minimizes entropy

To determine root node → algorithm asses every attribute-value and chooses combination that maximizes agreement
To determine leaf node (final nodes) → majority rule
- The class of the majority among the data-points in that node
- Stops either when all data points have same class or when their is no remaining attribute-value combo that could be used to further split the data
The same attributes can be picked multiple times if paritions aren’t overlapping
Algorithm does not need to use all attributes

Entropy (measure of agreement)

Lower entropy → more observations in partition agree in terms of class outcome
Entropy of 0 → all observation in partition have same class
Entropy of 1 → max disagreement (50:50 split in outcome values)

Evaluation

Accuracy for overall model
Recall + Precision for individual classes
F1 Scores for individual classes

Confusion Matrix

True Positive (TP) → # of cases for which we predict true, and the actual class is true
True Negative (TN) → # of cases for which we predict false, and the actual class is false
False Positive (FP) → # of cases for which we predict true, but actual class is false
False Negative (FN) → # of cases for which we predict false, but actual class is true

Classification Probabilities

k-NN → % of a class among k-neighbors
Decision tree → % of a class in leaf node

Generated based on cutoff value → can lead to different classification results

Numeric Prediction

Predicting continuous values of outcome variable

Objective → predicting a numerical outcome for new observation based on data containing obersvations whose outcome variable is already known

Split Data (Randomly); normally 65%-85% training data

(Training Data) Used to run algorithm and learn model; model construction
- Construct a classification model based on training data
(Testing Data) Used to test model and assess performance; model testing
- Measure accuracy of your model using test data

K-NN (Nearest Neighbors)

Identify nearest neighbors based on distance measure (euclidian)

Data needs to be normalized (standardized); outcome does not
Nearest neighbors are points with shortest distance from new point
Once NN have been identified → numeric prediction for new observation is obtained using average of the NN outcome values

K = number of neighbors to consider

Chosen to MIN. RMSE
Cannot be ties since predicted outcome is determined by taking an average
Predictions can be improved using weighted sitance

Evaluation

ME (Mean Error)
MAE (Mean Absolute Error)
MAPE (Mean Absolute % Error)
RMSE (Root Mean Square Error)

How it Works

Need to decide K → how many neighbors to consider
- Look at decided number of nearest neighbors and take the average of the outcomes
  - Ex: If K=3, nn are 50,5 and 25 then the outcome would be 26.66

Summary

Pros
- Simple, captures complex relationship without building model
- Doesn’t make any assumptions abut data distribution
Cons
- If you have a lot of attributes you need lots of observations to make reasonable predictions
- Very slow, bad for real time predictions

Regression Trees

Splits data based on the attribute-value combination that minimized the Sum of Squared Deviations (SSD)

To determine leaf node (final node) → uses average of the outcome values of data points in that node
Regression trees tend to discretize outcome

Sum of Squared Deviations (SSD)

Lower SSD → more observations in the partition agree in terms of continuous outcome
SSD = 0 → all observations in a partition have the same outcome value
Regression trees split data based on attribute-value combo that minimizes SSD

Evaluation

Prediction Error → difference between predicted outcome and actual outcome
ME (Mean Error) → average of prediction errors, indicates over or under prediction
- Negative → under predicting
- Positive → over predicting
MAE (Mean Absolute Error) → magnitude of the average error
MAPE (Mean Absolute % Error) → relative to actual values, gives % score of how much predictions deviate on average from actual values
RMSE (Root Mean Square Error) → measures average magnitude of error, penalizes larger errors

Summary

Pros
- No parametric assumptions
- Good for variable selection
- Robust to outliers
Cons
- Unstable → changes to data lead to different splits
- Can miss relationships between attributes