Machine Learning and Association Rule Learning

Covariance Matrix

Definitions:
- Covariance matrix is a summary of how variables covary. It quantifies the degree to which two variables change together. A positive covariance indicates that variables tend to move in the same direction, while a negative covariance indicates they tend to move in opposite directions. A covariance close to zero suggests little or no linear relationship.
- It is calculated as: Cov(X, Y) = \frac{\text{Sum}((Xi - \bar{X})(Yi - \bar{Y}))}{n-1}
  - The denominator n-1 is used instead of n to provide an unbiased estimate of the population covariance, known as Bessel's correction.
Example Values:
- Var(X): 1.67
- Cov(X, Y): 3.33
- Cov(Y, X): 3.33 (Covariance is symmetric, Cov(X, Y) = Cov(Y, X).)
- Var(Y): 6.67
Variance Calculation (for both X and Y):
- Variance for X: Vx = \frac{\text{Sum}((xi - \bar{x})(x_i - \bar{x}))}{n - 1}
- Variance for Y: Vy = \frac{\text{Sum}((yi - \bar{y})(y_i - \bar{y}))}{n - 1}
- Covariance between X and Y: V{xy} = V{yx} = \frac{\text{Sum}((xi - \bar{x})(yi - \bar{y}))}{n - 1}

Machine Learning Class

Course Details: SEIS 763, Fall 2025, Machine Learning, Lecture 8

Machine Learning Steps

Steps covered in the first half of a typical machine learning workflow:
1. Load Data: Importing the dataset into the working environment.
2. Drop extra columns: Removing irrelevant or redundant features from the dataset to simplify the model and improve performance.
3. Split data into dependent and independent variables: Separating the target variable (dependent, usually denoted as y) from the predictor variables (independent, usually denoted as X).
4. Handle Missing Values: Imputing or removing records with missing data.
5. Feature Scaling/Normalization: Transforming numerical features to a standard range to prevent certain features from dominating the model due to their scale.
6. Model Selection: Choosing an appropriate machine learning algorithm based on the problem type (e.g., classification, regression) and data characteristics.
7. … (future states include training, evaluation, hyperparameter tuning, and deployment)

SciKitLearn Pipeline

Overview of the SciKitLearn pipeline for machine learning workflows:
1. pipeline.fit(): This method trains the model using the provided training data. During fitting, the model learns the underlying patterns and relationships within the data.
2. Labels: Also known as targets or dependent variables, these are the output values that the model is designed to predict. They are crucial for supervised learning tasks.
3. Training Set: This subset of the data is used to teach the machine learning model. The model learns parameters and makes internal adjustments based on this data.
4. pipeline.predict(): After the model has been fitted, this method is used to make predictions on new, unseen data based on the patterns learned during the training phase.
5. .fit_transform(): A common method used for preprocessing steps like scaling, imputation, or dimensionality reduction. It first learns the parameters (e.g., mean and standard deviation for scaling) from the data (fit) and then applies those transformations to the data (transform) in one step.

K-fold Cross Validation

Importance of splitting data into two sets for training and testing: This practice helps to assess the generalization capability of a model, ensuring it performs well on unseen data.
Addresses performance tuning: Cross-validation is a robust technique to estimate model performance and tune hyperparameters by reducing the variability of the performance estimate compared to a single train-test split.
- Example of using K-Nearest Neighbors (KNN) with different values of K:
  - Training/testing data points (x1, x2, …, xd, y)
  - K values tested: k=3, k=5, k=7 with implications on accuracy: A smaller k value in KNN can lead to models that are sensitive to noise and highly variable (high variance, low bias), while a larger k value tends to produce smoother decision boundaries, reducing variance but potentially increasing bias if k is too large.
    - Accuracy metrics showing differential performance, demonstrating how different k values affect the model's ability to generalize.

Holdout Method

Description of holdout method for training and validation: The holdout method involves splitting the dataset into three parts: a training set, a validation set, and a test set.
- Strategy for hyperparameter tuning and model selection: The training set is used to train the model, the validation set is used to evaluate model performance during hyperparameter tuning and to select the best model, and the test set (or holdout set) is used for a final, unbiased evaluation of the chosen model's performance on completely unseen data.
- Select optimal model and evaluate performance on holdout set: This provides a reliable estimate of the model's real-world performance after all tuning and selection processes are complete.

Hyperparameter Selection Techniques

Grid Search:
- Method to exhaustively search for the best combination of hyperparameters from a predefined set of values.
- Use: GridSearchCV from sklearn.model_selection. It evaluates a model for each combination of all the hyperparameters values provided, ensuring the absolute best combination within the given grid is found, though it can be computationally expensive.
Random Search:
- Method for a randomized search on hyperparameters, where a fixed number of parameter settings are sampled from specified distributions.
- Use: RandomizedSearchCV from sklearn.model_selection. This method can be more efficient than Grid Search, especially in high-dimensional hyperparameter spaces, as it often finds nearly optimal solutions much faster by exploring a wider range of values rather than an exhaustive search.

Hyperparameter Tuning Results

Example results:
- Accuracy values ranging from 0.86 to 0.88, indicating the performance of different hyperparameter configurations.
- Various parameters tested:
  - max_depth: A hyperparameter for tree-based models (e.g., Decision Trees, Random Forests) that controls the maximum depth of the tree. A deeper tree can capture more specific information but risks overfitting, while a shallower tree might underfit.
  - n_estimators: A hyperparameter for ensemble methods (e.g., Random Forests, Gradient Boosting) that specifies the number of individual estimators (e.g., decision trees) in the ensemble. More estimators generally improve performance but increase computation time.
  - Possible values for criterion: entropy, gini. These are impurity measures used in decision trees to determine the quality of a split. Gini impurity measures the probability of incorrectly classifying a randomly chosen element in the dataset if it were randomly labeled according to the distribution of labels in the dataset. Entropy measures the randomness or unpredictability in the dataset.

Association Rule Learning

Introduction to Association Learning: A data mining technique used to discover interesting relationships and associations among a large set of data items.
- Example of products often bought together (Diapers

➔ Beer): This classic example from market basket analysis illustrates how patterns of co-occurrence can be found in transactional data.
- Concepts introduced in the 90s with market basket analysis: This field emerged to analyze customer buying habits by finding associations between different items that customers place in their "shopping baskets."

Key Definitions in Association Learning:

Rule Structure:
- Antecedent → Consequent: An association rule in the form A \rightarrow B implies that if items in the antecedent (A) are present in a transaction, then it is likely that items in the consequent (B) will also be present.
Definitions to understand rules:
- Frequent Itemset: A collection of one or more items that frequently appear together in a specified proportion of transactions. These itemsets form the basis for generating association rules.

Apriori Association Rule Learning Algorithm:

Notions of support, confidence, and lift:
- Support definition: Measures the popularity of an itemset and is defined as the proportion of transactions that contain a given itemset.
  S(B) = \frac{\text{Transactions containing B}}{\text{Total Transactions}}
- Confidence: Measures the reliability of an inference rule. It indicates the conditional probability that a transaction containing A also contains B (i.e., P(B|A)).
  Confidence(A \rightarrow B) = \frac{\text{Transactions containing (A \cap B)}}{\text{Transactions containing A}}
- Lift: Measures how much more likely item B is purchased given item A, relative to the baseline probability of purchasing B independently. A lift value greater than 1 suggests a positive association, less than 1 suggests a negative association, and equal to 1 suggests independence.
  Lift(A \rightarrow B) = \frac{Confidence(A \rightarrow B)}{S(B)}
Example with Burgers and Ketchup: Transaction numbers and counts are given to derive support and confidence metrics, illustrating how these measures are calculated in a practical scenario to form actionable rules.

Frequent Itemset Mining:

Assessment needed to track occurrences of itemsets and to filter them against a minimum support threshold to find relevant item sets.
Reduction strategies using the apriori property to limit unnecessary evaluations: The Apriori property states that all non-empty subsets of a frequent itemset must also be frequent. This property is used to efficiently prune the search space by discarding candidate itemsets if any of their subsets are found to be infrequent.

Transaction Details:

Example with transaction numbers outlined from the provided dataset, demonstrating how raw transactional data is structured for association rule mining.
Calculating support given an array of transaction contexts: This involves counting how many transactions contain specific items or itemsets and dividing by the total number of transactions.

Finding Relevant Itemsets:

This process involves iteratively generating itemsets of increasing size (e.g., single items, pairs, triplets) and assessing their frequency (support). Itemsets that meet a minimum support threshold are considered "frequent." This step is crucial for identifying meaningful combinations and subsets of items that repeatedly appear together, allowing for valid rule discovery based upon their set memberships.

Calculating Confidence and Frequencies:

Assessed through examples showing sequential confidence calculations within defined thresholds. Once frequent itemsets are identified, association rules are generated from them. The confidence for each rule (A \rightarrow B) is calculated. Rules that meet a minimum confidence threshold are considered strong and outline possible rule establishments based on item transactions, filtering out less reliable associations.

Conclusion:

Highlighting the significance of the association algorithms and their computations allowing for future predictive analysis based on historical purchase behaviors and rules. These techniques are foundational for various business applications such as recommendation systems, targeted marketing campaigns, inventory management, and understanding customer purchasing patterns.