CIS4321_Midterm_Studyguide

Overview of Data Mining and Business Intelligence

Types of Patterns that Can be Mined

  • Association Rules:

    • Definition: Identify relationships among variables in large databases, aiming to understand how different items are associated with each other.

    • Application: Commonly used in market basket analysis, where businesses examine purchase patterns to enhance product placement and promotions.

    • Key Measures:

      • Support: Measures the frequency of occurrence of an itemset; essential for determining the relevance of the rule in the dataset.

      • Confidence: Refers to the likelihood that item B is purchased when item A is purchased; a metric to assess the strength of the association between items.

      • Lift: Ratio of the observed support to that expected if A and B were independent; indicates the degree to which the occurrence of one item influences the occurrence of another.

  • Classification & Regression Patterns:

    • Definition: Involves predictive modeling tasks that use historical data to make predictions about future outcomes.

    • Classification focuses on discrete labels, categorizing data points into specific classes (e.g., spam vs. not spam), while regression deals with continuous values, predicting numeric outcomes (e.g., price).

  • Cluster Analysis:

    • Definition: Groups data into clusters of similar objects to identify patterns and relationships inherent within the dataset, facilitating pattern recognition and anomaly detection.

  • Outlier Detection:

    • Definition: Identifies unusual or rare data points that do not conform to the general patterns of the dataset; essential for quality control and detecting anomalies that could indicate fraud or errors.

Differences between Data Mining, Data Analysis & Business Intelligence

  • Data Mining:

    • Focuses on hypothesis generation and is explorative, aimed at discovering hidden patterns and relationships within large datasets.

    • Particularly adept at handling vast amounts of unstructured or semi-structured data, making it suitable for various fields, including finance, healthcare, and marketing.

  • Data Analysis:

    • Centers on hypothesis testing in a confirmative manner; primarily addresses previously established theories through smaller datasets.

    • Often employs statistical tools and methodologies to validate or challenge existing hypotheses.

  • Business Intelligence:

    • Involves reporting techniques and straightforward visualizations based on discovered relationships, aimed at decision-making and strategic planning.

    • Integrates data from various sources and uses analytics to inform business decisions, enhancing operational efficiency.

Various Types of Data Mining Processes

  • CRISP-DM (Cross-Industry Standard Process for Data Mining):

    1. Business Understanding: Define situational context, objectives, deliverables, and project plan tailored to the organization’s needs.

    2. Data Understanding: Collect raw data, conduct preliminary analysis, and define hypotheses to guide exploration.

    3. Data Preparation: Select relevant records and variables, perform data cleaning, transformation, and data wrangling to ensure data quality.

    4. Modeling: Choose and apply appropriate modeling techniques, convert data formats as necessary, and validate models against criteria to ensure reliability.

    5. Evaluation: Rigorously assess the performance of models, select best-performing ones, and interpret results to derive actionable insights.

    6. Deployment: Develop actionable insights and establish a deployment strategy for ongoing monitoring, feedback loops, and refining processes.

Data Preprocessing & Missing Data Management

SEMMA (Sample, Explore, Modify, Model, Assess) by SAS Institute.

  • A systematic framework designed to streamline the data mining process, guiding practitioners in effectively managing large datasets.

Knowledge Discovery from Data (KDD):

  • An overarching process that encompasses data mining, describing the end-to-end approach of extracting useful knowledge from data.

Managing Missing Data:

Types of Missing Data:
  • Missing Completely at Random (MCAR):

    • Definition: Missingness is unrelated to any data points; occurs sporadically and without pattern.

    • Example: A student misses an assignment due to an unrelated family emergency.

  • Missing at Random (MAR):

    • Definition: Missingness is related to other observed data points but not to the missing values themselves.

    • Example: In health surveys, men may more readily report weight than women due to cultural factors in reporting.

  • Missing Not at Random (MNAR):

    • Definition: Missingness is related to the absent information itself, leading to bias in the dataset.

    • Example: Individuals experiencing severe depression may choose to skip questions about happiness, affecting overall analysis.

Techniques for Handling Missing Data:
  • Deletion Methods:

    • Listwise deletion removes all data for a participant if any response is missing.

    • Pairwise deletion analyzes only the data available for each analysis.

  • Imputation Methods:

    • Techniques such as mean, median, mode for simple imputation, as well as advanced predictive models to fill gaps in data.

  • Interpolation Techniques:

    • Utilize surrounding data points to estimate missing values, maintaining the integrity of the dataset.

  • Imputation Process:

    • Step 1: Understand the dataset and the nature of missingness.

    • Step 2: Choose an imputation method based on the missingness type.

    • Step 3: Apply the chosen imputation technique rigorously to ensure data accuracy.

High-dimensional Data and Machine Learning Concepts

High-Dimensional Data:

  • Definition: Characterized by many features, such as genomic data, leading to complex data relationships.

  • Challenges: Includes the "curse of dimensionality" causing models to overfit, higher computational costs, and difficulties in visualizing data.

  • Solutions: Employ feature selection and dimensionality reduction techniques such as PCA (Principal Component Analysis) or t-SNE (t-Distributed Stochastic Neighbor Embedding) to manage high-dimensional data effectively.

Outlier Detection Methods:

  • Techniques include Z-score, Interquartile Range (IQR), robust scaling, and isolation forest, contributing to a comprehensive understanding of data distribution and accuracy in predictions.

Machine Learning Concepts:

Four Main Categories:
  1. Regression:

    • Definition: Predicts continuous values for outcomes, such as stock prices or weather forecasting.

  2. Classification:

    • Definition: Predicts distinct categories for outcomes, such as customer segmentation based on purchasing behavior.

  3. Dimensionality Reduction:

    • Definition: Determines key features within data to reduce complexity while preserving essential information.

  4. Clustering:

    • Definition: Identifies multi-dimensional groupings within data, useful for market segmentation and social network analysis.

Supervised vs. Unsupervised Learning:
  • Supervised Learning:

    • Involves labeled datasets where models learn from known outcomes; applies to tasks in regression and classification.

  • Unsupervised Learning:

    • Operates on unlabeled data, identifying inherent patterns, relevant primarily to clustering and dimensionality reduction tasks.

Similarity Measures and Model Comparison

Similarity Measures:

  • Defined as methods to determine how closely two data points resemble each other; vital for clustering and similarity-based models.

Common Measures:
  • Euclidean Distance:

    • Definition: The straight-line distance between data points in a multi-dimensional space; widely used for continuous variables.

  • Manhattan Distance:

    • Definition: The sum of absolute differences across dimensions; particularly effective for structured, grid-like data patterns.

  • Cosine Similarity:

    • Definition: Measures the cosine of the angle between two vectors, providing insight into directional similarities; often applied in text analysis and recommendation systems.

  • Jaccard Similarity:

    • Definition: Compares the size of the intersection versus the union of sets, commonly used for binary attributes.

Application in KNN (K-Nearest Neighbors):

  • Performance of algorithms like KNN is heavily influenced by the choice of similarity measure; selecting appropriate metrics optimizes classification and regression results.

Regression vs. Classification:
  • Regression:

    • Purpose: Predicts continuous numeric outcomes, aiding economic forecasts and scientific research.

    • Methods include linear regression and polynomial regression; evaluation metrics consist of Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (R²).

  • Classification:

    • Purpose: Predicts discrete classes for categorical outcomes, relevant to tasks like behavior prediction.

    • Methods: Logistic regression, decision trees, and support vector machines (SVM) are commonly applied; evaluation metrics include accuracy, precision, recall, F1-score, Receiver Operating Characteristic (ROC), and the Area Under Curve (AUC).

Key Differences:
  • Regression focuses on predicting continuous outputs, while classification targets categorical labels; each employs distinct techniques and algorithms tailored to respective problem types.

Underfitting & Overfitting, Bias & Variance

Underfitting vs. Overfitting:

  • Overfitting:

    • Occurs when models learn the noise and unnecessary complexities of training data, resulting in poor performance on unseen data.

    • Examples: Using a low 'k' in KNN increases sensitivity to noise; unpruned decision trees can capture training patterns well but struggle to generalize effectively.

  • Underfitting:

    • Lack of complexity leads to poor performance across both training and test datasets; models do not adequately represent the underlying patterns in the data.

High Variance vs. High Bias:

  • High Variance (Overfitting):

    • Characterized by high accuracy on training data and low accuracy on new data, common with complex and flexible models.

  • High Bias (Underfitting):

    • Involves low accuracy across datasets, often arising from models that are too simplistic to capture the true relationships.

Bias-Variance Trade-off:

  • Finding the optimal balance between model simplicity and complexity is crucial for effectiveness; excessive simplicity leads to ignored patterns, while excessive complexity results in noise memorization.

Model Validation and Performance Metrics

Model Validation and Performance Metrics:

Classification Metrics:
  • Including accuracy, precision, recall, F1-score, Receiver Operating Characteristic (ROC) curve, and Area Under Curve (AUC).

Regression Metrics:
  • Incorporate Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (R²).

Model Validation Techniques:

  • Techniques such as train/test split and K-fold cross-validation ensure models are tested rigorously, providing robust validation of model performance and stability.

  • Emphasizes the importance of hyperparameter tuning to identify optimal model configurations that enhance predictive accuracy.

Algorithms & Techniques:

  • K-Nearest Neighbors (KNN):

    • Characteristics: An instance-based learning approach utilizing distance metrics for classification.

    • Choosing 'k': A smaller 'k' increases variance, making the model sensitive to noise, while a larger 'k' increases bias, potentially oversimplifying the model.

    • Importance of Feature Scaling: Essential for normalization and standardization, enabling all features to contribute equally to distance calculations.

    • Advantages:

      • Conceptually simple and easily understandable; does not include an explicit training phase, reducing initial computational requirements.

    • Disadvantages:

      • Computationally expensive during prediction; sensitive to irrelevant features that may distort distance metrics.

    • Applications:

      • Commonly used in areas such as consumer credit ratings, loan applications, fraud detection, and medical diagnosis.

Linear Regression and Decision Trees

Linear Regression:

  • Presumes a linear relationship between independent and dependent variables; commonly utilizes Ordinary Least Squares (OLS) for cost function minimization.

  • Critical to check assumptions such as linearity between variables and the normal distribution of errors to ensure model validity.

  • Regularization techniques, including Ridge (L2) and Lasso (L1), aid in preventing overfitting by penalizing complex models.

Decision Trees:

  • Employ greedy algorithms (e.g., CART) to iteratively split on features that minimize impurity; decisions based on maximizing information gain or minimizing Gini impurity.

  • Entropy Measures:

    • Reflect overall diversity in a dataset; high uncertainty corresponds to high entropy, which must be addressed for effective classification.

  • Gini Impurity:

    • Expresses the likelihood of misclassification in decision-making processes.

  • Overfitting can arise when trees branch excessively without pruning; careful validation and pruning strategies are essential.

  • Pros:

    • High interpretability, suitable for both categorical and numerical data types.

  • Cons:

    • Prone to instability from minor variations in data; requires diligent handling to avoid overfitting.

K-Fold Cross Validation

K-Fold Cross Validation:

  • Variants involve partitioning data into 'k' folds, iteratively utilizing one fold for testing while training on the remaining data; provides more reliable performance estimates than a singular train/test split.

  • Stratified K-Fold:

    • Tailored to classification tasks ensuring class distribution is maintained across folds to enhance model training and evaluation.

  • Leave-One-Out Cross-Validation:

    • A specialized form where 'k' equals the number of observations, enabling analysis of the model's behavior on each data point individually.

Detecting Underfitting & Overfitting:

  • Strategies include analyzing learning curves comparing training error against validation error.

  • Continuous monitoring using cross-validation techniques facilitates consistent performance assessment across data folds, ensuring models generalize well on unseen data.

  • Significant discrepancies between training and testing performance metrics may indicate underlying issues such as overfitting or underfitting.

Python & Scikit-Learn Workflow

Standard Machine Learning Workflow:

  • Data Preprocessing:

    • Extract features and targets to prepare the dataset for analysis and modeling.

  • Splitting the Data:

    • Utilize the train_test_split function to create training and testing sets, with common ratios ranging from 80/20 or 70/30 to ensure sufficient training data.

  • Model Creation:

    • Select suitable estimators and tune hyperparameters to optimize model performance.

  • Model Training:

    • Fit the model using training data, adjusting parameters as necessary based on performance feedback.

  • Prediction & Transformation:

    • Utilize methods for making predictions and transformations suitable for clustering and dimensionality reduction tasks.

  • Evaluation:

    • Assess models utilizing performance metrics, visualizations, and the resources available in Scikit-Learn to inform improvements and adaptations.

Data Definitions:

  • X (Feature Matrix):

    • Represents independent variables used for prediction, structured in a suitable format for analysis.

  • y (Target Variable):

    • Denotes the variable being predicted or estimated, fed into models for analysis.

  • train_test_split example:

    • train_test_split(X, y, test_size=0.2, random_state=42).

Hyperparameters and Model Parameters

Model Parameters:

  • Values determined during model training optimized by the algorithm; for example, weights in linear regression are categorized as parameters of the model.

Hyperparameters:

  • Values that need to be set before training a model, often requiring manual tuning for optimal performance; crucial for shaping how a model operates.

  • Examples include the 'k' value in KNN or the learning rate in other algorithms that significantly impacts model efficiency.

Hyperparameter Tuning:

  • The process of identifying the best hyperparameter values, frequently employing strategies like grid search or random search for systemic optimization.

robot