CIS 4321 Midterm Study Guide
Association Rules:
Association rules are used to identify relationships among variables in large datasets, often applied in market basket analysis to discover how items are related to each other.
Key Measures:
Support: The fraction of transactions that include both antecedent and consequent, indicating the frequency of the rule.
Confidence: A measure of the likelihood that an item can be found in a transaction given that another item is present.
Lift: The ratio of the observed support to that expected if the two rules were independent.
Classification & Regression Patterns:
These are predictive modeling tasks that differentiate between discrete labels (classification) and continuous values (regression).
Classification methods might include algorithms such as decision trees, SVM, and neural networks, while regression could use linear regression or polynomial regression.
Cluster Analysis:
A technique used to group data into clusters of similar objects based on features, enabling the identification of inherent structures within the data.
Common clustering algorithms include K-means, hierarchical clustering, and DBSCAN.
Outlier Detection:
Refers to methods used to identify unusual or rare data points typically excluded from the main data patterns, crucial in fraud detection and quality control.
Data Mining: Involves hypothesis generation and exploration of large datasets to uncover unknown patterns and relationships.
Data Analysis: Focuses on hypothesis testing using statistical techniques to validate existing theories with smaller, targeted datasets.
Business Intelligence: Encompasses reporting, visualization techniques, and data warehousing strategies to support decision-making processes based on analyzed data.
CRISP-DM (Cross-Industry Standard Process for Data Mining):
Business Understanding: Clearly define the objectives, project scope, tasks, and expected deliverables to align with business goals.
Data Understanding: Collect and analyze initial raw data to gain insights and formulate hypothesis about data characteristics.
Data Preparation: Involves selecting relevant variables, cleaning the data for missing or irrelevant entries, and treating outliers.
Modeling: Apply different data mining techniques, followed by documenting the statistical assumptions made during the process.
Evaluation: Assess model performance, select the most effective models, interpret results, and provide actionable insights to stakeholders.
Deployment: Implement the solutions and insights gained from the analysis into real-world applications while establishing a strategy for ongoing monitoring and refinement.
Managing Missing Data (MAR, MCAR, etc.):
Missing Completely At Random (MCAR): When the missingness of data is unrelated to any observed data points, indicating that we can ignore these missing values without bias.
Missing at Random (MAR): When the probability of missing data correlates to other observed data but not the missing data itself, requiring a more nuanced analysis to handle.
Missing Not at Random (MNAR): When missingness is related to the absent data itself, which can introduce significant bias if not addressed properly.
Techniques for Handling Missing Data:
Deletion: Two approaches – listwise deletion (removing entire records) or pairwise deletion (removing specific cases for particular analyses).
Imputation: Involves estimating missing values through statistical methods like mean/mode substitution or using predictive models to gain more accurate replacements.
Interpolation: Fills in missing values based on adjacent data points using domains like time-series analysis to maintain data integrity.
The Process of Imputation:
Understand the Data & Missingness: Analyze patterns behind missing data to choose an appropriate imputation strategy.
Choose an Imputation Method: Select methodologies based on data characteristics and observed relationships.
Apply the Method: Perform the imputation using the selected techniques to create a complete dataset.
Refers to datasets with a large number of features or variables, such as genomic data, which present unique challenges.
Challenges:
Curse of Dimensionality: As the number of features grows, the required volume of data increases exponentially, complicating the analysis.
Overfitting: Models trained on high-dimensional data may learn noise instead of actual patterns leading to poor generalization.
Computational Cost: Increasing dimensions demand more resources for storage, processing, and analysis.
Solutions: Techniques like feature selection to eliminate irrelevant features and dimensionality reduction methods such as PCA and t-SNE help simplify analyses and improve computational efficiency.
Common techniques for detecting outliers include:
Z-score Method: Identifying outliers based on a statistical threshold set around the mean.
IQR Method: Using the interquartile range to define boundaries for outliers.
Robust Scaling: Transforming data to reduce sensitivity to outliers.
Isolation Forest: An advanced algorithm that effectively isolates anomalies rather than profiling normal data points.
Regression: Forecasting continuous values, applicable in various domains like finance and real estate.
Classification: Predicting distinct categories, widely used in applications like spam detection and customer segmentation.
Dimensionality Reduction: Methods that rank features to condense the data while retaining its critical information.
Clustering: Identifying natural groupings in data, essential for market segmentation and trend analysis.
Supervised Learning: Involves training algorithms on labeled input and output pairs; used mainly in regression and classification tasks where the understanding of outcomes or categories is required.
Unsupervised Learning: Deals with identifying patterns without labeled outcomes; useful in clustering and dimensionality reduction tasks that reveal insightful information about data structure.
Quantitative methods for assessing the likeness between two data points include:
Euclidean Distance: Measures straight-line distance, suitable for continuous variables.
Manhattan Distance: Calculates absolute differences, particularly useful in grid-like data or when dealing with outliers.
Cosine Similarity: Evaluates the angle between vectors, frequently applied in text analysis and document similarity tasks.
Jaccard Similarity: Assesses similarity by comparing intersection over union for binary attributes.
Regression:
Purpose: Predict continuous numeric outcomes like sales figures or stock prices.
Examples: Linear regression techniques, polynomial regression for non-linear relationships.
Evaluation Metrics: Include Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R² to evaluate the performance of the regression models.
Classification:
Purpose: To categorize data into discrete labels, such as identifying if an email is spam.
Examples: Techniques like logistic regression, decision trees, and support vector machines (SVM).
Evaluation Metrics: Metrics for assessment include accuracy, precision, recall, F1-score, receiver operating characteristic (ROC) curve, and area under the curve (AUC).
Key Differences:
Output Type: Continuous values for regression and categorical labels for classification.
Problem Type: Regression involves predicting quantities, while classification is focused on grouping data into categories.
Overfitting:
Occurs when a model learns the noise and specific patterns of the training data excessively, leading to poor performance on unseen data.
Commonly arises in complex models where parameters are too flexible, for instance, when k in KNN is too low or a decision tree is excessively deep.
Underfitting:
Characterized by poor performance on both training and testing datasets, which indicates an overly simplistic model that fails to capture the underlying trends.
High Variance: Leads to strong performance on training data but poor on test data, indicating that the model is too sensitive to fluctuations in the training set.
High Bias: Indicates low accuracy for both training and test datasets, showcasing a model that is too rigid and incapable of capturing complex data patterns.
This concept highlights the balancing act between a model’s complexity and its generalization ability, aiming to reduce both bias and variance for optimal performance.
Classification Metrics: Include accuracy, precision, recall, F1-score, ROC curve, and area under the curve (AUC) to evaluate a model's effectiveness in classification tasks.
Regression Metrics: Metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R² are essential for gauging performance in regression tasks.
Model Validation: Strategies such as train/test split, k-fold cross-validation, and hyperparameter tuning enhance model evaluation and reliability.
K-Nearest Neighbors (KNN):
An instance-based learning algorithm that depends on the metrics of distance, such as Euclidean and Manhattan distances.
Choosing k: Selecting an appropriate value for k is crucial. A smaller k can lead to higher sensitivity to noise (high variance), while a larger k risks oversimplifying the model (high bias).
Advantages: Notably simple to implement and doesn't require an explicit training phase.
Disadvantages: Computational heaviness during predictions due to the need to calculate distances for each instance; sensitive to irrelevant features which may distort results.
Applications: Widely applied in credit ratings, loan evaluations, fraud detection, and medical diagnoses.
Linear Regression:
Assumes linear relationships between variables; typically employs the Ordinary Least Squares (OLS) method.
Check Assumptions: Evaluating assumptions of linearity, normality of errors, and homoscedasticity are critical to ensure reliable predictions.
Regularization Techniques: Methods like Ridge (L2) and Lasso (L1) are implemented to prevent overfitting and enhance model generalization.
Decision Trees:
Utilize greedy algorithms to reduce impurity using metrics like Gini Index or information entropy.
Advantages: Highly interpretable, they handle both categorical and numerical data adeptly.
Disadvantages: Prone to instability; decision trees can easily overfit to training data.
K-Fold Cross Validations:
This method involves splitting the dataset into k folds, providing a more reliable estimate of model performance compared to a single train/test split.
Stratified K-Fold: A variation that maintains class distribution in each fold, ensuring balanced training and testing sets.
Data Preprocessing: Focus on feature extraction and target label identification to prepare data for model training.
Splitting Data: Implementing a train/test split to segregate data for building and evaluating models, essential for performance testing.
Model Creation: Selecting suitable estimators and tuning hyperparameters to optimize model configuration.
Model Training: Executing model training with the selected method using training data.
Prediction & Transformation: Making predictions on the unseen testing dataset and transforming inputs into actionable insights.
Evaluation: Analyzing performance through scores and metrics to ascertain the effectiveness of the model.
Model Parameters: These values are learned by the model during the training phase, such as the coefficients in linear regression.
Hyperparameters: These are pre-set specifications that define the model’s architecture and learning process, making tuning essential for achieving optimal model performance.
Q: What does KNN stand for? A: K-Nearest Neighbors Q: What type of learning does KNN belong to? A: Supervised Learning Q: What is the primary concept KNN relies on? A: Similarity and Proximity Q: What is the most commonly used distance metric in KNN? A: Euclidean Distance Q: In regression, how does KNN make predictions? A: By averaging the values of K nearest neighbors. Q: How does KNN handle classification tasks? A: Assigns the most frequent class among K neighbors. Q: What happens if K is too small? A: The model is too sensitive to noise. Q: What is one advantage of KNN? A: No training phase required. Q: What happens if K is too large? A: The model becomes too simplistic and loses accuracy.
Splitting the Data:
Use a training set for model training and a testing set for model evaluation. Example code: train_test_split(X, y, test_size=0.3, random_state=42)
Data Scaling:
Essential for algorithms sensitive to feature scales to ensure uniformity. Example code provides necessary methods for scaling.
Model Creation:
Begin by initializing the KNN classifier and setting required hyperparameters for accurate predictions.
Model Training:
Implement the fit
method with training data to train the KNN model effectively.
Prediction:
Conduct predictions on the testing dataset using the trained model.
Evaluation:
Evaluate the model using accuracy scores and confusion matrix analysis to measure performance comprehensively.