Note

0.0(0)

Take a practice test

Chat with Kai

Explore Top Notes

NaOH Titration Flashcards

Note

Studied by 1 person

5.0(1)

APES 5.6 Pest Control Methods

Note

Studied by 17 people

5.0(1)

Chapter Fourteen: Schizophrenia and Related Disorders

Note

Studied by 12 people

5.0(1)

Ozymandias- Notes (Anna)

Note

Studied by 76 people

4.2(6)

Ch 14 - Money and Banking

Note

Studied by 21 people

5.0(2)

Chapter 19 - Human Effects on Ecosystem

Note

Studied by 21 people

5.0(2)

Data Mining Process and Machine Learning Concepts

Data Mining Process

Understanding the Problem: Identify the pain points or problems before determining the necessary data.

Phases of Data Mining

Business Understanding: Define objectives and requirements.
Data Understanding: Collection and description of data.
Data Preparation: Cleaning and transforming data into the required format.
Modeling: Applying machine learning algorithms.
Evaluation: Assessment of the model’s performance.
Deployment: Implementing the model in the business context.

Data Terminology

Label: Variable to predict (e.g., type of animal, spam email indicator, housing price).
Features: Input variables representing the data (used for prediction).
Examples of Features: Balance, Age, Default status of customers.
Representations: Set of values representing a dataset.
Dataset: A set of examples.

Types of Features

Numerical Features: Contain numerical values. Examples: age, income, revenue (can be continuous or discrete).
Categorical Features: Contain non-numerical values grouped into categories. Examples: color, gender, occupation (can be nominal or ordinal).

Machine Learning Terminology

Labeled data: Data tagged with correct outputs for algorithm learning.
Unlabeled data: Data without output tags, used in unsupervised learning.
Supervised Learning: Model learns from labeled data (e.g., classification, regression).
Unsupervised Learning: Learns from unlabeled data to identify patterns (e.g., clustering).
Reinforcement Learning: Learns through interaction with an environment (rewards/punishments).

Supervised Learning: Classification

Definition: Predicts a categorical label based on input features.
Input: Labeled data with categorical labels.
Output: Probabilities of each label indicating the likelihood of each category.
Examples: Spam detection, image classification.
Models: Decision trees, logistic regression, neural networks.

Probability in Classification

Definition: Measures uncertainty of an event (P(event)).
Range: Probability values range from 0 to 1 (e.g., predicting 'Heads' in a coin toss).

Classification Types

Binary Classification: Two classes (positive and negative).
Multiclass Classification: More than two classes present.

Supervised Learning: Regression

Definition: Predicts a continuous numerical value based on features.
Input: Labeled data with numerical values.
Output: Numerical predictions. E.g., housing price prediction, sales forecasting.

Data Preparation Process

Analogy: Similar to cooking processes (collection, cleaning, integration, analysis).
Sources of Data: Internal (company databases), external (government data, third-party providers).
Data Collection Considerations: Size and quality are crucial for effective model training.

Importance of Data Annotation

Vital for AI breakthroughs. E.g., Google Cloud’s Data Labeling Service.
Examples: Netflix recommendations, ride-sharing apps

Data Preparation Methods

Structured Data Preparation Methods

Feature-Level Preparation: Handle missing data, outliers, and transform features.
Dataset-Level Preparation: Clean the dataset efficiently.

Loading Data from CSV Files

CSV Format: Standard format for storing tabular data.
Reading CSV in Pandas: Use the pandas library for data manipulation.

Handling Missing Values

Definition: Missing value is unavailable data.
Methods: Delete or impute missing values (e.g., mean imputation).

Outlier Management

Definition: Outliers are values significantly different from bulk data.
Handling: Assess the context; decide to remove or retain.

Transforming Categorical Features

Conversion: Categorical values must be converted to numerical formats for analysis.
One-hot Encoding: Converts categorical values into dummy columns.

Transforming Numerical Features

Normalization: Applies to features with large ranges. Improves model performance.
Techniques: Scaling, clipping, log scaling, Z-score normalization.

Feature Selection: Filter Methods

Purpose: Remove irrelevant or redundant features before model training.
Methods: Variance method, correlation method.

Class Distribution in Imbalanced Datasets

Definition: Skewed distribution (majority vs. minority classes).
Example Applications: Healthcare, fraud detection.

Balancing Imbalanced Datasets

Downsampling: Remove samples from the majority class.
Upsampling: Replicate samples from the minority class.

Training vs. Testing

Purpose: Essential for evaluating model accuracy.
Methods: Learning with a training set, evaluating with a separate test set.

Decision Trees

Definition: Supervised learning method for classification and regression.
Advantages: Easy to understand and implement, computationally inexpensive.
Structure: Nodes represent decisions based on features.

Decision Tree Learning Process

Goals and Methodology

To create a tree mapping features to labels.
Loss Function: Minimize impurity at nodes for clear decision making.

Node Impurity

Definition: Measures label diversity within nodes.
Measures: Entropy, Gini Index, Variance.

Information Gain

Purpose: Identify optimal features for data splitting.
Calculation: Consider reduction of entropy from parent to child nodes.

Overfitting in Decision Trees

Definition: Model memorizes training data but fails to generalize.
Causes: Learning random noise as patterns.

Evaluating Overfitting

Error Analysis: Evaluate performance discrepancies between training and test sets.

Avoiding Overfitting

Techniques: Favor simpler structures, early stopping, hyperparameter tuning.

Hyper-Parameter Tuning

Concept: Adjust parameters for optimized model performance.
Methods: Grid search, random search, K-Fold cross-validation.

Model Evaluation Essentials

Points for Evaluation: Final evaluation on test set; hyperparameter tuning on validation set.

Confusion Matrix

Definition: Table for classification model performance.
Key Information: True Positives, True Negatives, False Positives, False Negatives.

Limitations of Accuracy

High accuracy may not reflect model performance in imbalanced datasets.

Importance of Precision and Recall

Precision: True positives among predicted positives.
Recall: True positives among actual positives.

Dynamics of Precision and Recall

Inverse relationship; the adjustment of the threshold affects both metrics.

ROC Curve Analysis

Graphical representation of True Positive Rate vs. False Positive Rate.

Evaluating with Cost/Benefit Analysis

Cost-Sensitive Evaluation: Takes into account varying impacts of misclassifications.

F1 Score

Definition: Harmonic mean of precision and recall, essential for classification evaluation.

Regression Overview

Definition: Predict continuous values based on features (e.g., housing prices).

Linear Regression

Definition: Predicts continuous values based on linear relationships of input features.
Optimization Method: Gradient descent to minimize loss function.