10 - Decision Trees

Machine Learning: A subset of artificial intelligence focused on enabling systems to learn from data and improve over time without explicit programming.

Supervised Learning:
- Uses labeled data.
- Predicts outcomes based on input-output pairs.
- Requires prior knowledge of classes.
Unsupervised Learning:
- Uses unlabeled data.
- Seeks to find hidden patterns or intrinsic structures without predefined categories.

Classification: A mechanism that predicts a data point's category based on its features. The goal is to assign a label from predefined categories.
Classifier: An algorithm that assigns a piece of data to one of multiple predefined classes.

Supervised algorithms utilize labeled training datasets to make predictions.
Classification is a subset of supervised learning that involves predicting categorical labels.

Common algorithms include:
- Logistic Regression
- Decision Trees
- Support Vector Machines
- k-Nearest Neighbors

Decision Trees: A flowchart-like structure used to make decisions based on data features or to predict outcomes.
- Components:
- Root Node: The top decision node, representing the best predictor.
- Internal Nodes: Decision nodes for splitting data.
- Leaf Nodes: Terminal nodes providing the final classification or decision.
- Example:
- Decision to bring an umbrella based on cloudiness.

Select the best attribute as a splitting criterion.
Split the dataset based on this attribute.
Continue splitting recursively until stopping criteria are met (e.g., all data classified, no remaining features, maximum depth reached).

Impurity measures the degree of heterogeneity in a dataset. Common measures include:
- Gini Index: Estimates the probability of misclassification.
- Entropy: Measures the uncertainty in the data; goal is to minimize it during splits.

Splits are evaluated based on impurity measures:
- Gini impurity and entropy guide the selection of the best attribute to split on.
Information Gain (IG) indicates how much entropy is reduced by a particular split.

The root node is chosen based on the feature that maximizes Information Gain, thus reducing uncertainty the most.

Easy to Understand: The structure is intuitive, making it easy for non-technical stakeholders to interpret.
No Data Preparation Required: They do not require normalization or scaling of data.
Handles Both Numerical and Categorical Data: Decision trees can work with different types of data, which makes them versatile.
Robust: They can handle missing values and are less affected by outliers.
Visual Representation: The flowchart-like structure helps to visualize the decision-making process clearly.

Overfitting: They can create overly complex trees that do not generalize well to unseen data if not properly pruned.
Instability: Small changes in data can result in a completely different structure, making the model sensitive.
Bias towards Certain Structures: Decision trees can be biased if one class dominates the dataset, leading to poor predictions for minority classes.
Limited Expressiveness: They are less effective for capturing complex relationships compared to other algorithms like ensemble methods or neural networks.

When Interpretability is Crucial: If stakeholders need to understand the decision-making process clearly.
For Preliminary Data Analysis: To gather insights about the data and explore relationships before applying more complex models.
If the Data is Mixed-Type: When there are both categorical and numerical features in the dataset.
In Cases of Missing Values: They can work well when some data points are incomplete.
In Applications Requiring Fast Inference: When predictions need to be made quickly, such as real-time systems.