Machine Learning Notes

MACHINE LEARNING

Definition: Machine Learning (ML) allows computers to learn from data without being explicitly programmed.
- According to Alan Turing: "Can machines think?"
- Example: Turing Test - A human judge interacts with a human and a computer to distinguish between them. If the judge cannot consistently tell which is which, the computer passes the test.

KEY DEFINITIONS

Arthur Samuel (1959): Describes ML as the ability of computers to learn from experience.
T. Mitchell: A computer program learns from experience E in regard to tasks T if performance P improves with E.
- Applications of ML:
- Spam detection
- Voice recognition
- Stock trading
- Robotics
- Healthcare diagnostics
- SEO and ecommerce optimizations.

TYPES OF MACHINE LEARNING

SUPERVISED LEARNING

Involves an input-output pairing where the correct output is known.
Types of problems:
- Regression: Predicting a continuous output.
- Example: Predict house price based on size.
- Classification: Predicting discrete categories.
- Example: Classifying a tumor as benign or malignant.

UNSUPERVISED LEARNING

The model finds hidden patterns in data without labeled responses.
No right or wrong answers, used to explore data and discover patterns.
Clustering: Grouping data based on similarity.
- Example: Classifying genes based on various characteristics.

FUNCTIONALITY IN SUPERVISED LEARNING

MODELING

Models define the relationship between input (X) and output (Y).
Training set is represented as pairs of input and output: ({(x(i), y(i)); i = 1, …, m}).

HYPOTHESIS

A function ( h: X \rightarrow Y ) that predicts the output.
Example of hypothesis in linear regression:
- ( Y' = mx + c ) where ( m ) is the slope and ( c ) is the intercept.

LOSS FUNCTION

Measures accuracy of predictions by comparing predicted values to actual outcomes.
Often minimized using techniques like gradient descent.
For example, the Mean Squared Error is calculated as:
\frac{1}{n} \sum_{i=1}^{n} (h(x(i)) - y(i))^2

GRADIENT DESCENT

An optimization algorithm used to find the minimum of a function.
Iteratively adjusts parameters to reduce the loss function value.
Involves calculating the derivative to find the slope of the cost function and adjusting with learning rate ( \alpha ).
Formula:
\theta{new} = \theta{old} - \alpha \frac{dJ(\theta)}{d\theta}\n

MULTIVARIATE REGRESSION

Extends multiple features for prediction:
- For multiple features: ( h(x) = \theta0 + \theta1x1 + \theta2x2 + … + \thetanx_n )
Gradient descent then adapts accordingly to multiple variables.

POLYNOMIAL REGRESSION

Used when data does not fit a linear model.
Example: Quadratic function ( ax^2 + bx + c ).

LOSS FUNCTION IN POLYNOMIAL REGRESSION

Similar to linear regression; evaluates differences but considers polynomial degrees.

LOGISTIC REGRESSION

Used for binary classification tasks (output is discrete).
Uses the sigmoid function to convert any value into a probability between 0 and 1:
h(x) = g(\theta^T x) = \frac{1}{1 + e^{-z}} \text{, where } z = \theta^T x

COST FUNCTION FOR LOGISTIC REGRESSION

The cost function differs from linear regression due to non-linear output, defined as:
J(\theta) = -\frac{1}{m} [y \log(h(x)) + (1 - y) \log(1 - h(x))]

OVERFITTING & UNDERFITTING

Overfitting: Model learns noise from the training data, performing poorly on unseen data.
Underfitting: Model is too simplistic to capture underlying structure.
- Addressed through regularization, simplifying models by penalizing large coefficients.

CATEGORICAL DATA HANDLING

Machine learning algorithms require numerical input, necessitating encoding:
- One-Hot Encoding: Converts categorical variables into binary variables.
Example: For "color" variable with categories like "red," "green," and "blue," represent them as three binary variables.

MULTICLASS CLASSIFICATION

Extends binary classification methods using one-vs-all strategy.

REGULARIZATION TECHNIQUES

Used to reduce overfitting while preserving all features:
- L1 Regularization (Lasso): Adds an absolute value penalty on the size of coefficients.
- L2 Regularization (Ridge): Adds the squared value of coefficients to the loss function; simpler hypothesis results avoiding overfitting.