Machine Learning Notes

Module 1: Introduction to Machine Learning

Part 1: Machine Learning vs. Statistical Learning

  • Focus:

    • Statistical Learning: Hypothesis testing and interpretability.

    • Machine Learning: Predictive accuracy.

  • Driver:

    • Statistical Learning: Math, theory, hypothesis.

    • Machine Learning: Fitting data.

  • Data Size:

    • Statistical Learning: Any reasonable set.

    • Machine Learning: Big data.

  • Data Type:

    • Statistical Learning: Structured.

    • Machine Learning: Structured, unstructured, semi-structured.

  • Dimensions/Scalability:

    • Statistical Learning: Mostly low dimensional data.

    • Machine Learning: High dimensional data.

  • Model Choice:

    • Statistical Learning: Parameter significance & in-sample goodness of fit.

    • Machine Learning: Cross-validation of predictive accuracy on partitions of data.

  • Interpretability:

    • Statistical Learning: High.

    • Machine Learning: Low.

  • Strength:

    • Statistical Learning: Understand causal relationship & behavior.

    • Machine Learning: Prediction (forecasting and nowcasting).

The Big Picture
  • As researchers or practitioners, the goal is to solve real-world problems through inference or predictions.

  • Examples of relationships to explore:

    • Sales and advertisement/R&D expenditure/seasonality/industry.

    • Quantity demanded and price/income/technology/price of competitors.

    • Wage and education/age/gender/experience.

Simple Example: Quantifying Wage Components
  • Drivers: Education, age, experience, IQ, ethnicity, race, gender, industry, location, working hours.

  • Linear model example:
    wage=β<em>0+β</em>1educ+β<em>2age+β</em>3exper+β<em>4IQ++β</em>khours+uwage = \beta<em>0 + \beta</em>1 educ + \beta<em>2 age + \beta</em>3 exper + \beta<em>4 IQ + … + \beta</em>k hours + u

  • Considerations:

    • Interpretability of the model.

    • Prediction making ability.

Different Example: Cat vs. Dog Classification (Image Recognition)
  • Considerations:

    • Interpretability of the model is not as important.

    • Accuracy of predictions is critical.

Limitations of Econometrics/Structured ML
  • Econometrics/structured ML can only handle structured (tabular) data.

  • Unstructured data includes images, text, audio, and video.

More Complex Example: Stock Price Prediction
  • Classical drivers: Company's fundamentals, competitors, technical analysis, seasonality.

  • Other factors: Market sentiment (news, tweets, blogger opinions), satellite images from parking lots.

Why Learn Machine Learning?
  • Deep learning is prevalent.

  • Better career opportunities.

  • Hedge against the next recession.

Part 2: What is Machine Learning?

  • Machine Learning is a subset of AI that enables computers to learn from data.

  • A machine learning system is trained with algorithms rather than explicitly programmed.

  • ML involves automated detection of meaningful patterns in data and applying those patterns to make predictions on unseen data.

  • The goal is to maximize performance on unseen data and generalize.

Artificial Intelligence vs. Machine Learning vs. Deep Learning
  • Artificial Intelligence: Any technique that enables machines to mimic human behavior (1950s).

  • Machine Learning: A subset of AI that enables computers to learn from data, models are trained with a set of algorithms (1980s).

  • Deep Learning: A subset of ML that extracts patterns from data using neural networks (2010s).

Part 3: Different Types of Machine Learning Algorithms

  • Supervised Learning

  • Unsupervised Learning

  • Reinforcement Learning

Supervised Learning
  • Computers learn to model relationships based on training data where inputs and outputs are labeled.

  • Trained algorithms are used to predict outcomes for test data.

  • Regression:

    1. Predicting stock market returns.

    2. Predicting housing prices.

  • Classification:

    1. Generating buy, sell, hold signals.

    2. Estimating the likelihood of a successful M&A or IPO.

    3. Predicting credit default rate.

    4. Classification on winning and losing funds or ETFs.

Unsupervised Learning
  • Computers are trained on unlabeled train data without any guidance to discover underlying patterns and find groups of samples that behave similarly.

  • Clustering:

    1. Grouping companies into peer groups based on non-standard characteristics.

    2. Client profiling and asset allocation.

    3. Portfolio diversification and stock selection based on co-movements similarities.

  • Dimensionality Reduction:

    1. Identify the most predictive factors underlying asset price movements (to avoid factor zoo).

Reinforcement Learning
  • A computer (agent) learns from interacting with its environment by producing actions and discovering rewards. The machine explores and exploits to maximize the reward.

  • Example: A virtual trader (agent) follows trading rules (actions) in a market (environment) to maximize profits (reward).

ML Algorithm Road Map

  • Supervised:

    • Regression: Linear/Polynomial, Penalized regression, KNN, SVR, Tree-based Regression models

    • Classification: Logistic regression, KNN, SVC, Tree-based Classification models

  • Unsupervised:

    • Dimensionality Reduction: Principal Component Analysis (PCA)

    • Clustering: K-Mean, Hierarchical

GitHub Modules

  • Module 1: Introduction to Machine Learning

  • Module 2: Setting up Machine Learning Environment

  • Module 3: Linear Regression (Econometrics approach)

  • Module 4: Machine Learning Fundamentals

  • Module 5: Linear Regression (Machine Learning approach)

  • Module 6: Penalized Regression (Ridge, LASSO, Elastic Net)

  • Module 7: Logistic Regression

  • Module 8: K-Nearest Neighbors (KNN)

  • Module 9: Classification and Regression Trees (CART)

  • Module 10: Bagging and Boosting

  • Module 11: Dimensionality Reduction (PCA)

  • Module 12: Clustering (KMeans – Hierarchical)

Warning: A ML algorithm will always find a pattern, even if there is none.