Mathematics for Machine Learning - Comprehensive Notes

Mathematics for Machine Learning - Comprehensive Notes

Contents

  • Foreword

  • Part I: Mathematical Foundations

    • Introduction and Motivation

    • Linear Algebra

    • Analytic Geometry

    • Matrix Decompositions

    • Vector Calculus

    • Probability and Distributions

    • Continuous Optimization

  • Part II: Central Machine Learning Problems

    • When Models Meet Data

    • Linear Regression

    • Dimensionality Reduction with Principal Component Analysis

    • Density Estimation with Gaussian Mixture Models

    • Classification with Support Vector Machines

Foreword

  • Machine learning aims to distill human knowledge and reasoning into machines.

  • Software packages abstract low-level technical details but can lead to unawareness of design decisions and algorithm limits.

  • Practitioners need programming, data analysis, large-scale computation, mathematics, and statistics.

  • Machine learning textbooks assume mathematical and statistical competence.

  • This book narrows the mathematics skills gap, focusing on the mathematical foundations of basic machine learning.

Why Another Book on Machine Learning?

  • Mathematics formalizes intuitive concepts, clarifying problem-solving.

  • Machine learning motivates learning mathematics by providing direct relevance to practical problems.

  • This book serves as a guidebook to mathematical literature relevant to machine learning.

  • It motivates mathematical concepts by demonstrating their use in fundamental machine learning problems.

  • It provides a brief, precise glimpse of machine learning for readers with a mathematical background.

  • In contrast to other books, this book focuses on the mathematical concepts behind the models themselves.

  • The goal is to provide a deeper understanding of basic questions in machine learning and connect practical questions with mathematical modeling.

  • It intends to provide the mathematical background to make it easier to read other machine learning textbooks.

Target Audience

  • Anyone who believe that everybody should have some understanding of its underlying principles.

  • Readers should have high school mathematics and physics knowledge, including derivatives, integrals, and geometric vectors.

  • The target audience includes undergraduate university students, evening learners, and online machine learning course participants.

  • Three types of interaction with machine learning:

    • Astute Listener: Users benefit from machine learning by using open-source software and cloud-based tools to extract meaningful insights from data. They can discern between different types of machine learning and think about ethics, fairness, and privacy.

    • Experienced Artist: Skilled practitioners can plug and play different tools and libraries into an analysis pipeline, understanding machine learning interfaces and their use cases. They can extend and generalize existing machine learning algorithms.

    • Fledgling Composer: Developers need to develop new methods and extend existing algorithms, understanding the mathematical basis of machine learning and uncover relationships between different tasks. They are often researchers who are able to propose and explore novel approaches for attacking the many challenges of learning from data.

Table of Symbols

  • Scalars: a,b,c,α,β,γa, b, c, α, β, γ

  • Vectors: x,y,z\mathbf{x}, \mathbf{y}, \mathbf{z}

  • Matrices: A,B,C\mathbf{A}, \mathbf{B}, \mathbf{C}

  • Transpose: xT,AT\mathbf{x}^T, \mathbf{A}^T

  • Inverse: A1\mathbf{A}^{-1}

  • Inner product: x,y\langle \mathbf{x}, \mathbf{y} \rangle

  • Dot product: xTy\mathbf{x}^T \mathbf{y}

  • Tuple: B=(b<em>1,b</em>2,b3)B = (b<em>1, b</em>2, b_3)

  • Matrix of column vectors: B=[b<em>1,b</em>2,b3]B = [b<em>1, b</em>2, b_3]

  • Set of vectors: B=b<em>1,b</em>2,b3B = {b<em>1, b</em>2, b_3}

  • Integers/Natural numbers: Z,N\mathbb{Z}, \mathbb{N}

  • Real/Complex numbers: R,C\mathbb{R}, \mathbb{C}

  • n-dimensional vector space of real numbers: Rn\mathbb{R}^n

  • For all x: x\forall x

  • There exists x: x\exists x

  • a is defined as b: a:=ba := b

  • b is defined as a: a=:ba =: b

  • a is proportional to b: aba \propto b

  • Function composition: gfg \circ f

  • If and only if: \Leftrightarrow

  • Implies: \Rightarrow

  • Sets: A,CA, C

  • a is an element of the set A: aAa \in A

  • Empty set: \emptyset

  • Number of dimensions: DD

  • Number of data points: NN

  • Identity matrix of size m × m: Im\mathbf{I}_m

  • Matrix of zeros of size m × n: 0m,n\mathbf{0}_{m,n}

  • Matrix of ones of size m × n: 1m,n\mathbf{1}_{m,n}

  • Standard/canonical vector: ei\mathbf{e}_i

  • Dimensionality of vector space: dim\text{dim}

  • Rank of matrix A: rk(A)\text{rk}(\mathbf{A})

  • Image of linear mapping Φ: Im(Φ)\text{Im}(\Phi)

  • Kernel (null space) of a linear mapping Φ: ker(Φ)\text{ker}(\Phi)

  • Span of b1: span[b1]\text{span}[b_1]

  • Trace of A: tr(A)\text{tr}(\mathbf{A})

  • Determinant of A: det(A)\text{det}(\mathbf{A})

  • Absolute value or determinant: | \cdot |

  • Norm: | \cdot |

  • Eigenvalue or Lagrange multiplier: λ\lambda

  • Eigenspace corresponding to eigenvalue λ: EλE_\lambda

  • Parameter vector: θ\mathbf{\theta}

  • Partial derivative of f with respect to x: fx\frac{\partial f}{\partial x}

  • Total derivative of f with respect to x: dfdx\frac{df}{dx}

  • Gradient: \nabla

  • Lagrangian: LL

  • Negative log-likelihood: LL

  • Binomial coefficient: (nk){n \choose k}

  • Variance of x with respect to the random variable X: VX[x]\mathbb{V}_X[x]

  • Expectation of x with respect to the random variable X: EX[x]\mathbb{E}_X[x]

  • Covariance between x and y: CovX,Y[x,y]\text{Cov}_{X,Y}[x, y]

  • X is conditionally independent of Y given Z: X!!YZX \perp!!\perp Y | Z

  • Random variable X is distributed according to p: XpX \sim p

  • Gaussian distribution with mean µ and covariance Σ: N(μ,Σ)\mathcal{N}(\mu, \Sigma)

  • Bernoulli distribution with parameter µ: Ber(μ)\text{Ber}(\mu)

  • Binomial distribution with parameters N, µ: Bin(N,μ)\text{Bin}(N, \mu)

  • Beta distribution with parameters α, β: Beta(α,β)\text{Beta}(\alpha, \beta)

Table of Abbreviations and Acronyms

  • e.g. - Exempli gratia (for example)

  • GMM - Gaussian mixture model

  • i.e. - Id est (this means)

  • i.i.d. - Independent, identically distributed

  • MAP - Maximum a posteriori

  • MLE - Maximum likelihood estimation/estimator

  • ONB - Orthonormal basis

  • PCA - Principal component analysis

  • PPCA - Probabilistic principal component analysis

  • REF - Row-echelon form

  • SPD - Symmetric, positive definite

  • SVM - Support vector machine

Part I: Mathematical Foundations

  • Presents mathematical concepts required for machine learning.

1 Introduction and Motivation
  • Machine learning is about automatically extracting valuable information from data.

  • Core concepts: data, model, and learning.

  • Data is at the core of machine learning.

  • The goal is to extract valuable patterns from data without much domain-specific expertise.

  • Models describe the process that generates data.

  • Learning is automatically finding patterns and structure in data by optimizing the parameters of the model.

  • Mathematical foundations facilitate creating new solutions, debugging approaches, and understanding limitations.

1.1 Finding Words for Intuitions
  • Concepts and words can be ambiguous in machine learning.

  • The word “algorithm” can mean a system that makes predictions or a system that adapts internal parameters.

  • Predictors are algorithms that make predictions based on input data.

  • Training refers to a system that adapts some internal parameters of the predictor.

  • Data often needs to be converted into numerical format.

  • Data is considered as vectors, which can be arrays of numbers, arrows with direction and magnitude, or objects obeying addition and scaling.

  • Models describe a process for generating data, as simplified versions of the real data-generating process.

  • Learning involves training the model by optimizing its parameters with respect to a utility function.

  • Training data != unseen data; over-memorizing training data usually hurts.

1.2 Two Ways to Read This Book
  • Two strategies for understanding mathematics for machine learning:

    • Bottom-up: Building concepts from foundational to advanced.

    • Top-down: Drilling down from practical needs to basic requirements.

  • The book is modular to separate foundational concepts from applications.

  • Part I lays the mathematical foundations while Part II applies these foundations.

  • Part I:

    • Requires a solid mathematical foundation.

    • Introduces numerical data as vectors and tables of data as matrices (linear algebra).

    • Discusses the idea of similarity between vectors requires operations that return numerical values (analytic geometry).

    • Introduces matrix operations and matrix decomposition, emphasizing their usefulness in machine learning.

    • Covers quantifying noise and uncertainty, the realm of probability theory.

    • Explains that gradients enable optimization using vector calculus.

  • Part II:

    • Applies the mathematical concepts.

    • Orders chapters by difficulty.

    • Restates machine learning components (data, models, parameter estimation) mathematically.

    • Discusses linear regression, including parameter estimation, maximum likelihood, and Bayesian linear regression.

    • Features dimensionality reduction using principal component analysis.

    • Focuses on Gaussian mixture models for density estimation and iterative parameter finding.

    • Concludes with support vector machines for classification.

1.3 Exercises and Feedback
  • Exercises are provided in Part I, mostly by pen and paper.

  • Provides programming tutorials for Part II.

  • Encourages readers to download the book freely at https://mml-book.com.

  • Mistakes can be reported and feedback provided using the preceding URL.

2 Linear Algebra

  • Algebra studies objects and rules, while linear algebra focuses on vectors and their manipulations.

  • Vectors are objects that can be added and multiplied by scalars.

  • Geometric vectors may be familiar from high school math and physics and are directed segments.

  • Polynomials are vectors with addition and scalar multiplication.

  • Audio signals are vectors represented as a series of numbers.

  • Elements of Rn\mathbb{R}^n are vectors (tuples of n real numbers).

  • Linear algebra focuses on the similarities between these vector concepts.

  • Vector space is the set of vectors that can result by starting with a small set of vectors adding then to each other and scaling them.

  • Linear Algebra series by 3Blue1Brown and Gilbert Strang’s course are other resources.

  • Linear algebra helps with linear equations, geometry, vector calculus, and reducing dimensions with PCA.

2.1 Systems of Linear Equations
  • Many problems are formulated as systems of linear equations.

  • Example: Optimizing production with limited resources N<em>1,,N</em>n\mathbb{N}<em>1, \dots, \mathbb{N}</em>n

    • aij: units of resource Ri needed for a unit of product Nj.

    • xj: units of product Nj to be produced.

    • bi: total available units of resource Ri. a<em>11x</em>1++a<em>1nx</em>n=b1a<em>{11}x</em>1 + \dots + a<em>{1n}x</em>n = b_1

    • \dots

    • a<em>m1x</em>1++a<em>mnx</em>n=bma<em>{m1}x</em>1 + \dots + a<em>{mn}x</em>n = b_m

  • x1, . . . , xn are the unknowns of the system.

  • (x1, . . . , xn) ∈ Rn that satisfies the equation is a solution.

  • Types of solutions: no solution, unique solution, or infinitely many solutions.

  • System of equations expressed in matrix form: Ax=b\mathbf{A} \mathbf{x} = \mathbf{b}

2.2 Matrices
  • Matrices represent linear equations and linear functions (mappings).

  • An (m, n) matrix A is an m·n-tuple of elements aij, arranged in a rectangular scheme of m rows and n columns.

    • A=[a<em>11amp;a</em>12amp;amp;a<em>1n a</em>21amp;a<em>22amp;amp;a</em>2n amp;amp;amp; a<em>m1amp;a</em>m2amp;amp;a<em>mn],a</em>ijRA = \begin{bmatrix} a<em>{11} &amp; a</em>{12} &amp; \cdots &amp; a<em>{1n} \ a</em>{21} &amp; a<em>{22} &amp; \cdots &amp; a</em>{2n} \ \vdots &amp; \vdots &amp; \ddots &amp; \vdots \ a<em>{m1} &amp; a</em>{m2} &amp; \cdots &amp; a<em>{mn} \end{bmatrix}, \quad a</em>{ij} \in \mathbb{R}

  • Rows are (1, n)-matrices called rows.

  • Columns are (m, 1)-matrices called columns.

  • Matrix addition: Element-wise sum of matrices of the same size.

  • Matrix multiplication: Elements cij of C = AB are computed as c<em>ij=</em>l=1na<em>ilb</em>ljc<em>{ij} = \sum</em>{l=1}^{n} a<em>{il}b</em>{lj}

  • Matrix multiplication is not commutative: AB != BA.

  • Identity matrix: In is the n × n-matrix with 1 on the diagonal and 0 everywhere else.

  • Matrix Properties:

    • Associativity: (AB)C = A(BC)

    • Distributivity: (A + B)C = AC + BC and A(C + D) = AC + AD

    • Multiplication with the identity matrix: I<em>mA=AI</em>n=A\mathbf{I}<em>m \mathbf{A} = \mathbf{A} \mathbf{I}</em>n = \mathbf{A}

2.2.2 Inverse and Transpose
  • The inverse of a square matrix A is A−1 such that AA−1 = In = A−1A. Not every matrix have an inverse, and those who does are