Linear Algebra and Geometric Interpretations

Dot Product

Corresponds to the angle between two vectors.
Given two vectors, take the transpose of the first vector (making it a row vector).
Multiply the transposed vector with the second vector.
This involves multiplying corresponding terms in each vector and summing the results.
The dot product results in a single number (scalar).

In two dimensions, consider vector x and vector y.
\theta_x is the angle vector x subtends on the horizontal axis.
x1 = ||x||2 \cos(\thetax), where ||x||2 is the two-norm (magnitude) of x.
x2 = ||x||2 \sin(\theta_x)
Similarly, for vector y: y1 = ||y||2 \cos(\thetay) and y2 = ||y||2 \sin(\thetay)
x1y1 + x2y2 = ||x||2 ||y||2 \cos(\thetax) \cos(\thetay) + ||x||2 ||y||2 \sin(\thetax) \sin(\thetay)
This simplifies to ||x||2 ||y||2 \cos(\thetax - \thetay), which is the product of magnitudes times the cosine of the angle between them.
If the angle is zero, the dot product is the product of the magnitudes.
If the angle is 90 degrees (orthogonal), the dot product is zero, since \cos(90^\circ) = 0.
When vectors are normalized (two-norm equals one), the dot product measures the angle between them.

If one vector is a unit vector (e.g., \beta) and the other is not (e.g., x), the dot product x^T \beta has a geometric meaning.
x^T \beta = ||x||2 ||\beta||2 \cos(\theta), where ||\beta||_2 = 1.
So, x^T \beta = ||x||_2 \cos(\theta)
This represents the length of the projection of x onto \beta.
Consider a right triangle formed by dropping a perpendicular from point x onto the \beta axis.
The dot product gives the length of the base of this triangle.
Projections are fundamental in many reasoning tasks involving points.
This concept is crucial when discussing hyperplanes and using linear surfaces for class separation.

A matrix can be viewed as a collection of points in high-dimensional space, where each column is a point.

Consider a unit vector \beta and a matrix A.
Pre-multiplying A with \beta^T (i.e., \beta^T A) has a geometric interpretation.
Algebraically, if A has columns a1, a2, \dots, then \beta^T A results in a row vector: [\beta^T a1, \beta^T a2, \dots].
Each element \beta^T ai is the projection of point ai onto the unit vector \beta.
This operation projects all points in the matrix onto a single vector \beta, effectively looking at those points in that dimension.
This principle underlies Principal Component Analysis (PCA).

Given a matrix A and a vector x, the product Ax results in a new vector.
If A is n \times k and x is k \times 1, the result is an n \times 1 vector.
Each entry in the resulting vector is a dot product of a row of A with the vector x.
Alternatively, this can be viewed as a linear combination of the columns of A, where the coefficients are the entries of x.
Ax = x1 a1 + x2 a2 + \dots + xk ak, where ai are the columns of A and xi are the elements of x.
Geometrically, you take each column vector of A, scale it by the corresponding entry in x, and then sum all the scaled vectors to get a new point.

If a matrix A has only two columns, all linear combinations of these columns will lie in the same plane.
The span of all possible linear combinations of the columns of A forms a vector space.
The rank of the vector space is the number of orthogonal dimensions covered by the space.
If all points lie on a plane, the rank is 2, even if the original space is higher-dimensional.
The vector space is the set of all points that can be reached by Ax for all possible choices of x.

The outer product of two vectors x and y (i.e., xy^T) results in a matrix.
If x is n \times 1 and y is 1 \times m, then xy^T is an n \times m matrix.
When m = 1, xy^T is just the vector x scaled by the scalar y_1.
If y has multiple entries (y1, y2, \dots, y_m), each column of the resulting matrix is a scaled version of the vector x.
All columns of the matrix will lie on the same line passing through the origin.
This matrix has a rank of 1 because its vector space is confined to a single dimension.

Given two classes of points, if they are separated in space, the goal is to find a surface that separates them.
In two dimensions, this is a line; in three dimensions, it's a plane; in n dimensions, it's a hyperplane.
A hyperplane is an (n-1)-dimensional surface.
There is a line orthogonal to all points on the hyperplane in an n-dimensional space.
The unit vector \beta normal to the hyperplane, and the shortest distance from the hyperplane to the origin \beta_0, define the hyperplane.

For a point x not on the hyperplane, the shortest distance from x to the hyperplane is given by |\beta^T x - \beta_0|.
\beta^T x is the projection of x onto \beta.
Subtracting \beta_0 gives the signed distance to the hyperplane.
If \beta^T x - \beta_0 = 0, the point lies on the hyperplane.
The sign of \beta^T x - \beta_0 indicates which side of the hyperplane the point lies on.
If two points have the same sign for \beta^T x - \beta_0, they are on the same side; if they have opposite signs, they are on opposite sides.

Gene expression or proteomics experiments result in an expression matrix.
Samples fall into classes (e.g., ALL and AML leukemia).
The problem is to predict the class of a new sample based on its expression values (supervised classification).
Another problem is to determine if points intrinsically separate into different classes (unsupervised learning).

Single-cell data provides expression vectors for individual cells.
The goal is to cluster cells by type and project them into lower dimensions for visualization.
Each cell type should ideally form a distinct cluster.
This allows for comparison of cell compositions between samples.

Given a matrix (collection of points), represent each sample by a single point in a lower dimension.
Algebraically, find a direction \beta and pre-multiply the matrix by \beta^T.
Geometrically, project all points onto the direction \beta.

The goal is to project points in a way that preserves as much information as possible.
If points are projected onto a direction where they cluster tightly, information is lost.
The best direction is the one along which the projected points have the maximum variance.
Find \beta such that the variance of the projected points is maximized. This \beta is the principal component.

Compute the mean vector m of all data points.
Subtract the mean from each data point to center the data at the origin: xi \leftarrow xi - m.
Project the centered data points onto \beta: \beta^T (x_i - m).
These projected points have a mean of zero.

The sample variance of the projected points is \sumi (\beta^T (xi - m))^2.
The goal is to maximize this variance over all choices of the unit vector \beta.

\sumi (\beta^T (xi - m))^2 = \sumi \beta^T (xi - m) (xi - m)^T \beta = \beta^T S \beta, where S = \sumi (xi - m) (xi - m)^T is the covariance matrix.
The problem becomes: Maximize \beta^T S \beta subject to ||\beta||_2 = 1.