Bioinformatics and Linear Algebra Fundamentals

Unsupervised vs. Supervised Classification

Rows often represent genes or molecules.
Columns represent experiments.
Cells contain expression values.

Unsupervised Classification

Goal: Determine if experiments naturally form classes without prior knowledge.
Methods: Unsupervised clustering and other unsupervised methods.

Supervised Classification

Experiments are pre-classified (e.g., type A and type B).
Given a new sample of unknown type.
Goal: Classify the sample as either type A or type B.

Dimensionality Reduction

Given a class, identify a subset of genes (e.g., 10 genes).
Goal: Predict class membership based on the expression values of this reduced set of genes.

Geometric Perspective

The algebra can be intimidating, but concepts become clear with a geometric perspective.
This approach aligns with modern bioinformatic thinking.

Leukemia Example: ALL vs. AML

Classic gene expression example using microarrays (precursor to RNA-seq).
Two forms of leukemia: ALL (Acute Lymphoblastic Leukemia) and AML (Acute Myeloid Leukemia).
Distinction is crucial because AML is generally more aggressive, requiring different treatment.
Gene expression values from patient blood samples are analyzed.
Most discriminative genes (not all) are used.
Red = high expression, Blue = low expression.
Genes exist that are highly expressed in ALL and lowly expressed in AML, and vice versa.
Noise exists in the data; single genes are insufficient for classification.
A compendium of genes provides a strong signal for classification.

Classifying Orphan Samples

Given a new sample, how to classify it as ALL or AML.
One approach: Count how many genes are highly expressed in ALL vs. AML.
Another approach: Weighted average of gene expression values in each class, comparing against a threshold.

Linear Algebra Preliminaries

Two-Gene Example

Simplify to two genes for visualization.
Plot each sample in a two-dimensional space.
X and Y axes represent gene expression values for gene 1 (G1) and gene 2 (G2).
If samples from different classes are well separated, they are linearly separable.
In two dimensions, linear separation means a line can divide the classes.
New samples are classified based on which side of the line they fall.

Higher Dimensions

Three genes: three-dimensional space.
Linear separation: a plane.
High-dimensional space: a hyperplane.

Non-Linearly Separable Data

Classes may be separable but not linearly separable.

Non-Linear Transformations

Apply transformations to make non-linearly separable data linearly separable in a transformed space.

Example: Concentric Rings

One class forms a blue ring; the other forms an orange ring.
No single line can separate them in the original space.
Transformation idea: Use sine or cosine waves.
The idea is to leverage the radial structure.

Transformation to Radius

Represent each point as (x1, x2).
Apply the transformation: $x1^2 + x2^2$ (square of the distance from the center).
By the Pythagorean theorem, the distance from the origin is $\sqrt{x1^2 + x2^2}$ .
All blue points fall within a certain radius range; all red points fall outside that range.
Transform to a one-dimensional space with a cutoff point separating the classes.

Deep Learning

Deep learning provides a general structure for nonlinearity.
The machine learns the nonlinear transformations.

Course Outline

Basics of linear separation.
Examples where linear separation fails.
Multilayer perceptrons for learning nonlinear structures.
More complex structures.

Linear Algebra Basics

Column Vectors

Represented as column vectors:
$\begin{bmatrix} x1 \ x2 \ … \ x_n \end{bmatrix}$
Think of them as experiments in a gene expression matrix.

Row Vectors

Use the transpose form for row vectors: $\begin{bmatrix} x1 & x2 & … & x_n \end{bmatrix}$ .
Transpose: Rows become columns and columns become rows.

Vectors as Points

A vector can be visualized as a point in high-dimensional space.
Example: (x1, x2) in a two-dimensional plane.
Every vector is a point in high-dimensional space.
Also represents the direction of the line connecting the origin to the point.

Vector Magnitude: Norms

Euclidean Norm (L2 Norm)

$||x||2 = \sqrt{x1^2 + x2^2 + x3^2 + … + x_n^2}$
Shortest distance from the origin to the point in Euclidean space.

L1 Norm (Manhattan Distance)

$||x||1 = |x1| + |x2| + … + |xn|$
Sum of absolute values of the components.
Manhattan distance: Distance traveled along grid lines (no diagonals).

Infinity Norm

$||x||\infty = \max(|x1|, |x2|, …, |xn|)$
Maximum absolute value of the vector's components.

Scalar Multiplication

Multiplying a vector by a scalar changes its magnitude but not its direction.
Example: Multiplying vector (1, 1) by 0.5 results in (0.5, 0.5).

Unit Vectors

Dividing a vector by its magnitude (2-norm) produces a unit vector.
$\hat{x} = \frac{x}{||x||_2}$
The direction remains the same, but the magnitude becomes 1.

Dot Product

Multiply corresponding components and sum the results.
$x \cdot y = x1y1 + x2y2 + … + xnyn$
Can be written as $x^T y$ where x is transposed into a row vector.