Bioinformatics and Linear Algebra Fundamentals
Unsupervised vs. Supervised Classification
- Rows often represent genes or molecules.
- Columns represent experiments.
- Cells contain expression values.
Unsupervised Classification
- Goal: Determine if experiments naturally form classes without prior knowledge.
- Methods: Unsupervised clustering and other unsupervised methods.
Supervised Classification
- Experiments are pre-classified (e.g., type A and type B).
- Given a new sample of unknown type.
- Goal: Classify the sample as either type A or type B.
Dimensionality Reduction
- Given a class, identify a subset of genes (e.g., 10 genes).
- Goal: Predict class membership based on the expression values of this reduced set of genes.
Geometric Perspective
- The algebra can be intimidating, but concepts become clear with a geometric perspective.
- This approach aligns with modern bioinformatic thinking.
Leukemia Example: ALL vs. AML
- Classic gene expression example using microarrays (precursor to RNA-seq).
- Two forms of leukemia: ALL (Acute Lymphoblastic Leukemia) and AML (Acute Myeloid Leukemia).
- Distinction is crucial because AML is generally more aggressive, requiring different treatment.
- Gene expression values from patient blood samples are analyzed.
- Most discriminative genes (not all) are used.
- Red = high expression, Blue = low expression.
- Genes exist that are highly expressed in ALL and lowly expressed in AML, and vice versa.
- Noise exists in the data; single genes are insufficient for classification.
- A compendium of genes provides a strong signal for classification.
Classifying Orphan Samples
- Given a new sample, how to classify it as ALL or AML.
- One approach: Count how many genes are highly expressed in ALL vs. AML.
- Another approach: Weighted average of gene expression values in each class, comparing against a threshold.
Linear Algebra Preliminaries
Two-Gene Example
- Simplify to two genes for visualization.
- Plot each sample in a two-dimensional space.
- X and Y axes represent gene expression values for gene 1 (G1) and gene 2 (G2).
- If samples from different classes are well separated, they are linearly separable.
- In two dimensions, linear separation means a line can divide the classes.
- New samples are classified based on which side of the line they fall.
Higher Dimensions
- Three genes: three-dimensional space.
- Linear separation: a plane.
- High-dimensional space: a hyperplane.
Non-Linearly Separable Data
- Classes may be separable but not linearly separable.
- Apply transformations to make non-linearly separable data linearly separable in a transformed space.
Example: Concentric Rings
- One class forms a blue ring; the other forms an orange ring.
- No single line can separate them in the original space.
- Transformation idea: Use sine or cosine waves.
- The idea is to leverage the radial structure.
- Represent each point as (x1, x2).
- Apply the transformation: x<em>12+x</em>22 (square of the distance from the center).
- By the Pythagorean theorem, the distance from the origin is x<em>12+x</em>22.
- All blue points fall within a certain radius range; all red points fall outside that range.
- Transform to a one-dimensional space with a cutoff point separating the classes.
Deep Learning
- Deep learning provides a general structure for nonlinearity.
- The machine learns the nonlinear transformations.
Course Outline
- Basics of linear separation.
- Examples where linear separation fails.
- Multilayer perceptrons for learning nonlinear structures.
- More complex structures.
Linear Algebra Basics
Column Vectors
- Represented as column vectors:
[x<em>1 x</em>2 … xn] - Think of them as experiments in a gene expression matrix.
Row Vectors
- Use the transpose form for row vectors: [x<em>1x</em>2amp;…amp;xn].
- Transpose: Rows become columns and columns become rows.
Vectors as Points
- A vector can be visualized as a point in high-dimensional space.
- Example: (x1, x2) in a two-dimensional plane.
- Every vector is a point in high-dimensional space.
- Also represents the direction of the line connecting the origin to the point.
Vector Magnitude: Norms
Euclidean Norm (L2 Norm)
- ∣∣x∣∣<em>2=x</em>12+x<em>22+x</em>32+…+xn2
- Shortest distance from the origin to the point in Euclidean space.
L1 Norm (Manhattan Distance)
- ∣∣x∣∣<em>1=∣x</em>1∣+∣x<em>2∣+…+∣x</em>n∣
- Sum of absolute values of the components.
- Manhattan distance: Distance traveled along grid lines (no diagonals).
Infinity Norm
- ∣∣x∣∣<em>∞=max(∣x</em>1∣,∣x<em>2∣,…,∣x</em>n∣)
- Maximum absolute value of the vector's components.
Scalar Multiplication
- Multiplying a vector by a scalar changes its magnitude but not its direction.
- Example: Multiplying vector (1, 1) by 0.5 results in (0.5, 0.5).
Unit Vectors
- Dividing a vector by its magnitude (2-norm) produces a unit vector.
- x^=∣∣x∣∣2x
- The direction remains the same, but the magnitude becomes 1.
Dot Product
- Multiply corresponding components and sum the results.
- x⋅y=x<em>1y</em>1+x<em>2y</em>2+…+x<em>ny</em>n
- Can be written as xTy where x is transposed into a row vector.