Comprehensive Bioinformatics and Functional Genomics Notes: Combined 2021-2025 Exam Guides

Sequencing Technologies: Advantages and Disadvantages

Illumina - Advantages: Generates huge amounts of data, is currently the cheapest method, has a low sequence error rate, and features very few indels (insertions/deletions). It has established wet-lab and dry-lab (bioinformatics) protocols. - Disadvantages: Produces short reads. Requires large, expensive machines.
Pacific Biosciences (PacBio) - Advantages: Produces long reads and has a low sequence error rate compared to some other long-read tech. - Disadvantages: High cost. Requires large amounts of DNA. Wet-lab and dry-lab pipelines are less established than Illumina.
Nanopore (Oxford Nanopore Technologies) - Advantages: Produces long reads and is highly portable (small devices). The technology is rapidly improving. - Disadvantages: Traditionally high error rates, specifically a high incidence of indel errors. Less established wet/dry lab protocols.

Sequencing Depth and Genome Assembly

Sequencing Depth (D) Calculation - The formula for sequencing depth is: $D = \frac{N \times L}{G}$ - $N$ = number of reads - $L$ = length of reads - $G$ = genome size (base pairs) - Example 1: For a genome of $5 \times 10^6$ base pairs, reads of length $125$ , and a target depth of $50$ , the number of reads needed is: $N = \frac{50 \times 5,000,000}{125} = 2,000,000$ . - Example 2: For a genome of $3 \times 10^6$ bp, $1,000,000$ read pairs (meaning $2,000,000$ total reads), and length $150$ , the depth is: $D = \frac{1,000,000 \times 2 \times 150}{3,000,000} = 100$ .
Probability of Coverage - The probability that a random read covers a specific position in a circular genome of length $G$ is: $P = \frac{L}{G}$ .
Genome Assembly Concepts - Contigs: Pieces of the genome assembled from overlapping reads. - N50 Value: A metric for assembly quality. Given a set of contigs (lengths: $100, 200, 300, 400, 500, 600, 700$ ), the total length is $2800$ . The N50 is the length of the smallest contig such that the sum of it and all larger contigs is at least $50\%$ of the total ( $1400$ . Here: $700 + 600 + 500 = 1800$ , so N50 is $500$ . - De Bruijn Graphs (DBG): A method for genome assembly that avoids comparing all reads directly (faster than OLC for many reads). Reads are cut into shorter sequences of length $K$ ( $k-mers$ ). Each unique $k-mer$ is a node; edges are created for overlaps of length $K-1$ .
Sequencing Strategies - Amplicon sequencing: Sequences a specific marker or region (e.g., 16S rRNA). - Shotgun sequencing: Sequences all DNA fragments in a sample. - Metabarcoding: Mapping biological composition from environment-specific marker sequences.

Pairwise Sequence Alignment and Scoring

Dynamic Programming Algorithms - Global Alignment (Needleman-Wunsch): Finds the best alignment over the whole length. Starts filling a matrix from the top-left (score 0, then accumulated gap penalties). The final score is in the bottom-right cell. - Local Alignment (Smith-Waterman): Finds the best matching sub-segments. Matrix initialization and the recursion rule allow cells to reset to $0$ (no negative values). The optimal score is the highest value anywhere in the matrix.
Scoring Logic - Log-Likelihood Ratio Score: $S(a,b) = \log_{10} \left( \frac{Pr(a,b|related)}{Pr(a,b|unrelated)} \right)$ . - Positive score: Pair occurs more often in related sequences than by chance. - Negative score: Pair occurs less often in related sequences than by chance. - Example Calculation: If $Pr(A,B) = 0.1$ , $Pr(A) = 0.01$ , and $Pr(B) = 0.1$ , then the score is $\log_{10} \left( \frac{0.1}{0.01 \times 0.1} \right) = \log_{10}(100) = 2$ .
Gap Penalties - Linear Gap Penalty: Fixed penalty per gap position. - Affine Gap Penalty: Distinguishes between starting a gap (Gap-open, high cost) and extending a gap (Gap-extension, lower cost). $Score_{gap} = Open + (Length - 1) \times Extension$ .

Sequence Databases and BLAST Statistics

Data Formats - FASTA: Header starts with >. Sequence follows. To count sequences, count lines starting with >. - Accession Number: A unique address/identifier for an object (e.g., a sequence) in a database. - Newick: A format describing tree topology and branch lengths.
BLAST Metrics - E-value: The expected number of hits with a score at least as high as the observed score that would be found by chance in a database of a certain size. Smaller E-values indicate higher significance. - Bit-score: A normalized version of the raw score that is independent of the scoring matrix scaling. E-value can be calculated from bit-score if search sequence length and database size are known. - Null Distribution: The distribution of scores expected from aligning unrelated/independent sequences.
Search Performance - Sensitivity (Recall): The proportion of actual homologs in the database that appear on the hit list: $TP / (TP + FN)$ . - Precision: The proportion of hits on the list that are actually homologs: $TP / (TP + FP)$ . - Specificity: The proportion of non-homologs correctly excluded: $TN / (TN + FP)$ .

Multiple Sequence Alignment (MSA)

Algorithms - Progressive Alignment: Greedy algorithm that builds the MSA based on a guide tree (made via UPGMA or NJ). Once a gap is placed, it is rarely changed (“Once a gap, always a gap”). - Iterative Methods: Repeatedly realigns groups of sequences in the MSA to improve the total score.
Guide Trees (UPGMA) - UPGMA (Unweighted Pair-Group Method with Arithmetic mean): A hierarchical clustering method using Average Linkage, where the distance between two groups is the average of distances between all pairs of members across groups. - Guide Tree Parsing: In a Newick string like ((A,(B,C)),E), the bracketed pairs closest to the center (inner brackets) are aligned first (e.g., B and C first, then A to the BC profile).
Scoring - Sum-of-Pairs (SP) Score: The sum of scores for all possible pairwise combinations of sequences at each position in the MSA.

Sequence Models: PSSM and pHMM

Position Specific Scoring Matrix (PSSM) - A table with a row for each symbol (e.g., A, C, G, T) and a column for each sequence position. Values represent the log-ratio of the probability of the symbol at that position vs. background frequency. - Example: For 20 amino acids over 8 positions, a PSSM requires $20 \times 8 = 160$ probability values.
Profile Hidden Markov Models (pHMM) - Describes motifs that can vary in length (including gaps). Includes "Match" states (observing symbols), "Insertion" states, and "Deletion" states (skipping symbols).
PROSITE Patterns - Syntax to represent motifs. Example: A-[LK]-[IW]-X-L(3)-S means A followed by L or K, then I or W, then any symbol (X), then L three times, then S.

Molecular Phylogeny

Homology Concepts - Orthologs: Genes originated via speciation. Used to describe species evolution. - Paralogs: Genes originated via gene duplication. Used to describe gene family evolution. - Note: All orthologs and paralogs are homologs, but not all homologs are orthologs.
Phylogenetic Trees - Cladogram: Shows topology (branching order) only; branch lengths have no meaning. - Phylogram: Branch lengths are proportional to evolutionary distance. - Additive tree: Sum of branch lengths between nodes equals the distance in the distance matrix. - Ultrametric tree: All leaves are equidistant from the root (assumes a constant molecular clock).
Reconstruction Methods - Distance-based (NJ, UPGMA): Transform sequence data into a distance matrix. Neighbor-Joining (NJ) is a greedy method that seeks the tree with the smallest total length. - Discrete methods: - Maximum Parsimony (PM): Finds the tree requiring the fewest total mutations (uses informative positions where at least 2 symbols occur at least 2 times). - Maximum Likelihood (LM): Finds the tree/model that maximizes the probability of the observed data. - Bayesian (BM): Finds trees with the highest posterior probability.
Distance Models - P-distance: The observed proportion of mismatches: $Mismatches / (Matches + Mismatches)$ . Gaps are often excluded. - Evolutionary Distance: The estimated actual number of substitutions. As p-distance increases, the gap between p-distance and evolutionary distance grows because of multiple substitutions at the same site. Jukes-Cantor model corrects for this; maximum p-distance is $0.75$ (for nucleotides).
Bootstrapping - A statistical technique to assess clade stability. Resamples the MSA with replacement to generate many new alignments. The bootstrap value is the frequency with which a specific clade appears in the resulting trees.

Functional Genomics: Transcriptomics and Statistical Analysis

RNA-Seq Applications - Quantification of gene expression (coding and non-coding). - Discovery of novel (unannotated) genes. - Detection of splice variants. - Mapping mutations (SNPs) within transcripts.
Differential Expression Genes (DEGs) - Multiple Testing Correction: When testing 20,000 genes with a p-value of $0.05$ , we expect $1,000$ false positives by chance even if no genes are differentially expressed. - Bonferroni: Controls the family-wise error rate ( $E = p \times N$ ). Very strict. - False Discovery Rate (FDR/q-values): Controls the proportion of false positives among the "significant" results. An FDR of $0.05$ for $100$ selected genes means expecting $5$ false positives. - Volcano Plot: Plots the significance ( $-\log_{10}(p)$ ) vs. effect size ( $\log_{2}(Fold Change)$ ). Genes in the top corners are significant DEGs.
Enrichment Analysis (Fisher's Exact Test) - Tests if a specific functional group of genes (e.g., cold stress genes) appears in a set of DEGs or clusters more often than expected by chance.
Dimensionality Reduction and Exploration - PCA (Principal Component Analysis): Used on data tables (like RPKM/TPM) to visualize gradients, groups, or outliers. - PCoA (Principal Coordinate Analysis): Used on distance/dissimilarity tables. - Clustering: K-means (requires choosing $k$ clusters, randomized) vs. Hierarchical (produces a tree, deterministic). - Diversity: - Alpha-diversity: Diversity within a single sample. - Beta-diversity: Diversity between samples.
Programming in R - Vectors vs. Lists: Vectors store elements of the same type; lists can store different types. - String Manipulation: str_detect() (logical), str_locate() (integer matrix of positions), str_extract() (text). - Tidyverse: mutate() adds columns, filter() keeps rows based on conditions, select() keeps/excludes columns.