Comprehensive Bioinformatics and Functional Genomics Notes: Combined 2021-2025 Exam Guides

Sequencing Technologies: Advantages and Disadvantages

  • Illumina   - Advantages: Generates huge amounts of data, is currently the cheapest method, has a low sequence error rate, and features very few indels (insertions/deletions). It has established wet-lab and dry-lab (bioinformatics) protocols.   - Disadvantages: Produces short reads. Requires large, expensive machines.

  • Pacific Biosciences (PacBio)   - Advantages: Produces long reads and has a low sequence error rate compared to some other long-read tech.   - Disadvantages: High cost. Requires large amounts of DNA. Wet-lab and dry-lab pipelines are less established than Illumina.

  • Nanopore (Oxford Nanopore Technologies)   - Advantages: Produces long reads and is highly portable (small devices). The technology is rapidly improving.   - Disadvantages: Traditionally high error rates, specifically a high incidence of indel errors. Less established wet/dry lab protocols.

Sequencing Depth and Genome Assembly

  • Sequencing Depth (D) Calculation   - The formula for sequencing depth is: D=N×LGD = \frac{N \times L}{G}     - NN = number of reads     - LL = length of reads     - GG = genome size (base pairs)   - Example 1: For a genome of 5×1065 \times 10^6 base pairs, reads of length 125125, and a target depth of 5050, the number of reads needed is: N=50×5,000,000125=2,000,000N = \frac{50 \times 5,000,000}{125} = 2,000,000.   - Example 2: For a genome of 3×1063 \times 10^6 bp, 1,000,0001,000,000 read pairs (meaning 2,000,0002,000,000 total reads), and length 150150, the depth is: D=1,000,000×2×1503,000,000=100D = \frac{1,000,000 \times 2 \times 150}{3,000,000} = 100.

  • Probability of Coverage   - The probability that a random read covers a specific position in a circular genome of length GG is: P=LGP = \frac{L}{G}.

  • Genome Assembly Concepts   - Contigs: Pieces of the genome assembled from overlapping reads.   - N50 Value: A metric for assembly quality. Given a set of contigs (lengths: 100,200,300,400,500,600,700100, 200, 300, 400, 500, 600, 700), the total length is 28002800. The N50 is the length of the smallest contig such that the sum of it and all larger contigs is at least 50%50\% of the total (14001400. Here: 700+600+500=1800700 + 600 + 500 = 1800, so N50 is 500500.   - De Bruijn Graphs (DBG): A method for genome assembly that avoids comparing all reads directly (faster than OLC for many reads). Reads are cut into shorter sequences of length KK (kmersk-mers). Each unique kmerk-mer is a node; edges are created for overlaps of length K1K-1.

  • Sequencing Strategies   - Amplicon sequencing: Sequences a specific marker or region (e.g., 16S rRNA).   - Shotgun sequencing: Sequences all DNA fragments in a sample.   - Metabarcoding: Mapping biological composition from environment-specific marker sequences.

Pairwise Sequence Alignment and Scoring

  • Dynamic Programming Algorithms   - Global Alignment (Needleman-Wunsch): Finds the best alignment over the whole length. Starts filling a matrix from the top-left (score 0, then accumulated gap penalties). The final score is in the bottom-right cell.   - Local Alignment (Smith-Waterman): Finds the best matching sub-segments. Matrix initialization and the recursion rule allow cells to reset to 00 (no negative values). The optimal score is the highest value anywhere in the matrix.

  • Scoring Logic   - Log-Likelihood Ratio Score: S(a,b)=log10(Pr(a,brelated)Pr(a,bunrelated))S(a,b) = \log_{10} \left( \frac{Pr(a,b|related)}{Pr(a,b|unrelated)} \right).     - Positive score: Pair occurs more often in related sequences than by chance.     - Negative score: Pair occurs less often in related sequences than by chance.   - Example Calculation: If Pr(A,B)=0.1Pr(A,B) = 0.1, Pr(A)=0.01Pr(A) = 0.01, and Pr(B)=0.1Pr(B) = 0.1, then the score is log10(0.10.01×0.1)=log10(100)=2\log_{10} \left( \frac{0.1}{0.01 \times 0.1} \right) = \log_{10}(100) = 2.

  • Gap Penalties   - Linear Gap Penalty: Fixed penalty per gap position.   - Affine Gap Penalty: Distinguishes between starting a gap (Gap-open, high cost) and extending a gap (Gap-extension, lower cost). Scoregap=Open+(Length1)×ExtensionScore_{gap} = Open + (Length - 1) \times Extension.

Sequence Databases and BLAST Statistics

  • Data Formats   - FASTA: Header starts with >. Sequence follows. To count sequences, count lines starting with >.   - Accession Number: A unique address/identifier for an object (e.g., a sequence) in a database.   - Newick: A format describing tree topology and branch lengths.

  • BLAST Metrics   - E-value: The expected number of hits with a score at least as high as the observed score that would be found by chance in a database of a certain size. Smaller E-values indicate higher significance.   - Bit-score: A normalized version of the raw score that is independent of the scoring matrix scaling. E-value can be calculated from bit-score if search sequence length and database size are known.   - Null Distribution: The distribution of scores expected from aligning unrelated/independent sequences.

  • Search Performance   - Sensitivity (Recall): The proportion of actual homologs in the database that appear on the hit list: TP/(TP+FN)TP / (TP + FN).   - Precision: The proportion of hits on the list that are actually homologs: TP/(TP+FP)TP / (TP + FP).   - Specificity: The proportion of non-homologs correctly excluded: TN/(TN+FP)TN / (TN + FP).

Multiple Sequence Alignment (MSA)

  • Algorithms   - Progressive Alignment: Greedy algorithm that builds the MSA based on a guide tree (made via UPGMA or NJ). Once a gap is placed, it is rarely changed (“Once a gap, always a gap”).   - Iterative Methods: Repeatedly realigns groups of sequences in the MSA to improve the total score.

  • Guide Trees (UPGMA)   - UPGMA (Unweighted Pair-Group Method with Arithmetic mean): A hierarchical clustering method using Average Linkage, where the distance between two groups is the average of distances between all pairs of members across groups.   - Guide Tree Parsing: In a Newick string like ((A,(B,C)),E), the bracketed pairs closest to the center (inner brackets) are aligned first (e.g., B and C first, then A to the BC profile).

  • Scoring   - Sum-of-Pairs (SP) Score: The sum of scores for all possible pairwise combinations of sequences at each position in the MSA.

Sequence Models: PSSM and pHMM

  • Position Specific Scoring Matrix (PSSM)   - A table with a row for each symbol (e.g., A, C, G, T) and a column for each sequence position. Values represent the log-ratio of the probability of the symbol at that position vs. background frequency.   - Example: For 20 amino acids over 8 positions, a PSSM requires 20×8=16020 \times 8 = 160 probability values.

  • Profile Hidden Markov Models (pHMM)   - Describes motifs that can vary in length (including gaps). Includes "Match" states (observing symbols), "Insertion" states, and "Deletion" states (skipping symbols).

  • PROSITE Patterns   - Syntax to represent motifs. Example: A-[LK]-[IW]-X-L(3)-S means A followed by L or K, then I or W, then any symbol (X), then L three times, then S.

Molecular Phylogeny

  • Homology Concepts   - Orthologs: Genes originated via speciation. Used to describe species evolution.   - Paralogs: Genes originated via gene duplication. Used to describe gene family evolution.   - Note: All orthologs and paralogs are homologs, but not all homologs are orthologs.

  • Phylogenetic Trees   - Cladogram: Shows topology (branching order) only; branch lengths have no meaning.   - Phylogram: Branch lengths are proportional to evolutionary distance.   - Additive tree: Sum of branch lengths between nodes equals the distance in the distance matrix.   - Ultrametric tree: All leaves are equidistant from the root (assumes a constant molecular clock).

  • Reconstruction Methods   - Distance-based (NJ, UPGMA): Transform sequence data into a distance matrix. Neighbor-Joining (NJ) is a greedy method that seeks the tree with the smallest total length.   - Discrete methods:     - Maximum Parsimony (PM): Finds the tree requiring the fewest total mutations (uses informative positions where at least 2 symbols occur at least 2 times).     - Maximum Likelihood (LM): Finds the tree/model that maximizes the probability of the observed data.     - Bayesian (BM): Finds trees with the highest posterior probability.

  • Distance Models   - P-distance: The observed proportion of mismatches: Mismatches/(Matches+Mismatches)Mismatches / (Matches + Mismatches). Gaps are often excluded.   - Evolutionary Distance: The estimated actual number of substitutions. As p-distance increases, the gap between p-distance and evolutionary distance grows because of multiple substitutions at the same site. Jukes-Cantor model corrects for this; maximum p-distance is 0.750.75 (for nucleotides).

  • Bootstrapping   - A statistical technique to assess clade stability. Resamples the MSA with replacement to generate many new alignments. The bootstrap value is the frequency with which a specific clade appears in the resulting trees.

Functional Genomics: Transcriptomics and Statistical Analysis

  • RNA-Seq Applications   - Quantification of gene expression (coding and non-coding).   - Discovery of novel (unannotated) genes.   - Detection of splice variants.   - Mapping mutations (SNPs) within transcripts.

  • Differential Expression Genes (DEGs)   - Multiple Testing Correction: When testing 20,000 genes with a p-value of 0.050.05, we expect 1,0001,000 false positives by chance even if no genes are differentially expressed.     - Bonferroni: Controls the family-wise error rate (E=p×NE = p \times N). Very strict.     - False Discovery Rate (FDR/q-values): Controls the proportion of false positives among the "significant" results. An FDR of 0.050.05 for 100100 selected genes means expecting 55 false positives.   - Volcano Plot: Plots the significance (log10(p)-\log_{10}(p)) vs. effect size (log2(FoldChange)\log_{2}(Fold Change)). Genes in the top corners are significant DEGs.

  • Enrichment Analysis (Fisher's Exact Test)   - Tests if a specific functional group of genes (e.g., cold stress genes) appears in a set of DEGs or clusters more often than expected by chance.

  • Dimensionality Reduction and Exploration   - PCA (Principal Component Analysis): Used on data tables (like RPKM/TPM) to visualize gradients, groups, or outliers.   - PCoA (Principal Coordinate Analysis): Used on distance/dissimilarity tables.   - Clustering: K-means (requires choosing kk clusters, randomized) vs. Hierarchical (produces a tree, deterministic).   - Diversity:     - Alpha-diversity: Diversity within a single sample.     - Beta-diversity: Diversity between samples.

  • Programming in R   - Vectors vs. Lists: Vectors store elements of the same type; lists can store different types.   - String Manipulation: str_detect() (logical), str_locate() (integer matrix of positions), str_extract() (text).   - Tidyverse: mutate() adds columns, filter() keeps rows based on conditions, select() keeps/excludes columns.