Comprehensive Bioinformatics and Functional Genomics Notes: Combined 2021-2025 Exam Guides
Sequencing Technologies: Advantages and Disadvantages
Illumina - Advantages: Generates huge amounts of data, is currently the cheapest method, has a low sequence error rate, and features very few indels (insertions/deletions). It has established wet-lab and dry-lab (bioinformatics) protocols. - Disadvantages: Produces short reads. Requires large, expensive machines.
Pacific Biosciences (PacBio) - Advantages: Produces long reads and has a low sequence error rate compared to some other long-read tech. - Disadvantages: High cost. Requires large amounts of DNA. Wet-lab and dry-lab pipelines are less established than Illumina.
Nanopore (Oxford Nanopore Technologies) - Advantages: Produces long reads and is highly portable (small devices). The technology is rapidly improving. - Disadvantages: Traditionally high error rates, specifically a high incidence of indel errors. Less established wet/dry lab protocols.
Sequencing Depth and Genome Assembly
Sequencing Depth (D) Calculation - The formula for sequencing depth is: - = number of reads - = length of reads - = genome size (base pairs) - Example 1: For a genome of base pairs, reads of length , and a target depth of , the number of reads needed is: . - Example 2: For a genome of bp, read pairs (meaning total reads), and length , the depth is: .
Probability of Coverage - The probability that a random read covers a specific position in a circular genome of length is: .
Genome Assembly Concepts - Contigs: Pieces of the genome assembled from overlapping reads. - N50 Value: A metric for assembly quality. Given a set of contigs (lengths: ), the total length is . The N50 is the length of the smallest contig such that the sum of it and all larger contigs is at least of the total (. Here: , so N50 is . - De Bruijn Graphs (DBG): A method for genome assembly that avoids comparing all reads directly (faster than OLC for many reads). Reads are cut into shorter sequences of length (). Each unique is a node; edges are created for overlaps of length .
Sequencing Strategies - Amplicon sequencing: Sequences a specific marker or region (e.g., 16S rRNA). - Shotgun sequencing: Sequences all DNA fragments in a sample. - Metabarcoding: Mapping biological composition from environment-specific marker sequences.
Pairwise Sequence Alignment and Scoring
Dynamic Programming Algorithms - Global Alignment (Needleman-Wunsch): Finds the best alignment over the whole length. Starts filling a matrix from the top-left (score 0, then accumulated gap penalties). The final score is in the bottom-right cell. - Local Alignment (Smith-Waterman): Finds the best matching sub-segments. Matrix initialization and the recursion rule allow cells to reset to (no negative values). The optimal score is the highest value anywhere in the matrix.
Scoring Logic - Log-Likelihood Ratio Score: . - Positive score: Pair occurs more often in related sequences than by chance. - Negative score: Pair occurs less often in related sequences than by chance. - Example Calculation: If , , and , then the score is .
Gap Penalties - Linear Gap Penalty: Fixed penalty per gap position. - Affine Gap Penalty: Distinguishes between starting a gap (Gap-open, high cost) and extending a gap (Gap-extension, lower cost). .
Sequence Databases and BLAST Statistics
Data Formats - FASTA: Header starts with
>. Sequence follows. To count sequences, count lines starting with>. - Accession Number: A unique address/identifier for an object (e.g., a sequence) in a database. - Newick: A format describing tree topology and branch lengths.BLAST Metrics - E-value: The expected number of hits with a score at least as high as the observed score that would be found by chance in a database of a certain size. Smaller E-values indicate higher significance. - Bit-score: A normalized version of the raw score that is independent of the scoring matrix scaling. E-value can be calculated from bit-score if search sequence length and database size are known. - Null Distribution: The distribution of scores expected from aligning unrelated/independent sequences.
Search Performance - Sensitivity (Recall): The proportion of actual homologs in the database that appear on the hit list: . - Precision: The proportion of hits on the list that are actually homologs: . - Specificity: The proportion of non-homologs correctly excluded: .
Multiple Sequence Alignment (MSA)
Algorithms - Progressive Alignment: Greedy algorithm that builds the MSA based on a guide tree (made via UPGMA or NJ). Once a gap is placed, it is rarely changed (“Once a gap, always a gap”). - Iterative Methods: Repeatedly realigns groups of sequences in the MSA to improve the total score.
Guide Trees (UPGMA) - UPGMA (Unweighted Pair-Group Method with Arithmetic mean): A hierarchical clustering method using Average Linkage, where the distance between two groups is the average of distances between all pairs of members across groups. - Guide Tree Parsing: In a Newick string like
((A,(B,C)),E), the bracketed pairs closest to the center (inner brackets) are aligned first (e.g., B and C first, then A to the BC profile).Scoring - Sum-of-Pairs (SP) Score: The sum of scores for all possible pairwise combinations of sequences at each position in the MSA.
Sequence Models: PSSM and pHMM
Position Specific Scoring Matrix (PSSM) - A table with a row for each symbol (e.g., A, C, G, T) and a column for each sequence position. Values represent the log-ratio of the probability of the symbol at that position vs. background frequency. - Example: For 20 amino acids over 8 positions, a PSSM requires probability values.
Profile Hidden Markov Models (pHMM) - Describes motifs that can vary in length (including gaps). Includes "Match" states (observing symbols), "Insertion" states, and "Deletion" states (skipping symbols).
PROSITE Patterns - Syntax to represent motifs. Example:
A-[LK]-[IW]-X-L(3)-Smeans A followed by L or K, then I or W, then any symbol (X), then L three times, then S.
Molecular Phylogeny
Homology Concepts - Orthologs: Genes originated via speciation. Used to describe species evolution. - Paralogs: Genes originated via gene duplication. Used to describe gene family evolution. - Note: All orthologs and paralogs are homologs, but not all homologs are orthologs.
Phylogenetic Trees - Cladogram: Shows topology (branching order) only; branch lengths have no meaning. - Phylogram: Branch lengths are proportional to evolutionary distance. - Additive tree: Sum of branch lengths between nodes equals the distance in the distance matrix. - Ultrametric tree: All leaves are equidistant from the root (assumes a constant molecular clock).
Reconstruction Methods - Distance-based (NJ, UPGMA): Transform sequence data into a distance matrix. Neighbor-Joining (NJ) is a greedy method that seeks the tree with the smallest total length. - Discrete methods: - Maximum Parsimony (PM): Finds the tree requiring the fewest total mutations (uses informative positions where at least 2 symbols occur at least 2 times). - Maximum Likelihood (LM): Finds the tree/model that maximizes the probability of the observed data. - Bayesian (BM): Finds trees with the highest posterior probability.
Distance Models - P-distance: The observed proportion of mismatches: . Gaps are often excluded. - Evolutionary Distance: The estimated actual number of substitutions. As p-distance increases, the gap between p-distance and evolutionary distance grows because of multiple substitutions at the same site. Jukes-Cantor model corrects for this; maximum p-distance is (for nucleotides).
Bootstrapping - A statistical technique to assess clade stability. Resamples the MSA with replacement to generate many new alignments. The bootstrap value is the frequency with which a specific clade appears in the resulting trees.
Functional Genomics: Transcriptomics and Statistical Analysis
RNA-Seq Applications - Quantification of gene expression (coding and non-coding). - Discovery of novel (unannotated) genes. - Detection of splice variants. - Mapping mutations (SNPs) within transcripts.
Differential Expression Genes (DEGs) - Multiple Testing Correction: When testing 20,000 genes with a p-value of , we expect false positives by chance even if no genes are differentially expressed. - Bonferroni: Controls the family-wise error rate (). Very strict. - False Discovery Rate (FDR/q-values): Controls the proportion of false positives among the "significant" results. An FDR of for selected genes means expecting false positives. - Volcano Plot: Plots the significance () vs. effect size (). Genes in the top corners are significant DEGs.
Enrichment Analysis (Fisher's Exact Test) - Tests if a specific functional group of genes (e.g., cold stress genes) appears in a set of DEGs or clusters more often than expected by chance.
Dimensionality Reduction and Exploration - PCA (Principal Component Analysis): Used on data tables (like RPKM/TPM) to visualize gradients, groups, or outliers. - PCoA (Principal Coordinate Analysis): Used on distance/dissimilarity tables. - Clustering: K-means (requires choosing clusters, randomized) vs. Hierarchical (produces a tree, deterministic). - Diversity: - Alpha-diversity: Diversity within a single sample. - Beta-diversity: Diversity between samples.
Programming in R - Vectors vs. Lists: Vectors store elements of the same type; lists can store different types. - String Manipulation:
str_detect()(logical),str_locate()(integer matrix of positions),str_extract()(text). - Tidyverse:mutate()adds columns,filter()keeps rows based on conditions,select()keeps/excludes columns.