Lecture 1: Transcriptomics - RNA-seq 🧬🔬

Learning Outcomes:

  • Define gene expression and understand why experiments are needed despite having genome sequences.

  • Describe the basic methodology of RNA-seq.

  • Explain how RNA-seq read distribution relates to mRNA quantity, structure, and splicing in eukaryotes, and interpret this data.

  • Describe the utility of normalised metrics like TPM in RNA-seq and interpret comparisons using X/Y Scatter and Volcano plots for differential gene expression.

  • Briefly describe single-cell RNA-seq (scRNA-seq) and interpret UMAP plots.


Introduction to Gene Expression

  • Gene expression is the functional "read-out" of information from a genome, typically involving the transcription of DNA into RNA, which may then be translated into protein (e.g., FUN12 mRNA to Fun12 protein) or function directly as RNA (e.g., tRNA).

  • Genomes are visualized as collections of genes and other features along chromosomes. For example, the yeast genome shows many genes like FUN12 and CLN3.

  • Despite having accurate genome sequences (e.g., human reference genome hg38 ), we currently cannot predict when and where genes will be expressed in an organism just from the sequence.

  • Therefore, experiments are necessary to assay gene expression.


DNA Sequencing and Functional Genomics

  • Modern DNA sequencing technology allows us to assemble and align sequence fragments to determine genome sequences. This process involves isolating cells, making genomic DNA, processing it for sequencing, sequencing, and then assembling/aligning the fragments.

  • This underpins the "’Omics Revolution". Genome sequencing provides reference genomes.

  • Functional genomics leverages DNA sequencing to perform gene expression analysis. Key methods include:

    • RNA-seq: Analyzes/maps RNA transcripts and splicing (transcriptome assay).

    • ChIP-seq: Analyzes/maps specific DNA-bound proteins (epigenome assay).

    • DNase-seq, MNase-seq, ATAC-seq: Analyze/map general DNA-bound proteins or accessible chromatin (epigenome assay).

    • Bisulfite-seq: Identifies DNA base chemical modifications (epigenome assay).

    • Hi-C: Analyzes/maps 3D in vivo DNA interactions (chromosome structure map).


Transcriptomics and RNA-seq 📜

  • The transcriptome is the complete set of RNA transcripts in a cell, tissue, or organism under specific conditions.

  • Transcriptomes are complex and change during development, disease, and in response to the environment.

  • Usually, the amount of mRNA from a gene is proportional to the amount of resulting protein, making mRNA levels a good proxy for gene activity.

  • RNA-seq is the core technology for assaying transcriptomes.

Basic Methodology of RNA-seq:

  1. Isolate RNA: Purify RNA from a living cell population.

  2. mRNA Selection (often): Most RNA-seq methods select for mRNAs because researchers are primarily interested in protein-coding genes. Since eukaryotic mRNAs have a characteristic poly-A tail, they can be selected from total RNA using beads coated with complementary poly-T sequences. Other RNAs like tRNA are typically excluded.

  3. Convert mRNA to cDNA: It's technically challenging to sequence RNA directly, so RNA is converted to a more stable DNA copy (cDNA) using the enzyme reverse transcriptase. This conversion is ideally one-to-one.

  4. Library Preparation: The cDNA molecules are processed into a sequencing library (fragmented, adaptors ligated).

  5. Next-Generation Sequencing (NGS): The cDNA library is sequenced using high-throughput methods like Illumina sequencing. For human mRNA, ~30-50 million NGS reads per sample are typically needed.

  6. Data Analysis: The resulting sequence reads are aligned to a reference genome. The frequency distribution of these aligned reads is then plotted or counted.


Interpreting RNA-seq Data

  • Read Frequency and Quantity: The frequency of RNA-seq reads aligning to a specific gene measures the level of that gene's transcripts in the cell population. For example, in yeast, if the VTC4 gene produces ~5 times more mRNA than the CCT3 gene, RNA-seq will show a ~5-fold higher frequency of reads aligning to VTC4 compared to CCT3.

  • Gene Structure and Splicing in Eukaryotes:

    • Higher eukaryotic protein-coding genes often contain non-coding introns that interrupt the coding exons. Gene maps usually show exons as boxes and introns as gaps between them, with chevrons or arrows indicating transcription direction.

    • During mRNA processing, introns are removed by splicing.

    • RNA-seq samples mature mRNAs, so reads align to exons but not to introns (which are spliced out). This results in a "spiky" appearance of RNA-seq data when mapped to the genome.

    • This feature allows RNA-seq data to map the exons of transcribed genes.

  • Alternative Splicing (AS): Eukaryotic cells can splice pre-mRNAs in different ways, producing multiple mRNA variants (isoforms) from a single gene, which increases coding potential.

    • RNA-seq can identify these variants, as reads will span different exon:exon junctions depending on the splice variant produced. Genome maps often show known splice variants stacked to illustrate different exon combinations from the same DNA region. For example, the GLIPR2 gene has four known splice variants using combinations of five exons. RNA-seq data from neutrophils might show reads corresponding only to exons 1, 2, and 5, indicating production of GLIPR2 Variant #1.

  • Untranslated Regions (UTRs): Not all exonic sequences encode protein. RNA-seq also maps the 5' and 3' UTRs of mRNAs, often shown as "thin boxes" in gene maps, while protein-coding regions are "fat boxes". UTRs are important for regulating mRNA localization and translation.

RNA-seq data, therefore, provides information on gene transcription levels, exon-intron structure, and alternative splicing patterns.


Quantitative RNA-seq Analysis and Differential Gene Expression (DGE)

  • To accurately quantify and compare gene expression where read frequency is distributed over discrete exons and genes of varying lengths, bioinformatics metrics are used to derive a single absolute expression level value for each gene.

  • TPM (Transcripts Per Kilobase Million) is a common normalised metric.

    • Calculation steps:

      1. Divide read counts for a gene by the length of the gene in kilobases (RPK).

      2. Sum all RPK values in the sample and divide by 1,000,000 (per million scaling factor).

      3. Divide individual gene RPK values by this scaling factor to get TPM.

    • Other metrics include RPKM (Reads Per Kilobase Million) and FPKM (Fragments Per Kilobase Million). You don't need to learn the detailed calculations for BI2234, just their utility.

  • TPM values are normalized for quantitative comparison between different genes and experiments, enabling Differential Gene Expression (DGE) analysis.

Approaches to DGE Analysis:

  1. Comparing one gene across many conditions: Plotting TPM values for a single gene (e.g., GLIPR2) across multiple tissue types or experimental conditions reveals its expression pattern. For instance, GLIPR2 is highly transcribed in blood cell types but rarely in brain cells. Box-and-whisker plots are often used for such comparisons.

  2. Comparing all genes between two conditions: (e.g., cancer cells vs. normal cells )

    • X/Y Scatter Plot: Each dot represents a gene. Its x-coordinate is the TPM in the control condition, and the y-coordinate is the TPM in the experimental condition.

      • Genes with unchanged expression lie on the x=y diagonal line (e.g., TUB ).

      • Genes with significantly altered expression lie above (up-regulated, e.g., HER2 ) or below (down-regulated) the diagonal. Statistical significance (e.g., p-values) helps define these thresholds.

    • Volcano Plot: Plots the log fold change in expression between two conditions against the -log10 p-value for each gene.

      • Genes with significantly increased expression appear in the top-right quadrant.

      • Genes with significantly decreased expression appear in the top-left quadrant.

      • Genes with no significant change cluster around the bottom center. A p-value threshold (e.g., p ≤ 0.05) is typically used to define significance.

Quantitative RNA-seq comparisons allow discovery of changes in gene regulation driving development or disease.


Single-Cell RNA-seq (scRNA-seq) 🔬👤

  • Most RNA-seq is performed on bulk tissue samples or cell cultures, averaging transcriptomes across millions of cells.

  • scRNA-seq assays the transcriptome of individual single cells.

  • Process:

    1. Cells are sampled one-by-one (e.g., using microfluidics or dissection).

    2. RNAs within each cell are simultaneously converted to cDNA and tagged with a unique index sequence (cell barcode) specific to that cell.

    3. Indexed cDNAs from all cells are pooled, prepared into a library, and sequenced using NGS.

    4. Reads are assigned back to their cell of origin using the barcodes.

  • Utility: scRNA-seq allows the construction of cell atlases by classifying the transcriptomes of individual cells.

  • UMAP (Uniform Manifold Approximation and Projection) Plots:

    • Used to visualize scRNA-seq data. Each dot on a UMAP plot represents the transcriptome of a single cell, positioned in "UMAP space" based on its overall gene expression profile (which genes are active/inactive).

    • Cells with similar transcriptomic profiles cluster together, enabling the identification and characterization of different cell types and states within a heterogeneous population.

    • For example, scRNA-seq of the human pancreas identified at least 29 different cell types. Similar atlases have been generated for whole organisms like monkeys.


Summary

  • RNA-seq, in its various forms (bulk and single-cell), is the current core technology for assaying gene expression.

  • It provides quantitative information about which genes are transcribed, their expression levels, their exon-intron structures, and alternative splicing variants.

  • Understanding RNA-seq principles is fundamental to understanding most other functional genomics methods.