!! Bulk, Single Cell, and Spatial RNA-Seq

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/40

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

41 Terms

1
New cards

Why study RNA?

  • RNA is dynamic, capturing more information than static DNA.

  • RNA is the first quantitative link between DNA sequence and phenotype.

  • RNA capture regulatory complexity beyond gene sequence.

    • The transcriptome encodes layers of control not visible in DNA through alternative splicing and noncoding RNA regulation.

2
New cards

Gene expression

  • A gene is a short DNA section containing instructions for making proteins.

    • Exons, introns.

    • Alternative splicing.

      • Different combinations of exons.

      • Multiple transcripts.

  • Gene expression level is the number of gene transcripts produced in an organism or cell.

  • It’s the process of using DNA information to create RNA and proteins.

<ul><li><p>A gene is a short DNA section containing instructions for making proteins.</p><ul><li><p>Exons, introns.</p></li><li><p>Alternative splicing.</p><ul><li><p>Different combinations of exons.</p></li><li><p>Multiple transcripts.</p></li></ul></li></ul></li><li><p>Gene expression level is the number of gene transcripts produced in an organism or cell.</p></li><li><p>It’s the process of using DNA information to create RNA and proteins.</p></li></ul><p></p>
3
New cards

Detecting gene expression

  • RNA-Seq and DNA microarray can be used to profile gene expression.

  • cDNA (complementary DNA) is used in gene expression studies to detect and quantify gene expression products.

    • RNA is not stable and can be easily degraded.

  • cDNA libraries are gene libraries containing only genes expressed in a cell or tissue.

4
New cards

RNA-Seq steps

  1. Sample preparation: Prepares the mRNA for sequencing.

    • Starting material: Total RNA is isolated, explicitly focusing on the mRNA.

    • Fragmentation/cDNA synthesis: The mRNA can either be fragmented directly into RNA fragments or be converted into more stable cDNA using reverse transcriptase.

    • Adapters: Sequencing adapters are ligated (attached) to the ends of fragments, creating a sequencing library.

  2. Next Generation Sequencing: The prepared library is sequenced to produce raw data.

    • Sequencing: The fragments are sequenced to generate millions of short sequence reads.

  3. Data analysis: The raw sequencing reads are processed to quantify gene expression.

    • Mapping reads: The short reads are aligned against a known reference genome or transcriptome.

      • Exonic reads map entirely within an exon.

      • Junction reads span a splice junction (connecting two exons).

      • Poly(A) end reads map to the end of the transcript, confirming the 3’ end.

    • Visualization (genome browser).

    • De novo assembly: If a reference genome is unavailable, reads can be assembled to build transcripts.

    • Quantification: he number of reads mapping to a gene or transcript is counted to determine its expression level. The final result is a Base-resolution expression profile, a graph showing the abundance of RNA (read coverage) across the transcript length.

<ol><li><p>Sample preparation: Prepares the mRNA for sequencing.</p><ul><li><p>Starting material: Total RNA is isolated, explicitly focusing on the mRNA.</p></li><li><p>Fragmentation/cDNA synthesis: The mRNA can either be fragmented directly into RNA fragments or be converted into more stable cDNA using reverse transcriptase.</p></li><li><p>Adapters: Sequencing adapters are ligated (attached) to the ends of fragments, creating a sequencing library.</p></li></ul></li><li><p>Next Generation Sequencing: The prepared library is sequenced to produce raw data.</p><ul><li><p>Sequencing: The fragments are sequenced to generate millions of short sequence reads.</p></li></ul></li><li><p>Data analysis: The raw sequencing reads are processed to quantify gene expression.</p><ul><li><p>Mapping reads: The short reads are aligned against a known reference genome or transcriptome.</p><ul><li><p>Exonic reads map entirely within an exon.</p></li><li><p>Junction reads span a splice junction (connecting two exons).</p></li><li><p>Poly(A) end reads map to the end of the transcript, confirming the 3’ end.</p></li></ul></li><li><p>Visualization (genome browser).</p></li><li><p>De novo assembly: I<span>f a reference genome is unavailable, reads can be assembled to build transcripts</span>.</p></li><li><p>Quantification: <span>he number of reads mapping to a gene or transcript is counted to determine its expression level</span>. <span>The final result is a Base-resolution expression profile, a graph showing the abundance of RNA (read coverage) across the transcript length</span>.</p></li></ul></li></ol><p></p>
5
New cards

RNA-Seq analysis steps

  • Quality control sequence output = FASTQ files.

  • Pre-processing notes: If the reads contain low-quality bases or adapter sequences, you might want to trim or filter them.

  • Alignment notes: map reads to reference genome/transcriptome; output = BAM files.

  • Quantitation notes: abundance quantification; gene, exon, or transcript levels.

<ul><li><p>Quality control sequence output = FASTQ files.</p></li><li><p>Pre-processing notes: If the reads contain low-quality bases or adapter sequences, you might want to trim or filter them.</p></li><li><p>Alignment notes: map reads to reference genome/transcriptome; output = BAM files.</p></li><li><p>Quantitation notes: abundance quantification; gene, exon, or transcript levels.</p></li></ul><p></p>
6
New cards

RNA read alignment

  • Find the position of a sequencing read on the reference genome.

    • Reads can be unaligned or aligned to more than one location.

    • A paired-end read will only be counted as one read if both ends align to the genome.

  • Given a list of transcripts, each read can be mapped to one of the four classes: exonic, partially overlaps an exon, intronic, or between genes.

    • Why intronic: They are mostly unspliced pre-mRNA or functional non-coding RNAs transcribed from those genomic regions.

<ul><li><p>Find the position of a sequencing read on the reference genome.</p><ul><li><p>Reads can be unaligned or aligned to more than one location.</p></li><li><p>A paired-end read will only be counted as one read if both ends align to the genome.</p></li></ul></li><li><p>Given a list of transcripts, each read can be mapped to one of the four classes: exonic, partially overlaps an exon, intronic, or between genes.</p><ul><li><p>Why intronic: They are mostly unspliced pre-mRNA or functional non-coding RNAs transcribed from those genomic regions.</p></li></ul></li></ul><p></p>
7
New cards

Abundance quantification

  • Raw counts can be biased:

    • Sequencing depth: Different total numbers of reads between samples.

      • Correction: Normalizing by total library size.

    • Gene length: Longer genes have more space for reads to map, leading to higher raw counts even if the gene's concentration is the same as a shorter gene.

      • Correction: Normalizing by transcript/gene length.

    • RNA composition: A few highly expressed genes can take up a significant fraction of the total reads, making other genes appear to have lower expression than they really do. This affects the comparison between samples.

      • Correction: Using specialized between-sample normalization methods.

<ul><li><p>Raw counts can be biased:</p><ul><li><p>Sequencing depth: Different total numbers of reads between samples.</p><ul><li><p>Correction: Normalizing by total library size.</p></li></ul></li><li><p>Gene length: Longer genes have more space for reads to map, leading to higher raw counts even if the gene's concentration is the same as a shorter gene.</p><ul><li><p>Correction: Normalizing by transcript/gene length.</p></li></ul></li><li><p>RNA composition: A few highly expressed genes can take up a significant fraction of the total reads, making other genes appear to have lower expression than they really do. This affects the comparison between samples.</p><ul><li><p>Correction: <span>Using specialized between-sample normalization methods.</span></p></li></ul></li></ul></li></ul><p></p>
8
New cards

RNA-seq normalization methods

  • Within sample:

    • Required to compare the expression of genes within an individual sample.

    • It can adjust data for two primary technical variables: transcript length and sequencing depth.

  • Counts per million (CPM).

  • FPKM (fragments per kilobase of transcript per million fragments mapped).

  • Transcripts per million (TPM).

9
New cards

Counts per million (CPM)

  • The number of raw counts mapped to a transcript, scaled by the total number of sequencing reads in your sample, multiplied by a million.

  • It normalizes RNA-seq data for sequencing depth but not gene length.

    • Not yet ready for comparison of gene expression.

10
New cards

FPKM (fragments per kilobase of transcript per million fragments mapped) & RPKM (reads per kilobase of transcript per million reads mapped)

  • FPKM = paired-end data.

  • RPKM = single-end data.

  • Correct for both library size and gene length.

  • Good for comparison of gene expression within a single sample, but not for across-sample comparison.

11
New cards

Transcripts per million (TPM)

  • Represent the relative number of transcripts you would detect for a gene if you had sequenced one million full-length transcripts.

  • Correct for both sequencing depth and transcript length.

  • Suitable for within-sample comparison of gene expression, but not for across-sample comparison.

12
New cards

RPKM summary

  • Step 1 = normalize for sequencing depth: RMP = (raw counts for gene / (total mapped reads / 10)) x 1,000,000.

    • Divide the total mapped reads by 10 to simplify the number.

  • Step 2 = normalize for gene length: RPKM = RPM / gene length in kilobases.

  • Example:

    • Gene length = 2 kb.

    • Gene raw counts = 10.

    • Total mapped reads = 35.

    • Step 1: (10 / (35/10)) x 1,000,000 = 2.86.

    • Step 2: 2.86 / 2 = 1.43.

13
New cards

TPM vs. RPKM

  • RPKM:

    • Reads mapped / (gene length * total mapped reads in millions).

    • Normalizes for depth first, then length.

    • The total RPKM/FPKM of all genes in a sample does NOT sum to the same value across different samples.

    • Only for comparing the expression between different genes within a single sample.

  • TPM:

    • ((Reads mapped / gene length in kb) / (sum of all genes) (reads mapped / gene length in kb)) * 106.

    • Normalizes for length first, then scales the length-normalized values (the RPK) by the total sum of all RPKs in the sample.

    • The sum of all TPM values in a sample is always 106 (one million), making samples directly comparable.

    • Recommended for comparing the expression of a single gene across different samples (replicates).

<ul><li><p>RPKM:</p><ul><li><p>Reads mapped / (gene length * total mapped reads in millions).</p></li><li><p>Normalizes for depth first, then length.</p></li><li><p>The total RPKM/FPKM of all genes in a sample does NOT sum to the same value across different samples.</p></li><li><p>Only for comparing the expression between different genes within a single sample.</p></li></ul></li><li><p>TPM:</p><ul><li><p>((Reads mapped / gene length in kb) / (sum of all genes) (reads mapped / gene length in kb)) * 10<sup>6</sup>.</p></li><li><p>Normalizes for length first, then scales the length-normalized values (the RPK) by the total sum of all RPKs in the sample.</p></li><li><p>The sum of all TPM values in a sample is always <span>106</span> (one million), making samples directly comparable.</p></li><li><p>Recommended for comparing the expression of a single gene across different samples (replicates).</p></li></ul></li></ul><p></p>
14
New cards

Evolution of gene expression measurements

  • Bulk RNA-seq: Measures average gene expression across many cells.

    • Cost-effective and straightforward, but masks cell-to-cell differences.

  • Single-cell RNA-seq: Profiles individual cells to reveal cellular heterogeneity, rare cell types, and dynamic states.

  • Spatial RNA-seq: Captures where genes are expressed within tissue, preserving spatial context and cell-cell interactions.

15
New cards

Why study single cells?

  • When compared to bulk RNA-seq, single-cell studies:

    • Reveals cell-to-cell differences and provides an unbiased view of cellular complexity within a tissue.

    • Allows for the discovery and precise identification of rare cell types and dynamic cellular states (e.g., transitional stages).

    • Provides the highest resolution view of the intermediate step between genotype and phenotype (where cells are the functional constituents).

16
New cards

Bulk vs. scRNA-seq

  • Bulk RNA-seq:

    • Quantifying expression signatures from ensembles.

    • Insufficient for studying a heterogeneous system.

  • scRNA-seq:

    • Inference of gene regulatory networks across the cells.

    • Heterogeneity of cell responses.

    • Cell type identification.

<ul><li><p>Bulk RNA-seq:</p><ul><li><p>Quantifying expression signatures from ensembles.</p></li><li><p>Insufficient for studying a heterogeneous system.</p></li></ul></li><li><p>scRNA-seq:</p><ul><li><p>Inference of gene regulatory networks across the cells.</p></li><li><p>Heterogeneity of cell responses.</p></li><li><p>Cell type identification.</p></li></ul></li></ul><p></p>
17
New cards

scRNA-seq analysis cell- and gene-level analysis

  • Cell-level:

    • Marker gene identification.

    • Cluster analysis.

    • Trajectory analysis.

  • Gene-level analysis:

    • Single-cell differential expression analysis.

    • Gene set analysis.

    • Gene regulatory networks.

18
New cards

scRNA-seq process

  1. Sample preparation.

  2. Single-cell RNA sequencing.

  3. Data processing.

  4. Data analysis.

19
New cards

scRNA-seq sample preparation

  • Cells are physically separated into a single-cell solution from which specific cell types can be enriched or excluded.

    • Each cell is captured by one droplet.

    • Each droplet contains a unique barcode and the necessary reagents for reverse transcription.

    • Each individual RNA molecule captured within that droplet is tagged with its own unique molecular identifier (UMI).

      • The UMI distinguishes unique RNA molecules from duplicates that arise during PCR amplification.

20
New cards

scRNA-seq single-cell RNA sequencing

  • Extremely small amount of RNAs within a cell → hard to detect.

  • PCR amplification → start sequencing.

21
New cards

scRNA-seq data processing

  • UMI.

  • Gene counts.

  • Drop-outs in single cell.

  • Imputation method: MAGIC.

22
New cards

Unique molecular identified (UMI)

  • UMIs are short (4-10 bp) random barcodes added to transcripts during reverse-transcription.

  • UMIs enable sequencing reads to be assigned to individual transcript molecules and thus the removal of amplification noise and biases from scRNA-Seq data.

  • They reduce the amplification noise by allowing (almost) complete duplication of fragments.

  • Counting the number of distinct UMI sequences is easier.

  • This information does not get lost during the amplification process.

23
New cards

Gene counts

  • In each gene, within each cell, the total number of unique UMI is counted and reported as the number of transcripts of that gene for a given cell.

24
New cards

Drop-outs in single cell

  • A gene can be observed at a moderate or high expression level in one cell but not detected in another.

  • Why do dropouts occur in a single cell:

    • Technical artifacts.

    • Cell type differences.

    • Statistical sampling.

    • Biological factors.

  • Zero inflation: unusually high number of zeros (undetected gene expression values).

25
New cards

What should we do about dropouts?

  • Ignore zero inflation:

    • Let downstream statistical methods do the heavy lifting.

  • Aggregate information across similar cells.

    • Clustering or pseudobulk approaches.

    • Smooth out noise and highlight consistent expression patterns.

  • Impute scRNA-seq gene count matrix before analysis.

    • Estimate missing gene expression values and reduce sparsity caused by dropouts.

26
New cards

Why do we need imputation methods?

  • Downstream analyses rely on the accuracy of gene expression measurements.

  • Imputation methods:

    • MAGIC.

    • Droplet.

    • DrImpute.

    • scDoc.

27
New cards

MAGIC (Markov affinity-based graph imputation of cells)

  • Denoise high-dimensional scRNA-seq data.

  • Impute missing expression values by sharing information across similar cells.

  • Transform the similary of matrix A into Markov transition matrix M.

  • Raise the Markov matrix to the power of t: Mt, which determines the weight of cells.

<ul><li><p>Denoise high-dimensional scRNA-seq data.</p></li><li><p>Impute missing expression values by sharing information across similar cells.</p></li><li><p>Transform the similary of matrix A into Markov transition matrix M.</p></li><li><p>Raise the Markov matrix to the power of t: M<sup>t</sup>, which determines the weight of cells.</p></li></ul><p></p>
28
New cards

scRNA-seq data analysis

  • Dimensionality reduction.

  • Clustering and marker identification.

  • Trajectory analysis.

29
New cards

Gene expression analysis: clustering

  • Organize objects into groups based on similarity.

  • A cluster is a collection of objects which are similar to objects in the same cluster, but are dissimilar to objects in other clusters.

30
New cards

Hierarchial clustering

  • Divisive:

    • Starts with all data points in one cluster.

    • Choose split so that data points in the two clusters are most similar (maximize “distance” between clusters).

    • Continue until all data points are in single gene clusters.

  • Agglomerative: union between the two nearest clusters.

    • Start with each data point in its own cluster.

    • Joins the two most similar clusters.

    • Continues until all data points are in one cluster.

31
New cards

How to find the two most similar clusters

  • Single linkage: shortest distance.

  • Complete linkage: longest distance.

  • Average linkage: average distance.

32
New cards
<p>!! Single-linkage example</p>

!! Single-linkage example

ask gemini

33
New cards

!! Biclustering

  • Clustering becomes too restrictive on large datasets:

    • Seeks a global partition of genes based on their expression similarity across ALL conditions.

  • Relevant knowledge can be revealed by identifying genes with a typical pattern across a subset of the conditions: e.g., genes co-expressed under some conditions.

34
New cards

Bi-clustering patterns

  • Constant values: might be an over/under-expression of a group of genes in a subset of experiments.

  • Constant rows: a gene signature of a subset of experiments.

  • Constant columns: a set of co-expressed genes in a subset of experiments.

  • Coherent values: a common trend in a group of genes in a subset of experiments.

    • The picture holds examples.

<ul><li><p>Constant values: might be an over/under-expression of a group of genes in a subset of experiments.</p></li><li><p>Constant rows: a gene signature of a subset of experiments.</p></li><li><p>Constant columns: a set of co-expressed genes in a subset of experiments.</p></li><li><p>Coherent values: a common trend in a group of genes in a subset of experiments.</p><ul><li><p>The picture holds examples.</p></li></ul></li></ul><p></p>
35
New cards

!! A good bi-cluster

36
New cards

!! δ (Delta) bi-cluster

  • Find bi-clusters with mean squared residue < δ.

  • Repetitive procedure:

    • Remove the row/col that reduces H the most.

    • Add rows/cols that do not increase H.

    • Stop when H < δ.

  • Mask bi-cluster with random values.

  • Repeat to find more bi-clusters.

WHAT IS δ AND WHAT IS H????

37
New cards

!! Differential gene expression

38
New cards

What to compare - differential gene expression

  • TPM: correct for sequencing depth and gene length.

    • Tool: edge, suitable for small sample sizes.

  • TMM: correct for differences in transcript pool composition; extreme outliers.

  • logCPM: stabilize variance; remove dependence of variance on the mean.

    • Tool: Limma, suitable for moderate to large sample sizes.

  • All aim to provide better comparison across samples.

  • Other tool: DESeq2.

    • Negative binomial model with geometric size-factor normalization; shrinkage of dispersion & fold-change.

    • Broadly used.

39
New cards

Normalization in differential gene expression analysis

  • EdgeR, DESeq2, and some others want to keep the integer read counts in the differential gene expression testing because they:

    • Use a discrete statistical power.

    • Want to retain statistical power.

    • But they are simplified as part of the differential gene expression analysis.

  • Limma is fine with continuous values like FPKM.

40
New cards

!! Parametric vs. non-parametric

  • Parametric methods often work better.

  • For experiments with fewer than 12 samples per group: use edgeR.

    • UseDESeq2 otherwise.

WHAT IS PARAMETRIC VS. NON-PARAMETRIC???

41
New cards

!! Be careful with DE analysis

  • Too many false positives.

  • Take-away message:

    • Wilcoxon rank-sum test is recommended.

      • what the hell is this