!! Bulk, Single Cell, and Spatial RNA-Seq

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/40

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

41 Terms

New cards

Why study RNA?

RNA is dynamic, capturing more information than static DNA.
RNA is the first quantitative link between DNA sequence and phenotype.
RNA capture regulatory complexity beyond gene sequence.
- The transcriptome encodes layers of control not visible in DNA through alternative splicing and noncoding RNA regulation.

New cards

Gene expression

A gene is a short DNA section containing instructions for making proteins.
- Exons, introns.
- Alternative splicing.
  - Different combinations of exons.
  - Multiple transcripts.
Gene expression level is the number of gene transcripts produced in an organism or cell.
It’s the process of using DNA information to create RNA and proteins.

<ul><li><p>A gene is a short DNA section containing instructions for making proteins.</p><ul><li><p>Exons, introns.</p></li><li><p>Alternative splicing.</p><ul><li><p>Different combinations of exons.</p></li><li><p>Multiple transcripts.</p></li></ul></li></ul></li><li><p>Gene expression level is the number of gene transcripts produced in an organism or cell.</p></li><li><p>It’s the process of using DNA information to create RNA and proteins.</p></li></ul><p></p>

New cards

Detecting gene expression

RNA-Seq and DNA microarray can be used to profile gene expression.
cDNA (complementary DNA) is used in gene expression studies to detect and quantify gene expression products.
- RNA is not stable and can be easily degraded.
cDNA libraries are gene libraries containing only genes expressed in a cell or tissue.

New cards

RNA-Seq steps

Sample preparation: Prepares the mRNA for sequencing.
- Starting material: Total RNA is isolated, explicitly focusing on the mRNA.
- Fragmentation/cDNA synthesis: The mRNA can either be fragmented directly into RNA fragments or be converted into more stable cDNA using reverse transcriptase.
- Adapters: Sequencing adapters are ligated (attached) to the ends of fragments, creating a sequencing library.
Next Generation Sequencing: The prepared library is sequenced to produce raw data.
- Sequencing: The fragments are sequenced to generate millions of short sequence reads.
Data analysis: The raw sequencing reads are processed to quantify gene expression.
- Mapping reads: The short reads are aligned against a known reference genome or transcriptome.
  - Exonic reads map entirely within an exon.
  - Junction reads span a splice junction (connecting two exons).
  - Poly(A) end reads map to the end of the transcript, confirming the 3’ end.
- Visualization (genome browser).
- De novo assembly: If a reference genome is unavailable, reads can be assembled to build transcripts.
- Quantification: he number of reads mapping to a gene or transcript is counted to determine its expression level. The final result is a Base-resolution expression profile, a graph showing the abundance of RNA (read coverage) across the transcript length.

New cards

RNA-Seq analysis steps

Quality control sequence output = FASTQ files.
Pre-processing notes: If the reads contain low-quality bases or adapter sequences, you might want to trim or filter them.
Alignment notes: map reads to reference genome/transcriptome; output = BAM files.
Quantitation notes: abundance quantification; gene, exon, or transcript levels.

<ul><li><p>Quality control sequence output = FASTQ files.</p></li><li><p>Pre-processing notes: If the reads contain low-quality bases or adapter sequences, you might want to trim or filter them.</p></li><li><p>Alignment notes: map reads to reference genome/transcriptome; output = BAM files.</p></li><li><p>Quantitation notes: abundance quantification; gene, exon, or transcript levels.</p></li></ul><p></p>

New cards

RNA read alignment

Find the position of a sequencing read on the reference genome.
- Reads can be unaligned or aligned to more than one location.
- A paired-end read will only be counted as one read if both ends align to the genome.
Given a list of transcripts, each read can be mapped to one of the four classes: exonic, partially overlaps an exon, intronic, or between genes.
- Why intronic: They are mostly unspliced pre-mRNA or functional non-coding RNAs transcribed from those genomic regions.

<ul><li><p>Find the position of a sequencing read on the reference genome.</p><ul><li><p>Reads can be unaligned or aligned to more than one location.</p></li><li><p>A paired-end read will only be counted as one read if both ends align to the genome.</p></li></ul></li><li><p>Given a list of transcripts, each read can be mapped to one of the four classes: exonic, partially overlaps an exon, intronic, or between genes.</p><ul><li><p>Why intronic: They are mostly unspliced pre-mRNA or functional non-coding RNAs transcribed from those genomic regions.</p></li></ul></li></ul><p></p>

New cards

Abundance quantification

Raw counts can be biased:
- Sequencing depth: Different total numbers of reads between samples.
  - Correction: Normalizing by total library size.
- Gene length: Longer genes have more space for reads to map, leading to higher raw counts even if the gene's concentration is the same as a shorter gene.
  - Correction: Normalizing by transcript/gene length.
- RNA composition: A few highly expressed genes can take up a significant fraction of the total reads, making other genes appear to have lower expression than they really do. This affects the comparison between samples.
  - Correction: Using specialized between-sample normalization methods.

$<ul><li>Raw counts can be biased:<ul><li>Sequencing depth: Different total numbers of reads between samples.<ul><li>Correction: Normalizing by total library size.</li></ul></li><li>Gene length: Longer genes have more space for reads to map, leading to higher raw counts even if the gene's concentration is the same as a shorter gene.<ul><li>Correction: Normalizing by transcript/gene length.</li></ul></li><li>RNA composition: A few highly expressed genes can take up a significant fraction of the total reads, making other genes appear to have lower expression than they really do. This affects the comparison between samples.<ul><li>Correction: Using specialized between-sample normalization methods.</li></ul></li></ul></li></ul>$

New cards

RNA-seq normalization methods

Within sample:
- Required to compare the expression of genes within an individual sample.
- It can adjust data for two primary technical variables: transcript length and sequencing depth.
Counts per million (CPM).
FPKM (fragments per kilobase of transcript per million fragments mapped).
Transcripts per million (TPM).

New cards

Counts per million (CPM)

The number of raw counts mapped to a transcript, scaled by the total number of sequencing reads in your sample, multiplied by a million.
It normalizes RNA-seq data for sequencing depth but not gene length.
- Not yet ready for comparison of gene expression.

New cards

FPKM (fragments per kilobase of transcript per million fragments mapped) & RPKM (reads per kilobase of transcript per million reads mapped)

FPKM = paired-end data.
RPKM = single-end data.
Correct for both library size and gene length.
Good for comparison of gene expression within a single sample, but not for across-sample comparison.

New cards

Transcripts per million (TPM)

Represent the relative number of transcripts you would detect for a gene if you had sequenced one million full-length transcripts.
Correct for both sequencing depth and transcript length.
Suitable for within-sample comparison of gene expression, but not for across-sample comparison.

New cards

RPKM summary

Step 1 = normalize for sequencing depth: RMP = (raw counts for gene / (total mapped reads / 10)) x 1,000,000.
- Divide the total mapped reads by 10 to simplify the number.
Step 2 = normalize for gene length: RPKM = RPM / gene length in kilobases.
Example:
- Gene length = 2 kb.
- Gene raw counts = 10.
- Total mapped reads = 35.
- Step 1: (10 / (35/10)) x 1,000,000 = 2.86.
- Step 2: 2.86 / 2 = 1.43.

New cards

TPM vs. RPKM

RPKM:
- Reads mapped / (gene length * total mapped reads in millions).
- Normalizes for depth first, then length.
- The total RPKM/FPKM of all genes in a sample does NOT sum to the same value across different samples.
- Only for comparing the expression between different genes within a single sample.
TPM:
- ((Reads mapped / gene length in kb) / (sum of all genes) (reads mapped / gene length in kb)) * 10⁶.
- Normalizes for length first, then scales the length-normalized values (the RPK) by the total sum of all RPKs in the sample.
- The sum of all TPM values in a sample is always 106 (one million), making samples directly comparable.
- Recommended for comparing the expression of a single gene across different samples (replicates).

<ul><li><p>RPKM:</p><ul><li><p>Reads mapped / (gene length * total mapped reads in millions).</p></li><li><p>Normalizes for depth first, then length.</p></li><li><p>The total RPKM/FPKM of all genes in a sample does NOT sum to the same value across different samples.</p></li><li><p>Only for comparing the expression between different genes within a single sample.</p></li></ul></li><li><p>TPM:</p><ul><li><p>((Reads mapped / gene length in kb) / (sum of all genes) (reads mapped / gene length in kb)) * 10<sup>6</sup>.</p></li><li><p>Normalizes for length first, then scales the length-normalized values (the RPK) by the total sum of all RPKs in the sample.</p></li><li><p>The sum of all TPM values in a sample is always <span>106</span> (one million), making samples directly comparable.</p></li><li><p>Recommended for comparing the expression of a single gene across different samples (replicates).</p></li></ul></li></ul><p></p>

New cards

Evolution of gene expression measurements

Bulk RNA-seq: Measures average gene expression across many cells.
- Cost-effective and straightforward, but masks cell-to-cell differences.
Single-cell RNA-seq: Profiles individual cells to reveal cellular heterogeneity, rare cell types, and dynamic states.
Spatial RNA-seq: Captures where genes are expressed within tissue, preserving spatial context and cell-cell interactions.

New cards

Why study single cells?

When compared to bulk RNA-seq, single-cell studies:
- Reveals cell-to-cell differences and provides an unbiased view of cellular complexity within a tissue.
- Allows for the discovery and precise identification of rare cell types and dynamic cellular states (e.g., transitional stages).
- Provides the highest resolution view of the intermediate step between genotype and phenotype (where cells are the functional constituents).

New cards

Bulk vs. scRNA-seq

Bulk RNA-seq:
- Quantifying expression signatures from ensembles.
- Insufficient for studying a heterogeneous system.
scRNA-seq:
- Inference of gene regulatory networks across the cells.
- Heterogeneity of cell responses.
- Cell type identification.

<ul><li><p>Bulk RNA-seq:</p><ul><li><p>Quantifying expression signatures from ensembles.</p></li><li><p>Insufficient for studying a heterogeneous system.</p></li></ul></li><li><p>scRNA-seq:</p><ul><li><p>Inference of gene regulatory networks across the cells.</p></li><li><p>Heterogeneity of cell responses.</p></li><li><p>Cell type identification.</p></li></ul></li></ul><p></p>

New cards

scRNA-seq analysis cell- and gene-level analysis

Cell-level:
- Marker gene identification.
- Cluster analysis.
- Trajectory analysis.
Gene-level analysis:
- Single-cell differential expression analysis.
- Gene set analysis.
- Gene regulatory networks.

New cards

scRNA-seq process

Sample preparation.
Single-cell RNA sequencing.
Data processing.
Data analysis.

New cards

scRNA-seq sample preparation

Cells are physically separated into a single-cell solution from which specific cell types can be enriched or excluded.
- Each cell is captured by one droplet.
- Each droplet contains a unique barcode and the necessary reagents for reverse transcription.
- Each individual RNA molecule captured within that droplet is tagged with its own unique molecular identifier (UMI).
  - The UMI distinguishes unique RNA molecules from duplicates that arise during PCR amplification.

New cards

scRNA-seq single-cell RNA sequencing

Extremely small amount of RNAs within a cell → hard to detect.
PCR amplification → start sequencing.

New cards

scRNA-seq data processing

UMI.
Gene counts.
Drop-outs in single cell.
Imputation method: MAGIC.

New cards

Unique molecular identified (UMI)

UMIs are short (4-10 bp) random barcodes added to transcripts during reverse-transcription.
UMIs enable sequencing reads to be assigned to individual transcript molecules and thus the removal of amplification noise and biases from scRNA-Seq data.
They reduce the amplification noise by allowing (almost) complete duplication of fragments.
Counting the number of distinct UMI sequences is easier.
This information does not get lost during the amplification process.

New cards

Gene counts

In each gene, within each cell, the total number of unique UMI is counted and reported as the number of transcripts of that gene for a given cell.

New cards

Drop-outs in single cell

A gene can be observed at a moderate or high expression level in one cell but not detected in another.
Why do dropouts occur in a single cell:
- Technical artifacts.
- Cell type differences.
- Statistical sampling.
- Biological factors.
Zero inflation: unusually high number of zeros (undetected gene expression values).

New cards

What should we do about dropouts?

Ignore zero inflation:
- Let downstream statistical methods do the heavy lifting.
Aggregate information across similar cells.
- Clustering or pseudobulk approaches.
- Smooth out noise and highlight consistent expression patterns.
Impute scRNA-seq gene count matrix before analysis.
- Estimate missing gene expression values and reduce sparsity caused by dropouts.

New cards

Why do we need imputation methods?

Downstream analyses rely on the accuracy of gene expression measurements.
Imputation methods:
- MAGIC.
- Droplet.
- DrImpute.
- scDoc.

New cards

MAGIC (Markov affinity-based graph imputation of cells)

Denoise high-dimensional scRNA-seq data.
Impute missing expression values by sharing information across similar cells.
Transform the similary of matrix A into Markov transition matrix M.
Raise the Markov matrix to the power of t: M^t, which determines the weight of cells.

<ul><li><p>Denoise high-dimensional scRNA-seq data.</p></li><li><p>Impute missing expression values by sharing information across similar cells.</p></li><li><p>Transform the similary of matrix A into Markov transition matrix M.</p></li><li><p>Raise the Markov matrix to the power of t: M<sup>t</sup>, which determines the weight of cells.</p></li></ul><p></p>

New cards

scRNA-seq data analysis

Dimensionality reduction.
Clustering and marker identification.
Trajectory analysis.

New cards

Gene expression analysis: clustering

Organize objects into groups based on similarity.
A cluster is a collection of objects which are similar to objects in the same cluster, but are dissimilar to objects in other clusters.

New cards

Hierarchial clustering

Divisive:
- Starts with all data points in one cluster.
- Choose split so that data points in the two clusters are most similar (maximize “distance” between clusters).
- Continue until all data points are in single gene clusters.
Agglomerative: union between the two nearest clusters.
- Start with each data point in its own cluster.
- Joins the two most similar clusters.
- Continues until all data points are in one cluster.

New cards

How to find the two most similar clusters

Single linkage: shortest distance.
Complete linkage: longest distance.
Average linkage: average distance.

New cards

!! Single-linkage example

ask gemini

New cards

!! Biclustering

Clustering becomes too restrictive on large datasets:
- Seeks a global partition of genes based on their expression similarity across ALL conditions.
Relevant knowledge can be revealed by identifying genes with a typical pattern across a subset of the conditions: e.g., genes co-expressed under some conditions.

New cards

Bi-clustering patterns

Constant values: might be an over/under-expression of a group of genes in a subset of experiments.
Constant rows: a gene signature of a subset of experiments.
Constant columns: a set of co-expressed genes in a subset of experiments.
Coherent values: a common trend in a group of genes in a subset of experiments.
- The picture holds examples.

<ul><li><p>Constant values: might be an over/under-expression of a group of genes in a subset of experiments.</p></li><li><p>Constant rows: a gene signature of a subset of experiments.</p></li><li><p>Constant columns: a set of co-expressed genes in a subset of experiments.</p></li><li><p>Coherent values: a common trend in a group of genes in a subset of experiments.</p><ul><li><p>The picture holds examples.</p></li></ul></li></ul><p></p>

New cards

!! A good bi-cluster

New cards

!! δ (Delta) bi-cluster

Find bi-clusters with mean squared residue < δ.
Repetitive procedure:
- Remove the row/col that reduces H the most.
- Add rows/cols that do not increase H.
- Stop when H < δ.
Mask bi-cluster with random values.
Repeat to find more bi-clusters.

WHAT IS δ AND WHAT IS H????

New cards

!! Differential gene expression

New cards

What to compare - differential gene expression

TPM: correct for sequencing depth and gene length.
- Tool: edge, suitable for small sample sizes.
TMM: correct for differences in transcript pool composition; extreme outliers.
logCPM: stabilize variance; remove dependence of variance on the mean.
- Tool: Limma, suitable for moderate to large sample sizes.
All aim to provide better comparison across samples.
Other tool: DESeq2.
- Negative binomial model with geometric size-factor normalization; shrinkage of dispersion & fold-change.
- Broadly used.

New cards

Normalization in differential gene expression analysis

EdgeR, DESeq2, and some others want to keep the integer read counts in the differential gene expression testing because they:
- Use a discrete statistical power.
- Want to retain statistical power.
- But they are simplified as part of the differential gene expression analysis.
Limma is fine with continuous values like FPKM.

New cards

!! Parametric vs. non-parametric

Parametric methods often work better.
For experiments with fewer than 12 samples per group: use edgeR.
- UseDESeq2 otherwise.

WHAT IS PARAMETRIC VS. NON-PARAMETRIC???

New cards

!! Be careful with DE analysis

Too many false positives.
Take-away message:
- Wilcoxon rank-sum test is recommended.
  - what the hell is this