RNA sequencing

Aims of the Lecture

The primary objectives of this lecture are to:

Understand how RNA sequencing (RNA-seq) is conducted.
Comprehend the benefits and challenges associated with RNA-seq.
Familiarize oneself with the file formats utilized in RNA-seq analysis.
Understand the procedural steps involved in analyzing RNA-seq data.

RNA Production in Bacteria

Overview of RNA Synthesis

The process of RNA synthesis is crucial for understanding how RNA is made in bacterial cells. The enzyme responsible for synthesizing RNA from a DNA template in bacteria is called RNA polymerase.

RNA Polymerase

RNA polymerase is a central element in the transcription process, where it catalyzes the synthesis of RNA from a DNA template. The fundamental mechanism involves the following components:

DNA: The double-stranded molecule that serves as a template for transcription, consisting of genes that encode information for RNA and protein synthesis.
RNA: The single-stranded nucleic acid synthesized from the DNA template by RNA polymerase during transcription.

cDNA Synthesis

First and Second Strand Synthesis

In the context of RNA-seq, complementary DNA (cDNA) is synthesized from RNA in two main steps:

First Strand Synthesis: The initial step where RNA is reverse-transcribed into the first strand of cDNA.
Second Strand Synthesis: In this step, the second strand of cDNA is synthesized, creating a double-stranded cDNA molecule that will be used for sequencing.

Library Preparation for RNA-seq

Stranded vs. Non-Stranded Libraries

The preparation of RNA-seq libraries can take two forms: stranded and non-stranded. The strand-specific information is crucial for determining the origin of transcripts.

Stranded Libraries

In stranded libraries, only cDNA from the first strand synthesis is sequenced. This means that:

The sequencing reads are mappable only to the 5’ to 3’ direction of the gene.
"Stranded" data allows for the determination of which strand (sense or antisense) the transcripts originated from.

Non-Stranded Libraries

In non-stranded libraries, both cDNA from the first and second strand synthesis is sequenced. Thus:

The sequencing reads can correspond to both strands of the gene.
"Non-stranded" data does not provide information about which strand the transcripts came from, leading to potential ambiguity in interpretation.

Illumina RNA-seq Library Preparation

The Illumina platform is widely used for RNA-seq library preparation. The following steps are involved:

Adapter Addition:
- Two adapters are Ligated to the ends of the cDNA fragments being prepared for sequencing. These adapters are necessary for the subsequent steps of sequencing and amplification.
- Adapter 1: Specific to the first cDNA strand.
- Adapter 2: Specific to the second cDNA strand.
Tagmentation:
- This process combines fragmentation of the DNA with the addition of adapter sequences, allowing for efficient library preparation.
Flow Cell Attachment:
- After tagmentation, the cDNA fragments are attached to the surface of a flow cell. Each fragment adheres to a distinct location on the flow cell surface, which allows for parallel sequencing.
Bridge Amplification:
- Once attached to the flow cell, bridge amplification occurs, resulting in the formation of clusters. Each cluster consists of identical DNA fragments amplified from a single cDNA molecule.
Clonal Amplification:
- This step ensures that each cluster represents a specific sequence with high enough abundance for accurate sequencing.
Sequencing Process:
- Sequencing begins with the incorporation of fluorescently labeled bases that identify which nucleotide has been added at each cluster, allowing for the determination of the original RNA sequence.

File Formats for RNA-seq Analysis

Understanding the different file formats utilized in RNA-seq analysis is crucial for proper data handling and interpretation.

FASTA Format

The FASTA format is commonly used to store sequences of genes and genomes. It typically includes:
- Header line: Starts with a “>” followed by the name of the sequence.
- Sequence lines: The actual sequence of nucleotides.

Example:

>name_of_sequence
ACATGACGTTACTATCGCTCGTCAGTACGTACGTAGCTGATAAACATACGACTGACACTGACTGACTGTACGTGAGACTGATGATCGACTGACTGACTGGGGGGCCATGGAGGGATACGAATCAC

FASTQ Format

The FASTQ format produces an output from whole genome sequencing, describing individual sequencing reads, including:
- A header line beginning with “@” followed by unique identifiers.
- Sequence lines containing the nucleotide reads.
- A plus sign line indicating the start of quality scores.
- Quality score lines providing a measure of the reliability of the sequences.

Example:

@ERR385913.1 UB-NGS-01:179:C1RAFACXX:2:1101:1290:2166/1
ATTCTCAGGAGAACCCCGCCGACCCGGCGGCGTGTTTGCCGTTGTTCCGTG
+
BCCFFFFDHHHFHJJJJHJIIGJJHHIJIHDD@BDDDDDDDDDDDDDDDCD

SAM Format

The SAM (Sequence Alignment/Map) format describes how individual reads from the FASTQ file align with the reference genome. Key components of a SAM file include:
- QNAME: Name of the sequence read.
- FLAG: Flags indicating the read's properties (e.g., whether it is mapped).
- RNAME: Reference name where the read aligns.
- POS: The position on the reference genome.
- MAPQ: Mapping quality score indicating confidence in alignment.
- CIGAR: Alignment string displaying the alignment of the read to the reference.
- SEQ: Actual sequence of the read.
- QUAL: Quality score of the read.

BAM Format

The BAM (Binary Alignment/Map) format is a compressed version of the SAM file, facilitating storage and processing of large-scale sequencing data. The BAM format preserves all the necessary alignment and quality information from the SAM files but in a binary format to save space.

Conclusion

Each of the steps and processes outlined contributes to the advancement of RNA-seq technology, which plays an essential role in various biological and medical research applications. Understanding the underlying principles, challenges, and data formats is vital for researchers leveraging RNA-seq for gene expression analysis and other genomic studies.