W1 L3: Sequencing genes and genomes

Sequencing of Biomolecules:

DNA:

  • Work out the order of the four bases (A, C, G, and T) in fragments of DNA, usually amplified by PCR or DNA cloning

  • Possible since 1977 - Increasingly sophisticated and increasingly affordable

RNA:

  • In principle and in practice, RNA is sequenced indirectly through the sequencing of cDNA

Protein:

  • Possible since 1950 (insulin)

Classical protein sequencing is time consuming and requires relatively high amounts →Increasingly replaced by modern methods

DNA sequencing

Single-read sequencing

  • Maxam-Gilbert method

    • Based on chemical degradation

    • NOW OBSOLETE

  • Sanger method

    • Based on primer extension chain termination

    • Dominating method until NGS – still used

  • Massive parallel sequencing = Next-generation sequencing (NGS)

    • Based on primer extension

    • Fully automated

    • A revolution in progress

Sanger Sequencing

- DNA sequencing with chain-terminating inhibitors

- dideoxy sequencing (ddNPTs)

  • Used for sequencing single genes and fragments of DNA

    • Understanding the structure of the fragment

    • Mutation screening in specific genes

    • Validations of findings from NGS

  • Highly accurate

Pre-sequencing

  1. Need to produce multiple copies of the DNA fragment to be sequenced

    • Clone the DNA fragment into a plasmid and grow in E Coli or

    • Amplify the DNA fragment by PCR

  2. Denature the sequence by heating (or by adding NaOH) to produce single-stranded DNA

  3. Prepare a DNA polymerase, primer, dNTPs and ddNPTs

Primer Extension

  • DNA synthesis does not start from scratch

  • Primer sequence needed to anneal to template strand

  • Synthesis of new strand - by adding bases to the primer that are complementary to the template→ extending the primer

  • DNA polymerase synthesises complementary strand

    • Starting from the primer

    • Forms a complementary copy to the template strand

Termination of DNA synthesis

Dideoxyribonucleic triphoshates (ddNTPs) — terminator nucleotides

  • Modified version of the normal DNA building blocks (dNTPs)

  • Uses the same bases (A/C/G/T), but the sugar is modified

  • Wherever the ddNTP has been incorporated, DNA synthesis can proceed no further

  • The lack of a 3’ OH group in the dideoxynucleotide prevents the formation of phosphodiester bond

  • Multiple strands due to DNA amplification

  • Excess of normal dNTPs against the amount of ddNTPs (100:1), which compete against each other in DNA synthesis

  • Termination happens at different places at different strands→ result is a set of DNA sequences of varying length, each ending to a ddNTP

Sanger sequence

  • Primer, DNA polymerase & mix of normal (unlabelled) dNTPs & labeled ddNTPs

  • Labeled used to be radioactive - now fluorescent dyes used

  • Correct order reached through gel electrophoresis & fluorescence detection

  • In genotyping, where only single SNP is genotyped, process is similar, but no normal (unlabelled) dNTPs are added

Separating nucleic acids according to size: Slab gel electrophoresis

  • Nucleid acids cary many -ve charged phosphate gps

  • Migrate towards +ve electrode when placed in electric field

  • Porous gel acts as sieve → small mols. pass more easily than larger ones

Advances in Sanger sequencing technology

1977-1985

  • Radioactive labelling, requiring 4 separate reaction tubes for each ddNTP - separated individually on large (30cm x 50cm) slab electrophoresis gels→ dry & X-ray

  • X-ray film, heavy X-ray film cassettes, darkroom film manipulation

  • Manual reading of autoradiogram & feeding into computer (re-reading often required)

  • 1.5 days from set up to results

1986 onwards

  • Fluorescent dye labelling - using mixed reaction tube containing all 4 ddNTPs

  • Automated, optical detection system using scanning laser

  • Direct automated entry of DNA base sequence data into computer

  • Time-consuming due to handling of gel plates

Capillary gel electrophoresis (1995 onwards)

  • Slab gel electrophoresis is manual (slow, prone to human error) - capillary gel electrophoresis is automated

  • Fluorescently-labelled DNA samples migrate through long, v. thin tubes containing polyacrylamide gel (instead of running gel for a finite time)

  • Machine uses laser to detect fluorescence a fixed point just before end of gel

  • Allows longer reads (up to 1000 bases)

    • separation not stopped at specific point

    • each fragment is allowed to proceed to bottom of gel where resolution is highest

  • Faster & cheaper

Pros & cons of Sanger sequencing

Pros

  • Highly accurate sequences ~99.95%

  • Several 100 bases long (800-1000bp)

Cons

  • Gel eletrophoresis not suitable for handling large no. of samples at a time - not fully automated

    • not suitable to genome sequencing


1st sequenced human genomes

  • Human Genome Project 2001 → daft genome compromising of DNA from several volunteers

    • BAC - based sequencing (bacterial artificial chromosome)

  • Celera company 2001 → genome of J. Graig Venter

    • Expressed Sequence Tags (ETS)

  • Highly time-consuming & extremely expensive

    • HGP→ took 10 yrs & cost $2.7 billion (In 2022, costs ~$1000)

Next-generation sequencing (NGS)

  • Also known as massively-parallel sequencing - sequencing millions of DNA fragments simultaneously

  • From 2005 onwards a tech revolution

  • Vast ↑ in amt. of sequencing data per run (seq. throughput) → dramatic ↓in cost

  • Moving from sequencing single gene & exons to:

    • Whole-genome sequences (WGS)

    • Whole-exome sequences (targeted DNA sequencing) (WES)

    • Whole-transcriptomes (RNAseq) & targeted transcriptomes

    • Methyl-Seq, ChIP-seq

  • Ribo-seq

Key Terminology:

  • Throughput→ amt. of seq. data (in Mb) that are processed in 1 run

  • Read length→ length of DNA fragments -measured in nucleotides

    • short read lengths can be processed w/ high throughput

  • Read depth→ seq. coverage i.e. how many times each seq. is represented

    • 30x to 50x for WGS

    • 100x for WES

    • important for small read lengths for genome assembly

  • Genome assembly→ aligning & merging sequenced pieces to make sense of seq. genome

    • de novo - 1st time sequencing a species or,

    • alignment against a previously obtained reference sequence

Methods based on amplified DNA templates

  • 2nd-gen DNA seq.

  • From short (35 nucleotides) to medium-length seq. (up to 800 nucleotides)

  • High to v. high seq. throughput

  • Rel. high R of seq. errors in individual reads (overcome by ↑ coverage)

Commonly used 2nd-gen sequencing platforms

  • Roche/454 pyrosequencing

    • 1st NGS tech in market (2005)- discontinued support in 2016

    • Rel. long reads & speedy but low throughput

    • Emulsion PCR

  • ABI SOLiD technique

    • 2007

    • checks each base independently twice → low error rate

    • “Wildfire” method for preparing seq. templates (v. similar to Bridge PCR by Illumina)

  • Illumina/Solexa sequencing

    • 2008 - now market leader

    • Bridge PCR

  • Ion Torrent systems

    • 2010

    • Emulsion PCR

    • Rel. long reads & speedy but low throughput (similar to Roche/454)

Sequencing w/ emulsion PCR

  • Uses bead surfaces, H2O & oil

  • Simultaneous amplification of each seq. w/o risk of contamination

  • each bead act as microreactor for PCR - each containing 1 strand of DNA

  • Terminators are reversible - after chem deprotection, synthesis is possible

Sequencing w/ bridge amplification

Whole Process

  • Genomic DNA→ Cut DNA→ Add linkers

Bridge amplification (𝘪𝘯 𝘴𝘪𝘵𝘶 PCR in figure above)

Methods based on unamplified DNA

  • 3rd-gen DNA sequencing - ‘single-molecule sequencing’ (SMS) → long-read sequencing

  • Long seq. (1000s of nucleotides) -modest throughput

    • important for assembly of genomes from newly seq. species & for distinguishing large-scale variations e.g. copy number variations in known ones

  • Release of protons (instead of fluorophore or phosphate) - recorded as electric current

  • Avoids problems related to DNA amplification (underrepresentation or overrepresentation of DNA after amplification)

  • Simple & cheap tech (small portable machines)

  • Higher error rates

GTEx portal

  • Genotype-Tissue Expression (GTEx) project - build a comprehensive public resource to study tissue-specific gene expression & regulation

  • 54 non-diseased tissue sites across 1000 individual - for molecular assays inc. WGS, WES & RNA-Seq

  • Remaining samples available from GTEx Biobank

  • GTEx Portal provides open access to data inc. gene expression, QTLs & histology images

Single-cell sequencing

  • Possible to perform sequencing of genome, transcriptome & epigenome in a single cell

  • Traditional sequencing works on cell pop.→ resulting data are aggregate values

  • For understanding of cell-to-cell variation & identifying new cell types

  • Catalog of human cell types

    • stable cell properties

    • transient cell features

    • cell positions

    • lineage relationships (identification of novel stem cells)

  • Widely employed in cancer research

Analysis of NGS data

  • Output files consist of millions of short(~100bp) reads (2nd gen) or longer reads (3rd gen)

  • Reads are mapped to reference genome (if species known - as opposed to 𝘥𝘦 𝘯𝘰𝘷𝘰 sequencing)

  • Variants are identified

  • Heavily computational project requiring several software tools & computational skills

Sanger sequence vs. NGS

  • Sanger sequence

    • cheap, fast, simple

    • highly accurate

    • gives an answer about a single, specific question

  • NGS

    • getting cheaper w/ more tech

    • more error-prone

    • highly versatile (WGS, WES, RNA-Seq, methyl-seq…)

    • computationally laborious

robot