!! DNA Sequencing

0.0(0)
studied byStudied by 0 people
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/38

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

39 Terms

1
New cards

Human genome project

  • 1990-2003.

  • Goal:

    • It was a global scientific research project whose primary goal was to generate the first sequence of the human genome.

      • To determine the base pairs that make up human DNA and identify, map, and sequence all of the genes of the human genome from both a physical and functional standpoint.

  • Finding:

    • Sequenced 92% of the human genome with fewer than 400 gaps.

    • About 23,000 genes.

    • Significantly more segmental

      duplication.

  • Major impact:

    • Personalized health care.

    • Cancer research.

  • Cost:

    • $3 billion over 15 years.

2
New cards

Sequencing pipeline

knowt flashcard image
3
New cards

Step 1: sample preparation

  • Shotgun sequencing:

    • Randomly fragment the entire DNA sample into millions of small, overlapping pieces.

    • Random reads.

<ul><li><p>Shotgun sequencing:</p><ul><li><p>Randomly fragment the entire DNA sample into millions of small, overlapping pieces.</p></li><li><p><u>Random reads</u>.</p></li></ul></li></ul><p></p>
4
New cards

Step 2: sequencing

  • First generation: Sanger sequencing.

  • Second generation (next generation): Illumina.

  • Third generation: Oxford nanopore.

5
New cards

Sanger sequencing

  • A chain termination method that reads DNA sequence one fragment at a time.

    • Low throughput.

    • It is high-cost, making it less cost-effective for large projects.

6
New cards

Sanger sequencing process

  • Sample Preparation: A single-stranded copy of the DNA segment to be sequenced is prepared.

  • The Sequencing Reaction: The reaction mixture contains DNA polymerase, a primer, standard deoxynucleotides (A, T, G, C), and a small amount of chain-terminating dideoxynucleotides (ddNTPs). The ddNTPs are key because they stop DNA synthesis when incorporated into a growing DNA strand.

    • A binds to T.

    • G binds to C.

    • The number of short segments depends on how many binding base pairs are present.

      • For example, if there are three Ts and the DNA segment is A, then there will be three short segments.

  • Creating Fragments: The reaction continues until all strands of DNA have undergone this reaction. Because the ddNTPs are in a low concentration, DNA polymerase will make copies of the DNA segment that terminate at every possible position where a specific nucleotide (A, T, C, or G) is found.

  • Separating the Fragments: The resulting fragments are then separated by size using electrophoresis. The longer DNA fragments move more slowly than the shorter ones.

    • Heavy = slow.

    • Light = short.

  • Reading the Sequence: The sequence is then read directly from the gel, starting with the shortest fragments and ending with the longest. This process determines the order of the nucleotides and thus the DNA sequence.

7
New cards

Illumina sequencing pipeline

knowt flashcard image
8
New cards

Step 1 - Illumina sequencing

  • Step 1 = library preparation.

  • Adapters attach to DNA fragments to provide priming sites for cluster amplification and sequencing on the flow cell.

    • Adapters are a pair of annealed oligonucleotides that facilitate clonal amplification and sequencing reactions. Identical duplex adapters are ligated to both ends of the library fragments so that oligos on the flow cell can recognize them for sequencing.

    • Adapter ligation is a method of attaching synthetic oligonucleotides with known sequences

  • It can also incorporate index barcodes that allow multiplexing of multiple samples in a single run.

  • On the 8-lane flow cell:

    • Each spot has an attached oligo.

      • Oligo = oligonucleotide = single-stranded fragment of DNA or RNA.

    • DNA fragments need to bind these oligos to attach to the surface.

    • Add adaptors to fragments.

  • PCR (polymerase chain reaction) is a molecular biology technique used to make many copies (amplify) of small sections of DNA or a gene.

  • Using PCR, we can generate thousands to millions of copies of a particular section of DNA from a minimal amount of DNA so that we can visualize them.

    • Nowadays, Illumina also provides an option with PCR-free sequencing.

  • Indexes, also known as barcodes, are used to identify which read corresponds to which sample after sequencing bioinformatically

<ul><li><p>Step 1 = library preparation.</p></li><li><p>Adapters attach to DNA fragments to provide priming sites for cluster amplification and sequencing on the flow cell.</p><ul><li><p>Adapters are&nbsp;a pair of annealed oligonucleotides that facilitate clonal amplification and sequencing reactions. Identical duplex adapters are ligated to both ends of the library fragments so that oligos on the flow cell can recognize them for sequencing.</p></li><li><p>Adapter ligation&nbsp;is a method of attaching synthetic oligonucleotides with known sequences</p></li></ul></li><li><p>It can also incorporate index barcodes that allow multiplexing of multiple samples in a single run.</p></li><li><p>On the 8-lane flow cell:</p><ul><li><p>Each spot has an attached oligo.</p><ul><li><p>Oligo = oligonucleotide = single-stranded fragment of DNA or RNA.</p></li></ul></li><li><p>DNA fragments need to bind these oligos to attach to the surface.</p></li><li><p><u>Add adaptors to fragments</u>.</p></li></ul></li><li><p>PCR (polymerase chain reaction) is a molecular biology technique used to make many copies (amplify) of small sections of DNA or a gene.</p></li><li><p>Using PCR, we can generate thousands to millions of copies of a particular section of DNA from a minimal amount of DNA so that we can visualize them.</p><ul><li><p>Nowadays, Illumina also provides an option with PCR-free sequencing.</p></li></ul></li><li><p>Indexes, also known as barcodes, are&nbsp;used to identify which read corresponds to which sample after sequencing bioinformatically</p></li></ul><p></p>
9
New cards

Step 2 - Illumina sequencing

  • Step 2 = cluster generation (bridge amplification).

  • How it works:

    • DNA fragments are prepared with adapters that attach to oligos (short DNA molecules) on the flow cell surface.

    • The fragments bind to these oligos on the flow cell to create a “bridge.”

    • DNA polymerase copies the stand, and the new strand attaches to nearby oligos, creating clusters of identical DNA molecules.

10
New cards

Step 3 - Illumina sequencing

  • Step 3 = sequencing through imaging.

  • The process sequences native DNA in real-time with single-molecule resolution.

  • It utilizes tiny nanopores on the machine's surface to measure changes in electric current (color) as single-stranded DNA passes through.

  • Each base has a unique color that the camera records.

<ul><li><p>Step 3 = sequencing through imaging.</p></li><li><p>The process sequences native DNA in real-time with single-molecule resolution.</p></li><li><p>It utilizes tiny nanopores on the machine's surface to measure changes in electric current (color) as single-stranded DNA passes through.</p></li><li><p>Each base has a unique color that the camera records.</p></li></ul><p></p>
11
New cards

Short reads

  • DNA fragments are typically longer than what can be fully sequenced.

  • Reads are Short DNA sequences (A, T, G, C) from individual DNA fragments.

  • Illumina sequencing reads are typically between 50 and 300 base pairs (bp).

    • Depending on the sequencing platform and the number of cycles selected.

  • Single read:

    • Sequence a DNA fragment from only one end.

    • Generate one continuous stretch of bases (typically 50 to 300 bp) per fragment.

12
New cards

Paired-end read

  • Random DNA fragment with an approximately known size.

  • Both ends of the fragments are sequenced.

    • Paired-end reads refer to the two ends of the same DNA molecule. You can sequence one end, then turn it around and sequence the other. The two sequences you get are paired-end reads or “mate-pairs”

13
New cards

Limitations of short reads

  • Illumina platforms sequence short reads (up to 300 bps).

    • Limitations in identifying structure variations.

    • Short length makes it difficult to span large insertions, deletions, inversions, or repetitive regions, leading to ambiguous or incomplete mapping.

  • As they are short, it may not be possible to map reads to the specific region of the reference genome they came from. Complex genomic regions, structural variations, and large stretches of repetitive sequences can push short read sequencing methods to their limits, leading to gaps and ambiguities in the assembled sequences.

14
New cards

Long reads vs. short reads

  • Short reads = Illumina (second-generation).

  • Long reads = Oxford nanopore (third-generation).

<ul><li><p>Short reads = Illumina (second-generation).</p></li><li><p>Long reads = Oxford nanopore (third-generation).</p></li></ul><p></p>
15
New cards

!! Pacific Biosciences

  • Pacific Biosciences: ??

  • Long reads: on average, a few kbs.

    • what are kbs?

  • Rely on the signal from a single molecule.

  • High error rates: uniformly random errors, up to 15%

  • Does single-molecule, real-time sequencing (SMRT).

16
New cards

SMRT long-read sequencing

  • SMRT detects fluorescence events corresponding to adding one specific nucleotide by a polymerase tethered to the bottom of a tiny well.

  • Every well has a polymerase molecule attached inside.

    • Polymerase helps fill in nucleotides on a single-stranded piece of DNA.

  • Each time the polymerase adds a base, a camera takes a picture.

17
New cards

Oxford Nanopore technology

  • Each nucleotide base is a different size and has different electrical properties.

  • The wells of the machine measure the electrical current changes that occur when single-stranded DNA pass through tiny nanopores on the surface.

  • Each base has its own electrical signature that the machine measures and records.

  • Picture: output.

<ul><li><p>Each nucleotide base is a different size and has different electrical properties.</p></li><li><p>The wells of the machine measure the electrical current changes that occur when single-stranded DNA pass through tiny nanopores on the surface.</p></li><li><p>Each base has its own electrical signature that the machine measures and records.</p></li><li><p>Picture: output.</p></li></ul><p></p>
18
New cards

Genome assembly

  • Alignment and merging of reads to determine their original order and form a continuous representation of chromosomes.

    • De novo assembly: assemble from scratch without reference.

    • Reference-based assembly: map reads directly to an already assembled reference sequence.

<ul><li><p>Alignment and merging of reads to determine their original order and form a continuous representation of chromosomes.</p><ul><li><p><u>De novo assembly</u>: assemble from scratch without reference.</p></li><li><p><u>Reference-based assembly</u>: map reads directly to an already assembled reference sequence.</p></li></ul></li></ul><p></p>
19
New cards

!! Reference-based assembly

  • Reference genome: a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species.

  • The reference genome provides a consensus sequence (coordinates) to which individuals’ data can be compared.

  • Reference genome examples:

    • hg19 (GRCh37.xx).

    • hg38 (GRCh38.xx).

    • HOW DO YOU KNOW WHICH ONE TO CHOOSE???

<ul><li><p><u>Reference genome</u>: a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species.</p></li><li><p>The reference genome provides a consensus sequence (coordinates) to which individuals’ data can be compared.</p></li><li><p>Reference genome examples:</p><ul><li><p>hg19 (GRCh37.xx).</p></li><li><p>hg38 (GRCh38.xx).</p></li><li><p>HOW DO YOU KNOW WHICH ONE TO CHOOSE???</p></li></ul></li></ul><p></p>
20
New cards

De novo assembly

  • Without an established reference genome.

  • Piecing together short reads into a complete genome by finding overlaps between them.

  • Like piecing together a puzzle without a picture.

  • Contigs: overlapping reads to form longer, continuous sequences.

    • Gaps in contigs arise due to technical limitations and biological complexities.

    • Prevent the assembler from joining all reads into a continuous sequence.

  • Scaffolds: Contigs are organized using additional information to provide a higher-order structure.

    • Composed of >=1 contigs, separated by gaps, with unknown sequence.

  • Technical limitations:

    • Short reads may not span repetitive regions or complex genomic structures, making it hard for the assembler to order and orient them correctly.

    • Some regions of the genome are not well-covered by reads due to bias in sequencing (e.g., GC content bias), random sampling variability, poor-quality DNA,

<ul><li><p>Without an established reference genome.</p></li><li><p>Piecing together short reads into a complete genome by finding overlaps between them.</p></li><li><p>Like piecing together a puzzle without a picture.</p></li><li><p>Contigs: overlapping reads to form longer, continuous sequences.</p><ul><li><p>Gaps in contigs arise due to technical limitations and biological complexities.</p></li><li><p>Prevent the assembler from joining all reads into a continuous sequence.</p></li></ul></li><li><p>Scaffolds: Contigs are organized using additional information to provide a higher-order structure.</p><ul><li><p>Composed of &gt;=1 contigs, separated by gaps, with unknown sequence.</p></li></ul></li><li><p>Technical limitations: </p><ul><li><p>Short reads may not span repetitive regions or complex genomic structures, making it hard for the assembler to order and orient them correctly. </p></li><li><p>Some regions of the genome are not well-covered by reads due to bias in sequencing (e.g., GC content bias), random sampling variability, poor-quality DNA,</p></li></ul></li></ul><p></p>
21
New cards

Genome assembly example

knowt flashcard image
22
New cards

Paired-end reads for genome assembly

  • In reference-based assembly, paired reads improve alignment accuracy across repetitive or ambiguous regions.

  • In de novo assembly, they help bridge gaps and resolve repeats by linking contigs based on known insert sizes and orientations.

  • Paired-end sequencing might not necessarily provide sequencing data for the entire length of the fragment, but it can help bridge the gaps between contigs, as the distance between the paired reads is known.

23
New cards

!! Common approaches for de novo assembly

  • Overlap-layout-consensus (OLC) method:

    • Create a graph where nodes are reads and edges represent sequence overlaps.

    • Suitable for long and error-prone reads.

    • Inefficient for very large datasets.

  • De Bruijn graph (DBG) framework:

    • Decomposes reads into shorter k-mers and connects them into a graph based on overlaps.

      • WHAT ARE K-MERS?

    • Suitable for short, accurate reads.

24
New cards

De Bruijn graph construction

  1. Choose a value of k.

  2. For each k-mer that exists in any sequence create an edge with one node labeled as the prefix and one node labeled as the suffix.

  3. Glue all nodes that have the same label.

<ol><li><p>Choose a value of k.</p></li><li><p>For each k-mer that exists in any sequence create an edge with one node labeled as the prefix and one node labeled as the suffix.</p></li><li><p>Glue all nodes that have the same label.</p></li></ol><p></p>
25
New cards

!! De Brujin graph example

  • Bulges = undireted cycles.

  • Whirls = directed cycles.

  • !! EXPLAIN BULGES AND WHIRLS BETTER

  • They occur because of sequencing errors or repeats in the genome.

<ul><li><p>Bulges = undireted cycles.</p></li><li><p>Whirls = directed cycles.</p></li><li><p>!! EXPLAIN BULGES AND WHIRLS BETTER</p></li><li><p>They occur because of sequencing errors or repeats in the genome.</p></li></ul><p></p>
26
New cards

How to assess assembly

  • Bioinformaticians use metrics to assess an assembly's completeness, contiguity, and accuracy. For example, the commonly used N50 metric indicates the assembly's contiguity. 

  • N50: the length of the contig where over 50% of the total assembled sequences are contained in contigs of that length or larger.

    • Step 1: Calculate the total length of all contigs and order them by length.

    • Step 2: Calculate 50% of the total length.

      • L50 = L/2.

<ul><li><p>Bioinformaticians&nbsp;use metrics&nbsp;to assess an assembly's completeness, contiguity, and accuracy. For example, the commonly used N50 metric indicates the assembly's contiguity.&nbsp;</p></li><li><p>N50: the length of the contig where over 50% of the total assembled sequences are contained in contigs of that length or larger.</p><ul><li><p>Step 1: Calculate the total length of all contigs and order them by length.</p></li><li><p>Step 2: Calculate 50% of the total length.</p><ul><li><p>L50 = L/2.</p></li></ul></li></ul></li></ul><p></p>
27
New cards

Depth/coverage

  • Coverage in DNA sequencing is the number of unique reads that incldue a given nucleotide in the reconstruced sequence.

    • Higher depth = more confidence in assembly.

    • Low depth = risk of missing regions or errors.

  • Sequencing depth varies across the platforms and depends on the application goals.

<ul><li><p>Coverage in DNA sequencing is the number of unique reads that incldue a given nucleotide in the reconstruced sequence.</p><ul><li><p>Higher depth = more confidence in assembly.</p></li><li><p>Low depth = risk of missing regions or errors.</p></li></ul></li><li><p>Sequencing depth varies across the platforms and depends on the application goals.</p></li></ul><p></p>
28
New cards

Variant calling

  • Identify the differences in an individual’s genome compared to a reference genome.

    • Single Nucleotide Polymorphisms (SNPs).

    • Insertions/deletions (InDels).

    • Structural variants (SVs).

  • To account for the various types of error in the data, we only call variants at locations that have multiple reads.

  • Mapping to the reference assembly helps scientists identify single-nucleotide polymorphisms (SNPs) and small variations in sequences by comparing reads to known genomes. 

29
New cards

Data files

  • FASTQ: sequence reads and quality control.

  • SAM & BAM: alignment to the genome.

    • BAM specifically does alignment cleanup.

  • VCF: variant calling.

30
New cards

FASTQ files

  • FASTQ files to store sequence reads.

    • An extension of the old FASTA format.

    • Includes both sequence and quality scores.

    • Four lines to represent each read.

  • Picture:

    • Line 1 begins with the ‘@’ character, followed by a sequence identifier and an optional description. It can contain flow cell IDs, lane numbers, and information on read pairs. 

    • Line 2 is the sequence letters. of

    • Line 3 begins with a ‘+’ character; it marks the end of the sequence and is optionally followed by the same sequence identifier again in line 1. 

    • Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. Each letter corresponds to a quality score. A standard in the field is to use “Phred quality scores”. 

<ul><li><p>FASTQ files to store sequence reads.</p><ul><li><p>An extension of the old FASTA format.</p></li><li><p>Includes both sequence and quality scores.</p></li><li><p>Four lines to represent each read.</p></li></ul></li><li><p>Picture:</p><ul><li><p>Line 1&nbsp;begins with the ‘@’ character, followed by a sequence identifier and an optional description. It can contain flow cell IDs, lane numbers, and information on read pairs.&nbsp;</p></li><li><p>Line 2&nbsp;is the sequence letters.&nbsp;of </p></li><li><p>Line 3&nbsp;begins with a ‘+’ character; it marks the end of the sequence and is optionally followed by the same sequence identifier again in line 1.&nbsp;</p></li><li><p>Line 4&nbsp;encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. Each letter corresponds to a quality score. A standard in the field is to use “Phred quality scores”.&nbsp;</p></li></ul></li></ul><p></p>
31
New cards

SAM & BAM files

  • SAM (Sequence Alignment Map) files for reads aligned to the reference genome.

  • BAM files are compressed SAM files.

<ul><li><p>SAM (Sequence Alignment Map) files for reads aligned to the reference genome.</p></li><li><p>BAM files are compressed SAM files.</p></li></ul><p></p>
32
New cards

VCF files

  • VCF files record and store genetic variants found at specific locations within a DNA sequence.

<ul><li><p>VCF files record and store genetic variants found at specific locations within a DNA sequence.</p></li></ul><p></p>
33
New cards

Functional elements of DNA

  • Segments of DNA with a defined biological role.

    • Protein-coding genes.

    • Non-coding regulatory elements.

      • Promoters: Regions near genes where transcription begins.

      • Enhancers that control gene expression.

      • Noncoding RNA genes with direction functions themselves.

34
New cards

Encyclopedia of DNA Elements (ENCODE)

  • Build a comprehensive list of functional elements in the human genome → also known as genome annotation.

  • ENCODE genome annotation is tissue-specific.

    • ENCODE’s genome annotation is tissue-specific because the functional elements of the genome, such as which genes are active or inactive, vary between different cell and tissue types.

    • Help understand how gene activity is regulated in a tissue-specific manner.

<ul><li><p>Build a comprehensive list of functional elements in the human genome → <u>also known as genome annotation</u>.</p></li><li><p><u>ENCODE genome annotation is tissue-specific</u>.</p><ul><li><p>ENCODE’s genome annotation is tissue-specific because the functional elements of the genome, such as which genes are active or inactive, vary between different cell and tissue types.</p></li><li><p>Help understand how gene activity is regulated in a tissue-specific manner.</p></li></ul></li></ul><p></p>
35
New cards

GenBank

  • A collection of publicly available annotated nucleotide sequences.

    • 250,000 organisms in total.

  • A primary database → updated only by submitters.

  • To get information on the files in GenBank, look at the presentation.

36
New cards

RefSeq

  • Reference sequence: a genomic sequence that has been chosen as the basis for annotations such as genes and sequence variations.

    • Genomic: gene sequence.

    • Transcript: sequence of mRNAs after alternative splicing.

    • Protein: sequence of downstream protein products corresponding to these genes.

  • It is a curated collection of DNA, RNA, and protein sequences built by NCBI.

  • There is only one example of each natural biological molecule for major organisms.

    • 4,000 organisms in total.

  • A derivative database → continually updated by NCBI; uses information from the GenBank.

37
New cards

Functional divisions

  • Sequence tagged sites (STS):

    • Relatively short sequence (200 to 500 bp)

    • Occurred only once in the genome and
      whose location and base sequence are
      known.

  • Expressed sequence tags (ESTs):

    • A subset of sequence tagged site (STS)
      located within coding region of a gene.

  • Genome survey sequence (GSS) and
    High-throughput Genomic (HTG):

    • Unfinished and partial genomic DNA
      sequences.

    • Will be moved to their respective divisions once complete.

38
New cards

International Nucleotide Sequence Database Collaboration (INSDC)

  • International collaboration:

    • DNA Data Bank of Japan (DDBJ).

    • European Nucleotide Archive (ENA).

    • GenBank.

  • Data sharing.

  • No use restriction.

  • Permanently accessible.

  • A unique identifier: accession number.

    • AAC37594.

  • Change in sequences:

    • Version number AAC37594.1.

39
New cards

Visualization of the genome

  • Use the UCSC genome browser.

  • For directions on how to use the browser, look at the presentation.