From Raw Sequences to Assembled Genomes

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall with Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/40

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No study sessions yet.

41 Terms

1
New cards

Library Construction

  • involves ligating known DNA sequences (adapters) onto unknown genomic fragments, so that the fragments can bind to the flow cell, be amplified and sequenced from both ends

  • terminal sequences: complementary to oligos on flow cell; allow DNA fragments to attach (tether) to sequencing surface

  • adapters: short know DNA sequences ligated via enzymes, enables amplification and sequencing

  • primer binding sites: where sequencing primers bind, required to initiate DNA synthesis

  • unknown DNA sequence: can obtain ~150-250 bp from the ends of each fragment

2
New cards

Library Construction Figure

knowt flashcard image
3
New cards

Why Sequence Both Ends?

  • helps resolve repetitive DNA

  • places fragments more accurately in the genome

4
New cards

Cluster Generation

  • steps: 2) generate clusters of amplified DNA sequences

  • DNA strands bind to flow cell via base pairing to 1st terminal sequence

  • flow cell oligos are extended using DNA polymerase

  • 3’ ends of the extended sequences hybridize w/ nearby oligos

  • flow cell oligos are extended using DNA polymerase (extension, bridge formation)

    • repeat to generate clusters

5
New cards

Remove Reverse Strands

  • after cluster generation, each cluster contains forward + reverse strands but sequencing-by-synthesis method requires a single stranded template, all oriented the same way

    • one strand must be removed

  • restriction sites are included into the adapter; the site is only present in one strand orientation

  • restriction enzyme cuts and removes the reverse strands

<ul><li><p>after cluster generation, each cluster contains forward + reverse strands but sequencing-by-synthesis method requires a single stranded template, all oriented the same way</p><ul><li><p>one strand must be removed</p></li></ul></li><li><p>restriction sites are included into the adapter; the site is only present in one strand orientation</p></li><li><p>restriction enzyme cuts and removes the reverse strands</p></li></ul><p></p>
6
New cards

Sequencing by Synthesis

  • steps: 3) massively parallel sequencing of clusters

  • nucleotide analogs w/ cleavable fluorophore and terminating groups are added and then terminator/dye are cleaved each cycle

<ul><li><p>steps: 3) massively parallel sequencing of clusters</p></li><li><p>nucleotide analogs w/ cleavable fluorophore and terminating groups are added and then terminator/dye are cleaved each cycle</p></li></ul><p></p>
7
New cards

How is the Genome Divided?

  • into coding regions (exome) and non coding regions

    • exons: differential splicing can occur

  • transcriptome: complete set of RNA transcripts produced in a cell (or tissue) at a given time and condition

    • can be used to compare differences between cells

    • tells us which genes/region of genome is transcribed

      • ome ending: studying all at once

  • enhancers: regions that affects the activity of a promoter (can be far from gene)

  • promoters: where core RNA polymerase machinery binds to initiate transcription (has to be close to gene)

  • mutations in regulatory non-coding regions can affect gene expression

  • mutations in exons can affect the underlying amino sequence as well as gene expression

8
New cards

How is the Genome Divided? FIGURE

knowt flashcard image
9
New cards

What does Exome Sequencing Identify?

  • exome sequencing identifies causal mutations even in a small patient pool

  • mutations in embryonic myosin MYH3 are thought to cause Freeman-Sheldon syndrome, a severe congenital joint disorder

    • NS/SS/I = potential damaging mutation

  • exome sequencing: MYH3 was the only damaged coding region in all 4 affected patients and not found in a public database of common SNPs, or in 8 control individuals

  • figure: as you compare patient genome and reference genome, you narrow down which gene is the problem (where the mutation is)

<ul><li><p>exome sequencing identifies causal mutations even in a small patient pool</p></li><li><p>mutations in embryonic myosin MYH3 are thought to cause Freeman-Sheldon syndrome, a severe congenital joint disorder</p><ul><li><p>NS/SS/I = potential damaging mutation</p></li></ul></li><li><p>exome sequencing: MYH3 was the only damaged coding region in all 4 affected patients and not found in a public database of common SNPs, or in 8 control individuals</p></li><li><p>figure: as you compare patient genome and reference genome, you narrow down which gene is the problem (where the mutation is)</p></li></ul><p></p>
10
New cards

What does Whole Genome Sequencing Reveal?

  • whole genome sequencing reveals the role of non-coding mutations

  • mutations affecting enhancers, promoters, silencers, insulator elements can have substantial affects on gene expression, and cannot be captured by exome sequencing

11
New cards

Conclusions so Far

  • the development of new technologies have dramatically lowered the cost of DNA sequencing

  • number of genes hasn’t increased that much in more complex organisms – complexity during animal evolution must be driven by increasingly sophisticated gene regulation

  • the ability to obtain massive amounts of sequencing data have helped to discover the causes of many rare genetic syndromes (even w/ very few patients)

12
New cards

Why has it taken ~20 years to complete the Human Genome Project?

Going from raw sequences to an assembled genome is not simple

  • putting short reads in the correct order requires there to be sufficient read coverage

  • sequenced genome may have deletions, insertions, rearrangements, substitutions, etc. compared to a reference genome

  • genomes have many regions that are highly repetitive

13
New cards

From Reads to Contigs

  • contig: a consensus sequence that is result of assembling together overlapping fragments

  • polymorphism: common DNA sequence variation at a specific genomic position that exists in a population

    • alleles are different versions at that site

    • typically small differences (single nucleotide polymorphisms, small indels)

  • coverage: the number of times each position is sequenced in an experiment; poor coverage can lead to assembly errors

  • major challenge for sequencing is repetitive DNA

14
New cards

Repetitive Elements

  • make up the majority of human genome (66-69% is repetitive or derived from repeats)

Major types:

  • tandem repeats: short repeats including minisatellites (10-60 bps) found in centromeres, microsatellites (2-8 bps) found in telomeres

  • interspaced repeats: transposable elements

    • found throughout genome, enriched in centromeres, pericentric/intercalary heterochromatin

15
New cards

Satellite DNA Sedimentation

  • satellite DNA differentially sediments in a CsCl gradient due to different base composition than the “main band” DNA

  • don’t have same AGCT representation

  • satellite bands are repetitive DNA that don’t co-sediment w/ main band

  • AT DNA is less dense than GC DNA

<ul><li><p>satellite DNA differentially sediments in a CsCl gradient due to different base composition than the “main band” DNA</p></li><li><p>don’t have same AGCT representation</p></li><li><p>satellite bands are repetitive DNA that don’t co-sediment w/ main band</p></li><li><p>AT DNA is less dense than GC DNA</p></li></ul><p></p>
16
New cards

Transposable Elements in Maize

  • coloration of maize kernels are caused by the disruption of a pigment gene by a transposable element Dissociation (Ds)

    • if left in pigment gene → purple corn (WT)

  • dissociation transposition (jumping) requires an enzyme produced by a second transposable element Activator (ac)

  • this showed that the genome is much mroe dynamic than we previously thought

    • genes can literally jump out from and into genomes

<ul><li><p>coloration of maize kernels are caused by the disruption of a pigment gene by a transposable element <em>Dissociation </em>(Ds)</p><ul><li><p>if left in pigment gene → purple corn (WT)</p></li></ul></li><li><p>dissociation transposition (jumping) requires an enzyme produced by a second transposable element <em>Activator </em>(ac)</p></li><li><p>this showed that the genome is much mroe dynamic than we previously thought</p><ul><li><p>genes can literally jump out from and into genomes</p></li></ul></li></ul><p></p>
17
New cards

Transposons Definitions

  • DNA transposons replicate by cutting and pasting themselves into different parts of the genome

  • LTR (long terminal repeat) retrotransposons: transcribed into RNA and reverse-transcribed into dsDNA in the cytoplasm using a tRNA primer (copy and paste)

  • Non-LTR retrotransposons: transcribed into RNA which is reverse transcribed into DNA used nicked host DNA as a primer in the nuclease (copy and paste)

  • both non-LTR and LTR use RNA intermediates → reverse-transcribe to DNA

18
New cards

Transposable Elements in our Genome

  • transposable elements make up almost half our genomes

    • genomic parasites; impact on genes; important for evolution

  • ~4,000 full-length LINE1 insertions and ~100,000 fragments in the human genome

    • initially thought ~40 LINE1 elements are still active in our genomes

<ul><li><p>transposable elements make up almost half our genomes</p><ul><li><p>genomic parasites; impact on genes; important for evolution</p></li></ul></li><li><p>~4,000 full-length LINE1 insertions and ~100,000 fragments in the human genome</p><ul><li><p>initially thought ~40 LINE1 elements are still active in our genomes</p></li></ul></li></ul><p></p>
19
New cards

Transposons Insertions: Diseases

  • study of 240 hemophilia patients showed that 2 individuals had disease mutations caused by independent LINE1 insertions into Factor VIII

  • evidence of elevated LINE1 activity in cancers, is hypothesized to play a role in mutagenizing cancer genomes to promote cancer progression

  • estimated that 1 in 1,000 disease causing mutations is due to a novel LINE1 insertion

<ul><li><p>study of 240 hemophilia patients showed that 2 individuals had disease mutations caused by independent LINE1 insertions into Factor VIII</p></li><li><p>evidence of elevated LINE1 activity in cancers, is hypothesized to play a role in mutagenizing cancer genomes to promote cancer progression</p></li><li><p>estimated that 1 in 1,000 disease causing mutations is due to a novel LINE1 insertion</p></li></ul><p></p>
20
New cards

Transposons Insertions: Evolution

  • new transposons insertions help to drive variation and the evolution of adaptive traits

<ul><li><p>new transposons insertions help to drive variation and the evolution of adaptive traits</p></li></ul><p></p>
21
New cards

Arc Gene

  • important for many forms of neuronal plasticity (i.e memory formation)

  • Arc = activity regulated cytoskeleton

  • related to repetitive DNA b/c it evolved from a retrotransposon (type of repetitive DNA)

<ul><li><p>important for many forms of neuronal plasticity (i.e memory formation)</p></li><li><p>Arc = activity regulated cytoskeleton</p></li><li><p>related to repetitive DNA b/c it evolved from a retrotransposon (type of repetitive DNA)</p></li></ul><p></p>
22
New cards

Arc1

  • Arc1 encodes a retroviral-like protein that traffics b/w neurons

    • independent domestication of retroelements happened many times in evolution (suggests strong selective advantage0

  • Arc1 protein is made in muscle cells via neuronal Arc mRNA

    • neurons produce Arc mRNA

  • Arc1 forms viral-like capsids in purified exosomes

    • these particles encapsulate Arc mRNA and are released in extracellular vesicles (exosomes)

  • our ability to form new memories requires a gene that evolved from a retro-element

    • loss of Arc → severe memory deficits

<ul><li><p>Arc1 encodes a retroviral-like protein that traffics b/w neurons</p><ul><li><p>independent domestication of retroelements happened many times in evolution (suggests strong selective advantage0</p></li></ul></li><li><p>Arc1 protein is made in muscle cells via neuronal Arc mRNA</p><ul><li><p>neurons produce Arc mRNA</p></li></ul></li><li><p>Arc1 forms viral-like capsids in purified exosomes </p><ul><li><p>these particles encapsulate Arc mRNA and are released in extracellular vesicles (exosomes)</p></li></ul></li><li><p>our ability to form new memories requires a gene that evolved from a retro-element</p><ul><li><p>loss of Arc → severe memory deficits</p></li></ul></li></ul><p></p>
23
New cards

Why can identical repetitive DNA sequences have different biological effects?

  • even if the 2 DNA sequences are identical at the nucleotide level, their biological effect can be completely different depending on where they are in the geome

    • sequence ≠ function by itself

  • eg. different contexts: different neighboring genes, regulatory elements, cell type

  • effects can be very different depending on where they land

    • eg. lands in promoter → alters transcription; exon → disrupts protein; intron → affects splicing

    • in Arc: one copy of the repetitive retrotransposon landed in the right genomic context and became essential for memory

24
New cards

How does Paired-End Sequencing Generate Reads

Paired reads are produced thru sequential sequencing of forward and reverse strands

  • DNA is fragmented to segments of known length, adapters are ligated to both ends and contain sequencing primer binding sites

    • this defines the distnaces between reads

  • cluster generation (contain forward and reverse strands)

  • Read 1: reverse strands are cleaved, leaving single-stranded forward templates

    • forward strands are sequenced (first paired read; forward read sequenced)

    • sequencing primer binds to the adapter

  • Read 2: forward strand is removed, reverse strand is regenerated by adding dNTPs and DNA polymerase

    • sequencing primer binds on the other adapter

    • reverse read sequenced (2nd paired read)

  • final results: reads from opposite ends/orientation, but from the same fragment and separated by a known insert size

25
New cards

How does Paired-End Sequencing Generate Reads FIGURE

  • within the two adapters that are ligated onto either end of DNA molecules are sequencing primer binding sites

  • paired end sequencing can help us to map some repetitive sequences

<ul><li><p>within the two adapters that are ligated onto either end of DNA molecules are sequencing primer binding sites</p></li><li><p>paired end sequencing can help us to map some repetitive sequences</p></li></ul><p></p>
26
New cards

Paired End Sequencing Mapping

knowt flashcard image
27
New cards

Sanger Sequencing Elegant Approach

  • increase the read length: get sequencing reactions to approach the size of chromosomes

  • long read sequencing analyzes single DNA molecules

  • long read sequencing has higher error rate (0.2-3% vs 0.1%) and is more expensive than short read sequencing

  • short and long read sequencing using all platforms can be (and are frequently) combined

    • this approach led to the full sequencing of human chromosomes

28
New cards

PacBio Sequencing

  • uses nanostructure arrays borrowing tech developed for semiconductors

    • “zero-mode waveguide nanostructured arrays”; tiny hole in a meal film, light can only penetrate a few nanometers into the well; only fluorophores right at the polymerase active site are illuminated

    • tiny nanostructured wells to restrict single DNA polymerase molecules so individual nucleotide incorporation can be detected in real time

    • semiconductor allows many identical nanometer scale wells

  • individual well vol = 1 zeptoliter (1 × 10-21 liters)

    • so small that only a few molecules are present at any time

    • one polymerase dominates the signal

    • enables single-molecule detection w/o amplification

  • rate of molecule diffusion is extremely fast compared to the rate of DNA base incorporation

    • when an nucleotide is incorporated, it is held by the polymerase, so the fluorophore stays long enough to be detected

29
New cards

PacBio Sequencing

knowt flashcard image
30
New cards

PacBio Hifi Sequencing

  • single molecule sequencing by synthesis (sequences one DNA molecule at a time)

    • no PCR amplification required

    • DNA polymerase incorporates nucleotides like normal replication and each base addition is observed in real time

  • bases are labeled with dyes thru the terminal phosphate group – incorporation will lead to cleavage of the dye

<ul><li><p>single molecule sequencing by synthesis (sequences one DNA molecule at a time)</p><ul><li><p>no PCR amplification required</p></li><li><p>DNA polymerase incorporates nucleotides like normal replication and each base addition is observed in real time</p></li></ul></li><li><p>bases are labeled with dyes thru the terminal phosphate group – incorporation will lead to cleavage of the dye</p></li></ul><p></p>
31
New cards

Real-time Detection of Base Incorporation

  • time for each base to be incorporated (miliseconds) is much faster than random diffusion (microseconds)

    • free nucleotides diffuse too quickly to produce a signal

  • DNA polymerase highly processive (can synthesize >70 kilobases w/o falling off)

    • enables very long reads

  • get a fluorescent spike (strong signal that lasts) for correct bases

    • DNA polymerase holds each nucleotide long enough for a detectable fluorescent signal, allowing extremely long single-molecule reads

32
New cards

Oxford Nanopore Sequencing

  • directly measures DNA translocation

  • similar to gel electrophoresis, applying a voltage gradient leads negatively charged DNA to travel towards the anode, only route is thru a pore

  • translocation of DNA blocks the natural current of ions thru a pore

    • as each nucleotide passes thru the pore, the DNA partially blocks the ionic current

    • different bases cause distinct current changes

  • detects DNA bases by measuring changes in ionic current as DNA molecule passes thru biological pore under applied voltage

  • can detect modified bases and structural variants

33
New cards

Oxford Nanopore Sequencing FIGURE

knowt flashcard image
34
New cards

Oxford Nanopore Sequencing: Shelf Life

  • because nanopore sequencers use biological channels, there is a limited shelf life

    • 1 month at room temp, 3 months at 4ºC

  • accessibility: data acquired in the lab via a laptop w/ a USB connection

<ul><li><p>because nanopore sequencers use biological channels, there is a limited shelf life </p><ul><li><p>1 month at room temp, 3 months at 4ºC</p></li></ul></li><li><p>accessibility: data acquired in the lab via a laptop w/ a USB connection</p></li></ul><p></p>
35
New cards

What is the sequence of centromeric DNA?

  • centromeric DNA is composed of highly repetitive sequences, making these regions difficult to sequence and assemble (“dark matter”)

  • centromeres are essential for chromosome segregation and are flanked by heterochromatin

  • humans: alpha satellite DNA is the main repeat in centromeres

    • centromeres organized as monomeric repeat and higher order repeats (multiple monomers in a specific order repeated many times)

    • surrounding heterochromatin contain transposable elements

  • drosophila: centromeres are also highly repetitive, but differ in sequence

    • surrounding heterochromatin contain many transposable elements

36
New cards

What is the sequence of centromeric DNA? FIGURE

knowt flashcard image
37
New cards

Satellite Repeats in Centromeres

  • give an evolutionary advantage

  • chromosomes w/ more repeats will end up inherited at a 60:40 ratio vs chromosomes w/ fewer repeats

  • more repeats = stronger centromere = higher chance of being inherited

    • during female meiosis, ¼ meiotic products becomes the egg, and chromosomes w/ stronger centromeres (more repeats) are more likely to attach to the egg pole

    • centromere repeats aren’t evolving b/c they make the organism for fit, which may lead to species that are less fit (b/c centromeres are cheating in meiosis)

38
New cards

Satellite Repeats in Centromeres FIGURE

knowt flashcard image
39
New cards

What did long-read sequencing reveal about fly centromeres, and how are transposable elements involved?

  • long read sequencing led to the discovery that fly centromeres are filled w/ islands of retroelements (not just homogenous satellite repeats)

  • centromeres are sites of evolutionary warfare since they compete for inheritance during female meiosis

    • transposons (“soldiers”) can contribute to centromere function and possibly increasing strength for centromere drive

40
New cards

What does comparison of human centromeres reveal about their variability and evolution?

  • gapless assemblies: genome assemblies have gaps in repetitive regions

  • humans are highly variable: ~50% of centromeric sequences can’t even be aligned b/w two individuals due to higher order repeat sequences

  • the positions of centromeres can differ by >500 kilobases

  • the size of centromeres can differ by up to 3 fold

  • further evidence that human chromosome evolution is shaped by complex and often non-Mendelian forces

    • human chromosome evolution is dynamic and influence by repetitive DNA and not just coding sequences

41
New cards

Conclusions

  • repetitive DNA has important functions in biology and disease

  • the full sequencing of genomes necessitates the incorporation of long read sequencing approaches

  • several sequencing approaches are combined tgt to assemble an accurate genome

  • short read sequencing gives reads of the highest accuracy whereas long read sequencing enables analysis of repetitive regions