1/85
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
read
a single sequence generated from a technology. It can be based on a population of molecules, or a single molecule.
Read length
the number of base pairs observed per read.
Coverage
the number of times, on average, a region of DNA has been sequenced. If you have 10 reads for a given sequence, that is 10x coverage.
polymorphism
when more than one allele exists at the same locus in the population, i.e. it is not “rare”.
single nucleotide polymorphism (SNP)
is a polymorphism where a single base has been changed.
Haplotype
Combination of alleles along a single chromosome.
Gel Electrophoresis
separating DNA of different sizes
Gels are molecular sieves
smaller molecules move more easily (faster)
Gel Electrophoresis
Current is applied - causes charged molecules to move through gel
Since DNA is negatively charged, thanks to the phosphates, run toward + end (anode)
Molecules are sorted into “bands” by their size
Agarose gels contain a dye that fluoresces in presence of DNA (ethidium bromide, sybr green)
amt of dna needed for PCR
Typically, the minimum amount of template DNA needed is 0.1-10ng DNA for 50 uL.
cheek swab
~3.5 ug (from 1 mL)
hair
~3 ug (from 4 hair follicles)
urine
~1.5 ug (from 5 mL)
Blood
~ 4 ug (from 50 uL, i.e. drop of blood)
PCR amplification is SIGNIFICANT
Typically, the minimum amount of template DNA needed is 0.1-10ng DNA for 50 uL.
Sources of DNA and amounts (in extraction):
Cheek swab: ~3.5 ug (from 1 mL)
Hair: ~3 ug (from 4 hair follicles)
Urine: ~1.5 ug (from 5 mL)
Blood: ~ 4 ug (from 50 uL, i.e. drop of blood)
These samples provide >1000x more DNA than needed!!
Sanger DNA Sequencing requires
Requires ddNTP
Old-school Sanger Sequencing 1
Four reactions, each containing primer, polymerase, dNTPs, and trace levels of a single ddNTP
Run each reaction on a polyacrylamide gel, which has 1bp resolution
Old-school Sanger Sequencing 2
Detect radioactive DNA by exposing gel to X-ray film.
Read the autoradiogram sequence from the bottom up: ~250 bases
Capillary electrophoresis
can separate DNA at 1bp resolution
DNA sequence can be read off
the chromatogram or electropherogram – typically 800-1000 bases: ddNTPs make long strand synthesis unlikely!
Sanger ddNTP chain termination DNA sequencing 1
DNA fragments can be sequenced by the dideoxy chain termination (Sanger) method, the first automated method to be employed
Sanger ddNTP chain termination DNA sequencing 2
Modified nucleotides called dideoxyribonucleotides (ddNTP) attach to synthesized DNA strands of different lengths
Sanger ddNTP chain termination DNA sequencing 3
Each type of ddNTP is tagged with a distinct fluorescent label that identifies the nucleotide at the end of each DNA fragment
Sanger ddNTP chain termination DNA sequencing 4
The DNA sequence can be read from the resulting electropherogram
Sanger ddNTP chain termination DNA sequencing
DNA fragments can be sequenced by the dideoxy chain termination (Sanger) method, the first automated method to be employed
Modified nucleotides called dideoxyribonucleotides (ddNTP) attach to synthesized DNA strands of different lengths
Each type of ddNTP is tagged with a distinct fluorescent label that identifies the nucleotide at the end of each DNA fragment
The DNA sequence can be read from the resulting electropherogram
Sanger specs # well plates
Many of these sequencers use 384 well plates, with a single sample in each well.
sanger specs # times well can be sequenced
Each well can be sequenced up to ~1 kbp (103)
human genome length
The human genome is ~3Gbp (3x109)
Sanger spec DNA per well
Need ~20ng/well
Sanger cost
Genewiz sequencing is ~$1-5/well, depending on volume
If I want to sequence a human genome, with 10x coverage, how many plates will I need to use? Sanger
(1 plate/400 rxn) x (1 rxn/103bp) x (3x109bp) x 10 ~ 105
Sanger specs
Many of these sequencers use 384 well plates, with a single sample in each well.
Each well can be sequenced up to ~1 kbp (103)
The human genome is ~3Gbp (3x109)
Genewiz sequencing is ~$1-5/well, depending on volume.
Need ~20ng/well
(1 plate/400 rxn) x (1 rxn/103bp) x (3x109bp) x 10 ~ 105
Sanger pros
Pros:
Cheap
Accurate
Good if you only need to sequence < ~100 kbp total. (This would be ~100 samples).
sanger cons
Cons
Gets expensive if you need to sequence >100 kbp.
Requires A LOT of DNA if you want to sequence >100 kbp.
Expensive machine; typically need to outsource this.
Read length is ~1000 bp.
illumina
Single molecules are amplified with PCR
Limitation: Sequencing length is ~200 bp, due to occasional incomplete cleavage of dye &/or azide blocker.
ilumina specs each lane
Each lane can sequence ~ 300 million (3x108) reads (dots).
illumina read length specs
Each read is ~100 bp
illumina cost per lane
$5,000 per lane.
illumina DNA needed
Uses ~50 ng of DNA
illumina specs
Specs
Each lane can sequence ~ 300 million (3x108) reads (dots).
Each read is ~100 bp
The human genome is ~3Gbp (3x109)
~$5,000 per lane.
Uses ~50 ng of DNA
illumina pros
Pros:
Cheap per base pair IF you sequence millions of base pairs!
Accurate
Requires VERY LITTLE DNA.
illumina cons
Cons
Overkill if you need to sequence < millions of base pairs.
Expensive machine; typically need to outsource this.
Reads are very short (~100 bp).
PacBio
Sequences SINGLE MOLECULE of DNA in every well!
DNA modifications slow down the polymerase!
This technology has quickly become THE GOLD STANDARD for long-reads due to the ability to increase accuracy with circular sequencing.
PacBio specs cell
Each SMRT cell can produce ~20 Gbp of data.
PacBio specs read length
Read length and coverage variable.
In PRINCIPLE, can produce very long reads of 10-100 kbp
PacBio accuracy
Accuracy is variable, depends on length.
PacBio cost
~$3,000 per cell.
PacBio DNA needed
Uses ~20 ug of DNA
PacBio can specs
Can observe DNA modifications*
PacBio specs
Each SMRT cell can produce ~20 Gbp of data.
Read length and coverage variable.
In PRINCIPLE, can produce very long reads of 10-100 kbp
Accuracy is variable, depends on length.
~$3,000 per cell.
Uses ~20 ug of DNA
Can observe DNA modifications*
PacBio pros
Pros:
Cheap per base pair IF you sequence millions of base pairs!
Accurate
Long reads!
Can detect DNA modifications.
PacBio cons
Cons
Overkill if you need to sequence < millions of base pairs.
Expensive machine; typically need to outsource this.
Needs 1000 times more DNA than Illumina.
PacBio pros and cons
Pros:
Cheap per base pair IF you sequence millions of base pairs!
Accurate
Long reads!
Can detect DNA modifications.
Cons
Overkill if you need to sequence < millions of base pairs.
Expensive machine; typically need to outsource this.
Needs 1000 times more DNA than Illumina.
Nanopore
It uses a DNA Polymerase as a BRAKE. It does NOT rely on its enzymatic activity!
Also, uses hairpin to read both strands.
Nanopore specs data
Each flow cell can produce ~10GBp of data.
Nanopore specs read length
Read length is variable; it depends on your sample.
In PRINCIPLE, can produce very long reads of 10-100 kbp
Nanopore specs cost
~$500 per reagent kit.
~$500 per flow cell.
Nanopore specs DNA amt needed
Uses ~1 ug of DNA
Nanopore can specs
Can observe DNA modifications*
Nanopore
It uses a DNA Polymerase as a BRAKE. It does NOT rely on its enzymatic activity!
Also, uses hairpin to read both strands.
Specs
Each flow cell can produce ~10GBp of data.
Read length is variable; it depends on your sample.
In PRINCIPLE, can produce very long reads of 10-100 kbp
~$500 per reagent kit.
~$500 per flow cell.
Uses ~1 ug of DNA
Can observe DNA modifications*
Nanopore pros and cons
Pros:
Cheap per base pair IF you sequence millions of base pairs!
Long reads!
Requires a somewhat small amount of DNA.
Affordable, and can be easily performed in lab (or at home!).
Can detect DNA modifications.
Cons
Overkill if you need to sequence < millions of base pairs.
Accuracy not the best.
Nanopore pros
Pros:
Cheap per base pair IF you sequence millions of base pairs!
Long reads!
Requires a somewhat small amount of DNA.
Affordable, and can be easily performed in lab (or at home!).
Can detect DNA modifications.
Nanopore cons
Cons
Overkill if you need to sequence < millions of base pairs.
Accuracy not the best.
SNP Array Logic
Since we have 4 features / SNP, each chip can detect ~1 million SNPs, i.e. most/all human SNPs
These chips commonly include a few hundred disease markers as well.
This is the type of technology used by many genotyping services, like 23andMe.
THESE CHIPS ONLY DETECT KNOWN SNPS!! THEY ARE INCAPABLE OF DETECTING NOVEL SNPS!!
De novo vs. comparative assembly
De novo assembly means you do everything from scratch
Comparative assembly means you have a “reference” genome. For example, you want to sequence your own genome, and you have Craig Venter’s genome already sequenced. Or you want to sequence a neanderthal genome and you have a human already sequenced.
Simplest scenario
Reads have no error
Read are long enough that each appears exactly once in the genome
Each read given in the same orientation (all 5’ to 3’, for example)
de novo Genome Assembly Challenges
Given many (millions or billions) of reads, produce a linear (or perhaps circular) genome
Issues:
Coverage
Errors in reads
Reads vary from very short (35bp) to quite long (10 kbp), and are double-stranded
Non-uniqueness of solution
Running time and memory
De Novo assembly
Much easier with long reads
Need very good coverage
Generally produces fragmented assemblies
Necessary when you don’t have a closely related (and correctly assembled) reference genome
Fragment Assembly
Cover region with ~10-fold redundancy
Overlap reads and extend to reconstruct the original genomic region
Read Coverage
Length of genomic segment: L
Number of reads: n
Length of each read: l
Coverage C = n l / L
How much coverage is enough?
Lander-Waterman model: Y = location,
r = # of times location is read
P(Y=r) = (C^r * e^(-C )) / r!
Finding Overlapping Reads
Create local multiple alignments from the overlapping reads
k-mer: Used to align and assemble contiguous (contig) sequences using de Bruijn graph.
Finding Overlapping Reads (cont’d)
Correct errors using multiple alignment
Score alignments
Accept alignments with good scores
Assembly Problem Solution: de Bruijn Graphs
Problem: Finding overlapping sequences within reads is computationally expensive. The number of alignments scales as the square of the reads (r), i.e. (~r2).
Solution: Chop the large number of reads into smaller sequences (size k), and build a graph of the paths through these k-mers. The goal is to find the shortest path that passes through all the reads.
Challenges for Eukaryotic Genomes
Diploid Organisms have TWO haplotypes.
Many repetitive regions.
High heterozygosity in the population.
If you are sequencing a large organism, you can sidestep this issue by sequencing ONE individual, HOWEVER:
This individual may not be representative of the entire population.
If you are sequencing a small organism (like a spider):
To have enough DNA, you often need to sequence MANY individuals, therefore you INCREASE the number of haplotypes.
Challenges in Fragment Assembly
Repeats: A major problem for fragment assembly
> 50% of human genome is repeat regions:
- over 1 million Alu repeats (about 300 bp)
- about 200,000 LINE repeats (1000 bp and longer)
Low-Complexity DNA
(e.g. ATATATATACATA…)
Microsatellite repeats
(a1…ak)N where k ~ 3-6
(e.g. CAGCAGTAGCAGCACCAG)
Transposons/retrotransposons
SINE Short Interspersed Nuclear Elements
(e.g., Alu: ~300 bp long, 106 copies)
LINE Long Interspersed Nuclear Elements
~500 - 5,000 bp long, 200,000 copies
LTR retroposons
Long Terminal Repeats (~700 bp) at each end
LTR retroposons
Long Terminal Repeats (~700 bp) at each end
Gene Families
genes duplicate & then diverge
Segmental duplications
~very long, very similar copies
Repeat Types
Low-Complexity DNA (e.g. ATATATATACATA…)
Microsatellite repeats (a1…ak)N where k ~ 3-6
(e.g. CAGCAGTAGCAGCACCAG)
Transposons/retrotransposons
SINE Short Interspersed Nuclear Elements
(e.g., Alu: ~300 bp long, 106 copies)
LINE Long Interspersed Nuclear Elements
~500 - 5,000 bp long, 200,000 copies
LTR retroposons Long Terminal Repeats (~700 bp) at each end
Gene Families genes duplicate & then diverge
Segmental duplications ~very long, very similar copies
Link Contigs into Scaffolds
Approach #1:
Generate highly accurate reads from Illumina short-read sequencing.
Assemble contigs based on these reads.
Align contigs to long less accurate reads from Oxford Nanopore &/ PacBio.
Cytogenetics
Presumptive Karyotype:
8 pairs of autosomes
X1X20 Sex determination system
Females have a pair of both X1 and X2
Males have one copy of X1 and X2
Number of observations and individuals
Female
Male
Genome Assembly Results
If reads are from different chromosomes, they should co-precipitate RARELY.
HOWEVER, if reads are from THE SAME chromosome, they should co-precipitate FREQUENTLY.