Genetics sequencing technology

0.0(0)
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/85

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

86 Terms

1
New cards

read

a single sequence generated from a technology. It can be based on a population of molecules, or a single molecule.

2
New cards

Read length

the number of base pairs observed per read.

3
New cards

Coverage

the number of times, on average, a region of DNA has been sequenced. If you have 10 reads for a given sequence, that is 10x coverage.

4
New cards

polymorphism

when more than one allele exists at the same locus in the population, i.e. it is not “rare”.

5
New cards

single nucleotide polymorphism (SNP)

is a polymorphism where a single base has been changed.

6
New cards

Haplotype

Combination of alleles along a single chromosome.

7
New cards

Gel Electrophoresis

separating DNA of different sizes

Gels are molecular sieves

smaller molecules move more easily (faster)

8
New cards

Gel Electrophoresis

  1. Current is applied - causes charged molecules to move through gel

  2. Since DNA is negatively charged, thanks to the phosphates, run toward + end (anode)

  3. Molecules are sorted into “bands” by their size

  4. Agarose gels contain a dye that fluoresces in presence of DNA (ethidium bromide, sybr green)

9
New cards

amt of dna needed for PCR

  • Typically, the minimum amount of template DNA needed is 0.1-10ng DNA for 50 uL.

10
New cards

cheek swab

~3.5 ug (from 1 mL)

11
New cards

hair

~3 ug (from 4 hair follicles)

12
New cards

urine

~1.5 ug (from 5 mL)

13
New cards

Blood

~ 4 ug (from 50 uL, i.e. drop of blood)

14
New cards

PCR amplification is SIGNIFICANT

  • Typically, the minimum amount of template DNA needed is 0.1-10ng DNA for 50 uL.

  • Sources of DNA and amounts (in extraction):

  • Cheek swab: ~3.5 ug (from 1 mL)

  • Hair: ~3 ug (from 4 hair follicles)

  • Urine: ~1.5 ug (from 5 mL)

  • Blood: ~ 4 ug (from 50 uL, i.e. drop of blood)

  • These samples provide >1000x more DNA than needed!!

15
New cards

Sanger DNA Sequencing requires

Requires ddNTP

16
New cards


Old-school Sanger Sequencing 1

Four reactions, each containing primer, polymerase, dNTPs, and trace levels of a single ddNTP

Run each reaction on a polyacrylamide gel, which  has 1bp resolution

17
New cards


Old-school Sanger Sequencing 2

Detect radioactive DNA by exposing gel to X-ray film. 

Read the autoradiogram sequence from the bottom up: ~250 bases

18
New cards

Capillary electrophoresis

can separate DNA at 1bp resolution

19
New cards

DNA sequence can be read off

the chromatogram or electropherogram – typically 800-1000 bases: ddNTPs make long strand synthesis unlikely!

20
New cards

Sanger ddNTP chain termination DNA sequencing 1

  1. DNA fragments can be sequenced by the dideoxy chain termination (Sanger) method, the first automated method to be employed

21
New cards

Sanger ddNTP chain termination DNA sequencing 2

  1. Modified nucleotides called dideoxyribonucleotides (ddNTP) attach to synthesized DNA strands of different lengths

22
New cards

Sanger ddNTP chain termination DNA sequencing 3

  1. Each type of ddNTP is tagged with a distinct fluorescent label that identifies the nucleotide at the end of each DNA fragment

23
New cards

Sanger ddNTP chain termination DNA sequencing 4

  1. The DNA sequence can be read from the resulting electropherogram

24
New cards

Sanger ddNTP chain termination DNA sequencing

  1. DNA fragments can be sequenced by the dideoxy chain termination (Sanger) method, the first automated method to be employed

  2. Modified nucleotides called dideoxyribonucleotides (ddNTP) attach to synthesized DNA strands of different lengths

  3. Each type of ddNTP is tagged with a distinct fluorescent label that identifies the nucleotide at the end of each DNA fragment

  4. The DNA sequence can be read from the resulting electropherogram

25
New cards

Sanger specs # well plates

  1. Many of these sequencers use 384 well plates, with a single sample in each well.

26
New cards

sanger specs # times well can be sequenced

  1. Each well can be sequenced up to ~1 kbp (103)

27
New cards

human genome length

  1. The human genome is ~3Gbp (3x109)

28
New cards

Sanger spec DNA per well

  1. Need ~20ng/well

29
New cards

Sanger cost

Genewiz sequencing is ~$1-5/well, depending on volume

30
New cards

If I want to sequence a human genome, with 10x coverage, how many plates will I need to use? Sanger

  1. (1 plate/400 rxn) x (1 rxn/103bp) x (3x109bp) x 10 ~ 105

31
New cards

Sanger specs

  1. Many of these sequencers use 384 well plates, with a single sample in each well.

  2. Each well can be sequenced up to ~1 kbp (103)

  3. The human genome is ~3Gbp (3x109)

  4. Genewiz sequencing is ~$1-5/well, depending on volume.

  5. Need ~20ng/well

  6. (1 plate/400 rxn) x (1 rxn/103bp) x (3x109bp) x 10 ~ 105

32
New cards

Sanger pros

  • Pros:

    • Cheap

    • Accurate

    • Good if you only need to sequence < ~100 kbp total. (This would be ~100 samples).

33
New cards

sanger cons

  • Cons

    • Gets expensive if you need to sequence >100 kbp.

    • Requires A LOT of DNA if you want to sequence >100 kbp.

    • Expensive machine; typically need to outsource this.

    • Read length is ~1000 bp.

34
New cards

illumina

Single molecules are amplified with PCR

Limitation: Sequencing length is ~200 bp, due to occasional incomplete cleavage of dye &/or azide blocker.

35
New cards

ilumina specs each lane

Each lane can sequence ~ 300 million (3x108) reads (dots).

36
New cards

illumina read length specs

Each read is ~100 bp

37
New cards

illumina cost per lane

$5,000 per lane.

38
New cards

illumina DNA needed

Uses ~50 ng of DNA

39
New cards

illumina specs

Specs

  • Each lane can sequence ~ 300 million (3x108) reads (dots).

  • Each read is ~100 bp

  • The human genome is ~3Gbp (3x109)

  • ~$5,000 per lane.

  • Uses ~50 ng of DNA

40
New cards

illumina pros

  • Pros:

    • Cheap per base pair IF you sequence millions of base pairs!

    • Accurate

    • Requires VERY LITTLE DNA.

41
New cards

illumina cons

  • Cons

    • Overkill if you need to sequence < millions of base pairs.

    • Expensive machine; typically need to outsource this.

    • Reads are very short (~100 bp).

42
New cards

PacBio

Sequences SINGLE MOLECULE of DNA in every well!

DNA modifications slow down the polymerase!

This technology has quickly become THE GOLD STANDARD for long-reads due to the ability to increase accuracy with circular sequencing.

43
New cards

PacBio specs cell

Each SMRT cell can produce ~20 Gbp of data.

44
New cards

PacBio specs read length

Read length and coverage variable.

In PRINCIPLE, can produce very long reads of 10-100 kbp

45
New cards

PacBio accuracy

Accuracy is variable, depends on length.

46
New cards

PacBio cost

~$3,000 per cell.

47
New cards

PacBio DNA needed

Uses ~20 ug of DNA

48
New cards

PacBio can specs

Can observe DNA modifications*

49
New cards

PacBio specs

  • Each SMRT cell can produce ~20 Gbp of data.

  • Read length and coverage variable.

  • In PRINCIPLE, can produce very long reads of 10-100 kbp

  • Accuracy is variable, depends on length.

  • ~$3,000 per cell.

  • Uses ~20 ug of DNA

  • Can observe DNA modifications*

50
New cards

PacBio pros

  • Pros:

    • Cheap per base pair IF you sequence millions of base pairs!

    • Accurate

    • Long reads!

    • Can detect DNA modifications.

51
New cards

PacBio cons

  • Cons

    • Overkill if you need to sequence < millions of base pairs.

    • Expensive machine; typically need to outsource this.

    • Needs 1000 times more DNA than Illumina.

52
New cards

PacBio pros and cons

  • Pros:

    • Cheap per base pair IF you sequence millions of base pairs!

    • Accurate

    • Long reads!

    • Can detect DNA modifications.

  • Cons

    • Overkill if you need to sequence < millions of base pairs.

    • Expensive machine; typically need to outsource this.

    • Needs 1000 times more DNA than Illumina.

53
New cards

Nanopore

It uses a DNA Polymerase as a BRAKE. It does NOT rely on its enzymatic activity!

Also, uses hairpin to read both strands.

54
New cards

Nanopore specs data

Each flow cell can produce ~10GBp of data.

55
New cards

Nanopore specs read length

Read length is variable; it depends on your sample.

In PRINCIPLE, can produce very long reads of 10-100 kbp

56
New cards

Nanopore specs cost

  • ~$500 per reagent kit.

  • ~$500 per flow cell.

57
New cards

Nanopore specs DNA amt needed

Uses ~1 ug of DNA

58
New cards

Nanopore can specs

Can observe DNA modifications*

59
New cards

Nanopore

It uses a DNA Polymerase as a BRAKE. It does NOT rely on its enzymatic activity!

Also, uses hairpin to read both strands.

Specs

  • Each flow cell can produce ~10GBp of data.

  • Read length is variable; it depends on your sample.

  • In PRINCIPLE, can produce very long reads of 10-100 kbp

  • ~$500 per reagent kit.

  • ~$500 per flow cell.

  • Uses ~1 ug of DNA

  • Can observe DNA modifications*

60
New cards

Nanopore pros and cons

  • Pros:

    • Cheap per base pair IF you sequence millions of base pairs!

    • Long reads!

    • Requires a somewhat small amount of DNA.

    • Affordable, and can be easily performed in lab (or at home!).

    • Can detect DNA modifications.

  • Cons

    • Overkill if you need to sequence < millions of base pairs.

    • Accuracy not the best.

61
New cards

Nanopore pros

  • Pros:

    • Cheap per base pair IF you sequence millions of base pairs!

    • Long reads!

    • Requires a somewhat small amount of DNA.

    • Affordable, and can be easily performed in lab (or at home!).

    • Can detect DNA modifications.

62
New cards

Nanopore cons

  • Cons

    • Overkill if you need to sequence < millions of base pairs.

    • Accuracy not the best.

63
New cards

SNP Array Logic

Since we have 4 features / SNP, each chip can detect ~1 million SNPs, i.e. most/all human SNPs

These chips commonly include a few hundred disease markers as well.

This is the type of technology used by many genotyping services, like 23andMe.

THESE CHIPS ONLY DETECT KNOWN SNPS!! THEY ARE INCAPABLE OF DETECTING NOVEL SNPS!!

64
New cards

De novo vs. comparative assembly

De novo assembly means you do everything from scratch

Comparative assembly means you have a “reference” genome. For example, you want to sequence your own genome, and you have Craig Venter’s genome already sequenced. Or you want to sequence a neanderthal genome and you have a human already sequenced.

65
New cards

Simplest scenario

  • Reads have no error

  • Read are long enough that each appears exactly once in the genome

  • Each read given in the same orientation (all 5’ to 3’, for example)

66
New cards

de novo Genome Assembly Challenges

  • Given many (millions or billions) of reads, produce a linear (or perhaps circular) genome 

  • Issues: 

    1. Coverage

    2. Errors in reads

    3. Reads vary from very short (35bp) to quite long (10 kbp), and are double-stranded

    4. Non-uniqueness of solution

    5. Running time and memory

67
New cards

De Novo assembly

  1. Much easier with long reads

  2. Need very good coverage

  3. Generally produces fragmented assemblies

  4. Necessary when you don’t have a closely related (and correctly assembled) reference genome

68
New cards

Fragment Assembly

Cover region with ~10-fold redundancy

Overlap reads and extend to reconstruct the original genomic region

69
New cards

Read Coverage

Length of genomic segment:  L

Number of reads:                    n        

Length of each read:               l

Coverage C = n l / L

70
New cards

How much coverage is enough?

Lander-Waterman model: Y = location,

r = # of times location is read

P(Y=r) = (C^r * e^(-C )) / r!

71
New cards

Finding Overlapping Reads

Create local multiple alignments from the overlapping reads

k-mer: Used to align and assemble contiguous (contig) sequences using de Bruijn graph.

72
New cards

Finding Overlapping Reads (cont’d)

  • Correct errors using multiple alignment

  • Score alignments

  • Accept alignments with good scores

73
New cards

Assembly Problem Solution: de Bruijn Graphs

  • Problem: Finding overlapping sequences within reads is computationally expensive. The number of alignments scales as the square of the reads (r), i.e. (~r2).

  • Solution: Chop the large number of reads into smaller sequences (size k), and build a graph of the paths through these k-mers. The goal is to find the shortest path that passes through all the reads.

74
New cards

Challenges for Eukaryotic Genomes

  • Diploid Organisms have TWO haplotypes.

  • Many repetitive regions.

  • High heterozygosity in the population.

  • If you are sequencing a large organism, you can sidestep this issue by sequencing ONE individual, HOWEVER:

    • This individual may not be representative of the entire population.

  • If you are sequencing a small organism (like a spider):

    • To have enough DNA, you often need to sequence MANY individuals, therefore you INCREASE the number of haplotypes.

75
New cards

Challenges in Fragment Assembly

  • Repeats:  A major problem for fragment assembly

  • > 50% of human genome is repeat regions:

- over 1 million Alu repeats (about 300 bp)

- about 200,000 LINE repeats (1000 bp and longer)

76
New cards

Low-Complexity DNA

(e.g. ATATATATACATA…)

77
New cards

Microsatellite repeats

(a1…ak)N where k ~ 3-6

(e.g. CAGCAGTAGCAGCACCAG)

78
New cards

Transposons/retrotransposons

  • SINE Short Interspersed Nuclear Elements

(e.g., Alu: ~300 bp long, 106 copies)

  • LINE Long Interspersed Nuclear Elements

~500 - 5,000 bp long, 200,000 copies

LTR retroposons

Long Terminal Repeats (~700 bp) at each end

79
New cards

LTR retroposons

Long Terminal Repeats (~700 bp) at each end

80
New cards

Gene Families

genes duplicate & then diverge

81
New cards

Segmental duplications

~very long, very similar copies

82
New cards

Repeat Types

  • Low-Complexity DNA (e.g. ATATATATACATA…)

  • Microsatellite repeats    (a1…ak)N where k ~ 3-6

(e.g. CAGCAGTAGCAGCACCAG)

  • Transposons/retrotransposons   

    • SINE Short Interspersed Nuclear Elements

(e.g., Alu: ~300 bp long, 106 copies)

  • LINE Long Interspersed Nuclear Elements

~500 - 5,000 bp long, 200,000 copies

  • LTR retroposons Long Terminal Repeats (~700 bp) at each end

  • Gene Families genes duplicate & then diverge

  • Segmental duplications ~very long, very similar copies

83
New cards

Link Contigs into Scaffolds

Approach #1:

  1. Generate highly accurate reads from Illumina short-read sequencing.

  2. Assemble contigs based on these reads.

  3. Align contigs to long less accurate reads from Oxford Nanopore &/ PacBio.

84
New cards

Cytogenetics

  • Presumptive Karyotype:

    • 8 pairs of autosomes

    • X1X20 Sex determination system

    • Females have a pair of both X1 and X2

    • Males have one copy of X1 and X2

  • Number of observations and individuals

    • Female

    • Male

85
New cards

Genome Assembly Results

If reads are from different chromosomes, they should co-precipitate RARELY.

HOWEVER, if reads are from THE SAME chromosome, they should co-precipitate FREQUENTLY.

86
New cards