MICR5846 L2: DNA Sequencing, Alignment and Assembly 7/26/25

0.0(0)

Studied by 0 people

Call with Kai

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/101

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

102 Terms

New cards

1.List the different types of sequencing technologies and describe their strengths and weaknesses.

1. Sanger sequencing: Only confirms one gene, but high cost

2. Oxford Nanopore: Not as accurate as Illumina,

3. SMRT sequencing: High fidelity

4. Illumina: For metagenomes/low cost, lower capacity/read length

New cards

2. Describe the chemistry of next-generation sequencing.

Next-generation sequencing (NGS) relies primarily on sequencing by synthesis (SBS) using reversible-terminator chemistry, where engineered DNA polymerases sequentially add labeled, terminal-blocked deoxynucleotide triphosphates (dNTPs) to a template DNA strand

New cards

3. Describe the steps in analyisng data from a sequencer.

1) Basecalling

2) Formatting

3) Trimming

4) Assembly

New cards

4. Describe the key attributes that are labelled using an annotation program.

1) Open reading frames

2) Gene locations

3) Strand orientation

4) Gene name

5) Protein coding

6) Promoter regions

7) Start and stop codons

New cards

5. How do genes that are paralogues and orthologues formed in nature?

Gene duplication: Produces paralogs (Gene 1a and Gene 1b)

Speciation: Produces orthologs (Species 1 Gene 1a, Species 2 Gene 1a)

New cards

6. How do we measure homology using a pairwise algorithm?

-Needleman Wunsch and other algorithms

-Predict which residues are identical or mutated, and where insertions or deletions (indels) have occurred

New cards

7. Explain the principles of measuring homology of DNA.

-% Identity: Percentage of nucleotides matching in an alignment to measure the homology

-This can be interpreted as evolutionary distance between the sequences.

New cards

8. Explain the principles of measuring homology of proteins, with a focus on how the matrices operate

-Use scoring matrix (BLOSUM) to generate % Similarity score for two amino acid sequences

-Greater scores to amino acids with similar properties, and penalties to gaps/differences.

-More useful than % identity

New cards

Name four types of DNA sequencing technologies

1) Sanger

2) Illumina (short-read, next gen)

2) Oxford Nanopore (long-read)

4) SMRT (Single molecule real-time)

New cards

What type of DNA sequencing is this?

1) PCR with fluorescent, chain-terminating ddNTPs

2) Size separation by capillary gel electrophoresis

3) Laser excitation & detection by sequencing machine, produces chromatogram

Sanger Sequencing

New cards

What type of DNA sequencing is this?

-Sanger Sequencing

1) PCR with fluorescent, chain-terminating ddNTPs

2) Size separation by capillary gel electrophoresis

3) Laser excitation & detection by sequencing machine, produces chromatogram

New cards

What happens after dNTPs are incorporated into DNA during the first step pf Sanger Sequencing?

-PCR stops

-Fragments are short, some only 1 base long

New cards

What type of DNA sequencing is this?

1) DNA is extracted/fragmented through mechanical breakage or with endonucleases.

2) DNA adapters are ligated to either end of each fragment

3) DNA is washed/added, binds to a complementary DNA terminus

4) DNA folds into bridge shape, adapter binds a neighboring terminus

5) Primers and polymerase are added, creates copy strand

6) DNA is denatured and strands separate.

7) The process is repeated, creates copies on the flow cell.

8) Fluorescent nucleotides/polymerase are added to extend the sequence

9) Flow cell is imaged, color indicates which base was added

10) Fluorescent tag/terminator are cleaved, next base added

11) This process continues until the whole strand is sequenced

Illumina Sequencing

New cards

What type of DNA sequencing is this?

-Illumina

1) DNA is extracted/fragmented through mechanical breakage or with endonucleases.

2) DNA adapters are ligated to either end of each fragment

3) DNA is washed/added, binds to a complementary DNA terminus

4) DNA folds into bridge shape, adapter binds a neighboring terminus

5) Primers and polymerase are added, creates copy strand

6) DNA is denatured and strands separate.

7) The process is repeated, creates copies on the flow cell.

8) Fluorescent nucleotides/polymerase are added to extend the sequence

9) Flow cell is imaged, color indicates which base was added

10) Fluorescent tag/terminator are cleaved, next base added

11) This process continues until the whole strand is sequenced

New cards

What is the first step of Illumina sequencing?

-DNA to be sequenced is extracted and fragmented

-Mechanical breakage or digestion with endonucleases

New cards

What occurs after this step in Illumina sequencing?

1) DNA to be sequenced is extracted and fragmented through mechanical breakage or digestion with endonucleases.

Specific DNA adapters are ligated to either end of each fragment

New cards

What are some special abilities of the sequences found in the DNA adapters used for Illumina sequencing?

-Allow the DNA to bind the flow cell

-Complementary to the sequencing primers

-Nucleotide barcodes allow us to distinguish between different samples

New cards

What occurs after this step in Illumina sequencing?

3) Specific DNA adapters are ligated to either end of each fragment.

-DNA is washed and added to the flow cell

-DNA binds to a complementary DNA terminus on the surface of the cell

New cards

What occurs after this step in Illumina sequencing?

3) DNA is washed and added to the flow cell

-DNA binds to a complementary DNA terminus on the surface of the cell

-DNA folds over to form a bridge shape

-The adapter binds a neighboring terminus.

New cards

What occurs after this step in Illumina sequencing?

4) DNA folds over to form a bridge shape and the adapter binds a neighboring terminus.

-Primers and polymerase are added

-Creates a copy (reverse strand) of the original template (forward strand).

New cards

True or False: The new strand that is copied during Illumina sequencing is identical to the forward strand

False, it is a reverse strand of the original template (forward strand)

New cards

What occurs after this step in Illumina sequencing?

5) Primers and polymerase are added which create a copy (reverse strand) of the original template (forward strand).

DNA is denatured and strands separate

New cards

What occurs after this step in Illumina sequencing?

6) DNA is denatured and strands separate

The process is repeated to create a dense cluster of copies on the flow cell

New cards

True or False: Each channel of a flow cell contains millions of clusters (copies of DNA on the flow cell)

True

New cards

What occurs after this step in Illumina sequencing?

7) The process is repeated to create a dense cluster of copies on the flow cell

-Sequencing begins.

-Fluorescent nucleotides and polymerase are added

-Extend the sequence by one base

New cards

What occurs after this step in Illumina sequencing?

8) Fluorescent nucleotides and polymerase are added to extend the sequence by one base

-The flow cell is then imaged.

-The color of fluorescence indicates which base was added.

New cards

What occurs after this step in Illumina sequencing?

9) The flow cell is then imaged. The colour of fluorescence indicates which base was added.

-Fluorescent tag and terminator are cleaved

-Next base can be added

New cards

What occurs after this step in Illumina sequencing?

10) The fluorescent tag and terminator are now cleaved, allowing the addition of the next base

The process continues until the whole strand is sequenced

New cards

Why is Illumina next-gen sequencing so convenient?

Millions of clusters can be simultaneously sequenced in parallel

New cards

How do you differentiate Illumina sequences from different samples?

-Nucleotide barcodes created by sequenced adapters

-Allows multiplexing of multiple samples in a single flow cell

New cards

When the Illumina sequencer interprets the fluorescent signals, it performs "basecalling." What is this?

-Assigns nucleotide to each signal

-Data is converted to sequence data

New cards

What is this?

-Assigns nucleotide to each signal

-Illumina data is converted to sequence data

Basecalling

New cards

What type of DNA sequencing is this?

1) DNA molecules bind to the polymerase inside flow cell pore and a copy is created

2) The copy passes through the pore, nucleotide identified by electrical signal

3) Base-calling algorithms used to convert electrical data into DNA sequence

Oxford Nanopore Sequencing

New cards

What type of DNA sequencing is this?

-Oxford Nanopore Sequencing

1) DNA molecules bind to the polymerase inside flow cell pore and a copy is created

2) The copy passes through the pore, nucleotide identified by electrical signal

3) Base-calling algorithms used to convert electrical data into DNA sequence

New cards

What sets the Oxford Nanopore flow cell apart from other flow cells?

-Flow cell consists of pores

-Each pore contains a phospholipid bilayer and polymerase

New cards

What type of DNA sequencing is this?

1) Double stranded DNA is used as the input.

2) 'SMRTbell' adapters are added to each end which join both complimentary strands together.

3) The now circular ssDNA binds to a ZMW pore containing a polymerase.

4) Fluorescent nucleotides are incorporated and cleaved continuously, producing a bright flash as each base is incorporated.

5) Sequencing occurs continuously around the circle, providing coverage of both strands many times over.

SMRT Sequencing (Single-Molecule Real-Time)

New cards

What type of DNA sequencing is this?

-SMRT Sequencing (Single-Molecule Real-Time)

1) Double stranded DNA is used as the input.

2) Adapters are added to each end, join both complimentary strands together.

3) Circular ssDNA binds to a ZMW pore containing a polymerase.

4) Fluorescent nucleotides are incorporated and produce a bright flash

5) Sequencing occurs continuously around the circle, providing coverage of both strands many times over.

New cards

What adapters are using during SMRT sequencing to join complimentary strands together?

SMRTbell

New cards

What happens to the ssDNA and the ZMW pores during SMRT Sequencing?

-The circular ssDNA will loop around and bind to both sides of the DNA duplex

-Reads the same DNA 50-100x

New cards

True or False: SMRT fluorescent data is output in real time when nucleotides are being cleaved and incorporated

True

New cards

What do both SMRT and Illumina sequencing have in common?

Both use a detector for fluorescent probes

New cards

True or False: SMRT has low fidelity, making it somewhat inaccurate

False, it reads the same DNA 50-100x

New cards

What type of DNA sequencing is this?

Problem: Single genes/plasmids

Input: PCR amplified DNA

Read Length: 500-1000

Capacity: Low

Cost: High

Accuracy: 99.9%

Sanger Sequencing

New cards

What is a downside of Sanger Sequencing?

-Only confirms single gene

-Low throughput, but expensive

-Still very accurate though

New cards

What type of DNA sequencing is this?

Problem: Whole draft genomes and 165 metagenomes

Input: Fragmented, amplified DNA

Read Length: 150-600

Capacity: High

Cost: Low

Accuracy: 99.9%

Illumina Sequencing

New cards

What sequencing is best suited for whole genomes and metagenomes?

Illumina Sequencing

New cards

What type of DNA sequencing is this?

Problem: Closed genomes, genomes with repetitive elements

Input: Amplified or Native DNA

Read Length: 15-100 kB

Capacity: Medium

Cost: High

Accuracy: >99%

Pacbio Sequencing

New cards

What type of DNA sequencing is this?

Problem: Closed genomes, genomes with repetitive elements

Input: Amplified or Native DNA

Read Length: Up to 1Mb

Capacity: Medium

Cost: High

Accuracy: >99%

Oxford Nanopore Sequencing

New cards

What type of sequencing would you use for closed genomes or human/plant genomes with repetitive elements?

-Pacbio (Outdated)

-Oxford Nanopore

New cards

True or False: Illumina is more accurate than Oxford Nanopore

True

New cards

True or False: Oxford Nanopore is more accurate than Illumina Sequencing

False

New cards

What is one advantage of using Oxford Nanopore over Illumina despite being less accurate?

-Higher throughput (medium capacity > low)

-Can process amplified PCR or direct extraction

New cards

What type of data does modern sequencing technology output?

-Fluorescent

-Electrical resistance

New cards

What is this?

-Q score/Phred score

-Basecaller outputs Quality Score

-Indicates confidence rate

New cards

What do raw data/raw reads from basecallers represent?

Output from sequencer for each sequenced fragment of DNA

New cards

How can you tell the difference between raw reads and data that has been assigned a Q score by the base score?

Fasta: Raw read (header + sequence)

Fastq: Includes quality scores

New cards

What happens when data is trimmed?

-Adapters/barcodes are removed from reads

-Low-quality information at ends is removed

New cards

True or False: The cutoff Q scores for trimming are consistent across different sequencing technologies and algorithms because they will be the same

False, they differ

New cards

What happens to data during assembly?

-Algorithm looks for overlapping k-mers

-Creates long sequence with them

-Assigns confidence threshold

-Contiguous regions are comhined

New cards

What is this?

-Coverage

-Number of times a given nucleotide has been sequenced

-Higher coverage = higher confidence rate, more likely to be assembled correctly

New cards

Why do gaps form during assembly?

-Low coverage

-Repetitive/duplicate regions of DNA that are difficult for algorithm to differentiate

New cards

True or False: A high quality assembly will contain more contigs, contiguous regions of sequence data

False

New cards

True or False: A reference-based assembly is more likely to have more contigs than a de-novo assembly

False, it will have less

New cards

What are the advantages of mapping reads back to a reference genome?

-Known genome structure

-More complete assemblies

New cards

What are some disadvantages of mapping reads back to a reference genome?

-Biased towards reference genome structure

-Miss large indels and genome rearrangements

-Need to be careful choosing references

New cards

What are the advantages of de-novo assembly (no reference genomes)?

-Unbiased by a reference genome

-Useful for new sequences

New cards

What are some disadvantages of de-novo assembly (no reference genomes)?

-Impacted by repetitive elements

-Fragmentary, more contigs due to less complete data

New cards

What are some ways to avoid fragmentary de-novo assembly data?

-Use long-read sequence data

-Reads can cross through difficult regions, connecting the contig

New cards

What are some important metrics for quality control?

Yield: total number of bases generated.

Error rate: compare reads that cover the same region.

%Q30: the percentage of bases with a Q score of 30 or higher.

Longest contig: ideally, one very long contig should cover most of your genome.

Number of contigs: lower indicates more contiguous sequence data.

N50: the median sequence length.

L50: The smallest number of contigs making up halfthe assembly size.

New cards

What do these metrics mean?

1) Yield

2) Longest Contig

3) Q30

1) Total number of bases generated

2) Ideally, one very long contig should cover most of your genome.

3) Percentage of bases with a Q score of 30 or higher.

New cards

What do these metrics mean?

1) Number of Contigs

2) N50

3) L50

1) Lower indicates more contiguous sequence data

2) The median sequence length.

3) The smallest number of contigs making up half (50%) the assembly size.

New cards

How do you calculate Error Rate?

Calculated by comparing reads which cover the same region

New cards

How do you annotate sequence data?

-Search database for known sequence with regions that are homologous to data

-Compare homology via alignment (base by base)

New cards

What is this?

-Homologs

Two sequences that are predicted to have a common ancestor

New cards

True or False: Homologs can be DNA or amino acid

True

New cards

What are 2 ways that homologs can arise?

-Gene duplication

-Speciation

New cards

What does gene duplication produce?

Paralogs

New cards

What does speciation produce?

Orthologs

New cards

Gene 1 in the ancestral species undergoes a duplication event generating Gene 1a and Gene 1b.

The ancestral species splits into two species, each with its own copy of Gene 1a and Gene 1b.

What is the relationship between Gene 1a in species one and Gene 1a in species two (DIFFERENT SPECIES)?

Orthologs

New cards

Gene 1 in the ancestral species undergoes a duplication event generating Gene 1a and Gene 1b.

The ancestral species splits into two species, each with its own copy of Gene 1a and Gene 1b.

What is the relationship between Gene 1a and Gene 1b?

Paralogs

New cards

Gene 1 in the ancestral species undergoes a duplication event generating Gene 1a and Gene 1b.

The ancestral species splits into two species, each with its own copy of Gene 1a and Gene 1b.

What is the relationship between species one and species two Genes 1a and Gene 1b (all four genes)?

Homologs

New cards

How does pairwise alignment estimate the level of homology between 2 sequences?

-Needleman-Wunsch algorithm

-Consider context of all residues in both sequences

New cards

What assumptions does the model of homology make?

-Identical residue pairs are aligned

-Different amino acids with similar physiochemical properties will form a pair

-Indels/gaps are rarer than mismatches

New cards

Why is a penalty applied by the algorithm for opening a gap?

-Gaps represent indels in an evolutionary context

-It is assumed that these are rarer than mismatches

New cards

True or False: When measuring homology, the pair with the highest total score is the most likely relationship

True

New cards

What is this?

-% Identity

-Percentage of nucleotides matching in an alignment to measure the homology

-Evolutionary distance between the sequences

New cards

True or False: Measuring homology in nucleotides is easier than in amino acids because the base either matches or it doesn't

True

New cards

True or False: When measuring amino acid homology, similar amino acid matches are ranked higher than dissimilar matches

True

New cards

What is this?

-BLOSUM

-Scoring matrix used for amino acid homology

-High scores = similar properties

-Low scores, penalty = gaps, different properties

New cards

True or False: Percentage identity is less useful than a percentage similarity score for comparing amino acid sequences

True

New cards

What does it mean to say that pairwise computations are very compute heavy?

-Pairwise comparison of 2x 100 bp fragments would have 1059 possibilities using fragments of two nucleotides in length

-Similar to the number of stars in the Milky Way, and there are very few computers that have that capacity.

New cards

How do we overcome the heaviness of pairwise computations?

-Dynamic computing

-Identifies the highest scoring comparisons

-Conducts local and global comparison to save memory and time.

New cards

What algorithm would you use when the strings are of equal length?

Needleman and Wunsch

New cards

What algorithm would you use when comparing partial sequences with contigs or sequences of unknown length?

Smith and Waterman

New cards

True or False: Only Smith and Waterman introduces gap penalties

False, both Smith and Waterman and Needleman and Wunsch do it

New cards

What is this?

-BLAST (Basic Local Alignment Search Tool)

-Heuristic algorithm that chops the sequence into 3-letter k-mers which are iteratively aligned and assembled to find the best fit

-Searches against the entire collection of sequences on NCBI and GenBank for matches.

New cards

What are the following BLAST algorithms used for?

-BLASTn

-BLASTP

-BLAST

-BLASTn: compares nucleotide queries to the nucleotide database.

-BLASTP: compares amino acid queries to the protein database.

-BLASTx: translates nucleotide queries and performs BLASTp.

New cards

What are the following BLAST algorithms used for?

-tBLASTn

-tBLASTx

-tBLASTn: compares amino acid queries to the nucleotide database translated.

-tBLASTx: translates nucleotide queries and compares to the nucleotide database translated.

New cards

What do these BLAST metrics mean?

1) Query Cover

2) Max Score

3) Description/Scientific Name

-Percentage of your search sequenced which is matched by the database sequence.

-Highest alignment score out of all aligned segments from the database.

-Name/function of the protein and the organism of origin.

100

New cards

What do these BLAST metrics mean?

1) E value/Expect value

2) Ident

3) Accession

1) Number of hits you "expect" to see purely by chance given a database of a certain size.

2) Percent amino acid identity of the of the amino acid sequence to the database sequence. g.

3) Unique identifier of the sequence in the database.