1/101
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
1.List the different types of sequencing technologies and describe their strengths and weaknesses.
1. Sanger sequencing: Only confirms one gene, but high cost
2. Oxford Nanopore: Not as accurate as Illumina,
3. SMRT sequencing: High fidelity
4. Illumina: For metagenomes/low cost, lower capacity/read length
2. Describe the chemistry of next-generation sequencing.
Next-generation sequencing (NGS) relies primarily on sequencing by synthesis (SBS) using reversible-terminator chemistry, where engineered DNA polymerases sequentially add labeled, terminal-blocked deoxynucleotide triphosphates (dNTPs) to a template DNA strand
3. Describe the steps in analyisng data from a sequencer.
1) Basecalling
2) Formatting
3) Trimming
4) Assembly
4. Describe the key attributes that are labelled using an annotation program.
1) Open reading frames
2) Gene locations
3) Strand orientation
4) Gene name
5) Protein coding
6) Promoter regions
7) Start and stop codons
5. How do genes that are paralogues and orthologues formed in nature?
Gene duplication: Produces paralogs (Gene 1a and Gene 1b)
Speciation: Produces orthologs (Species 1 Gene 1a, Species 2 Gene 1a)
6. How do we measure homology using a pairwise algorithm?
-Needleman Wunsch and other algorithms
-Predict which residues are identical or mutated, and where insertions or deletions (indels) have occurred
7. Explain the principles of measuring homology of DNA.
-% Identity: Percentage of nucleotides matching in an alignment to measure the homology
-This can be interpreted as evolutionary distance between the sequences.
8. Explain the principles of measuring homology of proteins, with a focus on how the matrices operate
-Use scoring matrix (BLOSUM) to generate % Similarity score for two amino acid sequences
-Greater scores to amino acids with similar properties, and penalties to gaps/differences.
-More useful than % identity
Name four types of DNA sequencing technologies
1) Sanger
2) Illumina (short-read, next gen)
2) Oxford Nanopore (long-read)
4) SMRT (Single molecule real-time)
What type of DNA sequencing is this?
1) PCR with fluorescent, chain-terminating ddNTPs
2) Size separation by capillary gel electrophoresis
3) Laser excitation & detection by sequencing machine, produces chromatogram
Sanger Sequencing
What type of DNA sequencing is this?
-Sanger Sequencing
1) PCR with fluorescent, chain-terminating ddNTPs
2) Size separation by capillary gel electrophoresis
3) Laser excitation & detection by sequencing machine, produces chromatogram
What happens after dNTPs are incorporated into DNA during the first step pf Sanger Sequencing?
-PCR stops
-Fragments are short, some only 1 base long
What type of DNA sequencing is this?
1) DNA is extracted/fragmented through mechanical breakage or with endonucleases.
2) DNA adapters are ligated to either end of each fragment
3) DNA is washed/added, binds to a complementary DNA terminus
4) DNA folds into bridge shape, adapter binds a neighboring terminus
5) Primers and polymerase are added, creates copy strand
6) DNA is denatured and strands separate.
7) The process is repeated, creates copies on the flow cell.
8) Fluorescent nucleotides/polymerase are added to extend the sequence
9) Flow cell is imaged, color indicates which base was added
10) Fluorescent tag/terminator are cleaved, next base added
11) This process continues until the whole strand is sequenced
Illumina Sequencing
What type of DNA sequencing is this?
-Illumina
1) DNA is extracted/fragmented through mechanical breakage or with endonucleases.
2) DNA adapters are ligated to either end of each fragment
3) DNA is washed/added, binds to a complementary DNA terminus
4) DNA folds into bridge shape, adapter binds a neighboring terminus
5) Primers and polymerase are added, creates copy strand
6) DNA is denatured and strands separate.
7) The process is repeated, creates copies on the flow cell.
8) Fluorescent nucleotides/polymerase are added to extend the sequence
9) Flow cell is imaged, color indicates which base was added
10) Fluorescent tag/terminator are cleaved, next base added
11) This process continues until the whole strand is sequenced
What is the first step of Illumina sequencing?
-DNA to be sequenced is extracted and fragmented
-Mechanical breakage or digestion with endonucleases
What occurs after this step in Illumina sequencing?
1) DNA to be sequenced is extracted and fragmented through mechanical breakage or digestion with endonucleases.
Specific DNA adapters are ligated to either end of each fragment
What are some special abilities of the sequences found in the DNA adapters used for Illumina sequencing?
-Allow the DNA to bind the flow cell
-Complementary to the sequencing primers
-Nucleotide barcodes allow us to distinguish between different samples
What occurs after this step in Illumina sequencing?
3) Specific DNA adapters are ligated to either end of each fragment.
-DNA is washed and added to the flow cell
-DNA binds to a complementary DNA terminus on the surface of the cell
What occurs after this step in Illumina sequencing?
3) DNA is washed and added to the flow cell
-DNA binds to a complementary DNA terminus on the surface of the cell
-DNA folds over to form a bridge shape
-The adapter binds a neighboring terminus.
What occurs after this step in Illumina sequencing?
4) DNA folds over to form a bridge shape and the adapter binds a neighboring terminus.
-Primers and polymerase are added
-Creates a copy (reverse strand) of the original template (forward strand).
True or False: The new strand that is copied during Illumina sequencing is identical to the forward strand
False, it is a reverse strand of the original template (forward strand)
What occurs after this step in Illumina sequencing?
5) Primers and polymerase are added which create a copy (reverse strand) of the original template (forward strand).
DNA is denatured and strands separate
What occurs after this step in Illumina sequencing?
6) DNA is denatured and strands separate
The process is repeated to create a dense cluster of copies on the flow cell
True or False: Each channel of a flow cell contains millions of clusters (copies of DNA on the flow cell)
True
What occurs after this step in Illumina sequencing?
7) The process is repeated to create a dense cluster of copies on the flow cell
-Sequencing begins.
-Fluorescent nucleotides and polymerase are added
-Extend the sequence by one base
What occurs after this step in Illumina sequencing?
8) Fluorescent nucleotides and polymerase are added to extend the sequence by one base
-The flow cell is then imaged.
-The color of fluorescence indicates which base was added.
What occurs after this step in Illumina sequencing?
9) The flow cell is then imaged. The colour of fluorescence indicates which base was added.
-Fluorescent tag and terminator are cleaved
-Next base can be added
What occurs after this step in Illumina sequencing?
10) The fluorescent tag and terminator are now cleaved, allowing the addition of the next base
The process continues until the whole strand is sequenced
Why is Illumina next-gen sequencing so convenient?
Millions of clusters can be simultaneously sequenced in parallel
How do you differentiate Illumina sequences from different samples?
-Nucleotide barcodes created by sequenced adapters
-Allows multiplexing of multiple samples in a single flow cell
When the Illumina sequencer interprets the fluorescent signals, it performs "basecalling." What is this?
-Assigns nucleotide to each signal
-Data is converted to sequence data
What is this?
-Assigns nucleotide to each signal
-Illumina data is converted to sequence data
Basecalling
What type of DNA sequencing is this?
1) DNA molecules bind to the polymerase inside flow cell pore and a copy is created
2) The copy passes through the pore, nucleotide identified by electrical signal
3) Base-calling algorithms used to convert electrical data into DNA sequence
Oxford Nanopore Sequencing
What type of DNA sequencing is this?
-Oxford Nanopore Sequencing
1) DNA molecules bind to the polymerase inside flow cell pore and a copy is created
2) The copy passes through the pore, nucleotide identified by electrical signal
3) Base-calling algorithms used to convert electrical data into DNA sequence
What sets the Oxford Nanopore flow cell apart from other flow cells?
-Flow cell consists of pores
-Each pore contains a phospholipid bilayer and polymerase
What type of DNA sequencing is this?
1) Double stranded DNA is used as the input.
2) 'SMRTbell' adapters are added to each end which join both complimentary strands together.
3) The now circular ssDNA binds to a ZMW pore containing a polymerase.
4) Fluorescent nucleotides are incorporated and cleaved continuously, producing a bright flash as each base is incorporated.
5) Sequencing occurs continuously around the circle, providing coverage of both strands many times over.
SMRT Sequencing (Single-Molecule Real-Time)
What type of DNA sequencing is this?
-SMRT Sequencing (Single-Molecule Real-Time)
1) Double stranded DNA is used as the input.
2) Adapters are added to each end, join both complimentary strands together.
3) Circular ssDNA binds to a ZMW pore containing a polymerase.
4) Fluorescent nucleotides are incorporated and produce a bright flash
5) Sequencing occurs continuously around the circle, providing coverage of both strands many times over.
What adapters are using during SMRT sequencing to join complimentary strands together?
SMRTbell
What happens to the ssDNA and the ZMW pores during SMRT Sequencing?
-The circular ssDNA will loop around and bind to both sides of the DNA duplex
-Reads the same DNA 50-100x
True or False: SMRT fluorescent data is output in real time when nucleotides are being cleaved and incorporated
True
What do both SMRT and Illumina sequencing have in common?
Both use a detector for fluorescent probes
True or False: SMRT has low fidelity, making it somewhat inaccurate
False, it reads the same DNA 50-100x
What type of DNA sequencing is this?
Problem: Single genes/plasmids
Input: PCR amplified DNA
Read Length: 500-1000
Capacity: Low
Cost: High
Accuracy: 99.9%
Sanger Sequencing
What is a downside of Sanger Sequencing?
-Only confirms single gene
-Low throughput, but expensive
-Still very accurate though
What type of DNA sequencing is this?
Problem: Whole draft genomes and 165 metagenomes
Input: Fragmented, amplified DNA
Read Length: 150-600
Capacity: High
Cost: Low
Accuracy: 99.9%
Illumina Sequencing
What sequencing is best suited for whole genomes and metagenomes?
Illumina Sequencing
What type of DNA sequencing is this?
Problem: Closed genomes, genomes with repetitive elements
Input: Amplified or Native DNA
Read Length: 15-100 kB
Capacity: Medium
Cost: High
Accuracy: >99%
Pacbio Sequencing
What type of DNA sequencing is this?
Problem: Closed genomes, genomes with repetitive elements
Input: Amplified or Native DNA
Read Length: Up to 1Mb
Capacity: Medium
Cost: High
Accuracy: >99%
Oxford Nanopore Sequencing
What type of sequencing would you use for closed genomes or human/plant genomes with repetitive elements?
-Pacbio (Outdated)
-Oxford Nanopore
True or False: Illumina is more accurate than Oxford Nanopore
True
True or False: Oxford Nanopore is more accurate than Illumina Sequencing
False
What is one advantage of using Oxford Nanopore over Illumina despite being less accurate?
-Higher throughput (medium capacity > low)
-Can process amplified PCR or direct extraction
What type of data does modern sequencing technology output?
-Fluorescent
-Electrical resistance
What is this?
-Q score/Phred score
-Basecaller outputs Quality Score
-Indicates confidence rate
What do raw data/raw reads from basecallers represent?
Output from sequencer for each sequenced fragment of DNA
How can you tell the difference between raw reads and data that has been assigned a Q score by the base score?
Fasta: Raw read (header + sequence)
Fastq: Includes quality scores
What happens when data is trimmed?
-Adapters/barcodes are removed from reads
-Low-quality information at ends is removed
True or False: The cutoff Q scores for trimming are consistent across different sequencing technologies and algorithms because they will be the same
False, they differ
What happens to data during assembly?
-Algorithm looks for overlapping k-mers
-Creates long sequence with them
-Assigns confidence threshold
-Contiguous regions are comhined
What is this?
-Coverage
-Number of times a given nucleotide has been sequenced
-Higher coverage = higher confidence rate, more likely to be assembled correctly
Why do gaps form during assembly?
-Low coverage
-Repetitive/duplicate regions of DNA that are difficult for algorithm to differentiate
True or False: A high quality assembly will contain more contigs, contiguous regions of sequence data
False
True or False: A reference-based assembly is more likely to have more contigs than a de-novo assembly
False, it will have less
What are the advantages of mapping reads back to a reference genome?
-Known genome structure
-More complete assemblies
What are some disadvantages of mapping reads back to a reference genome?
-Biased towards reference genome structure
-Miss large indels and genome rearrangements
-Need to be careful choosing references
What are the advantages of de-novo assembly (no reference genomes)?
-Unbiased by a reference genome
-Useful for new sequences
What are some disadvantages of de-novo assembly (no reference genomes)?
-Impacted by repetitive elements
-Fragmentary, more contigs due to less complete data
What are some ways to avoid fragmentary de-novo assembly data?
-Use long-read sequence data
-Reads can cross through difficult regions, connecting the contig
What are some important metrics for quality control?
Yield: total number of bases generated.
Error rate: compare reads that cover the same region.
%Q30: the percentage of bases with a Q score of 30 or higher.
Longest contig: ideally, one very long contig should cover most of your genome.
Number of contigs: lower indicates more contiguous sequence data.
N50: the median sequence length.
L50: The smallest number of contigs making up halfthe assembly size.
What do these metrics mean?
1) Yield
2) Longest Contig
3) Q30
1) Total number of bases generated
2) Ideally, one very long contig should cover most of your genome.
3) Percentage of bases with a Q score of 30 or higher.
What do these metrics mean?
1) Number of Contigs
2) N50
3) L50
1) Lower indicates more contiguous sequence data
2) The median sequence length.
3) The smallest number of contigs making up half (50%) the assembly size.
How do you calculate Error Rate?
Calculated by comparing reads which cover the same region
How do you annotate sequence data?
-Search database for known sequence with regions that are homologous to data
-Compare homology via alignment (base by base)
What is this?
-Homologs
Two sequences that are predicted to have a common ancestor
True or False: Homologs can be DNA or amino acid
True
What are 2 ways that homologs can arise?
-Gene duplication
-Speciation
What does gene duplication produce?
Paralogs
What does speciation produce?
Orthologs
Gene 1 in the ancestral species undergoes a duplication event generating Gene 1a and Gene 1b.
The ancestral species splits into two species, each with its own copy of Gene 1a and Gene 1b.
What is the relationship between Gene 1a in species one and Gene 1a in species two (DIFFERENT SPECIES)?
Orthologs
Gene 1 in the ancestral species undergoes a duplication event generating Gene 1a and Gene 1b.
The ancestral species splits into two species, each with its own copy of Gene 1a and Gene 1b.
What is the relationship between Gene 1a and Gene 1b?
Paralogs
Gene 1 in the ancestral species undergoes a duplication event generating Gene 1a and Gene 1b.
The ancestral species splits into two species, each with its own copy of Gene 1a and Gene 1b.
What is the relationship between species one and species two Genes 1a and Gene 1b (all four genes)?
Homologs
How does pairwise alignment estimate the level of homology between 2 sequences?
-Needleman-Wunsch algorithm
-Consider context of all residues in both sequences
What assumptions does the model of homology make?
-Identical residue pairs are aligned
-Different amino acids with similar physiochemical properties will form a pair
-Indels/gaps are rarer than mismatches
Why is a penalty applied by the algorithm for opening a gap?
-Gaps represent indels in an evolutionary context
-It is assumed that these are rarer than mismatches
True or False: When measuring homology, the pair with the highest total score is the most likely relationship
True
What is this?
-% Identity
-Percentage of nucleotides matching in an alignment to measure the homology
-Evolutionary distance between the sequences
True or False: Measuring homology in nucleotides is easier than in amino acids because the base either matches or it doesn't
True
True or False: When measuring amino acid homology, similar amino acid matches are ranked higher than dissimilar matches
True
What is this?
-BLOSUM
-Scoring matrix used for amino acid homology
-High scores = similar properties
-Low scores, penalty = gaps, different properties
True or False: Percentage identity is less useful than a percentage similarity score for comparing amino acid sequences
True
What does it mean to say that pairwise computations are very compute heavy?
-Pairwise comparison of 2x 100 bp fragments would have 1059 possibilities using fragments of two nucleotides in length
-Similar to the number of stars in the Milky Way, and there are very few computers that have that capacity.
How do we overcome the heaviness of pairwise computations?
-Dynamic computing
-Identifies the highest scoring comparisons
-Conducts local and global comparison to save memory and time.
What algorithm would you use when the strings are of equal length?
Needleman and Wunsch
What algorithm would you use when comparing partial sequences with contigs or sequences of unknown length?
Smith and Waterman
True or False: Only Smith and Waterman introduces gap penalties
False, both Smith and Waterman and Needleman and Wunsch do it
What is this?
-BLAST (Basic Local Alignment Search Tool)
-Heuristic algorithm that chops the sequence into 3-letter k-mers which are iteratively aligned and assembled to find the best fit
-Searches against the entire collection of sequences on NCBI and GenBank for matches.
What are the following BLAST algorithms used for?
-BLASTn
-BLASTP
-BLAST
-BLASTn: compares nucleotide queries to the nucleotide database.
-BLASTP: compares amino acid queries to the protein database.
-BLASTx: translates nucleotide queries and performs BLASTp.
What are the following BLAST algorithms used for?
-tBLASTn
-tBLASTx
-tBLASTn: compares amino acid queries to the nucleotide database translated.
-tBLASTx: translates nucleotide queries and compares to the nucleotide database translated.
What do these BLAST metrics mean?
1) Query Cover
2) Max Score
3) Description/Scientific Name
-Percentage of your search sequenced which is matched by the database sequence.
-Highest alignment score out of all aligned segments from the database.
-Name/function of the protein and the organism of origin.
What do these BLAST metrics mean?
1) E value/Expect value
2) Ident
3) Accession
1) Number of hits you "expect" to see purely by chance given a database of a certain size.
2) Percent amino acid identity of the of the amino acid sequence to the database sequence. g.
3) Unique identifier of the sequence in the database.