Genome Assembly and Sequence Alignment

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/119

flashcard set

Earn XP

Description and Tags

Flashcards from lecture notes on genome assembly and sequence alignment.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

120 Terms

1
New cards

What is the primary computational objective of sequence alignment in comparative genomics?

To maximize the identification of homologous regions and infer phylogenetic divergence.

2
New cards

Which algorithmic paradigm is employed by the Needleman-Wunsch algorithm to ensure optimal global alignment?

Dynamic programming with a scoring matrix

3
New cards

How does the Smith-Waterman algorithm differ from Needleman-Wunsch in its optimization strategy?

It maximizes local subsequence similarity using a zero-initialized matrix

4
New cards

In the context of dynamic programming for pairwise alignment, what does a horizontal move in the scoring matrix signify?

A deletion in the first sequence relative to the second

5
New cards

What is the significance of the traceback phase in the Needleman-Wunsch algorithm?

It reconstructs the optimal global alignment path from the matrix

6
New cards

Which scoring mechanism is typically used in the Needleman-Wunsch algorithm for nucleotide matches?

A positive score for identical nucleotides, typically +4

7
New cards

How does the Smith-Waterman algorithm’s matrix initialization strategy facilitate local alignment?

It sets all matrix cells to zero to allow flexible alignment starts

8
New cards

What is a critical consideration when assigning gap penalties in pairwise alignment?

Balancing penalties to optimize alignment without excessive fragmentation

9
New cards

In the context of the scoring matrix, what does the cell score represent?

The maximum score from match, mismatch, or gap moves

10
New cards

What is a key computational challenge in pairwise alignment using dynamic programming?

The quadratic space and time complexity of the scoring matrix

11
New cards

Which of the following best describes the Needleman-Wunsch algorithm’s alignment scope?

It enforces end-to-end alignment of entire sequences

12
New cards

What distinguishes the Smith-Waterman algorithm’s traceback process?

It initiates from the highest-scoring cell and stops at zero

13
New cards

What is the role of the scoring matrix in pairwise alignment algorithms?

To compute optimal alignment scores via dynamic programming

14
New cards

Which factor most influences the accuracy of pairwise alignment?

The choice of scoring parameters for matches and gaps

15
New cards

How does the Needleman-Wunsch algorithm handle sequences of unequal length?

It introduces gaps to align sequences end-to-end

16
New cards

In the Guide for hands-on Examples, which sequence pair is used to demonstrate pairwise alignment?

Mouse (ATAC vs. Rat (ATACG)

17
New cards

What is the Needleman-Wunsch output for human (ATGC and chimp (ATAC in the guide?

ATGCC vs. AT-ACC with a single gap

18
New cards

What is the Smith-Waterman output for the same sequence pair?

ATCC vs. ATCC focusing on a conserved region

19
New cards

Why does the Smith-Waterman algorithm exclude the initial G and A in its output?

It optimizes for the highest-scoring local subsequence

20
New cards

What scoring parameters are specified for the Needleman-Wunsch alignment in the guide?

+4 for matches, -1 for mismatches

21
New cards

Which online tool is recommended for pairwise alignment in the guide?

EMBL-EBI pairwise sequence alignment tool

22
New cards

What is a critical input precaution for the EMBL-EBI pairwise alignment tool?

Avoiding Extraneous whitespace in sequence input

23
New cards

What does the scoring matrix in the guide’s Needleman-Wunsch example display?

Numerical scores for matches and mismatches

24
New cards

In the Week I example with sequences "TAG" and "TACG", what is the Needleman-Wunsch output?

TA-G vs. TACG with a gap in TAG

25
New cards

In the Week I example with sequences "TAG" and "TACG", what region does Smith-Waterman prioritize?

The subsequence TA vs. TA

26
New cards

What is the significance of comparing Needleman-Wunsch and Smith-Waterman outputs in the guide?

To contrast global and local alignment strategies

27
New cards

In the guide’s pairwise alignment, what is the role of the mismatch between G and A?

It reduces the alignment score in Needleman-Wunsch

28
New cards

How are sequences formatted for input into the EMBL-EBI tool in the guide?

In separate fields for Sequence 1 and Sequence 2

29
New cards

What does the Smith-Waterman algorithm’s focus on “ATCC” indicate in the guide?

A conserved local subsequence with maximal similarity

30
New cards

What is the purpose of recording pairwise alignment results in the guide?

To enable comparative analysis of alignment outcomes

31
New cards

What is the primary computational challenge in multiple sequence alignment compared to pairwise alignment?

The exponential increase in alignment complexity with sequence count

32
New cards

Which algorithmic strategy does ClustalW employ for MSA?

Progressive alignment guided by a hierarchical tree

33
New cards

What is the function of the guide tree in ClustalW’s alignment process?

To prioritize the order of pairwise alignments based on similarity

34
New cards

How does MUSCLE enhance efficiency in MSA compared to ClustalW?

Through iterative refinement of initial alignments

35
New cards

What is a computational trade-off in ClustalW’s progressive alignment approach?

It sacrifices speed for phylogenetic accuracy

36
New cards

What does the '>' symbol signify in ClustalW’s sequence input format?

A sequence identifier prefix for FASTA format

37
New cards

What do stars or dots in ClustalW’s MSA output represent?

Conserved positions with identical nucleotides

38
New cards

Why does MUSCLE’s iterative refinement improve performance on large datasets?

It rapidly refines initial alignments for efficiency

39
New cards

What is a key limitation of MSA compared to pairwise alignment?

It relies on heuristic methods due to computational complexity

40
New cards

How does ClustalW’s guide tree influence the alignment of human and chimp sequences?

It prioritizes their alignment due to high similarity

41
New cards

What is the computational advantage of MUSCLE’s log-expectation approach?

It optimizes alignment through iterative score maximization

42
New cards

What is a potential drawback of ClustalW for complex datasets?

It may struggle with accuracy due to progressive alignment errors

43
New cards

How does MUSCLE’s initial alignment phase differ from ClustalW’s?

It generates a rapid, approximate alignment for refinement

44
New cards

What is the significance of differences between ClustalW and MUSCLE alignments?

They reflect algorithmic biases in gap placement and scoring

45
New cards

What is a key application of MSA in biological research?

Analyzing evolutionary relationships in gene families

46
New cards

Which sequences are aligned in the MSA example in the guide?

Human (ATGC, Chimp (ATAC, Mouse (ATAC, Rat (ATACG)

47
New cards

What is ClustalW’s output for the four sequences in the guide?

ATG-CC, ATACC, AT-ACA, ATACG

48
New cards

How does MUSCLE’s output differ from ClustalW’s output for the four sequences in the guide?

It introduces a gap in the chimp sequence (AT-AC

49
New cards

What is the rationale for MUSCLE’s gap insertion in the chimp sequence?

To optimize alignment of the terminal CC region

50
New cards

What does the guide recommend for visualizing conserved regions in MSA?

Highlighting conserved regions like AT and CC with colors

51
New cards

What does the conserved AT across all four sequences in the guide suggest?

A functionally critical region in hemoglobin genes

52
New cards

How are MSA results preserved in the guide?

By copying text or capturing screenshots

53
New cards

What does ClustalW’s guide tree indicate about human and chimp sequences?

They are closely related with a single nucleotide difference

54
New cards

In the bacterial MSA example from Week I, which sequences are used?

GCTA, GCTT, GCCA

55
New cards

What is ClustalW’s alignment for the bacterial sequences?

GA-TTC, GATCC, GCTTC, GACTC

56
New cards

What is MUSCLE’s alignment for the bacterial sequences?

GATTC, GAT-CC, GCTTC, GACTC

57
New cards

What does the conserved G and T in the bacterial alignment indicate?

A functionally conserved region across bacteria

58
New cards

In the plant DNA example from Week I, what is MUSCLE’s refined alignment?

GCTA, GC-TT, GCCA

59
New cards

How does ClustalW’s plant sequence alignment differ from MUSCLE’s?

It places a gap in the daisy sequence (GC-TT)

60
New cards

What do differences in MSA outputs between ClustalW and MUSCLE suggest?

Algorithmic variations in gap placement and scoring priorities

61
New cards

What is the computational goal of genome assembly?

To reconstruct a contiguous genomic sequence from short reads

62
New cards

What is the approximate nucleotide count of the human genome?

3.2 billion base pairs

63
New cards

How is the human genome’s information density characterized?

As a 3.2 GB text file equivalent to 800,000 pages

64
New cards

What is the primary computational challenge in genome assembly?

Resolving overlaps in fragmented, repetitive reads

65
New cards

What is the “repeat problem” in genome assembly?

The ambiguity caused by identical repetitive sequences

66
New cards

What is the purpose of phasing in human genome assembly?

To distinguish maternal and paternal chromosome variants

67
New cards

Which data structure is central to de novo genome assembly?

A De Bruijn graph for k-mer overlaps

68
New cards

What does a node represent in a De Bruijn graph?

A k-mer subsequence of fixed length

69
New cards

How are edges defined in a De Bruijn graph?

By overlaps of k-1 nucleotides between k-mers

70
New cards

What is the computational role of contigs in de novo assembly?

They form continuous reconstructed genomic segments

71
New cards

What is a key limitation of small k-mers in De Bruijn graph assembly?

They create cycles in repetitive regions

72
New cards

How do large k-mers improve de novo assembly?

They span repetitive regions to resolve ambiguities

73
New cards

What is the trade-off in k-mer size selection for de novo assembly?

Overlap detection sensitivity vs. repeat resolution specificity

74
New cards

Which sequencing technology aids in resolving repetitive regions?

Long-read Oxford Nanopore sequencing

75
New cards

What is a key metric for assessing genome assembly continuity?

The N50 value of contig lengths

76
New cards

What computational strategy characterizes reference-based genome assembly?

Mapping reads to a pre-existing genomic reference

77
New cards

Which tool is explicitly mentioned for reference-based assembly?

Bowtie2 for read mapping

78
New cards

What is a computational advantage of reference-based assembly?

It achieves rapid assembly with a close reference

79
New cards

What was the role of reference-based assembly in the COVID-19 pandemic?

Mapping reads to identify SARS-CoV-2 mutations

80
New cards

What specific mutation type was identified in the SARS-CoV-2 example?

A single nucleotide substitution (A to T)

81
New cards

What is a critical limitation of reference-based assembly for novel pathogens?

It may overlook unique genomic features not in the reference

82
New cards

What is a prerequisite for effective reference-based assembly?

A high-quality, well-annotated reference genome

83
New cards

In the reference-based assembly example, how is a read like “BROWNFO” processed?

It is mapped to its corresponding position in the reference

84
New cards

What is a potential bias in reference-based assembly?

It may conform to the reference’s genomic structure

85
New cards

Why is reference-based assembly suitable for medical genomics?

It leverages high-quality human reference genomes for precision

86
New cards

What issue arose during the 2011 E. coli outbreak with reference-based assembly?

It failed to detect the pathogen’s hybrid genomic structure

87
New cards

How did de novo assembly address the E. coli outbreak challenge?

It reconstructed the hybrid genome without a reference

88
New cards

In the miniature genome example, what sequence is assembled?

AGCTTAGCTTACCT

89
New cards

What k-mers are derived from the read AGCTT in the miniature genome example?

AGCT, GCTT

90
New cards

How does SPAdes mitigate repetitive k-mer challenges in de novo assembly?

By iterating over multiple k-mer sizes for resolution

91
New cards

When is de novo assembly the preferred approach in genomic research?

For novel organisms lacking a close reference

92
New cards

What is an example of a novel organism requiring de novo assembly?

A bacterial strain from a deep-sea hydrothermal vent

93
New cards

What is a primary application of genome assembly in infectious disease research?

Tracing evolutionary origins of pathogens

94
New cards

How does genome assembly contribute to cancer research?

By identifying driver mutations for targeted therapies

95
New cards

What is a key application in conservation biology?

Sequencing genomes to inform biodiversity strategies

96
New cards

What ethical concern is associated with human genome assembly?

The potential exposure of sensitive genetic traits

97
New cards

What is a discussion question posed about a new bird species?

Whether to apply de novo or reference-based assembly

98
New cards

Why might de novo assembly be chosen for a new bird species?

No phylogenetically close reference may be available

99
New cards

How does genome assembly support flu vaccine development?

By identifying antigenic mutations for vaccine design

100
New cards

What critical insight was gained from de novo assembly in the 2011 E. coli outbreak?

The pathogen was a hybrid of two bacterial species