BIOL 266

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/273

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

274 Terms

1
New cards

blastp

input query: protein
database: protein
output: matching protein sequences

2
New cards

blastn

input query: nucleotide
database: nucleotide
output: matching nucleotide sequences

3
New cards

blastx

input query: nucleotide. translates to proteins for 6 reading frames.
database: protein
output: protein sequences that match any of the 6 frames.

4
New cards

tblastn

input query: protein
database: nucleotides. translates each nucleotide sequence to 6 possible protein sequences.
output: any of the nucleotide sequences that could've produced that protein.

5
New cards

tblastx

input query: nucleotide. translates to proteins for 6 reading frames.
database: nucleotide. translates to proteins for 6 reading frames.
output: any nucleotide sequences that will code for the same proteins as the input.

6
New cards

examples of annotations (aka secondary data)

  • gene names

  • product descriptions

  • genomic coordinates for exons/coding sequences

  • cross references (eg uniprot/GO)

  • notes on function and structure

  • evolutionary notes

7
New cards

file format for annotations

GBK/genbank

8
New cards

tabular data file format

csv

9
New cards

sequences file format

fasta (sometimes plaintext)

10
New cards

ambiguous characters for nucleotides

R: purines (A/G)
Y: pyrimadines (T/C)
N: any (A/C/T/G)

11
New cards

start codon

AUG (methionine)

12
New cards

stop codons

UAA, UAG, UGA

13
New cards

reading frames per nucleotide sequence

6 (3 reading frames x 2 directions)

14
New cards

GO

gene ontology. standardized label for genes

15
New cards

issues with text-based search

  • gene/species name is inconsistent.
  • different databases process text differently
  • data base errors
  • should use homology (BLAST) search instead
16
New cards

sequence redundancy

  • duplicate sequences in database due to multiple researcher submissions (ex SARS-COV-2)
17
New cards

NCBI-nr

non redundant database

18
New cards

uniprot100

combines identical sequences into a single "cluster"

19
New cards

uniprot 90

combines sequences with 90% identity

20
New cards

uniprot50

combines sequences with 50% identity

21
New cards

uniprotXX

the lower XX is, the less redundancy there is as more sequences are combined into a single cluster.

22
New cards

database reliability issues

  • computer error (incorrect annotations/predictions, missed relationships)
  • human error (typos, mislabels, propogation of initial human errors)
23
New cards

data growth

challenging to analyze large bioinformatic databases (eg NCBI sequence read archive)

24
New cards

sequence alignment

computational method to arrange two biological sequences to identify regions of similarity

reveals regions of homology

25
New cards

homology

similarity due to common ancestry

26
New cards

homologous

when two sequences share a common ancestor (are evolutionarily related)

27
New cards

global alignment

  • align sequences from beginning to end

  • require similar length and significant overlap

28
New cards

local alignment

  • find the best matching regions within sequences

  • used for different lengths and conserved regions

29
New cards

good alignments have:

  • many matches (character matches)

  • few mismatches

  • few gaps

30
New cards

substitution matrices

a scoring scheme for matches/mismatches

31
New cards

blosum62

the protein substitution matrix. takes into account the frequency and probability of different protein mismatches

32
New cards

gap extension penalty

gap extensions (continuous gaps after the first) may be scored more leniently

33
New cards

needleman-wunsch

  • 1970.
  • dynamic programming algorithm used for global alignments.
  • put both sequences in a matrix. first row/column are filled with multiples of gap penalty.
  • for all other cells, add the match/mismatch score to the diagonal, and the gap score to the left or above cell. the value is the maximum of the 3.
  • to find alignment, traceback from bottom right corner.
  • multiple optimal alignments are possible.
34
New cards

smith-waterman

  • 1981.
  • dynamic programming algoritm used for local alignments.
  • based off of needleman-wunsch
  • harsher penalty for mismatches
  • instead of multiples of gap penalty, first row/column are all zeroes
  • cells are calculated the same, but any negative values will become 0 instead.
  • to find optimal alignment, start from the highest cell value in the table and traceback until you hit a 0.
  • multiple optimal alignments are possible.
35
New cards

horizontal move

gap in sequence 2 (sequence on the vertical)

36
New cards

vertical move

gap in sequence 1 (sequence on the horizontal)

37
New cards

diagonal move

align sequence 1 and 2

38
New cards

SSEARCH

  • extension of pairwise alignment, implements SW.

  • 1 vs many sequences instead of 1 vs 1

  • very slow in large databases. ok for small remote ones.

39
New cards

BLAST

basic local alignment search tool

40
New cards

blast heuristic

using words (substrings of a fixed length), but allows inexact matches between words

41
New cards

minimum blast word length for nucleotides to achieve significant score

16

42
New cards

minimum blast word length for proteins to achieve significant score

3

43
New cards

expectation (E) value

given sequence S being compared against the input, the number of matches with score >= S that are expected to occur in a database by chance

44
New cards

e value properties

  • lower E value => more significant

  • larger databases => increased E for the same score

45
New cards

PSI blast

detection of remote protein homology using profiles

46
New cards

blast steps (simplified)

  1. break sequence (LMAILVPT) into words: LMAI, MAIL, AILV, etc.
  2. search for word matches in database sequences (allows inexact high scoring matches)
  3. extend sequence on both sides until score falls below a threshold
  4. merge high scoring pairs into a longer alignment (extend more and allow gaps)
  5. report statistical significance of hits
47
New cards

percent identity

the % of bases in a returned sequence that are identical to the query genome
longers sequences => higher scores

48
New cards

E value significance

E < 0.01 => borderline significant
E < 10 ^-10 => highly significant

49
New cards

issues with database alignment searches

  • raw alignment scores don't show much (longer sequence = higher score, so use E values instead)

  • sequences with repeats can artificially inflate score

  • low complexity regions + short queries can return inaccurate results

50
New cards

multiple sequence alignment heuristic

progressive alignment (align one sequence at a time, starting with most closely related ones)

51
New cards

clustal series of progressive alignment algorithms

align all pairs of sequences using pairwise alignment (eg NW). score all pairwise alignments. create guide tree based on scores using UPGMA or NJ. align progressively based on the tree.

52
New cards

scoring MSAs

sum of pairs: for the number of pairs (ie # of sequences choose 2), use BLOSUM62 and add up each individual pair's score. sum all column scores to get final score.

53
New cards

msa can be coloured by:

  • property (ie functions of proteins)

  • conservation (colouring similar characters in a column)

54
New cards

measuring conservation

evolutionary conservation is plotted for each column. regions of high conservation may be meaningful.

55
New cards

warning about alignment

pairwise and sequence alignments can align anything. this does not indicate biological meaning.

56
New cards

phylogeny

hypothesis of the evolutionary relationship of a group

57
New cards

phylogenetics

study of evolutionary relationships using gene sequences

58
New cards

phylogenetic tree

visual representation of a phylogeny. root, branch, and nodes.

  • leaf node: sequences/species/taxa/operational taxonomical unit
  • non-leaf node: most recent common ancestor of this node's children
59
New cards

phylograms

shows evolutionary distance through branch length

60
New cards

cladograms

only shows tree structure, branch length is not indicative of anything

61
New cards

rooted tree

adds a common ancestor to all nodes

62
New cards

unrooted tree

no common ancestor

63
New cards

outgroup

used to root an unrooted tree by adding a root item to the highest level node

64
New cards

file format for phylogenetic trees

newick format

  • items within the same brackets indicate siblings
  • no root defined
65
New cards

distance based methods for computing trees from MSAs

sequence alignment -> distance matrix -> phylogenetic tree

66
New cards

UPGMA

unweighted pair group method with arithmetic mean

  • takes multiple sequences, creates a table of "distances" for all pairwise comparisons
  • distances can be # of mismatches between sequences, or normalized (divide by number of characters in sequence)
67
New cards

UPGMA steps

  1. examine alignment and create a matrix with the distances of each pairwise.
  2. group together the species with the smallest distance. create a new matrix where they are combined.
  3. calculate the empty cell values by taking averages of the two combined sequences.
  4. repeat step 2 until only two group remain.
68
New cards

How many codons code for amino acids?

61

69
New cards

If two protein sequences have less than 25% identity, they are not homologous.

False

70
New cards

Which of the following databases store dna sequence information?

  1. Genbank
  2. PDB
  3. Uniref90
  4. All
  5. None

Genbank. PDB and Uniref store proteins.

71
New cards

Which of the following algorithms involve pairwise alignments?

  1. BLAST
  2. Smith-Waterman
  3. Clustal
  4. A and B
  5. All of the above

All of the above.

72
New cards

NCBI genbank files ___:

  1. Are equivalent to FASTA files
  2. Store sequence alignments
  3. Are flat files
  4. Correspond only to a single gene or protein sequence record.
  5. Contain only primary (raw) data
  1. Are flat files
73
New cards

Which statement is correct? Fasta ___:

  1. is a plain text file format for storing one or more biological sequences
  2. are flat files used to store genome sequences and annotations
  3. contain the sequence of the forward strand only
  4. store the sequence of the forward and reverse strand
  5. cannot contain partial sequences
  1. is a plain text file format for storing one or more biological sequences
74
New cards

Suppose you have the following ORF: "ATGAGCGATCCATAG" How many amino acids are there in the resulting protein?

4 (Stop codon at the end)

75
New cards

Variable positions within a multiple sequence alignment of a protein family ___:

  1. are functionally unimportant
  2. do not impact protein structure
  3. are just as likely to be functionally important as conserved positions
  4. all of the above
  5. none of the above
  1. none of the above (conservation implies functional importance, not converse)

    (general rule: the more conserved something is, the more likely it is to be important)

76
New cards

Which of the following statements is incorrect about blast?

  1. blast ranks identified matches in the database by their E values
  2. blast can be used to identify homologs of a query sequence
  3. blast aligns a query sequence to all database sequences
  4. blast uses heuristics to rapidly identify candidate matches in the database
  5. blast can be used for functional annotation of genomes
  1. blast aligns a query sequence to all database sequences

(blast skips 99.9% of sequences, only properly aligns to part of the database)

Also, BLAST can be used for functional annotations.

77
New cards

The smith waterman algorithm:

  1. replaces negative values in the calculation matrix with 0
  2. uses a harder penalty for mismatches
  3. is guaranteed to compute the optimal local alignment(s)
  4. a b and c
  5. a and b
  1. a b and c
78
New cards

You have a sequence composed of 100 "ATG repeats". A blast x search will:

  1. detect homologs of this sequence in the nr database
  2. incorrectly translate the sequence into a string of methionine amino acids
  3. translate the sequence into protein and then perform a blastp search
  4. filter out this sequence as it is a low complexity/repetitive sequence
  5. search all six reading frames against the database
  1. filter out this sequence as it is a low complexity/repetitive sequence (theoretically)
79
New cards

multiple alignment algorithms can detect whether a family of sequences are homologous

false (the actual algorithm does not do this)

80
New cards

a cladogram depicts tree structure but branch lengths do not have meaning

true

81
New cards

heuristic algorithms efficiently explore all possible solutions to a problem

false. heuristic algorithms are shortcuts.

82
New cards

evolutionary unrelated sequences can still be homologous

false

83
New cards

Suppose a blast search returns a match with E = 1E-40. This match is likely to be a homolog.

True

84
New cards

THe clustal algorithm finds the best possible multiple sequence alignment.

False. It is a heuristic, so not exact or perfect.

85
New cards

The nodes of phylogenetic trees can represent most recent common ancestors.

True

86
New cards

in a blast search, if the query coverage of a hit (target sequence) is low it means

  1. the sequences share a low number of "words" in common

  2. the query aligns to a fraction of the target

  3. the target aligns to a fraction of the query

  4. the alignment contains gaps

  5. none of the above

  1. the target aligns to a fraction of the query

87
New cards

What is the query coverage in BLAST?

The amount that the target sequence aligns to the query sequence.

query: the sequence we input to search.
target: a match/hit sequence

88
New cards

How many rooted trees for 3 taxa?

3

89
New cards

How many unrooted trees for 3 taxa?

1

90
New cards

How many rooted trees for 2 taxa?

1

91
New cards

how many unrooted trees for 2 taxa?

1

92
New cards

How many rooted trees for 4 taxa?

15

93
New cards

How many unrooted trees for 4 taxa?

3

94
New cards

UPGMA trees are guaranteed to have the smallest total branch length

false

95
New cards

UPGMA trees have no implication of underlying evolutionary mechanisms

True

96
New cards

Minimum evolution principle

-Fewest # evolutionary steps is most likely
-this is a criterion used to decide what tree is the 'best'
-used in neighbor joining and maximum parsimony methods

97
New cards

Neighhour Joining Method

  • Improvement over UPGMA as it attempts to produce tree with the smallest sum of branch lengths
98
New cards

Neighbour joining basic principle

Find neighbours sequentially that minimize the total length of the tree.

Among all possible pairs of OTUs, the ones that gives the smallest sum of branch lengths is chosen. These OTUs are then regarded as a single OTU, and pairwise comparisons are done again to create a new distance matrix.

99
New cards

Maximum Parsimony

Character based method that directly uses MSA instead of a distance matrix

Evaluates many possible trees to find which tree(s) are consistent with the fewest # of changes

Some sites within the alignment are phylogenetically informative or uninformative

Informative sites favour some trees over others

100
New cards

Informative Sites

Positions within the alignment that MUST HAVE:

  1. At least 2 different characters (nucleotides or amino acids)
  2. Each character MUST be present more than once