Bioinformatics Notes
Introduction to Bioinformatics & Biology (CBIO 203)
Sequence Similarity Searching
- Sequence comparison is a powerful method for determining evolutionary relationships between genes.
- Similarity searching is based on alignment.
- Bioinformatics databases: GenBank, Swiss-Prot.
- BLAST and FastA provide rapid similarity searching, where 'rapid' implies approximate (heuristic) methods.
Why Search Databases?
- To verify if a new DNA sequence is already in databanks.
- To find proteins homologous to a putative coding Open Reading Frame (ORF).
- To identify similar non-coding DNA stretches (e.g., repeat elements, regulatory sequences).
- To locate false priming sites for PCR oligonucleotides.
Open Reading Frames (ORFs)
- ORFs are potential protein-coding regions within a DNA sequence. The image depicts 6 reading frames.
Similarity vs. Homology
- Similarity is the measure of how alike two sequences are.
- Homology implies evolutionary descent from a common ancestor.
- similarity over 100 amino acids is strong evidence for homology.
- Homology is an evolutionary statement indicating descent from a common ancestor, implying:
- Common 3D structure
- Usually common function
- Homology is all or nothing; sequences are either homologous or not.
Available Databases
DNA (Nucleotide Sequences):
- The big databases: Genbank, EMBL, DDBJ (weekly updates, exchange info routinely).
- Genomic databases: Human (GDB), Mouse (MGB), Yeast (SGB).
- Special databases: ESTs (Expressed Sequence Tags), STSs (Sequence-Tagged Sites), EPD (Eukaryotic Promoter Database), REPBASE (Repetitive Sequence Database).
Protein (Amino Acid Sequences):
- The big databases: Swiss-Prot (high annotation level), PIR (Protein Identification Resource).
- Translated databases: SP-TREMBL (translated EMBL), GenPept (translation of coding regions in GenBank).
- Special databases: PDB (sequences derived from 3D structures).
Homologous Sequence
- A homologous sequence is similar to another sequence due to common ancestry.
- Homologous proteins share similar folding or structure.
DNA vs. Protein Searches
- DNA consists of 4 characters (A, G, C, T); identity may occur by chance.
- Proteins consist of 20 characters (amino acids), improving comparison sensitivity.
- Convergence of proteins is rare; high similarity usually implies homology.
Nucleotide vs Protein Sequences
- When searching for similarity, it's generally better to use protein sequences when possible.
- If starting with a nucleotide sequence, consider translating it to protein and searching protein databases.
- Translating to amino acids may lose information due to the degeneracy of the genetic code (multiple codons for the same amino acid).
- Very different DNA sequences can code for similar protein sequences.
Conclusion
- Proteins are better for database similarity searches:
- DNA comparisons have more random matches.
- DNA databases are larger, leading to more random hits.
- Protein searches use more sensitive matrices (PAM, BLOSUM).
- Proteins are more conserved in evolution than DNA.
Main Algorithms for Database Searching
- FastA: Better for nucleotides than proteins.
- BLAST (Basic Local Alignment Search Tool): Better for proteins than nucleotides.
- Smith-Waterman: More sensitive than FastA or BLAST.
Global vs. Local Similarity
- Global Similarity: Uses complete aligned sequences; calculates total matches.
- Needleman-Wunsch algorithm.
- Not suitable for database searching.
- Local Similarity: Finds the best internal matching region between two sequences.
- Smith-Waterman algorithm.
- BLAST and FASTA.
- Dynamic Programming: Optimal computer solution, not approximate.
Search with Protein, not DNA Sequences
- 20 amino acids vs. 4 DNA bases means less chance of similarity.
- Proteins have varying degrees of similarity between amino acids.
- Protein databanks are smaller than DNA databanks.
Programs for Searching
- BLAST: Fastest, easily accessed on the web.
- Limited database sets.
- Good translation tools (BLASTX, TBLASTN).
- FASTA: Better/more complete alignments.
- Precise choice of databases.
- More sensitive for DNA-DNA comparisons.
- FASTX and TFASTX can find similarities in sequences with frameshifts.
- Smith-Waterman: Slower, more sensitive.
- Rigorous or exhaustive search.
- SSEARCH is part of the Fasta package.
FastA
- First rapid database search utility.
- 50 times faster than Dynamic Programming.
- Based on a heuristic approach (not guaranteed to locate the optimal solution).
FastA Algorithm steps
- Compute best diagonals from all frames of alignment derived from dot plot logic.
- Word method looks for exact matches between words in query and test sequence.
- DNA words are often 6 bases.
- Protein words are 1 or 2 amino acids.
- Only searches for diagonals in the region of word matches faster searching.
- After all diagonals are found, an attempt to join diagonals occur by adding gaps.
- Compute alignments in regions of best diagonals.
FastA Algorithm Overview
- FastA locates regions of high word match density between the query and search set sequences.
- The ten highest-scoring regions are rescored using a scoring matrix; the top score is saved as
init1. - FastA determines if initial regions from different diagonals can be joined, forming an approximate alignment with gaps; only non-overlapping regions can be joined.
- The score for joined regions is the sum of initial region scores minus a joining penalty for each gap; highest score is saved as
initn. - FastA then uses a variation of the Smith-Waterman algorithm to determine the best segment of similarity; the score for this alignment is the
optscore. - FastA uses linear regression against the natural log of the search set sequence length to compute a normalized z-score for the sequence pair.
- Using the z-score distribution, the program estimates the number of sequences expected to produce a z-score greater than or equal to the obtained z-score by chance. This is the E-score.
FastA Algorithm - Five Steps:
- Identify common k-words between sequences I and J.
- Score diagonals with k-word matches; identify the 10 best diagonals.
- Rescore initial regions with a substitution score matrix.
- Join initial regions using gaps, penalizing for gaps.
- Perform dynamic programming to find final alignments.
FASTA Format
- Simple format used by almost all programs.
>header linewith a[return]at the end.- Sequence (no specific requirements for line length, characters, etc.).
- Example:
>URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 .. CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG GAGTCACCAAAACCTGGGACAGGCTCATGCTCCAGGACAATTGCTGTGGCGTAAATGGTC CATCAGACTGGCAAAAATACACATCTGCCTTCCGGACTGAGAATAATGATGCTGACTATC CCTGGCCTCGTCAATGCTGTGTTATGAACAATCTTAAAGAACCTCTCAACCTGGAGGCTT
BLAST (Basic Local Alignment Search Tool)
- Most widely used and referenced computational biology/bioinformatics resource.
- Improves search speed of FASTA while retaining search sensitivity.
Key Features of BLAST
- Uses word matching like FastA.
- Similarity matching of words (3 aa’s, 11 bases).
- If no words are similar, no alignment is performed.
- Does not handle gaps effectively.
- New “gapped BLAST” (BLAST 2) is better.
- BLAST searches can be sent to the NCBI's server or run on a personal computer.
BLAST Algorithm Steps
- Word List Generation: For the query, find the list of high scoring words of length
w. - Database Comparison: Compare the word list to the database and identify exact matches.
- Extension: For each word match, extend the alignment in both directions to find alignments that score greater than a threshold value
S.
BLAST Algorithm
- Filter out low complexity regions
- Locate k-tuples (words) in the query sequence:
- Word length 3 for amino acids
- Word length 11 for nucleotides
BLAST Steps
- Seeding: Prepare a list of short, fixed-length segments (words) from the query.
- Searching: Find highly similar or exact matches for each word.
- Extension: Extend each match to a potentially longer match.
- Evaluation: Evaluate the results using E-values.
BLAST Programs
- BLASTP: Protein query against a protein database (allows gaps).
- BLASTN: DNA query against a DNA database (allows gaps).
- BLASTX: Translated DNA query (six reading frames) against a protein database (allows gaps).
- TBLASTN: Protein query against a translated DNA database (six reading frames, allows gaps).
- TBLASTX: Translated DNA query (six reading frames) against a translated DNA database (six reading frames, no gaps).
NCBI Resources
- NCBI (National Center for Biotechnology Information) provides public databases, conducts research in computational biology, develops software tools, and disseminates biomedical information.
- Available databases: PubMed, OMIM, Books.
- Molecular Databases: Sequences, structures.
- Entrez Tools.
- BLAST can be accessed through the NCBI website.
BLAST Usage
- The NCBI BLAST page allows searching for nucleotide or protein sequences.
- Users can enter accession numbers, GI numbers, or FASTA sequences.
- The page allows users to select the database to search (e.g., nucleotide collection).
- It provides options to optimize for highly similar or more dissimilar sequences.
BLAST Output
- BLAST results display a distribution of hits on the query sequence.
- Color keys indicate alignment scores.
- The output lists sequences producing significant alignments with scores, E-values, and percent identity.
- Links provide access to the aligned sequences.
Example BLAST Output
- The score is 26.3 bits (13), Expect = 614
- Identities = 13/13 (100%), Gaps = 0/13 (0%)
```