Bioinformatics Notes

Introduction to Bioinformatics & Biology (CBIO 203)

Sequence Similarity Searching

Sequence comparison is a powerful method for determining evolutionary relationships between genes.
Similarity searching is based on alignment.
Bioinformatics databases: GenBank, Swiss-Prot.
BLAST and FastA provide rapid similarity searching, where 'rapid' implies approximate (heuristic) methods.

Why Search Databases?

To verify if a new DNA sequence is already in databanks.
To find proteins homologous to a putative coding Open Reading Frame (ORF).
To identify similar non-coding DNA stretches (e.g., repeat elements, regulatory sequences).
To locate false priming sites for PCR oligonucleotides.

Open Reading Frames (ORFs)

ORFs are potential protein-coding regions within a DNA sequence. The image depicts 6 reading frames.

Similarity vs. Homology

Similarity is the measure of how alike two sequences are.
Homology implies evolutionary descent from a common ancestor.
$25\%$ similarity over 100 amino acids is strong evidence for homology.
Homology is an evolutionary statement indicating descent from a common ancestor, implying:
- Common 3D structure
- Usually common function
Homology is all or nothing; sequences are either homologous or not.

Available Databases

DNA (Nucleotide Sequences):

The big databases: Genbank, EMBL, DDBJ (weekly updates, exchange info routinely).
Genomic databases: Human (GDB), Mouse (MGB), Yeast (SGB).
Special databases: ESTs (Expressed Sequence Tags), STSs (Sequence-Tagged Sites), EPD (Eukaryotic Promoter Database), REPBASE (Repetitive Sequence Database).

Protein (Amino Acid Sequences):

The big databases: Swiss-Prot (high annotation level), PIR (Protein Identification Resource).
Translated databases: SP-TREMBL (translated EMBL), GenPept (translation of coding regions in GenBank).
Special databases: PDB (sequences derived from 3D structures).

Homologous Sequence

A homologous sequence is similar to another sequence due to common ancestry.
Homologous proteins share similar folding or structure.

DNA vs. Protein Searches

DNA consists of 4 characters (A, G, C, T); $25\%$ identity may occur by chance.
Proteins consist of 20 characters (amino acids), improving comparison sensitivity.
Convergence of proteins is rare; high similarity usually implies homology.

Nucleotide vs Protein Sequences

When searching for similarity, it's generally better to use protein sequences when possible.
If starting with a nucleotide sequence, consider translating it to protein and searching protein databases.
Translating to amino acids may lose information due to the degeneracy of the genetic code (multiple codons for the same amino acid).
Very different DNA sequences can code for similar protein sequences.

Conclusion

Proteins are better for database similarity searches:
- DNA comparisons have more random matches.
- DNA databases are larger, leading to more random hits.
- Protein searches use more sensitive matrices (PAM, BLOSUM).
- Proteins are more conserved in evolution than DNA.

Main Algorithms for Database Searching

FastA: Better for nucleotides than proteins.
BLAST (Basic Local Alignment Search Tool): Better for proteins than nucleotides.
Smith-Waterman: More sensitive than FastA or BLAST.

Global vs. Local Similarity

Global Similarity: Uses complete aligned sequences; calculates total $%$ matches.
- Needleman-Wunsch algorithm.
- Not suitable for database searching.
Local Similarity: Finds the best internal matching region between two sequences.
- Smith-Waterman algorithm.
- BLAST and FASTA.
Dynamic Programming: Optimal computer solution, not approximate.

Search with Protein, not DNA Sequences

20 amino acids vs. 4 DNA bases means less chance of similarity.
Proteins have varying degrees of similarity between amino acids.
Protein databanks are smaller than DNA databanks.

Programs for Searching

BLAST: Fastest, easily accessed on the web.
- Limited database sets.
- Good translation tools (BLASTX, TBLASTN).
FASTA: Better/more complete alignments.
- Precise choice of databases.
- More sensitive for DNA-DNA comparisons.
- FASTX and TFASTX can find similarities in sequences with frameshifts.
Smith-Waterman: Slower, more sensitive.
- Rigorous or exhaustive search.
- SSEARCH is part of the Fasta package.

FastA

First rapid database search utility.
50 times faster than Dynamic Programming.
Based on a heuristic approach (not guaranteed to locate the optimal solution).

FastA Algorithm steps

Compute best diagonals from all frames of alignment derived from dot plot logic.
Word method looks for exact matches between words in query and test sequence.
- DNA words are often 6 bases.
- Protein words are 1 or 2 amino acids.
- Only searches for diagonals in the region of word matches $=$ faster searching.
After all diagonals are found, an attempt to join diagonals occur by adding gaps.
Compute alignments in regions of best diagonals.

FastA Algorithm Overview

FastA locates regions of high word match density between the query and search set sequences.
The ten highest-scoring regions are rescored using a scoring matrix; the top score is saved as init1.
FastA determines if initial regions from different diagonals can be joined, forming an approximate alignment with gaps; only non-overlapping regions can be joined.
The score for joined regions is the sum of initial region scores minus a joining penalty for each gap; highest score is saved as initn.
FastA then uses a variation of the Smith-Waterman algorithm to determine the best segment of similarity; the score for this alignment is the opt score.
FastA uses linear regression against the natural log of the search set sequence length to compute a normalized z-score for the sequence pair.
Using the z-score distribution, the program estimates the number of sequences expected to produce a z-score greater than or equal to the obtained z-score by chance. This is the E-score.

FastA Algorithm - Five Steps:

Identify common k-words between sequences I and J.
Score diagonals with k-word matches; identify the 10 best diagonals.
Rescore initial regions with a substitution score matrix.
Join initial regions using gaps, penalizing for gaps.
Perform dynamic programming to find final alignments.

FASTA Format

Simple format used by almost all programs.
- >header line with a [return] at the end.
- Sequence (no specific requirements for line length, characters, etc.).
Example:
>URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 .. CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG GAGTCACCAAAACCTGGGACAGGCTCATGCTCCAGGACAATTGCTGTGGCGTAAATGGTC CATCAGACTGGCAAAAATACACATCTGCCTTCCGGACTGAGAATAATGATGCTGACTATC CCTGGCCTCGTCAATGCTGTGTTATGAACAATCTTAAAGAACCTCTCAACCTGGAGGCTT

BLAST (Basic Local Alignment Search Tool)

Most widely used and referenced computational biology/bioinformatics resource.
Improves search speed of FASTA while retaining search sensitivity.

Key Features of BLAST

Uses word matching like FastA.
Similarity matching of words (3 aa’s, 11 bases).
If no words are similar, no alignment is performed.
Does not handle gaps effectively.
New “gapped BLAST” (BLAST 2) is better.
BLAST searches can be sent to the NCBI's server or run on a personal computer.

BLAST Algorithm Steps

Word List Generation: For the query, find the list of high scoring words of length w.
Database Comparison: Compare the word list to the database and identify exact matches.
Extension: For each word match, extend the alignment in both directions to find alignments that score greater than a threshold value S.

BLAST Algorithm

Filter out low complexity regions
Locate k-tuples (words) in the query sequence:
- Word length 3 for amino acids
- Word length 11 for nucleotides

BLAST Steps

Seeding: Prepare a list of short, fixed-length segments (words) from the query.
Searching: Find highly similar or exact matches for each word.
Extension: Extend each match to a potentially longer match.
Evaluation: Evaluate the results using E-values.

BLAST Programs

BLASTP: Protein query against a protein database (allows gaps).
BLASTN: DNA query against a DNA database (allows gaps).
BLASTX: Translated DNA query (six reading frames) against a protein database (allows gaps).
TBLASTN: Protein query against a translated DNA database (six reading frames, allows gaps).
TBLASTX: Translated DNA query (six reading frames) against a translated DNA database (six reading frames, no gaps).

NCBI Resources

NCBI (National Center for Biotechnology Information) provides public databases, conducts research in computational biology, develops software tools, and disseminates biomedical information.
Available databases: PubMed, OMIM, Books.
Molecular Databases: Sequences, structures.
Entrez Tools.
BLAST can be accessed through the NCBI website.

BLAST Usage

The NCBI BLAST page allows searching for nucleotide or protein sequences.
Users can enter accession numbers, GI numbers, or FASTA sequences.
The page allows users to select the database to search (e.g., nucleotide collection).
It provides options to optimize for highly similar or more dissimilar sequences.

BLAST Output

BLAST results display a distribution of hits on the query sequence.
Color keys indicate alignment scores.
The output lists sequences producing significant alignments with scores, E-values, and percent identity.
Links provide access to the aligned sequences.

Example BLAST Output

The score is 26.3 bits (13), Expect = 614
Identities = 13/13 (100%), Gaps = 0/13 (0%)
```