Database Searching with BLAST and Heuristic Alignment Methods

Development and Origin: BLAST was developed at the National Center for Biotechnology Information (NCBI) in $1990$ .
Accessibility: The tool is web-accessible via the official URL: http://blast.ncbi.nlm.nih.gov/Blast.cgi.
Core Functionality:
- BLAST allows users to select a single nucleotide or protein sequence, referred to as the query.
- It performs pairwise alignments between that query and an entire database of sequences, referred to as the subject.
- The algorithm is characterized by being fast, accurate, and easily accessible through web interfaces.
Scientific Utility:
- BLAST finds regions of local similarity between biological sequences.
- It calculates the statistical significance of matches between the query and database sequences.
- It is used to infer functional and evolutionary relationships and to help identify members of gene families.

Heuristic vs. Dynamic Programming: Heuristic algorithms perform searches faster than regular dynamic programming because they examine only a fraction of the possible alignments.
Performance Metrics:
- Heuristic methods are approximately $50-100 \times$ faster than traditional dynamic programming methods.
- The trade-off for this speed is a moderate decrease in sensitivity.
Primary Heuristic Algorithms: The two major heuristic algorithms for database searching are BLAST and FASTA.
Alignment Strategy: Both BLAST and FASTA utilize the word method for pairwise sequence alignment.

Functional Inference: Similarity searches are essential for determining the function of genes sequenced in a laboratory for which no biological information is currently available.
Predictive Logic: If a query sequence can be readily aligned to a database sequence with a known function, structure, or biochemical activity, it is predicted that the query sequence possesses similar properties.
Scoring: The alignments with the best matching sequences are displayed and assigned a score. Statistical evaluations of these alignment scores are performed to determine significance.

Specify the sequence of interest: Provide the query input.
Select BLAST program: Choose the specific algorithm based on the query and database types.
Select a database: Choose the specific collection of sequences to search against.
Select optional search parameters: Customize the search using filters or specific limits.

Input Formats:
- FASTA format: The standard text-based format for representing sequences.
- Accession Number: Identification numbers from RefSeq or GenBank.
- GI Number: GenBank Identification number.
Strand Consideration: If the query is a DNA sequence, BLAST algorithms will search both strands.
Sub-range Selection: The search allows users to select a specific subset of the query sequence, such as a particular domain or region of interest.

Standard Program Categories:
- blastn: Compares a nucleotide query against a nucleotide database.
- blastp: Compares a protein query against a protein database.
- blastx: Compares a nucleotide query (translated in all $6$ reading frames) against a protein database. This results in $6$ database searches.
- tblastn: Compares a protein query against a nucleotide database (translated in all $6$ reading frames). This results in $6$ database searches.
- tblastx: Compares a nucleotide query (translated in $6$ frames) against a nucleotide database (translated in $6$ frames). This is the most computationally intensive algorithm, performing $36$ total protein-protein database searches.
Program Selection Logic:
- Finding Encoded Proteins: Use blastx if you have a DNA sequence and want to identify the protein it might encode.
- Searching New Genomes: Use tblastn to find protein homologs in newly sequenced genomes where coding regions are not yet annotated (e.g., Expressed Sequence Tags or ESTs, and draft genomes).
- Divergent Sequences: Use tblastx when comparing highly divergent nucleotide sequences. By translating them into proteins, the tool can detect conserved coding regions and distant homologs that might be missed at the nucleotide level.

Protein Databases (for blastp and blastx):
- nr (non-redundant): The default option. It consists of combined protein records from GenBank, PDB (Protein Data Bank), Swiss-prot, PIR (Protein Information Resource), and PRF (Protein Research Foundation).
- RefSeq: Protein sequences specifically from NCBI’s Reference Sequence Project.
DNA Databases (for blastn, tblastn, and tblastx):
- Nucleotide nr database: Includes sequences from GenBank, EMBL (European Molecular Biology Laboratory), and DDBJ (DNA Data Bank of Japan).

Entrez Query Limits: BLAST searches can be restricted using Entrez query terms.
Logical Operators: Standard terms and Boolean operators are allowed.
- Example: protease NOT hiv1[Organism] limits the search to all proteases except those belonging to HIV-1.
Organism Filtering: Users can limit searches to a specific organism in two ways:
- Selecting from a pull-down menu of common organisms.
- Entering the name in the Entrez Query field using the [Organism] qualifier.
- Example: Mus musculus[Organism].
Program Optimization: The interface allows optimization for different alignment types:
- Megablast: For highly similar sequences.
- Discontiguous megablast: For more dissimilar sequences.
- blastn: For somewhat similar sequences.