Bioinformatics Comprehensive Notes

Bioinformatics

Bioinformatics is the application of Information Technology to store, organize, and analyze vast amounts of biological data.
The stored data includes sequences and structures of proteins and nucleic acids.
Nucleic acid information is stored as sequences.
Protein data is stored as sequences and structures.
Sequences are represented in one dimension, while structures contain three-dimensional data.

Merging Disciplines

Bioinformatics merges biology, mathematics, statistics, computer science, and IT into a single discipline to process biological data.
Complex machines are used to read biological data quickly.
The term "Bioinformatics" was coined by Paulien Hogeweg and Ben Hesper in 1970.

Necessity of Bioinformatics

Bioinformatics is necessary due to the explosion of publicly available genomic information, like that from the Human Genome Project.
It allows better understanding of gene analysis, taxonomy, and evolution.
It speeds up rational drug design and reduces manual drug development time.

Goals of Bioinformatics

To uncover biological information hidden in sequence, structure, literature, and biological data.
Used in molecular medicine.
Offers environmental benefits by identifying waste and clean-up bacteria.
In agriculture, it can produce high-yield, low-maintenance crops.

Fields of Bioinformatics

Molecular Medicine
Gene Therapy
Drug Development
Microbial genome applications
Crop Improvement
Forensic Analysis of Microbes
Biotechnology
Evolutionary Studies
Bio-Weapon Creation

Applications of Bioinformatics

Experimental Molecular Biology
Genetics and Genomics
Generating Biological Data
Analysis of gene and protein expression
Comparison of genomic data
Simulation & Modeling of DNA, RNA & Protein

Specific Bioinformatics Applications

1. Prediction of Protein Structure

Determining complex protein structures using bioinformatics tools, from primary (amino acid sequence) to secondary, tertiary, or quaternary structures.

2. Genome Annotation

Marking genomes to identify regulatory sequences and protein-coding regions, important for the Human Genome Project.

3. Comparative Genomics

Determining genomic structure and function relationships between different species. Intergenomic maps trace evolutionary processes.

4. Health and Drug Discovery

Aiding in drug discovery, diagnosis, and disease management by enabling targeted medicines and drugs.

5. Preventative Medicine

Correcting mutations by gene identification and splice site prediction, particularly in cancer analysis.

6. Gene Therapy

Detecting and quantifying mutations via next-generation sequencing, enabling cost-effective precision medicine.

Bio-Weapon Concerns

Scientists have constructed the poliomyelitis virus using artificial means, based on genomic data from the Internet and materials from chemical suppliers.
Research was funded by the US Department of Defense to prove bioweapons' reality and discourage relaxed immunization programs.

Antibiotic Resistance

Examining the genome of Enterococcus faecalis to understand antibiotic resistance.
Discovery of a virulence region, a pathogenicity island, helps in detecting pathogenic strains and preventing infection spread.

Database

A computerized archive to store and organize data for easy retrieval.
Consists of files or tables, each containing records and fields.

Organisation

Flat files: Simple databases storing nucleotide and amino acid sequences as single text files.
Relational databases: Treat data as relations, storing data in tables with records and fields (rows and columns).

GenBank Flat-File Format Example

LOCUS: Title by GenBank.
- Locus Name: Similar to accession number.
- Sequence Length: Number of bases.
- Molecule-Type: Type of nucleic acid sequence (mRNA, rRNA, snRNA, DNA).
- GB Division: Data class according to GenBank classification.
- Modification Date: Date of record modification.
DEFINITION: Name of the nucleotide sequence.
ACCESSION: Accession number, version, and GI number.
- Accession number: Unique identifier.
- VERSION: Identification number in "accession.version" format.
- GI: Sequence identification number; increases with sequence changes.
KEYWORDS: Indexed words.
SOURCE: Organism from which sequences are obtained.
ORGANISM: Scientific name and phylogenetic lineage.
REFERENCE: Citations of publications.
FEATURES: Information derived from the sequence (gene, exon, intron, promoters, CDS, alternate splice, Base Count, Origin).

Types of Databases

Primary databases: Contain experimentally derived data like sequences or structures of biological components (protein or nucleotide).
Secondary databases: Contain information derived from primary databases, such as conserved sequences and active site residues.
Composite databases: Collections of primary database resources with analysis tools.

Primary Databases

Contain raw nucleic acid sequence data submitted by researchers worldwide.
Examples: NCBI(The National Centre for Biotechnology Information)
- GenBank
- DDBJ (DNA data bank of Japan)
- SWISS-PROT(Swiss-Prot)
- PIR (Protein Information Resource)
- PDB(Protein Data Bank)
- TrEMBL (Translated European Molecular Biology Laboratory)
Nucleic acid
- EMBL
- GenBank
- DDBJ (DNA Data Bank of Japan)
Protein
- PIR
- MIPS
- SWISS-PROT
- TrEMBL

Secondary Databases

Contain information derived from primary databases (conserved sequences, active site residues, signature sequences).
Examples:
- Class Architecture Topology Homology (CATH)
- Kyoto Encyclopedia of Genes and Genomics (KEGG)
- Protein Families (Pfam)
- Structural Classification of Proteins (SCOP)
- PROSITE
- Pfam
- BLOCKS
- PRINTS

Composite Databases

Collections of several primary database resources.
Provide analysis tools and software.
Suffer from high data redundancy.

Biological Databases

Classified into:
- Sequence database
- structure database
- pathway databases.
Sequence databases apply to both nucleic acid and protein sequences.
Structure databases apply to proteins only.

Sequence Databases

Nucleotide and protein sequence databases are widely used.
Serve as repositories for wet lab results.
Major public data banks:
- GenBank (USA)
- EMBL (Europe)
- DDBJ (Japan)
Nucleic acid
- EMBL
- GenBank
- DDBJ (DNA Data Bank of Japan)
Protein databases:
- ExPaSy
- UniProt
- PIR
- MIPS
- SWISS-PROT
- TrEMBL
- PIR
- PDB
- Swiss-Prot
- TREMBL

National Center for Biotechnology Information (NCBI)

Developed at NIH in 1988.
Part of the National Library of Medicine.
Provides access to biomedical and genomic information.
Maintains databases, bioinformatics tools, and services.
GenBank is a popular database.

NCBI Mission

Find novel techniques for dealing with complex data.
Provide better accessibility to analytical and computational tools.
Maintain biological databases (primary or secondary).
Includes GENEBANK.
Provides data retrieval systems like ENTREZ.
Provides computational resources for GENEBANK data analysis.

NCBI Resources

Categorized into databases and tools.

Major NCBI Databases

GenBank and PubMed.
Other databases:
- Gene
- Genome
- Epigenomics
- Gene Expression
- RefSeq
- Structure
- dbSNP
- TAXONOMY

NCBI Tools

Entrez: Search engine of NCBI
Other tools:
- Genomes Browser
- BLAST
- CDTree
- Genetic Codes
- ORF Finder
- SNP Database Specialized Search Tools

GenBank

Genetic Sequence Databank at NCBI.
Established in 1982.
Contains publicly available nucleotide sequences.
Allows DNA sequence submission via BankIt (web-based) or Sequin (for complicated submissions).

GenBank Structure

Locus: Title given by GenBank.
- Locus Name: Similar to accession number.
- Sequence Length: Number of bases.
- Molecule-Type: Type of nucleic acid sequence.
- GB Division: Data class.
- Modification Date: Date of modification.
Definition: Name of the nucleotide sequence.
Accession: Accession number, version, and GI number.
- Accession number: Unique identifier.
- VERSION: Identification number in "accession.version" format.
- GI: Sequence identification number; updated on sequence change.
Keyword: Indexed words.
Source: Organism of origin.
Organism: Scientific name and lineage.
Reference: Citations of publications.
Features: Information derived from the sequence (gene, exon, intron, promoters, CDS).

European Molecular Biology Laboratory (EMBL)

Maintained by EBI, UK; formed in 1974.
Maintains databases with free access.
Primary source of nucleotide sequences for Europe.
Accepts nucleotide sequence data from genome-sequencing projects and the European Patent Office.

EMBL Collaboration

Collaborates with GenBank (USA) and DDBJ (Japan) for data collection.
Other genomic databases: Ensembl and Genome Reviews.
Daily releases with new submissions and updated data.
Entire database released every 3 months.

DDBJ

DNA Data Bank of Japan.
Collects DNA sequences submitted by researchers.
Run by the National Institute of Genetics, Japan.
Manages data in DDBJ format (flat file).
Flat file includes sequence, submitter information, references, source organisms, and feature information.

Ensembl Genome Database

Genome browser for retrieving genomic information from various organisms.
Created and maintained by EBI and the Sanger Center (UK).

Protein Databases

Swiss-Prot: Protein sequence and knowledge database with high-quality annotation.
PFAM: Database of protein families with annotations and multiple sequence alignments.

More Protein Databases

TrEMBL: EBI database of computer-annotated entries from translated coding sequences.
PIR: Integrated public bioinformatics resource supporting genomic and proteomic research.
Database of freely accessible protein sequences with high-quality data and functional information.

Structure Databases

Include:
- Protein DataBank (PDB): solving real problems in molecular biology
Established in 1972 at Brookhaven National Laboratory (BNL).
Contains structural information determined by X-ray, crystallography, NMR methods.
Maintained by the Research Collaboratory for Structural Bioinformatics (RCSB).

PROSITE

Database of protein domains and families.
Contains biologically significant sites, patterns, and profiles to identify known protein families.

CATH

Hierarchical classification of protein domain structures (Class, architecture, topology, homologous superfamily).

Pathway Databases

Describe biochemical pathways, reactions, and enzymes.
Examples:
- KEGG
- BRENDA
- BioCyc

KEGG

Kyoto Encyclopedia of Genes and Genomes.
Collection of databases dealing with genomes, enzymatic pathways, and biological chemicals.
Contains three databases: PATHWAY, GENES, and LIGAND.
- PATHWAY: Molecular interaction networks.
- GENES: Sequences of genes and proteins.
- LIGAND: Chemical compounds and reactions.

BioCyc

Database Collection of pathway and genome information for different organisms.
Includes EcoCyc (Escherichia coli K-12) and MetaCyc (pathways for more than 300 organisms).

SEQUENCE ALIGNMENT

Sequence alignment arranges protein (or DNA) sequences to identify similarity regions indicative of evolutionary relationships.
Useful for identifying sequence similarity, producing phylogenetic trees, and developing homology models of protein structures.

Importance of Sequence Alignment

Identify polymorphisms and mutations between sequences.
Quantify phylogenetic distance between two sequences.
Compare an mRNA with its genomic region.
Look for functional domains.

Types of Sequence Alignment

Global Alignment
Local Alignment

Global Alignment

Assumes similarity over the entire length of two sequences.
Alignment from beginning to end to find the best possible alignment across the entire length.

Applications of Global Alignment

Comparing two genes with the same function (in human vs. mouse).
Comparing two proteins with similar functions.

Local Alignment

Finds local regions with the highest level of similarity between two sequences, disregarding the rest.

Applications of Local Alignment

Searching for local similarities in large sequences.
Looking for conserved domains or motifs in two proteins.

PAIR WISE SEQUENCE ALIGNMENT

Finds the best-matching piecewise (local or global) alignments of two query sequences.

Methods of Producing Pair Wise Alignments

Dot matrix method (Old method).
Dynamic programming method(DP Method- Advanced method).
Word or k - tuple methods.

Tools for Pair Wise Sequence Alignment

BLAST, FASTA

Multiple Sequence Alignment (MSA)

Alignment of three or more biological protein or nucleic acid sequences of similar length.

Methods of MSA

Dynamic Programming Approach
Progressive method
Iterative method

Tools in MSA

CLUSTAL W, CLUSTAL W2, CLUSTAL Omega, etc..

Applications of MSA

Detecting similarities between sequences.
Detecting conserved regions or motifs in sequences.
Detection of structural homologies.
Improved prediction of secondary and tertiary structures of proteins.

Phylogenetic Tree

Depicts the evolutionary descent of species, organisms, or genes from a common ancestor.
Reconstructs evolutionary ancestors.
Estimates the time of divergence from ancestors.

History of Phylogenetic Trees

Early representations included a paleontological chart by Edward Hitchcock (1840).
Charles Darwin (1859) popularized the evolutionary "tree" concept.

How Phylogenetic Trees Work

Visual representation of evolutionary relationships.

Phylogenetic Tree Components

Leaves: Current species; sequences in current species
Internal nodes: Hypothetical common ancestors
Branches (Edges) length: "Time" from one speciation to the next

Dendrogram, Cladogram, Phylogram

Dendrogram: Diagrammatic representation of phylogenetic trees.
Cladogram: Branch lengths do not represent evolutionary time.
Phylogram: Branch lengths represent evolutionary time.

Types of Phylogenetic Trees

Cladogram
Chronogram
Phylogram

Rooted vs. Unrooted Trees

Rooted Tree: Inferences about a common ancestor.
Unrooted Tree: Illustration about the leaves or branches, without assumptions regarding a common ancestor.

Construction of Phylogenetic Tree

Find the tree that best describes species relationships.
Two main types:
- Character based methods
- Distance based methods

Character Based Methods

Use aligned characters directly during tree inference.
Examples: Parsimony and Maximum likelihood.

Distance Based Methods

Transform sequence data into pairwise distances for tree building.
Examples: UPGMA and Neighbor-joining.