Bioinformatics Comprehensive Notes

Bioinformatics

  • Bioinformatics is the application of Information Technology to store, organize, and analyze vast amounts of biological data.

  • The stored data includes sequences and structures of proteins and nucleic acids.

  • Nucleic acid information is stored as sequences.

  • Protein data is stored as sequences and structures.

  • Sequences are represented in one dimension, while structures contain three-dimensional data.

Merging Disciplines

  • Bioinformatics merges biology, mathematics, statistics, computer science, and IT into a single discipline to process biological data.

  • Complex machines are used to read biological data quickly.

  • The term "Bioinformatics" was coined by Paulien Hogeweg and Ben Hesper in 1970.

Necessity of Bioinformatics

  • Bioinformatics is necessary due to the explosion of publicly available genomic information, like that from the Human Genome Project.

  • It allows better understanding of gene analysis, taxonomy, and evolution.

  • It speeds up rational drug design and reduces manual drug development time.

Goals of Bioinformatics

  • To uncover biological information hidden in sequence, structure, literature, and biological data.

  • Used in molecular medicine.

  • Offers environmental benefits by identifying waste and clean-up bacteria.

  • In agriculture, it can produce high-yield, low-maintenance crops.

Fields of Bioinformatics

  • Molecular Medicine

  • Gene Therapy

  • Drug Development

  • Microbial genome applications

  • Crop Improvement

  • Forensic Analysis of Microbes

  • Biotechnology

  • Evolutionary Studies

  • Bio-Weapon Creation

Applications of Bioinformatics

  • Experimental Molecular Biology

  • Genetics and Genomics

  • Generating Biological Data

  • Analysis of gene and protein expression

  • Comparison of genomic data

  • Simulation & Modeling of DNA, RNA & Protein

Specific Bioinformatics Applications

1. Prediction of Protein Structure

  • Determining complex protein structures using bioinformatics tools, from primary (amino acid sequence) to secondary, tertiary, or quaternary structures.

2. Genome Annotation

  • Marking genomes to identify regulatory sequences and protein-coding regions, important for the Human Genome Project.

3. Comparative Genomics

  • Determining genomic structure and function relationships between different species. Intergenomic maps trace evolutionary processes.

4. Health and Drug Discovery

  • Aiding in drug discovery, diagnosis, and disease management by enabling targeted medicines and drugs.

5. Preventative Medicine

  • Correcting mutations by gene identification and splice site prediction, particularly in cancer analysis.

6. Gene Therapy

  • Detecting and quantifying mutations via next-generation sequencing, enabling cost-effective precision medicine.

Bio-Weapon Concerns

  • Scientists have constructed the poliomyelitis virus using artificial means, based on genomic data from the Internet and materials from chemical suppliers.

  • Research was funded by the US Department of Defense to prove bioweapons' reality and discourage relaxed immunization programs.

Antibiotic Resistance

  • Examining the genome of Enterococcus faecalis to understand antibiotic resistance.

  • Discovery of a virulence region, a pathogenicity island, helps in detecting pathogenic strains and preventing infection spread.

Database

  • A computerized archive to store and organize data for easy retrieval.

  • Consists of files or tables, each containing records and fields.

Organisation

  • Flat files: Simple databases storing nucleotide and amino acid sequences as single text files.

  • Relational databases: Treat data as relations, storing data in tables with records and fields (rows and columns).

GenBank Flat-File Format Example

  • LOCUS: Title by GenBank.

    • Locus Name: Similar to accession number.

    • Sequence Length: Number of bases.

    • Molecule-Type: Type of nucleic acid sequence (mRNA, rRNA, snRNA, DNA).

    • GB Division: Data class according to GenBank classification.

    • Modification Date: Date of record modification.

  • DEFINITION: Name of the nucleotide sequence.

  • ACCESSION: Accession number, version, and GI number.

    • Accession number: Unique identifier.

    • VERSION: Identification number in "accession.version" format.

    • GI: Sequence identification number; increases with sequence changes.

  • KEYWORDS: Indexed words.

  • SOURCE: Organism from which sequences are obtained.

  • ORGANISM: Scientific name and phylogenetic lineage.

  • REFERENCE: Citations of publications.

  • FEATURES: Information derived from the sequence (gene, exon, intron, promoters, CDS, alternate splice, Base Count, Origin).

Types of Databases

  • Primary databases: Contain experimentally derived data like sequences or structures of biological components (protein or nucleotide).

  • Secondary databases: Contain information derived from primary databases, such as conserved sequences and active site residues.

  • Composite databases: Collections of primary database resources with analysis tools.

Primary Databases

  • Contain raw nucleic acid sequence data submitted by researchers worldwide.

  • Examples: NCBI(The National Centre for Biotechnology Information)

    • GenBank

    • DDBJ (DNA data bank of Japan)

    • SWISS-PROT(Swiss-Prot)

    • PIR (Protein Information Resource)

    • PDB(Protein Data Bank)

    • TrEMBL (Translated European Molecular Biology Laboratory)

  • Nucleic acid

    • EMBL

    • GenBank

    • DDBJ (DNA Data Bank of Japan)

  • Protein

    • PIR

    • MIPS

    • SWISS-PROT

    • TrEMBL

Secondary Databases

  • Contain information derived from primary databases (conserved sequences, active site residues, signature sequences).

  • Examples:

    • Class Architecture Topology Homology (CATH)

    • Kyoto Encyclopedia of Genes and Genomics (KEGG)

    • Protein Families (Pfam)

    • Structural Classification of Proteins (SCOP)

    • PROSITE

    • Pfam

    • BLOCKS

    • PRINTS

Composite Databases

  • Collections of several primary database resources.

  • Provide analysis tools and software.

  • Suffer from high data redundancy.

Biological Databases

  • Classified into:

    • Sequence database

    • structure database

    • pathway databases.

  • Sequence databases apply to both nucleic acid and protein sequences.

  • Structure databases apply to proteins only.

Sequence Databases

  • Nucleotide and protein sequence databases are widely used.

  • Serve as repositories for wet lab results.

  • Major public data banks:

    • GenBank (USA)

    • EMBL (Europe)

    • DDBJ (Japan)

  • Nucleic acid

    • EMBL

    • GenBank

    • DDBJ (DNA Data Bank of Japan)

  • Protein databases:

    • ExPaSy

    • UniProt

    • PIR

    • MIPS

    • SWISS-PROT

    • TrEMBL

    • PIR

    • PDB

    • Swiss-Prot

    • TREMBL

National Center for Biotechnology Information (NCBI)

  • Developed at NIH in 1988.

  • Part of the National Library of Medicine.

  • Provides access to biomedical and genomic information.

  • Maintains databases, bioinformatics tools, and services.

  • GenBank is a popular database.

NCBI Mission

  • Find novel techniques for dealing with complex data.

  • Provide better accessibility to analytical and computational tools.

  • Maintain biological databases (primary or secondary).

  • Includes GENEBANK.

  • Provides data retrieval systems like ENTREZ.

  • Provides computational resources for GENEBANK data analysis.

NCBI Resources

  • Categorized into databases and tools.

Major NCBI Databases

  • GenBank and PubMed.

  • Other databases:

    • Gene

    • Genome

    • Epigenomics

    • Gene Expression

    • RefSeq

    • Structure

    • dbSNP

    • TAXONOMY

NCBI Tools

  • Entrez: Search engine of NCBI

  • Other tools:

    • Genomes Browser

    • BLAST

    • CDTree

    • Genetic Codes

    • ORF Finder

    • SNP Database Specialized Search Tools

GenBank

  • Genetic Sequence Databank at NCBI.

  • Established in 1982.

  • Contains publicly available nucleotide sequences.

  • Allows DNA sequence submission via BankIt (web-based) or Sequin (for complicated submissions).

GenBank Structure

  • Locus: Title given by GenBank.

    • Locus Name: Similar to accession number.

    • Sequence Length: Number of bases.

    • Molecule-Type: Type of nucleic acid sequence.

    • GB Division: Data class.

    • Modification Date: Date of modification.

  • Definition: Name of the nucleotide sequence.

  • Accession: Accession number, version, and GI number.

    • Accession number: Unique identifier.

    • VERSION: Identification number in "accession.version" format.

    • GI: Sequence identification number; updated on sequence change.

  • Keyword: Indexed words.

  • Source: Organism of origin.

  • Organism: Scientific name and lineage.

  • Reference: Citations of publications.

  • Features: Information derived from the sequence (gene, exon, intron, promoters, CDS).

European Molecular Biology Laboratory (EMBL)

  • Maintained by EBI, UK; formed in 1974.

  • Maintains databases with free access.

  • Primary source of nucleotide sequences for Europe.

  • Accepts nucleotide sequence data from genome-sequencing projects and the European Patent Office.

EMBL Collaboration

  • Collaborates with GenBank (USA) and DDBJ (Japan) for data collection.

  • Other genomic databases: Ensembl and Genome Reviews.

  • Daily releases with new submissions and updated data.

  • Entire database released every 3 months.

DDBJ

  • DNA Data Bank of Japan.

  • Collects DNA sequences submitted by researchers.

  • Run by the National Institute of Genetics, Japan.

  • Manages data in DDBJ format (flat file).

  • Flat file includes sequence, submitter information, references, source organisms, and feature information.

Ensembl Genome Database

  • Genome browser for retrieving genomic information from various organisms.

  • Created and maintained by EBI and the Sanger Center (UK).

Protein Databases

  • Swiss-Prot: Protein sequence and knowledge database with high-quality annotation.

  • PFAM: Database of protein families with annotations and multiple sequence alignments.

More Protein Databases

  • TrEMBL: EBI database of computer-annotated entries from translated coding sequences.

  • PIR: Integrated public bioinformatics resource supporting genomic and proteomic research.

  • Database of freely accessible protein sequences with high-quality data and functional information.

Structure Databases

  • Include:

    • Protein DataBank (PDB): solving real problems in molecular biology

  • Established in 1972 at Brookhaven National Laboratory (BNL).

  • Contains structural information determined by X-ray, crystallography, NMR methods.

  • Maintained by the Research Collaboratory for Structural Bioinformatics (RCSB).

PROSITE

  • Database of protein domains and families.

  • Contains biologically significant sites, patterns, and profiles to identify known protein families.

CATH

  • Hierarchical classification of protein domain structures (Class, architecture, topology, homologous superfamily).

Pathway Databases

  • Describe biochemical pathways, reactions, and enzymes.

  • Examples:

    • KEGG

    • BRENDA

    • BioCyc

KEGG

  • Kyoto Encyclopedia of Genes and Genomes.

  • Collection of databases dealing with genomes, enzymatic pathways, and biological chemicals.

  • Contains three databases: PATHWAY, GENES, and LIGAND.

    • PATHWAY: Molecular interaction networks.

    • GENES: Sequences of genes and proteins.

    • LIGAND: Chemical compounds and reactions.

BioCyc

  • Database Collection of pathway and genome information for different organisms.

  • Includes EcoCyc (Escherichia coli K-12) and MetaCyc (pathways for more than 300 organisms).

SEQUENCE ALIGNMENT

  • Sequence alignment arranges protein (or DNA) sequences to identify similarity regions indicative of evolutionary relationships.

  • Useful for identifying sequence similarity, producing phylogenetic trees, and developing homology models of protein structures.

Importance of Sequence Alignment

  • Identify polymorphisms and mutations between sequences.

  • Quantify phylogenetic distance between two sequences.

  • Compare an mRNA with its genomic region.

  • Look for functional domains.

Types of Sequence Alignment

  • Global Alignment

  • Local Alignment

Global Alignment

  • Assumes similarity over the entire length of two sequences.

  • Alignment from beginning to end to find the best possible alignment across the entire length.

Applications of Global Alignment

  • Comparing two genes with the same function (in human vs. mouse).

  • Comparing two proteins with similar functions.

Local Alignment

  • Finds local regions with the highest level of similarity between two sequences, disregarding the rest.

Applications of Local Alignment

  • Searching for local similarities in large sequences.

  • Looking for conserved domains or motifs in two proteins.

PAIR WISE SEQUENCE ALIGNMENT

  • Finds the best-matching piecewise (local or global) alignments of two query sequences.

Methods of Producing Pair Wise Alignments

  • Dot matrix method (Old method).

  • Dynamic programming method(DP Method- Advanced method).

  • Word or k - tuple methods.

Tools for Pair Wise Sequence Alignment

  • BLAST, FASTA

Multiple Sequence Alignment (MSA)

  • Alignment of three or more biological protein or nucleic acid sequences of similar length.

Methods of MSA

  • Dynamic Programming Approach

  • Progressive method

  • Iterative method

Tools in MSA

  • CLUSTAL W, CLUSTAL W2, CLUSTAL Omega, etc..

Applications of MSA

  • Detecting similarities between sequences.

  • Detecting conserved regions or motifs in sequences.

  • Detection of structural homologies.

  • Improved prediction of secondary and tertiary structures of proteins.

Phylogenetic Tree

  • Depicts the evolutionary descent of species, organisms, or genes from a common ancestor.

  • Reconstructs evolutionary ancestors.

  • Estimates the time of divergence from ancestors.

History of Phylogenetic Trees

  • Early representations included a paleontological chart by Edward Hitchcock (1840).

  • Charles Darwin (1859) popularized the evolutionary "tree" concept.

How Phylogenetic Trees Work

  • Visual representation of evolutionary relationships.

Phylogenetic Tree Components

  • Leaves: Current species; sequences in current species

  • Internal nodes: Hypothetical common ancestors

  • Branches (Edges) length: "Time" from one speciation to the next

Dendrogram, Cladogram, Phylogram

  • Dendrogram: Diagrammatic representation of phylogenetic trees.

  • Cladogram: Branch lengths do not represent evolutionary time.

  • Phylogram: Branch lengths represent evolutionary time.

Types of Phylogenetic Trees

  • Cladogram

  • Chronogram

  • Phylogram

Rooted vs. Unrooted Trees

  • Rooted Tree: Inferences about a common ancestor.

  • Unrooted Tree: Illustration about the leaves or branches, without assumptions regarding a common ancestor.

Construction of Phylogenetic Tree

  • Find the tree that best describes species relationships.

  • Two main types:

    • Character based methods

    • Distance based methods

Character Based Methods

  • Use aligned characters directly during tree inference.

  • Examples: Parsimony and Maximum likelihood.

Distance Based Methods

  • Transform sequence data into pairwise distances for tree building.

  • Examples: UPGMA and Neighbor-joining.