Bioinformatics Comprehensive Notes
Bioinformatics
Bioinformatics is the application of Information Technology to store, organize, and analyze vast amounts of biological data.
The stored data includes sequences and structures of proteins and nucleic acids.
Nucleic acid information is stored as sequences.
Protein data is stored as sequences and structures.
Sequences are represented in one dimension, while structures contain three-dimensional data.
Merging Disciplines
Bioinformatics merges biology, mathematics, statistics, computer science, and IT into a single discipline to process biological data.
Complex machines are used to read biological data quickly.
The term "Bioinformatics" was coined by Paulien Hogeweg and Ben Hesper in 1970.
Necessity of Bioinformatics
Bioinformatics is necessary due to the explosion of publicly available genomic information, like that from the Human Genome Project.
It allows better understanding of gene analysis, taxonomy, and evolution.
It speeds up rational drug design and reduces manual drug development time.
Goals of Bioinformatics
To uncover biological information hidden in sequence, structure, literature, and biological data.
Used in molecular medicine.
Offers environmental benefits by identifying waste and clean-up bacteria.
In agriculture, it can produce high-yield, low-maintenance crops.
Fields of Bioinformatics
Molecular Medicine
Gene Therapy
Drug Development
Microbial genome applications
Crop Improvement
Forensic Analysis of Microbes
Biotechnology
Evolutionary Studies
Bio-Weapon Creation
Applications of Bioinformatics
Experimental Molecular Biology
Genetics and Genomics
Generating Biological Data
Analysis of gene and protein expression
Comparison of genomic data
Simulation & Modeling of DNA, RNA & Protein
Specific Bioinformatics Applications
1. Prediction of Protein Structure
Determining complex protein structures using bioinformatics tools, from primary (amino acid sequence) to secondary, tertiary, or quaternary structures.
2. Genome Annotation
Marking genomes to identify regulatory sequences and protein-coding regions, important for the Human Genome Project.
3. Comparative Genomics
Determining genomic structure and function relationships between different species. Intergenomic maps trace evolutionary processes.
4. Health and Drug Discovery
Aiding in drug discovery, diagnosis, and disease management by enabling targeted medicines and drugs.
5. Preventative Medicine
Correcting mutations by gene identification and splice site prediction, particularly in cancer analysis.
6. Gene Therapy
Detecting and quantifying mutations via next-generation sequencing, enabling cost-effective precision medicine.
Bio-Weapon Concerns
Scientists have constructed the poliomyelitis virus using artificial means, based on genomic data from the Internet and materials from chemical suppliers.
Research was funded by the US Department of Defense to prove bioweapons' reality and discourage relaxed immunization programs.
Antibiotic Resistance
Examining the genome of Enterococcus faecalis to understand antibiotic resistance.
Discovery of a virulence region, a pathogenicity island, helps in detecting pathogenic strains and preventing infection spread.
Database
A computerized archive to store and organize data for easy retrieval.
Consists of files or tables, each containing records and fields.
Organisation
Flat files: Simple databases storing nucleotide and amino acid sequences as single text files.
Relational databases: Treat data as relations, storing data in tables with records and fields (rows and columns).
GenBank Flat-File Format Example
LOCUS: Title by GenBank.
Locus Name: Similar to accession number.
Sequence Length: Number of bases.
Molecule-Type: Type of nucleic acid sequence (mRNA, rRNA, snRNA, DNA).
GB Division: Data class according to GenBank classification.
Modification Date: Date of record modification.
DEFINITION: Name of the nucleotide sequence.
ACCESSION: Accession number, version, and GI number.
Accession number: Unique identifier.
VERSION: Identification number in "accession.version" format.
GI: Sequence identification number; increases with sequence changes.
KEYWORDS: Indexed words.
SOURCE: Organism from which sequences are obtained.
ORGANISM: Scientific name and phylogenetic lineage.
REFERENCE: Citations of publications.
FEATURES: Information derived from the sequence (gene, exon, intron, promoters, CDS, alternate splice, Base Count, Origin).
Types of Databases
Primary databases: Contain experimentally derived data like sequences or structures of biological components (protein or nucleotide).
Secondary databases: Contain information derived from primary databases, such as conserved sequences and active site residues.
Composite databases: Collections of primary database resources with analysis tools.
Primary Databases
Contain raw nucleic acid sequence data submitted by researchers worldwide.
Examples: NCBI(The National Centre for Biotechnology Information)
GenBank
DDBJ (DNA data bank of Japan)
SWISS-PROT(Swiss-Prot)
PIR (Protein Information Resource)
PDB(Protein Data Bank)
TrEMBL (Translated European Molecular Biology Laboratory)
Nucleic acid
EMBL
GenBank
DDBJ (DNA Data Bank of Japan)
Protein
PIR
MIPS
SWISS-PROT
TrEMBL
Secondary Databases
Contain information derived from primary databases (conserved sequences, active site residues, signature sequences).
Examples:
Class Architecture Topology Homology (CATH)
Kyoto Encyclopedia of Genes and Genomics (KEGG)
Protein Families (Pfam)
Structural Classification of Proteins (SCOP)
PROSITE
Pfam
BLOCKS
PRINTS
Composite Databases
Collections of several primary database resources.
Provide analysis tools and software.
Suffer from high data redundancy.
Biological Databases
Classified into:
Sequence database
structure database
pathway databases.
Sequence databases apply to both nucleic acid and protein sequences.
Structure databases apply to proteins only.
Sequence Databases
Nucleotide and protein sequence databases are widely used.
Serve as repositories for wet lab results.
Major public data banks:
GenBank (USA)
EMBL (Europe)
DDBJ (Japan)
Nucleic acid
EMBL
GenBank
DDBJ (DNA Data Bank of Japan)
Protein databases:
ExPaSy
UniProt
PIR
MIPS
SWISS-PROT
TrEMBL
PIR
PDB
Swiss-Prot
TREMBL
National Center for Biotechnology Information (NCBI)
Developed at NIH in 1988.
Part of the National Library of Medicine.
Provides access to biomedical and genomic information.
Maintains databases, bioinformatics tools, and services.
GenBank is a popular database.
NCBI Mission
Find novel techniques for dealing with complex data.
Provide better accessibility to analytical and computational tools.
Maintain biological databases (primary or secondary).
Includes GENEBANK.
Provides data retrieval systems like ENTREZ.
Provides computational resources for GENEBANK data analysis.
NCBI Resources
Categorized into databases and tools.
Major NCBI Databases
GenBank and PubMed.
Other databases:
Gene
Genome
Epigenomics
Gene Expression
RefSeq
Structure
dbSNP
TAXONOMY
NCBI Tools
Entrez: Search engine of NCBI
Other tools:
Genomes Browser
BLAST
CDTree
Genetic Codes
ORF Finder
SNP Database Specialized Search Tools
GenBank
Genetic Sequence Databank at NCBI.
Established in 1982.
Contains publicly available nucleotide sequences.
Allows DNA sequence submission via BankIt (web-based) or Sequin (for complicated submissions).
GenBank Structure
Locus: Title given by GenBank.
Locus Name: Similar to accession number.
Sequence Length: Number of bases.
Molecule-Type: Type of nucleic acid sequence.
GB Division: Data class.
Modification Date: Date of modification.
Definition: Name of the nucleotide sequence.
Accession: Accession number, version, and GI number.
Accession number: Unique identifier.
VERSION: Identification number in "accession.version" format.
GI: Sequence identification number; updated on sequence change.
Keyword: Indexed words.
Source: Organism of origin.
Organism: Scientific name and lineage.
Reference: Citations of publications.
Features: Information derived from the sequence (gene, exon, intron, promoters, CDS).
European Molecular Biology Laboratory (EMBL)
Maintained by EBI, UK; formed in 1974.
Maintains databases with free access.
Primary source of nucleotide sequences for Europe.
Accepts nucleotide sequence data from genome-sequencing projects and the European Patent Office.
EMBL Collaboration
Collaborates with GenBank (USA) and DDBJ (Japan) for data collection.
Other genomic databases: Ensembl and Genome Reviews.
Daily releases with new submissions and updated data.
Entire database released every 3 months.
DDBJ
DNA Data Bank of Japan.
Collects DNA sequences submitted by researchers.
Run by the National Institute of Genetics, Japan.
Manages data in DDBJ format (flat file).
Flat file includes sequence, submitter information, references, source organisms, and feature information.
Ensembl Genome Database
Genome browser for retrieving genomic information from various organisms.
Created and maintained by EBI and the Sanger Center (UK).
Protein Databases
Swiss-Prot: Protein sequence and knowledge database with high-quality annotation.
PFAM: Database of protein families with annotations and multiple sequence alignments.
More Protein Databases
TrEMBL: EBI database of computer-annotated entries from translated coding sequences.
PIR: Integrated public bioinformatics resource supporting genomic and proteomic research.
Database of freely accessible protein sequences with high-quality data and functional information.
Structure Databases
Include:
Protein DataBank (PDB): solving real problems in molecular biology
Established in 1972 at Brookhaven National Laboratory (BNL).
Contains structural information determined by X-ray, crystallography, NMR methods.
Maintained by the Research Collaboratory for Structural Bioinformatics (RCSB).
PROSITE
Database of protein domains and families.
Contains biologically significant sites, patterns, and profiles to identify known protein families.
CATH
Hierarchical classification of protein domain structures (Class, architecture, topology, homologous superfamily).
Pathway Databases
Describe biochemical pathways, reactions, and enzymes.
Examples:
KEGG
BRENDA
BioCyc
KEGG
Kyoto Encyclopedia of Genes and Genomes.
Collection of databases dealing with genomes, enzymatic pathways, and biological chemicals.
Contains three databases: PATHWAY, GENES, and LIGAND.
PATHWAY: Molecular interaction networks.
GENES: Sequences of genes and proteins.
LIGAND: Chemical compounds and reactions.
BioCyc
Database Collection of pathway and genome information for different organisms.
Includes EcoCyc (Escherichia coli K-12) and MetaCyc (pathways for more than 300 organisms).
SEQUENCE ALIGNMENT
Sequence alignment arranges protein (or DNA) sequences to identify similarity regions indicative of evolutionary relationships.
Useful for identifying sequence similarity, producing phylogenetic trees, and developing homology models of protein structures.
Importance of Sequence Alignment
Identify polymorphisms and mutations between sequences.
Quantify phylogenetic distance between two sequences.
Compare an mRNA with its genomic region.
Look for functional domains.
Types of Sequence Alignment
Global Alignment
Local Alignment
Global Alignment
Assumes similarity over the entire length of two sequences.
Alignment from beginning to end to find the best possible alignment across the entire length.
Applications of Global Alignment
Comparing two genes with the same function (in human vs. mouse).
Comparing two proteins with similar functions.
Local Alignment
Finds local regions with the highest level of similarity between two sequences, disregarding the rest.
Applications of Local Alignment
Searching for local similarities in large sequences.
Looking for conserved domains or motifs in two proteins.
PAIR WISE SEQUENCE ALIGNMENT
Finds the best-matching piecewise (local or global) alignments of two query sequences.
Methods of Producing Pair Wise Alignments
Dot matrix method (Old method).
Dynamic programming method(DP Method- Advanced method).
Word or k - tuple methods.
Tools for Pair Wise Sequence Alignment
BLAST, FASTA
Multiple Sequence Alignment (MSA)
Alignment of three or more biological protein or nucleic acid sequences of similar length.
Methods of MSA
Dynamic Programming Approach
Progressive method
Iterative method
Tools in MSA
CLUSTAL W, CLUSTAL W2, CLUSTAL Omega, etc..
Applications of MSA
Detecting similarities between sequences.
Detecting conserved regions or motifs in sequences.
Detection of structural homologies.
Improved prediction of secondary and tertiary structures of proteins.
Phylogenetic Tree
Depicts the evolutionary descent of species, organisms, or genes from a common ancestor.
Reconstructs evolutionary ancestors.
Estimates the time of divergence from ancestors.
History of Phylogenetic Trees
Early representations included a paleontological chart by Edward Hitchcock (1840).
Charles Darwin (1859) popularized the evolutionary "tree" concept.
How Phylogenetic Trees Work
Visual representation of evolutionary relationships.
Phylogenetic Tree Components
Leaves: Current species; sequences in current species
Internal nodes: Hypothetical common ancestors
Branches (Edges) length: "Time" from one speciation to the next
Dendrogram, Cladogram, Phylogram
Dendrogram: Diagrammatic representation of phylogenetic trees.
Cladogram: Branch lengths do not represent evolutionary time.
Phylogram: Branch lengths represent evolutionary time.
Types of Phylogenetic Trees
Cladogram
Chronogram
Phylogram
Rooted vs. Unrooted Trees
Rooted Tree: Inferences about a common ancestor.
Unrooted Tree: Illustration about the leaves or branches, without assumptions regarding a common ancestor.
Construction of Phylogenetic Tree
Find the tree that best describes species relationships.
Two main types:
Character based methods
Distance based methods
Character Based Methods
Use aligned characters directly during tree inference.
Examples: Parsimony and Maximum likelihood.
Distance Based Methods
Transform sequence data into pairwise distances for tree building.
Examples: UPGMA and Neighbor-joining.