Bioinformatics – Databases, Tools, BLAST & Phylogenetics
Lecture Learning Outcomes
• Understand the use of bioinformatics.
• Define the different databases, tools and repositories used in bioinformatics.
• Identify the different BLAST tools used to search queries.
• Identify the tools used for genome sequence alignments.
What Is Bioinformatics?
• Formal definition: “the use of computer to store, retrieve, analyse or predict the composition or structure of bio-molecules.”
• Practical definition: application of computational techniques + information technology to organise, curate and interrogate biological data.
Core Activities
Sequence-centric analyses (most common):
• \text{DNA/RNA} sequence assembly, annotation, mutation calling.
• Promoter & enhancer mapping; epigenetic mark prediction.
• Evolution of gene families & phylogenetics.
• Protein sequence/structure/function inference.Network-level analyses:
• Transcriptional regulatory networks.
• Pathway reconstruction & metabolic modelling.Structural analyses: protein folding, domain/motif prediction.
Significance
• Accelerates discovery cycles—from hypothesis ⇒ data mining ⇒ wet-lab validation ⇒ personalised therapies.
• Bridges genomics, transcriptomics, proteomics, metabolomics & phenomics.
A Continuous Workflow
Research Question → Bioinformatic Analysis → Laboratory Work → Sequencing → Bioinformatic Interpretation → Personalised Therapy.
• Feedback loop: each stage informs the previous, creating an iterative refinement of hypotheses and therapies.
Genome Sequencing & Read Alignment
Sequencing outputs “short reads.”
Align reads to a reference genome to call variants (e.g., SNP A \rightarrow G).
Output = VCF (variant call format) ready for downstream annotation.
Illustrative read set (excerpt):
GGTCTGGATGC
CGGTCTGGATGC
GCGGTCTGGATG …
Organisms Sequenced (June 2021)
• 3{,}278 unique animal nuclear genomes publicly available ≈ 0.2\% of all animal species.
• Implies enormous untapped biodiversity—opportunities for evolutionary biology, drug discovery, conservation.
Biological Databases
Bioinformatics relies on >3{,}000 distinct repositories.
Classification
Primary databases – raw, unanalysed sequences; public.
Secondary databases – data derived by analysing primary entries (e.g., motifs, domains).
Derived/Specialised databases – integrated or niche resources combining multiple sources + unique annotations.
Common Examples & Content
• NCBI – umbrella portal (GenBank, PubMed, dbVar, etc.).
• GenBank – annotated nucleotide sequences.
• dbVar – structural variations (insertions, deletions, duplications, inversions).
• Swiss-Prot – curated protein sequences.
• PDB – 3D macromolecular structures (X-ray, NMR, cryo-EM).
• MGED/GEO – microarray & expression data.
Graphical taxonomy (slide 9) links database type ↔ data class (nucleotide, protein, structure, pathways).
Bioinformatic Tools / Software
Purpose | Exemplars |
---|---|
Sequence Analysis | BLAST, ClustalW/Omega, T-Coffee, MEME, MEGA, PHYLIP |
Structure Analysis | CN3D, PyMOL, RasMol, MODELLER |
Functional / Pathway Analysis | GEO interface, InterProScan, COBRA Toolbox, Pathway Tools |
Choice depends on data type, hypothesis, required output (alignment, tree, structural model, flux map).
Visual Outputs From Bioinformatic Pipelines
• Volcano plots: \log{2}\text{(fold-change)} vs -\log{10}(P) significance.
• Violin plots: distribution of expression or risk scores.
• Heat maps: sample × gene expression matrices (row Z-scores).
• Correlation matrices, colour-key histograms.
• Pathway bar-charts: gene counts, enrichment scores.
Interpretation Example
Down-regulated genes (389), up-regulated (307) relative to control; pathways such as “Cytokine–cytokine receptor interaction” flagged with high enrichment.
Basic Analysis With NCBI Gene
Query: “Homo sapiens insulin like growth factor” returns >2000 records.
Filters: genomic location, coding vs non-coding, discontinued entries.
Gene record (IGF1, Gene ID 3479) includes:
• Summary, expression atlas, orthologs, phenotypes, HIV-1 interactions.
• Navigation to genome browsers (GDV).
• Download options (FASTA, GFF, etc.).
Accession Prefixes
• \text{NC} complete genomic molecule. • \text{NG} gene-specific genomic region.
• \text{NM} mRNA (reviewed). • \text{NP} protein (reviewed).
• \text{XM/XR/XP}_* computationally predicted mRNA / ncRNA / protein.
FASTA Format Essentials
>identifier optional_description
SEQUENCE (continuous, no numbers)
• Works for DNA, RNA, protein.
• Single ‘>’ header, uppercase letters, newline wrapped ≤80 chars.
• Standard input for BLAST, Clustal, MSA viewers.
Limitations of Direct Gene Queries
• Must already know gene name/synonyms.
• Isoforms produce multiple hits.
• Homonyms/aliases differ across species/databases.
Solution for unknowns: BLAST (Basic Local Alignment Search Tool).
Five Main BLAST Flavours
Tool | Query | Target DB | Typical Use |
---|---|---|---|
BLASTN | nucleotide | nucleotide | intra-species DNA match, SNP search |
BLASTP | protein | protein | protein homology, motif detection |
BLASTX | nucleotide → 6-frame protein | protein | find coding potential in raw DNA |
TBLASTN | protein | translated nucleotide | find distant nucleotide homologs via protein seed |
TBLASTX | translated nucleotide | translated nucleotide | deep cross-species comparisons of coding regions |
Statistical output includes E-value, bit score, % identity, coverage, alignment view, taxonomy distribution.
Interpreting BLAST Output (IGF1 Example)
• Top hit: Homo sapiens RefSeqGene NG_011713.1, 100\% identity, E=0.0.
• Cross-species hits: Canis lupus \approx 83\% identity; Sus scrofa \approx 79\%.
• Graphical summary bars colour-coded by alignment score (blue ≥200, green 80{-}200, etc.).
• Alignment section: gaps indicated by dashes, mismatches in lowercase or red.
Multiple Sequence Alignment (MSA) & Phylogenetics
ClustalW / Clustal Omega Workflow
Paste FASTA sequences for species/orthologs.
Compute MSA:
• ‘*’ identical, ‘:’ strongly similar, ‘.’ weakly similar, ‘-’ gap.Export alignment → Newick/PHYLIP tree.
Phylogenetic Trees
• Node: ancestral junction.
• Branch: lineage segment; branch length ≈ substitutions/site (e.g., 0.05).
• Taxon (terminal): extant gene/protein.
• Clade: ancestor + all descendants (monophyletic group).
TreeView, PHYLIP, MEGA visualise rooted/unrooted, rectangular/cladogram styles, with bootstrap support values (e.g., 1000 replicates giving >98\% confidence).
Real-World Applications of Bioinformatics
• Medicine: variant interpretation, pharmacogenomics, gene therapy.
• Drug design: structure-based docking, ADMET prediction.
• Agriculture: drought resistance, crop yield optimisation, transgenic design.
• Veterinary science: pathogen surveillance, breeding programs.
• Evolutionary studies: species divergence, molecular clocks.
• Forensics: species/individual identification, trace evidence.
• Environmental biotech: weather pattern modelling, waste-cleanup consortia.
• Antibiotic resistance tracking.
Interdisciplinary convergence drives personalised healthcare, sustainable agriculture and biodiversity conservation.
Ethical & Practical Considerations
• Data privacy (human genomes).
• Open vs commercial database licensing.
• Biosecurity: dual-use concerns (synthetic biology).
• Computational equity: ensuring global South access to resources.
Lecture Conclusions
Bioinformatics integrates computation & biology to manage data and answer complex questions.
Databases fall into primary, secondary, derived categories; >3{,}000 resources exist.
Genomic analysis spans almost all organisms, yet coverage remains <1\% of animal diversity.
NCBI is the central hub for known sequences; accession prefixes encode molecule type & curation level.
BLAST family is indispensable for unknown sequence identification.
ClustalW/Omega & related MSA tools underpin phylogenetic inference.
Mastery of these foundations enables researchers to traverse the full pipeline—from raw reads to evolutionary insights and therapeutic innovation.