Bioinformatics – Databases, Tools, BLAST & Phylogenetics

Lecture Learning Outcomes

• Understand the use of bioinformatics.
• Define the different databases, tools and repositories used in bioinformatics.
• Identify the different BLAST tools used to search queries.
• Identify the tools used for genome sequence alignments.


What Is Bioinformatics?

• Formal definition: “the use of computer to store, retrieve, analyse or predict the composition or structure of bio-molecules.”
• Practical definition: application of computational techniques + information technology to organise, curate and interrogate biological data.

Core Activities
  1. Sequence-centric analyses (most common):
    • \text{DNA/RNA} sequence assembly, annotation, mutation calling.
    • Promoter & enhancer mapping; epigenetic mark prediction.
    • Evolution of gene families & phylogenetics.
    • Protein sequence/structure/function inference.

  2. Network-level analyses:
    • Transcriptional regulatory networks.
    • Pathway reconstruction & metabolic modelling.

  3. Structural analyses: protein folding, domain/motif prediction.

Significance

• Accelerates discovery cycles—from hypothesis ⇒ data mining ⇒ wet-lab validation ⇒ personalised therapies.
• Bridges genomics, transcriptomics, proteomics, metabolomics & phenomics.


A Continuous Workflow

Research Question → Bioinformatic Analysis → Laboratory Work → Sequencing → Bioinformatic Interpretation → Personalised Therapy.
• Feedback loop: each stage informs the previous, creating an iterative refinement of hypotheses and therapies.


Genome Sequencing & Read Alignment

  1. Sequencing outputs “short reads.”

  2. Align reads to a reference genome to call variants (e.g., SNP A \rightarrow G).

  3. Output = VCF (variant call format) ready for downstream annotation.

Illustrative read set (excerpt):

GGTCTGGATGC
CGGTCTGGATGC
GCGGTCTGGATG …

Organisms Sequenced (June 2021)

• 3{,}278 unique animal nuclear genomes publicly available ≈ 0.2\% of all animal species.
• Implies enormous untapped biodiversity—opportunities for evolutionary biology, drug discovery, conservation.


Biological Databases

Bioinformatics relies on >3{,}000 distinct repositories.

Classification
  1. Primary databases – raw, unanalysed sequences; public.

  2. Secondary databases – data derived by analysing primary entries (e.g., motifs, domains).

  3. Derived/Specialised databases – integrated or niche resources combining multiple sources + unique annotations.

Common Examples & Content

• NCBI – umbrella portal (GenBank, PubMed, dbVar, etc.).
• GenBank – annotated nucleotide sequences.
• dbVar – structural variations (insertions, deletions, duplications, inversions).
• Swiss-Prot – curated protein sequences.
• PDB – 3D macromolecular structures (X-ray, NMR, cryo-EM).
• MGED/GEO – microarray & expression data.

Graphical taxonomy (slide 9) links database type data class (nucleotide, protein, structure, pathways).


Bioinformatic Tools / Software

Purpose

Exemplars

Sequence Analysis

BLAST, ClustalW/Omega, T-Coffee, MEME, MEGA, PHYLIP

Structure Analysis

CN3D, PyMOL, RasMol, MODELLER

Functional / Pathway Analysis

GEO interface, InterProScan, COBRA Toolbox, Pathway Tools

Choice depends on data type, hypothesis, required output (alignment, tree, structural model, flux map).


Visual Outputs From Bioinformatic Pipelines

• Volcano plots: \log{2}\text{(fold-change)} vs -\log{10}(P) significance.
• Violin plots: distribution of expression or risk scores.
• Heat maps: sample × gene expression matrices (row Z-scores).
• Correlation matrices, colour-key histograms.
• Pathway bar-charts: gene counts, enrichment scores.

Interpretation Example

Down-regulated genes (389), up-regulated (307) relative to control; pathways such as “Cytokine–cytokine receptor interaction” flagged with high enrichment.


Basic Analysis With NCBI Gene

  1. Query: “Homo sapiens insulin like growth factor” returns >2000 records.

  2. Filters: genomic location, coding vs non-coding, discontinued entries.

  3. Gene record (IGF1, Gene ID 3479) includes:
    • Summary, expression atlas, orthologs, phenotypes, HIV-1 interactions.
    • Navigation to genome browsers (GDV).
    • Download options (FASTA, GFF, etc.).

Accession Prefixes

• \text{NC} complete genomic molecule. • \text{NG} gene-specific genomic region.
• \text{NM} mRNA (reviewed). • \text{NP} protein (reviewed).
• \text{XM/XR/XP}_* computationally predicted mRNA / ncRNA / protein.


FASTA Format Essentials

>identifier optional_description
SEQUENCE (continuous, no numbers)

• Works for DNA, RNA, protein.
• Single ‘>’ header, uppercase letters, newline wrapped ≤80 chars.
• Standard input for BLAST, Clustal, MSA viewers.


Limitations of Direct Gene Queries

• Must already know gene name/synonyms.
• Isoforms produce multiple hits.
• Homonyms/aliases differ across species/databases.

Solution for unknowns: BLAST (Basic Local Alignment Search Tool).


Five Main BLAST Flavours

Tool

Query

Target DB

Typical Use

BLASTN

nucleotide

nucleotide

intra-species DNA match, SNP search

BLASTP

protein

protein

protein homology, motif detection

BLASTX

nucleotide → 6-frame protein

protein

find coding potential in raw DNA

TBLASTN

protein

translated nucleotide

find distant nucleotide homologs via protein seed

TBLASTX

translated nucleotide

translated nucleotide

deep cross-species comparisons of coding regions

Statistical output includes E-value, bit score, % identity, coverage, alignment view, taxonomy distribution.


Interpreting BLAST Output (IGF1 Example)

• Top hit: Homo sapiens RefSeqGene NG_011713.1, 100\% identity, E=0.0.
• Cross-species hits: Canis lupus \approx 83\% identity; Sus scrofa \approx 79\%.
• Graphical summary bars colour-coded by alignment score (blue ≥200, green 80{-}200, etc.).
• Alignment section: gaps indicated by dashes, mismatches in lowercase or red.


Multiple Sequence Alignment (MSA) & Phylogenetics

ClustalW / Clustal Omega Workflow
  1. Paste FASTA sequences for species/orthologs.

  2. Compute MSA:
    • ‘*’ identical, ‘:’ strongly similar, ‘.’ weakly similar, ‘-’ gap.

  3. Export alignment → Newick/PHYLIP tree.

Phylogenetic Trees

Node: ancestral junction.
Branch: lineage segment; branch length ≈ substitutions/site (e.g., 0.05).
Taxon (terminal): extant gene/protein.
Clade: ancestor + all descendants (monophyletic group).

TreeView, PHYLIP, MEGA visualise rooted/unrooted, rectangular/cladogram styles, with bootstrap support values (e.g., 1000 replicates giving >98\% confidence).


Real-World Applications of Bioinformatics

• Medicine: variant interpretation, pharmacogenomics, gene therapy.
• Drug design: structure-based docking, ADMET prediction.
• Agriculture: drought resistance, crop yield optimisation, transgenic design.
• Veterinary science: pathogen surveillance, breeding programs.
• Evolutionary studies: species divergence, molecular clocks.
• Forensics: species/individual identification, trace evidence.
• Environmental biotech: weather pattern modelling, waste-cleanup consortia.
• Antibiotic resistance tracking.

Interdisciplinary convergence drives personalised healthcare, sustainable agriculture and biodiversity conservation.


Ethical & Practical Considerations

• Data privacy (human genomes).
• Open vs commercial database licensing.
• Biosecurity: dual-use concerns (synthetic biology).
• Computational equity: ensuring global South access to resources.


Lecture Conclusions

  1. Bioinformatics integrates computation & biology to manage data and answer complex questions.

  2. Databases fall into primary, secondary, derived categories; >3{,}000 resources exist.

  3. Genomic analysis spans almost all organisms, yet coverage remains <1\% of animal diversity.

  4. NCBI is the central hub for known sequences; accession prefixes encode molecule type & curation level.

  5. BLAST family is indispensable for unknown sequence identification.

  6. ClustalW/Omega & related MSA tools underpin phylogenetic inference.

Mastery of these foundations enables researchers to traverse the full pipeline—from raw reads to evolutionary insights and therapeutic innovation.