Bioinformatics – Databases, Tools, BLAST & Phylogenetics

Lecture Learning Outcomes

• Understand the use of bioinformatics.
• Define the different databases, tools and repositories used in bioinformatics.
• Identify the different BLAST tools used to search queries.
• Identify the tools used for genome sequence alignments.

What Is Bioinformatics?

• Formal definition: “the use of computer to store, retrieve, analyse or predict the composition or structure of bio-molecules.”
• Practical definition: application of computational techniques + information technology to organise, curate and interrogate biological data.

Core Activities

Sequence-centric analyses (most common):
• \text{DNA/RNA} sequence assembly, annotation, mutation calling.
• Promoter & enhancer mapping; epigenetic mark prediction.
• Evolution of gene families & phylogenetics.
• Protein sequence/structure/function inference.
Network-level analyses:
• Transcriptional regulatory networks.
• Pathway reconstruction & metabolic modelling.
Structural analyses: protein folding, domain/motif prediction.

Significance

• Accelerates discovery cycles—from hypothesis ⇒ data mining ⇒ wet-lab validation ⇒ personalised therapies.
• Bridges genomics, transcriptomics, proteomics, metabolomics & phenomics.

A Continuous Workflow

Research Question → Bioinformatic Analysis → Laboratory Work → Sequencing → Bioinformatic Interpretation → Personalised Therapy.
• Feedback loop: each stage informs the previous, creating an iterative refinement of hypotheses and therapies.

Genome Sequencing & Read Alignment

Sequencing outputs “short reads.”
Align reads to a reference genome to call variants (e.g., SNP A \rightarrow G).
Output = VCF (variant call format) ready for downstream annotation.

Illustrative read set (excerpt):

GGTCTGGATGC
CGGTCTGGATGC
GCGGTCTGGATG …

Organisms Sequenced (June 2021)

• 3{,}278 unique animal nuclear genomes publicly available ≈ 0.2\% of all animal species.
• Implies enormous untapped biodiversity—opportunities for evolutionary biology, drug discovery, conservation.

Biological Databases

Bioinformatics relies on >3{,}000 distinct repositories.

Classification

Primary databases – raw, unanalysed sequences; public.
Secondary databases – data derived by analysing primary entries (e.g., motifs, domains).
Derived/Specialised databases – integrated or niche resources combining multiple sources + unique annotations.

Common Examples & Content

• NCBI – umbrella portal (GenBank, PubMed, dbVar, etc.).
• GenBank – annotated nucleotide sequences.
• dbVar – structural variations (insertions, deletions, duplications, inversions).
• Swiss-Prot – curated protein sequences.
• PDB – 3D macromolecular structures (X-ray, NMR, cryo-EM).
• MGED/GEO – microarray & expression data.

Graphical taxonomy (slide 9) links database type ↔ data class (nucleotide, protein, structure, pathways).

Bioinformatic Tools / Software

Purpose	Exemplars
Sequence Analysis	BLAST, ClustalW/Omega, T-Coffee, MEME, MEGA, PHYLIP
Structure Analysis	CN3D, PyMOL, RasMol, MODELLER
Functional / Pathway Analysis	GEO interface, InterProScan, COBRA Toolbox, Pathway Tools

Choice depends on data type, hypothesis, required output (alignment, tree, structural model, flux map).

Visual Outputs From Bioinformatic Pipelines

• Volcano plots: \log{2}\text{(fold-change)} vs -\log{10}(P) significance.
• Violin plots: distribution of expression or risk scores.
• Heat maps: sample × gene expression matrices (row Z-scores).
• Correlation matrices, colour-key histograms.
• Pathway bar-charts: gene counts, enrichment scores.

Interpretation Example

Down-regulated genes (389), up-regulated (307) relative to control; pathways such as “Cytokine–cytokine receptor interaction” flagged with high enrichment.

Basic Analysis With NCBI Gene

Query: “Homo sapiens insulin like growth factor” returns >2000 records.
Filters: genomic location, coding vs non-coding, discontinued entries.
Gene record (IGF1, Gene ID 3479) includes:
• Summary, expression atlas, orthologs, phenotypes, HIV-1 interactions.
• Navigation to genome browsers (GDV).
• Download options (FASTA, GFF, etc.).

Accession Prefixes

• \text{NC} complete genomic molecule. • \text{NG} gene-specific genomic region.
• \text{NM} mRNA (reviewed). • \text{NP} protein (reviewed).
• \text{XM/XR/XP}_* computationally predicted mRNA / ncRNA / protein.

FASTA Format Essentials

>identifier optional_description
SEQUENCE (continuous, no numbers)

• Works for DNA, RNA, protein.
• Single ‘>’ header, uppercase letters, newline wrapped ≤80 chars.
• Standard input for BLAST, Clustal, MSA viewers.

Limitations of Direct Gene Queries

• Must already know gene name/synonyms.
• Isoforms produce multiple hits.
• Homonyms/aliases differ across species/databases.

Solution for unknowns: BLAST (Basic Local Alignment Search Tool).

Five Main BLAST Flavours

Tool	Query	Target DB	Typical Use
BLASTN	nucleotide	nucleotide	intra-species DNA match, SNP search
BLASTP	protein	protein	protein homology, motif detection
BLASTX	nucleotide → 6-frame protein	protein	find coding potential in raw DNA
TBLASTN	protein	translated nucleotide	find distant nucleotide homologs via protein seed
TBLASTX	translated nucleotide	translated nucleotide	deep cross-species comparisons of coding regions

Statistical output includes E-value, bit score, % identity, coverage, alignment view, taxonomy distribution.

Interpreting BLAST Output (IGF1 Example)

• Top hit: Homo sapiens RefSeqGene NG_011713.1, 100\% identity, E=0.0.
• Cross-species hits: Canis lupus \approx 83\% identity; Sus scrofa \approx 79\%.
• Graphical summary bars colour-coded by alignment score (blue ≥200, green 80{-}200, etc.).
• Alignment section: gaps indicated by dashes, mismatches in lowercase or red.

Multiple Sequence Alignment (MSA) & Phylogenetics

ClustalW / Clustal Omega Workflow

Paste FASTA sequences for species/orthologs.
Compute MSA:
• ‘*’ identical, ‘:’ strongly similar, ‘.’ weakly similar, ‘-’ gap.
Export alignment → Newick/PHYLIP tree.

Phylogenetic Trees

• Node: ancestral junction.
• Branch: lineage segment; branch length ≈ substitutions/site (e.g., 0.05).
• Taxon (terminal): extant gene/protein.
• Clade: ancestor + all descendants (monophyletic group).

TreeView, PHYLIP, MEGA visualise rooted/unrooted, rectangular/cladogram styles, with bootstrap support values (e.g., 1000 replicates giving >98\% confidence).

Real-World Applications of Bioinformatics

• Medicine: variant interpretation, pharmacogenomics, gene therapy.
• Drug design: structure-based docking, ADMET prediction.
• Agriculture: drought resistance, crop yield optimisation, transgenic design.
• Veterinary science: pathogen surveillance, breeding programs.
• Evolutionary studies: species divergence, molecular clocks.
• Forensics: species/individual identification, trace evidence.
• Environmental biotech: weather pattern modelling, waste-cleanup consortia.
• Antibiotic resistance tracking.

Interdisciplinary convergence drives personalised healthcare, sustainable agriculture and biodiversity conservation.

Ethical & Practical Considerations

• Data privacy (human genomes).
• Open vs commercial database licensing.
• Biosecurity: dual-use concerns (synthetic biology).
• Computational equity: ensuring global South access to resources.

Lecture Conclusions

Bioinformatics integrates computation & biology to manage data and answer complex questions.
Databases fall into primary, secondary, derived categories; >3{,}000 resources exist.
Genomic analysis spans almost all organisms, yet coverage remains <1\% of animal diversity.
NCBI is the central hub for known sequences; accession prefixes encode molecule type & curation level.
BLAST family is indispensable for unknown sequence identification.
ClustalW/Omega & related MSA tools underpin phylogenetic inference.

Mastery of these foundations enables researchers to traverse the full pipeline—from raw reads to evolutionary insights and therapeutic innovation.