Comprehensive Notes on Biological Databases, Genome Browsers, and NCBI Tools

Major Public DNA Databases

  • There are three primary public DNA databases worldwide that share and exchange data regularly. The underlying raw DNA sequences within these databases are identical.     * GenBank: Housed at the NCBI (National Center for Biotechnology Information) in the United States.     * EMBL (European Molecular Biology Laboratory): Housed at the EBI (European Bioinformatics Institute) in Europe.     * DDBJ (DNA Data Bank of Japan): Housed in Japan.

NCBI Taxonomy and Growth of GenBank

  • The NCBI Taxonomy database represents species with sequences in GenBank. The number of represented species has grown significantly over time:     * October 2011: Greater than 200,000200,000 species. Total taxa were 437,012437,012.     * January 2015: Greater than 300,000300,000 species. Total taxa were 443,972443,972.     * January 2018: Greater than 400,000400,000 species. Total taxa were 547,528547,528.     * January 2021: Greater than 487,000487,000 species. Total taxa were 685,872685,872.     * January 2024: Greater than 566,000566,000 species. Total taxa were 793,084793,084.     * January 2026: Greater than 618,000618,000 species. Total taxa were 660,279660,279.

  • Snapshot of Taxa Counts (Jan 2026):     * Archaea: 1,1041,104 species (1,1041,104 total).     * Bacteria: 28,49628,496 species (29,51429,514 total).     * Eukaryota: 574,814574,814 species (615,440615,440 total).     * Fungi: 64,94264,942 species (66,60866,608 total).     * Metazoa: 300,837300,837 species (321,082321,082 total).     * Viridiplantae (Green Plants): 192,711192,711 species (211,022211,022 total).     * Viruses: 14,01014,010 species (14,20314,203 total).

The National Center for Biotechnology Information (NCBI)

  • The NCBI is a division of the National Library of Medicine (NLM) at the National Institutes of Health (NIH).

  • Core Mission: To advance science and health by providing access to biomedical and genomic information.

  • Popular Resources:     * PubMed: Search service for the National Library of Medicine. As of January 2026, it contains more than 39,000,00039,000,000 citations from MEDLINE, life science journals, and online books.     * BLAST: Basic Local Alignment Search Tool, used for sequence similarity searches.     * Gene/Protein/Nucleotide: Databases for specific molecular entities.     * PubChem: Repository for chemical information and bioactivity screening.     * GEO (Gene Expression Omnibus): Functional genomics studies and profiles.     * OMIM: Catalog of human genes and genetic disorders.

NCBI Key Features and Tools

  • Integrated Search and Retrieval System: The NCBI system integrates literature, DNA/protein sequences, 3D structures, clinical data (OMIM), population studies, and complete genome assemblies.

  • BLAST (Basic Local Alignment Search Tool): A sequence similarity search tool supporting DNA and protein analysis. It handles over 100,000100,000 searches per day.

  • OMIM (Online Mendelian Inheritance in Man): Created by Dr. Victor McKusick and led by Dr. Ada Hamosh at JHMI (Johns Hopkins Medical Institutions). It serves as a comprehensive catalog of human genes and genetic phenotypes.

  • NCBI Taxonomy: A browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses). It includes data on genetic codes and molecular data from extinct organisms. It is extremely useful for locating protein or gene data by species.

  • Structure (Molecular Modeling Database - MMDB):     * Includes biopolymer structures derived from the PDB (Protein Data Bank).     * Cn3D: A tool for 3D structure visualization.     * iCn3D: A newer WebGL-based viewer for interactive web viewing of macromolecular structures.     * VAST (Vector Alignment Search Tool): Identifies similar protein 3D structures.

Sequence Accession Numbers and Identifiers

  • Definition: An accession number is a unique label (string of letters and/or numbers) used to identify a specific molecular sequence or record.

  • Examples for Beta Globin (HBB):     * U01317.1: GenBank genomic DNA sequence.     * NG_000007.3: RefSeqGene.     * rs192792910: dbSNP (single nucleotide polymorphism) record.     * AA970968.1: Expressed Sequence Tag (EST).     * NM_000518.4: RefSeq DNA sequence derived from a transcript (mRNA).     * NP_000509.1: RefSeq protein sequence.     * CAA00182.1: GenBank protein sequence.     * Q14473: SwissProt protein identifier.     * 1YE0|B: PDB structure record for the protein.

The RefSeq Project and Curated Data

  • RefSeq provides expertly curated, stable reference versions of sequences. These are considered the most "agreed-upon" versions of a sequence.

  • Identifier Formats:     * NC_######: Complete genome or complete chromosome.     * NT_######: Genomic contig.     * NM_######: mRNA (DNA format); e.g., NM000518NM_000518 for beta globin.     * NP_######: Protein; e.g., NP000509NP_000509 for beta globin.

Navigating NCBI Gene and Protein Databases

  • NCBI Gene is recommended as a starting point because it aggregates information from major databases for each gene/protein across organisms.

  • HBB Gene Report Details:     * Official Symbol: HBB.     * Location: Chromosome 1111, 11p15.411p15.4.     * Exon Count: 33.     * Summary: Describes the alpha (HBA) and beta (HBB) loci as the providers of polypeptide chains in adult hemoglobin (Hb A). Mutations in HBB cause sickle cell anemia, while absence/reduction leads to thalassemia.     * Variation: Links to ClinVar, dbVar, and Variation Viewer for GRCh37 and GRCh38.     * Pathways: Includes Reactome data on O2/CO2O_2/CO_2 exchange and heme scavenging from plasma.

Amino Acid Codes and FASTA Format

  • One-Letter and Three-Letter Codes:     * Alanine (Ala, A), Arginine (Arg, R), Asparagine (Asn, N), Aspartic acid (Asp, D), Cysteine (Cys, C).     * Glutamic Acid (Glu, E), Glutamine (Gln, Q), Glycine (Gly, G), Histidine (His, H), Isoleucine (Ile, I).     * Leucine (Leu, L), Lysine (K), Methionine (Met, M), Phenylalanine (Phe, F), Proline (Pro, P).     * Serine (Ser, S), Threonine (Thr, T), Tryptophan (Trp, W), Tyrosine (Tyr, Y), Valine (Val, V).

  • FASTA Format: A versatile, compact text format beginning with a single header line (identified by a ">" symbol) followed by a string of nucleotides or amino acids in single-letter code.

Genome Browsers: Ensembl and UCSC

  • Concept: Genome browsers display "ideograms" (graphical representations) of chromosomes with selectable "annotation tracks" showing different types of data.

  • Ensembl (ensembl.org):     * Focuses primarily on vertebrate genomes.     * Supports research in evolution, sequence variation, and transcriptional regulation.     * Features tools like BioMart, VEP (Variant Effect Predictor), and BLAT.

  • UCSC Genome Browser (genome.ucsc.edu):     * Focuses on humans and other eukaryotes.     * Allows users to select track density: hide, dense, squish, pack, or full.     * Permits creation of "custom tracks" by uploading formatted spreadsheets.     * Table Browser: A quantitative tool (as opposed to visual) used to retrieve specific tabular data related to the tracks.

Practical Exercise: Analyzing SNPs with UCSC and Galaxy

  • Task: Determine the number of SNPs in the coding exons of human chromosome 1111.     * Step 1: Open the UCSC Table Browser. Choose the Variation group and the dbSNP 155 track. Set the position to "chr11" and the output format to BED.     * Step 2: Check the "Send query to Galaxy" box and hit "get output."     * Step 3: In Galaxy, rename the dataset to "SNPs."     * Step 4: Return to the UCSC Table Browser. Change the group to Genes and Gene Predictions and the track to NCBI RefSeq. Set the output to "Coding Exons" and send it to Galaxy.     * Step 5: In Galaxy, click "Operate on Genomic Intervals."     * Step 6: Run the "Intersect" tool on the "SNPs" and "Coding Exons" datasets.     * Step 7: The resulting list represents the SNPs located specifically within coding exons on chromosome 1111.