Comprehensive Notes on Biological Databases, Genome Browsers, and NCBI Tools

There are three primary public DNA databases worldwide that share and exchange data regularly. The underlying raw DNA sequences within these databases are identical. * GenBank: Housed at the NCBI (National Center for Biotechnology Information) in the United States. * EMBL (European Molecular Biology Laboratory): Housed at the EBI (European Bioinformatics Institute) in Europe. * DDBJ (DNA Data Bank of Japan): Housed in Japan.

The NCBI Taxonomy database represents species with sequences in GenBank. The number of represented species has grown significantly over time: * October 2011: Greater than $200,000$ species. Total taxa were $437,012$ . * January 2015: Greater than $300,000$ species. Total taxa were $443,972$ . * January 2018: Greater than $400,000$ species. Total taxa were $547,528$ . * January 2021: Greater than $487,000$ species. Total taxa were $685,872$ . * January 2024: Greater than $566,000$ species. Total taxa were $793,084$ . * January 2026: Greater than $618,000$ species. Total taxa were $660,279$ .
Snapshot of Taxa Counts (Jan 2026): * Archaea: $1,104$ species ( $1,104$ total). * Bacteria: $28,496$ species ( $29,514$ total). * Eukaryota: $574,814$ species ( $615,440$ total). * Fungi: $64,942$ species ( $66,608$ total). * Metazoa: $300,837$ species ( $321,082$ total). * Viridiplantae (Green Plants): $192,711$ species ( $211,022$ total). * Viruses: $14,010$ species ( $14,203$ total).

The NCBI is a division of the National Library of Medicine (NLM) at the National Institutes of Health (NIH).
Core Mission: To advance science and health by providing access to biomedical and genomic information.
Popular Resources: * PubMed: Search service for the National Library of Medicine. As of January 2026, it contains more than $39,000,000$ citations from MEDLINE, life science journals, and online books. * BLAST: Basic Local Alignment Search Tool, used for sequence similarity searches. * Gene/Protein/Nucleotide: Databases for specific molecular entities. * PubChem: Repository for chemical information and bioactivity screening. * GEO (Gene Expression Omnibus): Functional genomics studies and profiles. * OMIM: Catalog of human genes and genetic disorders.

Integrated Search and Retrieval System: The NCBI system integrates literature, DNA/protein sequences, 3D structures, clinical data (OMIM), population studies, and complete genome assemblies.
BLAST (Basic Local Alignment Search Tool): A sequence similarity search tool supporting DNA and protein analysis. It handles over $100,000$ searches per day.
OMIM (Online Mendelian Inheritance in Man): Created by Dr. Victor McKusick and led by Dr. Ada Hamosh at JHMI (Johns Hopkins Medical Institutions). It serves as a comprehensive catalog of human genes and genetic phenotypes.
NCBI Taxonomy: A browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses). It includes data on genetic codes and molecular data from extinct organisms. It is extremely useful for locating protein or gene data by species.
Structure (Molecular Modeling Database - MMDB): * Includes biopolymer structures derived from the PDB (Protein Data Bank). * Cn3D: A tool for 3D structure visualization. * iCn3D: A newer WebGL-based viewer for interactive web viewing of macromolecular structures. * VAST (Vector Alignment Search Tool): Identifies similar protein 3D structures.

Definition: An accession number is a unique label (string of letters and/or numbers) used to identify a specific molecular sequence or record.
Examples for Beta Globin (HBB): * U01317.1: GenBank genomic DNA sequence. * NG_000007.3: RefSeqGene. * rs192792910: dbSNP (single nucleotide polymorphism) record. * AA970968.1: Expressed Sequence Tag (EST). * NM_000518.4: RefSeq DNA sequence derived from a transcript (mRNA). * NP_000509.1: RefSeq protein sequence. * CAA00182.1: GenBank protein sequence. * Q14473: SwissProt protein identifier. * 1YE0|B: PDB structure record for the protein.

RefSeq provides expertly curated, stable reference versions of sequences. These are considered the most "agreed-upon" versions of a sequence.
Identifier Formats: * NC_######: Complete genome or complete chromosome. * NT_######: Genomic contig. * NM_######: mRNA (DNA format); e.g., $NM_000518$ for beta globin. * NP_######: Protein; e.g., $NP_000509$ for beta globin.

NCBI Gene is recommended as a starting point because it aggregates information from major databases for each gene/protein across organisms.
HBB Gene Report Details: * Official Symbol: HBB. * Location: Chromosome $11$ , $11p15.4$ . * Exon Count: $3$ . * Summary: Describes the alpha (HBA) and beta (HBB) loci as the providers of polypeptide chains in adult hemoglobin (Hb A). Mutations in HBB cause sickle cell anemia, while absence/reduction leads to thalassemia. * Variation: Links to ClinVar, dbVar, and Variation Viewer for GRCh37 and GRCh38. * Pathways: Includes Reactome data on $O_2/CO_2$ exchange and heme scavenging from plasma.

One-Letter and Three-Letter Codes: * Alanine (Ala, A), Arginine (Arg, R), Asparagine (Asn, N), Aspartic acid (Asp, D), Cysteine (Cys, C). * Glutamic Acid (Glu, E), Glutamine (Gln, Q), Glycine (Gly, G), Histidine (His, H), Isoleucine (Ile, I). * Leucine (Leu, L), Lysine (K), Methionine (Met, M), Phenylalanine (Phe, F), Proline (Pro, P). * Serine (Ser, S), Threonine (Thr, T), Tryptophan (Trp, W), Tyrosine (Tyr, Y), Valine (Val, V).
FASTA Format: A versatile, compact text format beginning with a single header line (identified by a ">" symbol) followed by a string of nucleotides or amino acids in single-letter code.

Concept: Genome browsers display "ideograms" (graphical representations) of chromosomes with selectable "annotation tracks" showing different types of data.
Ensembl (ensembl.org): * Focuses primarily on vertebrate genomes. * Supports research in evolution, sequence variation, and transcriptional regulation. * Features tools like BioMart, VEP (Variant Effect Predictor), and BLAT.
UCSC Genome Browser (genome.ucsc.edu): * Focuses on humans and other eukaryotes. * Allows users to select track density: hide, dense, squish, pack, or full. * Permits creation of "custom tracks" by uploading formatted spreadsheets. * Table Browser: A quantitative tool (as opposed to visual) used to retrieve specific tabular data related to the tracks.

Task: Determine the number of SNPs in the coding exons of human chromosome $11$ . * Step 1: Open the UCSC Table Browser. Choose the Variation group and the dbSNP 155 track. Set the position to "chr11" and the output format to BED. * Step 2: Check the "Send query to Galaxy" box and hit "get output." * Step 3: In Galaxy, rename the dataset to "SNPs." * Step 4: Return to the UCSC Table Browser. Change the group to Genes and Gene Predictions and the track to NCBI RefSeq. Set the output to "Coding Exons" and send it to Galaxy. * Step 5: In Galaxy, click "Operate on Genomic Intervals." * Step 6: Run the "Intersect" tool on the "SNPs" and "Coding Exons" datasets. * Step 7: The resulting list represents the SNPs located specifically within coding exons on chromosome $11$ .