Human Genome Project & DNA Polymorphisms

Learning Objectives

  • Appreciate scale & significance of Human Genome Project (HGP)
  • Discover molecular “secrets of life” via complete genetic blue-print
  • Compare human sequence variation with other species & among world populations
  • Apply genomic information for medicine, anthropology, environment, agriculture, forensics and more

Basic Cytogenetics & Terminology

  • Humans: 2n=462n=46 (23 homologous pairs) in somatic cells; n=23n=23 in gametes
  • Diploid genome receives one set of chromosomes from each parent → phenotypic resemblance
  • OMIM (Dec-9-2011) gene entries
    • Autosomal genes: 19 690; X-linked 1 167; Y-linked 59; Mitochondrial 65; Total 20 981
  • Human Gene Map (Table 3.1b) – total loci mapped per chromosome (e.g. chr-1 = 1288; chr-Y = 46; genome total = 13 176)

Genetic Mapping & Linkage Analysis (Box 3.1)

  • Genetic map = ordered array of gene loci on chromosomes; unit = centiMorgan (cM) where 1cM=1%1\,\text{cM}=1\% meiotic recombination
  • Recombination fraction θ\theta (or rr) estimated via Maximum Likelihood → Log of Odds (LOD) score
    • LOD=log10(Pr(familyθ)Pr(familyθ=0.5))\text{LOD}=\log_{10}\bigg(\dfrac{\Pr(\text{family}|\theta)}{\Pr(\text{family}|\theta=0.5)}\bigg)
    • Interpretations: LOD3\text{LOD}\ge3 (linked); LOD=2\text{LOD}=2 (strong); LOD=1\text{LOD}=1 (tentative); LOD2\text{LOD}\le-2 (no linkage)

DNA Structure Refresher (Watson–Crick 1953)

  • Double helix; strands antiparallel (5’→3’ / 3’→5’)
  • Complementary base-pairing: A::T, G::C via H-bonds
  • Nucleotides = base + deoxyribose + phosphate; A & G = purines, C & T = pyrimidines

Evolution of Genetic Markers

  • 1950s–70s: phenotypic markers (ABO, serum proteins, HLA)
  • 1980s: Restriction Fragment Length Polymorphisms (RFLPs)
  • Late-1980s → 1990s: Variable Number Tandem Repeats (VNTRs/minisatellites) (0.1–20 kb repeats of 15–100s bp)
  • 1990s: Microsatellites (STRs; 1–4 bp repeats; >6 000 loci)
  • 2000s: Single Nucleotide Polymorphisms (SNPs; >4×1054\times10^{5} sites)
  • Marker comparison
    • Multi-allelic VNTR/STR → high heterozygosity, ideal for linkage & forensics
    • SNPs → bi-allelic, ultra-stable, amenable to high-throughput automated genotyping, optimal for genome-wide scans

Why Sequence the Whole Genome?

  • Mapping ≈30 000 genes one-by-one deemed infeasible → full nucleotide sequence preferred
  • Genome = entire DNA content (nuclear + mitochondrial)

Human Genome Project (HGP)

Historical Milestones
  • 1977 Sanger dideoxy sequencing (foundation technology)
  • 1984 DOE workshop (Alta, Utah) → need to detect low-frequency radiation-induced mutations → genome sequencing proposal
  • 1986 DOE Mexico Conference formally conceives HGP; three objectives: refined physical maps, technology development, computational infrastructure
  • 1988 NIH creates Office of Human Genome Research; HUGO founded; US Congress funds NIH + DOE; budget sets aside 3–5 % for ELSI
  • Oct-1-1990: official launch, $3billion\$3\,\text{billion}, 15-year plan
  • 1993: Francis Collins becomes NHGRI director
  • 1998: Celera Genomics (J. Craig Venter) announces private sequencing effort (budget $300million\$300\,\text{million})
  • Jun-26-2000: simultaneous announcement of first draft genome (public + Celera)
  • 2001: drafts published (Nature & Science)
  • 2003: ~92 % euchromatic genome completed, 2 years ahead of schedule
  • Parallel model-organism sequencing: E. coli, S. cerevisiae, C. elegans, D. melanogaster, Mus musculus, etc.
Original Goals (1990)
  • Identify all ~30 000 human genes & map to chromosomes
  • Produce high-resolution physical maps (human + selected species)
  • Determine complete human nuclear DNA sequence
  • Develop data storage, analysis & distribution capacities
  • Innovate new sequencing & informatics technologies
Updated 5-Year Goals (1998)
  • Catalogue human variation (SNPs) → genotype–phenotype correlations
  • Cross-species genome comparison
  • Advanced computational infrastructure
  • Intensify ELSI research & public policy
  • Foster interdisciplinary genomics training

Sequencing Strategies

1. International Human Genome Sequencing Consortium (IHGSC)
  • Hierarchical (BAC-by-BAC) shotgun approach
    • Chromosome DNA → sheared (~150 kb) → cloned into BACs
    • BACs mapped to Sequence Tagged Sites (STS)
    • Each BAC shotgun-sequenced into ~1–2 kb fragments
    • Assembly → contigs → tiling path covering chromosome
2. Celera Genomics
  • Whole-genome shotgun (WGS) with pair-end reads
    • Entire genome fragmented, sequenced, computationally assembled into 119 000 scaffolds
    • Scaffolds positioned onto STS framework using public data
  • Faster & cheaper (≈$3million\$3\,\text{million}) but initially proprietary access

Genome Donors

  • Public project: blood (female) & sperm (male) from 20 anonymous donors; only 4 samples advanced; ≥70 % sequence from one Buffalo (NY) male
  • Celera: 21 donors; sequences from 5 individuals

Genome Assembly Workflow

  • Millions of reads (≤1000 bp) → overlap detection → merge into contigs
  • Contigs linked by mate-pair spacing = scaffolds
  • Scaffolds anchored to chromosomal STSs to form draft sequence

Genome Annotation

  • Structural annotation: identify ORFs, exons/introns, promoters, regulatory motifs, repeats
  • Functional annotation: infer biological role, expression pattern, pathways, interactions
  • Tools: BLAST, PHRED/PHRAP, automated pipelines; data deposited in GenBank, Ensembl, UCSC Genome Browser

Key Findings from the Human Genome

  • Genome size: 3.1647 Gb (haploid)
  • Gene count: ~30 000 (similar order to mouse & worm)
  • Average gene length ≈3 kb; largest = dystrophin (2.4 Mb)
  • Sequence identity among humans: 99.9 %; yet >10710^{7} variant sites between individuals
  • Protein-coding portion: only 1.1–1.4 % of genome; ≈98 % non-coding (regulatory, structural, repeats)
  • Segmental duplications abundant → source of primate-specific genes
  • Repeat content ≈50 % (higher than many species); rates of repeat accumulation declined over last 50 My

Genome Architecture

  • GC-rich, gene-dense “urban centres” vs AT-rich, gene-poor “deserts”
  • Light (euchromatic) vs dark (heterochromatic) bands on metaphase chromosomes correspond to GC/AT patterning
  • CpG islands (≥30 kb C/G runs) flank gene clusters → regulatory significance
  • Alternative splicing expands proteome diversity → humans produce ~3× more protein isoforms than flies/worms despite similar gene numbers

Applications / Benefits of HGP

Molecular Medicine
  • Early, presymptomatic genetic tests & risk prediction
  • DNA probes & genomic screens for monogenic disorders & complex traits
  • Gene therapy: vector-mediated replacement or correction of defective genes
  • Novel drug, immunotherapy & protein-replacement strategies
Risk Assessment
  • Identify alleles modulating susceptibility/resilience to radiation, chemicals, pathogens
  • Inform public-health interventions & personalized preventive medicine
Energy & Environment (DOE Microbial Genome Program, 1994)
  • Engineer microbes for biofuel production, bioremediation, extreme-environment bioprocesses, photosynthetic energy capture
Anthropology, Evolution, Migration
  • Trace maternal (mtDNA) & paternal (Y-chromosome) lineages
  • Date divergence events, reconstruct migration routes, correlate with linguistic/cultural history
Forensic Science
  • Individual identification (DNA fingerprinting), criminal justice, disaster victim ID, paternity, wildlife conservation, pathogen detection, transplant compatibility
Agriculture & Livestock
  • Genetically improve crop resilience (drought, pests), nutritional content, reduced pesticide requirement
  • Breed disease-resistant, high-yield livestock

Disadvantages & Concerns

  • Genetic discrimination in insurance & employment
  • Social stigma disrupting family/marriage
  • Potential misuse of genomic data for harmful purposes

Ethical, Legal & Social Implications (ELSI)

  • 3–5 % of HGP budget dedicated to ELSI research
  • Informed consent, privacy, data sharing, equitable benefit distribution

Post-Genomic Era Focus

  • Transcriptomics: global mRNA expression profiling
  • Proteomics: protein abundance, modifications, interactions → drug targets
  • Structural genomics: 3-D protein structures for each fold family
  • Functional genomics: gene knockouts/knock-downs, phenotyping, regulatory networks
  • Comparative genomics: cross-species alignment to annotate human genes

Need for Diploid Personal Genomes

  • 2007 Venter’s 6Gb6\,\text{Gb} diploid sequence (HuRef)
    • Human-to-human variation 5–7× greater than earlier estimates (15–30 Mb differences)
    • 4.1 million variants per person; 22 % are non-SNP (CNVs, indels, segmental duplications) yet account for 74 % of variant bases
  • Individual haplotype phasing enables discovery of rare/private alleles → foundation for personalized medicine

Major Spin-Off Projects

1000 Genomes Project (2008-)
  • Objective: dense catalogue of human variation by sequencing ≥1000 (later 2000+) individuals from multiple ancestries
  • Rapid public data release; informs disease studies, pharmacogenomics, population genetics
HapMap Project (2002-2009)
  • Map of haplotype blocks & tag-SNPs (≥1 % frequency) in four initial populations (YRI, CEU, JPT, CHB)
  • Reduced genotyping burden from ~10 M SNPs to ~500 k tag-SNPs for genome-wide association studies (GWAS)
Protein Structure Initiative (PSI)
  • Large-scale determination of representative protein folds to link structure ↔ function; accelerates rational drug design
Human Epigenome Consortium
  • Comprehensive mapping of DNA methylation & histone modifications to elucidate epigenetic regulation in health & disease
Human Genome Diversity Project (HGDP)
  • Collect & preserve DNA from ~700 endangered/isolated ethnic groups (initial plan)
  • Goals: reconstruct human evolutionary history, migration, adaptation, disease resistance alleles; ensure ethical sampling & benefit sharing

Summary Key-Points

  • HGP completed high-quality reference sequence of 3.2 Gb human genome, ahead of schedule & under budget
  • Two complementary sequencing strategies (BAC-by-BAC vs WGS) proved feasible
  • Genome data revolutionized biology, medicine, biotechnology, evolutionary studies, and spawned numerous large-scale initiatives (HapMap, 1kG, personal genomes, epigenomics)
  • Only ~1.4 % of genome encodes proteins; vast non-coding landscape harbours regulatory and structural elements
  • Ethical, legal & social frameworks essential to maximize benefits & minimize harms of genomic advances