Human Genome Project & DNA Polymorphisms

Learning Objectives

Appreciate scale & significance of Human Genome Project (HGP)
Discover molecular “secrets of life” via complete genetic blue-print
Compare human sequence variation with other species & among world populations
Apply genomic information for medicine, anthropology, environment, agriculture, forensics and more

Basic Cytogenetics & Terminology

Humans: $2n=46$ (23 homologous pairs) in somatic cells; $n=23$ in gametes
Diploid genome receives one set of chromosomes from each parent → phenotypic resemblance
OMIM (Dec-9-2011) gene entries
- Autosomal genes: 19 690; X-linked 1 167; Y-linked 59; Mitochondrial 65; Total 20 981
Human Gene Map (Table 3.1b) – total loci mapped per chromosome (e.g. chr-1 = 1288; chr-Y = 46; genome total = 13 176)

Genetic Mapping & Linkage Analysis (Box 3.1)

Genetic map = ordered array of gene loci on chromosomes; unit = centiMorgan (cM) where $1\,\text{cM}=1\%$ meiotic recombination
Recombination fraction $\theta$ (or $r$ ) estimated via Maximum Likelihood → Log of Odds (LOD) score
- $\text{LOD}=\log_{10}\bigg(\dfrac{\Pr(\text{family}|\theta)}{\Pr(\text{family}|\theta=0.5)}\bigg)$
- Interpretations: $\text{LOD}\ge3$ (linked); $\text{LOD}=2$ (strong); $\text{LOD}=1$ (tentative); $\text{LOD}\le-2$ (no linkage)

DNA Structure Refresher (Watson–Crick 1953)

Double helix; strands antiparallel (5’→3’ / 3’→5’)
Complementary base-pairing: A::T, G::C via H-bonds
Nucleotides = base + deoxyribose + phosphate; A & G = purines, C & T = pyrimidines

Evolution of Genetic Markers

1950s–70s: phenotypic markers (ABO, serum proteins, HLA)
1980s: Restriction Fragment Length Polymorphisms (RFLPs)
Late-1980s → 1990s: Variable Number Tandem Repeats (VNTRs/minisatellites) (0.1–20 kb repeats of 15–100s bp)
1990s: Microsatellites (STRs; 1–4 bp repeats; >6 000 loci)
2000s: Single Nucleotide Polymorphisms (SNPs; > $4\times10^{5}$ sites)
Marker comparison
- Multi-allelic VNTR/STR → high heterozygosity, ideal for linkage & forensics
- SNPs → bi-allelic, ultra-stable, amenable to high-throughput automated genotyping, optimal for genome-wide scans

Why Sequence the Whole Genome?

Mapping ≈30 000 genes one-by-one deemed infeasible → full nucleotide sequence preferred
Genome = entire DNA content (nuclear + mitochondrial)

Human Genome Project (HGP)

Historical Milestones

1977 Sanger dideoxy sequencing (foundation technology)
1984 DOE workshop (Alta, Utah) → need to detect low-frequency radiation-induced mutations → genome sequencing proposal
1986 DOE Mexico Conference formally conceives HGP; three objectives: refined physical maps, technology development, computational infrastructure
1988 NIH creates Office of Human Genome Research; HUGO founded; US Congress funds NIH + DOE; budget sets aside 3–5 % for ELSI
Oct-1-1990: official launch, $\$3\,\text{billion}$ , 15-year plan
1993: Francis Collins becomes NHGRI director
1998: Celera Genomics (J. Craig Venter) announces private sequencing effort (budget $\$300\,\text{million}$ )
Jun-26-2000: simultaneous announcement of first draft genome (public + Celera)
2001: drafts published (Nature & Science)
2003: ~92 % euchromatic genome completed, 2 years ahead of schedule
Parallel model-organism sequencing: E. coli, S. cerevisiae, C. elegans, D. melanogaster, Mus musculus, etc.

Original Goals (1990)

Identify all ~30 000 human genes & map to chromosomes
Produce high-resolution physical maps (human + selected species)
Determine complete human nuclear DNA sequence
Develop data storage, analysis & distribution capacities
Innovate new sequencing & informatics technologies

Updated 5-Year Goals (1998)

Catalogue human variation (SNPs) → genotype–phenotype correlations
Cross-species genome comparison
Advanced computational infrastructure
Intensify ELSI research & public policy
Foster interdisciplinary genomics training

Sequencing Strategies

1. International Human Genome Sequencing Consortium (IHGSC)

Hierarchical (BAC-by-BAC) shotgun approach
- Chromosome DNA → sheared (~150 kb) → cloned into BACs
- BACs mapped to Sequence Tagged Sites (STS)
- Each BAC shotgun-sequenced into ~1–2 kb fragments
- Assembly → contigs → tiling path covering chromosome

2. Celera Genomics

Whole-genome shotgun (WGS) with pair-end reads
- Entire genome fragmented, sequenced, computationally assembled into 119 000 scaffolds
- Scaffolds positioned onto STS framework using public data
Faster & cheaper (≈ $\$3\,\text{million}$ ) but initially proprietary access

Genome Donors

Public project: blood (female) & sperm (male) from 20 anonymous donors; only 4 samples advanced; ≥70 % sequence from one Buffalo (NY) male
Celera: 21 donors; sequences from 5 individuals

Genome Assembly Workflow

Millions of reads (≤1000 bp) → overlap detection → merge into contigs
Contigs linked by mate-pair spacing = scaffolds
Scaffolds anchored to chromosomal STSs to form draft sequence

Genome Annotation

Structural annotation: identify ORFs, exons/introns, promoters, regulatory motifs, repeats
Functional annotation: infer biological role, expression pattern, pathways, interactions
Tools: BLAST, PHRED/PHRAP, automated pipelines; data deposited in GenBank, Ensembl, UCSC Genome Browser

Key Findings from the Human Genome

Genome size: 3.1647 Gb (haploid)
Gene count: ~30 000 (similar order to mouse & worm)
Average gene length ≈3 kb; largest = dystrophin (2.4 Mb)
Sequence identity among humans: 99.9 %; yet > $10^{7}$ variant sites between individuals
Protein-coding portion: only 1.1–1.4 % of genome; ≈98 % non-coding (regulatory, structural, repeats)
Segmental duplications abundant → source of primate-specific genes
Repeat content ≈50 % (higher than many species); rates of repeat accumulation declined over last 50 My

Genome Architecture

GC-rich, gene-dense “urban centres” vs AT-rich, gene-poor “deserts”
Light (euchromatic) vs dark (heterochromatic) bands on metaphase chromosomes correspond to GC/AT patterning
CpG islands (≥30 kb C/G runs) flank gene clusters → regulatory significance
Alternative splicing expands proteome diversity → humans produce ~3× more protein isoforms than flies/worms despite similar gene numbers

Applications / Benefits of HGP

Molecular Medicine

Early, presymptomatic genetic tests & risk prediction
DNA probes & genomic screens for monogenic disorders & complex traits
Gene therapy: vector-mediated replacement or correction of defective genes
Novel drug, immunotherapy & protein-replacement strategies

Risk Assessment

Identify alleles modulating susceptibility/resilience to radiation, chemicals, pathogens
Inform public-health interventions & personalized preventive medicine

Energy & Environment (DOE Microbial Genome Program, 1994)

Engineer microbes for biofuel production, bioremediation, extreme-environment bioprocesses, photosynthetic energy capture

Anthropology, Evolution, Migration

Trace maternal (mtDNA) & paternal (Y-chromosome) lineages
Date divergence events, reconstruct migration routes, correlate with linguistic/cultural history

Forensic Science

Individual identification (DNA fingerprinting), criminal justice, disaster victim ID, paternity, wildlife conservation, pathogen detection, transplant compatibility

Agriculture & Livestock

Genetically improve crop resilience (drought, pests), nutritional content, reduced pesticide requirement
Breed disease-resistant, high-yield livestock

Disadvantages & Concerns

Genetic discrimination in insurance & employment
Social stigma disrupting family/marriage
Potential misuse of genomic data for harmful purposes

Ethical, Legal & Social Implications (ELSI)

3–5 % of HGP budget dedicated to ELSI research
Informed consent, privacy, data sharing, equitable benefit distribution

Post-Genomic Era Focus

Transcriptomics: global mRNA expression profiling
Proteomics: protein abundance, modifications, interactions → drug targets
Structural genomics: 3-D protein structures for each fold family
Functional genomics: gene knockouts/knock-downs, phenotyping, regulatory networks
Comparative genomics: cross-species alignment to annotate human genes

Need for Diploid Personal Genomes

2007 Venter’s $6\,\text{Gb}$ diploid sequence (HuRef)
- Human-to-human variation 5–7× greater than earlier estimates (15–30 Mb differences)
- 4.1 million variants per person; 22 % are non-SNP (CNVs, indels, segmental duplications) yet account for 74 % of variant bases
Individual haplotype phasing enables discovery of rare/private alleles → foundation for personalized medicine

Major Spin-Off Projects

1000 Genomes Project (2008-)

Objective: dense catalogue of human variation by sequencing ≥1000 (later 2000+) individuals from multiple ancestries
Rapid public data release; informs disease studies, pharmacogenomics, population genetics

HapMap Project (2002-2009)

Map of haplotype blocks & tag-SNPs (≥1 % frequency) in four initial populations (YRI, CEU, JPT, CHB)
Reduced genotyping burden from ~10 M SNPs to ~500 k tag-SNPs for genome-wide association studies (GWAS)

Protein Structure Initiative (PSI)

Large-scale determination of representative protein folds to link structure ↔ function; accelerates rational drug design

Human Epigenome Consortium

Comprehensive mapping of DNA methylation & histone modifications to elucidate epigenetic regulation in health & disease

Human Genome Diversity Project (HGDP)

Collect & preserve DNA from ~700 endangered/isolated ethnic groups (initial plan)
Goals: reconstruct human evolutionary history, migration, adaptation, disease resistance alleles; ensure ethical sampling & benefit sharing

Summary Key-Points

HGP completed high-quality reference sequence of 3.2 Gb human genome, ahead of schedule & under budget
Two complementary sequencing strategies (BAC-by-BAC vs WGS) proved feasible
Genome data revolutionized biology, medicine, biotechnology, evolutionary studies, and spawned numerous large-scale initiatives (HapMap, 1kG, personal genomes, epigenomics)
Only ~1.4 % of genome encodes proteins; vast non-coding landscape harbours regulatory and structural elements
Ethical, legal & social frameworks essential to maximize benefits & minimize harms of genomic advances