Human Genome Project & DNA Polymorphisms
Learning Objectives
- Appreciate scale & significance of Human Genome Project (HGP)
- Discover molecular “secrets of life” via complete genetic blue-print
- Compare human sequence variation with other species & among world populations
- Apply genomic information for medicine, anthropology, environment, agriculture, forensics and more
Basic Cytogenetics & Terminology
- Humans: 2n=46 (23 homologous pairs) in somatic cells; n=23 in gametes
- Diploid genome receives one set of chromosomes from each parent → phenotypic resemblance
- OMIM (Dec-9-2011) gene entries
- Autosomal genes: 19 690; X-linked 1 167; Y-linked 59; Mitochondrial 65; Total 20 981
- Human Gene Map (Table 3.1b) – total loci mapped per chromosome (e.g. chr-1 = 1288; chr-Y = 46; genome total = 13 176)
Genetic Mapping & Linkage Analysis (Box 3.1)
- Genetic map = ordered array of gene loci on chromosomes; unit = centiMorgan (cM) where 1cM=1% meiotic recombination
- Recombination fraction θ (or r) estimated via Maximum Likelihood → Log of Odds (LOD) score
- LOD=log10(Pr(family∣θ=0.5)Pr(family∣θ))
- Interpretations: LOD≥3 (linked); LOD=2 (strong); LOD=1 (tentative); LOD≤−2 (no linkage)
DNA Structure Refresher (Watson–Crick 1953)
- Double helix; strands antiparallel (5’→3’ / 3’→5’)
- Complementary base-pairing: A::T, G::C via H-bonds
- Nucleotides = base + deoxyribose + phosphate; A & G = purines, C & T = pyrimidines
Evolution of Genetic Markers
- 1950s–70s: phenotypic markers (ABO, serum proteins, HLA)
- 1980s: Restriction Fragment Length Polymorphisms (RFLPs)
- Late-1980s → 1990s: Variable Number Tandem Repeats (VNTRs/minisatellites) (0.1–20 kb repeats of 15–100s bp)
- 1990s: Microsatellites (STRs; 1–4 bp repeats; >6 000 loci)
- 2000s: Single Nucleotide Polymorphisms (SNPs; >4×105 sites)
- Marker comparison
- Multi-allelic VNTR/STR → high heterozygosity, ideal for linkage & forensics
- SNPs → bi-allelic, ultra-stable, amenable to high-throughput automated genotyping, optimal for genome-wide scans
Why Sequence the Whole Genome?
- Mapping ≈30 000 genes one-by-one deemed infeasible → full nucleotide sequence preferred
- Genome = entire DNA content (nuclear + mitochondrial)
Human Genome Project (HGP)
Historical Milestones
- 1977 Sanger dideoxy sequencing (foundation technology)
- 1984 DOE workshop (Alta, Utah) → need to detect low-frequency radiation-induced mutations → genome sequencing proposal
- 1986 DOE Mexico Conference formally conceives HGP; three objectives: refined physical maps, technology development, computational infrastructure
- 1988 NIH creates Office of Human Genome Research; HUGO founded; US Congress funds NIH + DOE; budget sets aside 3–5 % for ELSI
- Oct-1-1990: official launch, $3billion, 15-year plan
- 1993: Francis Collins becomes NHGRI director
- 1998: Celera Genomics (J. Craig Venter) announces private sequencing effort (budget $300million)
- Jun-26-2000: simultaneous announcement of first draft genome (public + Celera)
- 2001: drafts published (Nature & Science)
- 2003: ~92 % euchromatic genome completed, 2 years ahead of schedule
- Parallel model-organism sequencing: E. coli, S. cerevisiae, C. elegans, D. melanogaster, Mus musculus, etc.
Original Goals (1990)
- Identify all ~30 000 human genes & map to chromosomes
- Produce high-resolution physical maps (human + selected species)
- Determine complete human nuclear DNA sequence
- Develop data storage, analysis & distribution capacities
- Innovate new sequencing & informatics technologies
Updated 5-Year Goals (1998)
- Catalogue human variation (SNPs) → genotype–phenotype correlations
- Cross-species genome comparison
- Advanced computational infrastructure
- Intensify ELSI research & public policy
- Foster interdisciplinary genomics training
Sequencing Strategies
1. International Human Genome Sequencing Consortium (IHGSC)
- Hierarchical (BAC-by-BAC) shotgun approach
- Chromosome DNA → sheared (~150 kb) → cloned into BACs
- BACs mapped to Sequence Tagged Sites (STS)
- Each BAC shotgun-sequenced into ~1–2 kb fragments
- Assembly → contigs → tiling path covering chromosome
2. Celera Genomics
- Whole-genome shotgun (WGS) with pair-end reads
- Entire genome fragmented, sequenced, computationally assembled into 119 000 scaffolds
- Scaffolds positioned onto STS framework using public data
- Faster & cheaper (≈$3million) but initially proprietary access
Genome Donors
- Public project: blood (female) & sperm (male) from 20 anonymous donors; only 4 samples advanced; ≥70 % sequence from one Buffalo (NY) male
- Celera: 21 donors; sequences from 5 individuals
Genome Assembly Workflow
- Millions of reads (≤1000 bp) → overlap detection → merge into contigs
- Contigs linked by mate-pair spacing = scaffolds
- Scaffolds anchored to chromosomal STSs to form draft sequence
Genome Annotation
- Structural annotation: identify ORFs, exons/introns, promoters, regulatory motifs, repeats
- Functional annotation: infer biological role, expression pattern, pathways, interactions
- Tools: BLAST, PHRED/PHRAP, automated pipelines; data deposited in GenBank, Ensembl, UCSC Genome Browser
Key Findings from the Human Genome
- Genome size: 3.1647 Gb (haploid)
- Gene count: ~30 000 (similar order to mouse & worm)
- Average gene length ≈3 kb; largest = dystrophin (2.4 Mb)
- Sequence identity among humans: 99.9 %; yet >107 variant sites between individuals
- Protein-coding portion: only 1.1–1.4 % of genome; ≈98 % non-coding (regulatory, structural, repeats)
- Segmental duplications abundant → source of primate-specific genes
- Repeat content ≈50 % (higher than many species); rates of repeat accumulation declined over last 50 My
Genome Architecture
- GC-rich, gene-dense “urban centres” vs AT-rich, gene-poor “deserts”
- Light (euchromatic) vs dark (heterochromatic) bands on metaphase chromosomes correspond to GC/AT patterning
- CpG islands (≥30 kb C/G runs) flank gene clusters → regulatory significance
- Alternative splicing expands proteome diversity → humans produce ~3× more protein isoforms than flies/worms despite similar gene numbers
Applications / Benefits of HGP
Molecular Medicine
- Early, presymptomatic genetic tests & risk prediction
- DNA probes & genomic screens for monogenic disorders & complex traits
- Gene therapy: vector-mediated replacement or correction of defective genes
- Novel drug, immunotherapy & protein-replacement strategies
Risk Assessment
- Identify alleles modulating susceptibility/resilience to radiation, chemicals, pathogens
- Inform public-health interventions & personalized preventive medicine
Energy & Environment (DOE Microbial Genome Program, 1994)
- Engineer microbes for biofuel production, bioremediation, extreme-environment bioprocesses, photosynthetic energy capture
Anthropology, Evolution, Migration
- Trace maternal (mtDNA) & paternal (Y-chromosome) lineages
- Date divergence events, reconstruct migration routes, correlate with linguistic/cultural history
Forensic Science
- Individual identification (DNA fingerprinting), criminal justice, disaster victim ID, paternity, wildlife conservation, pathogen detection, transplant compatibility
Agriculture & Livestock
- Genetically improve crop resilience (drought, pests), nutritional content, reduced pesticide requirement
- Breed disease-resistant, high-yield livestock
Disadvantages & Concerns
- Genetic discrimination in insurance & employment
- Social stigma disrupting family/marriage
- Potential misuse of genomic data for harmful purposes
Ethical, Legal & Social Implications (ELSI)
- 3–5 % of HGP budget dedicated to ELSI research
- Informed consent, privacy, data sharing, equitable benefit distribution
Post-Genomic Era Focus
- Transcriptomics: global mRNA expression profiling
- Proteomics: protein abundance, modifications, interactions → drug targets
- Structural genomics: 3-D protein structures for each fold family
- Functional genomics: gene knockouts/knock-downs, phenotyping, regulatory networks
- Comparative genomics: cross-species alignment to annotate human genes
Need for Diploid Personal Genomes
- 2007 Venter’s 6Gb diploid sequence (HuRef)
- Human-to-human variation 5–7× greater than earlier estimates (15–30 Mb differences)
- 4.1 million variants per person; 22 % are non-SNP (CNVs, indels, segmental duplications) yet account for 74 % of variant bases
- Individual haplotype phasing enables discovery of rare/private alleles → foundation for personalized medicine
Major Spin-Off Projects
1000 Genomes Project (2008-)
- Objective: dense catalogue of human variation by sequencing ≥1000 (later 2000+) individuals from multiple ancestries
- Rapid public data release; informs disease studies, pharmacogenomics, population genetics
HapMap Project (2002-2009)
- Map of haplotype blocks & tag-SNPs (≥1 % frequency) in four initial populations (YRI, CEU, JPT, CHB)
- Reduced genotyping burden from ~10 M SNPs to ~500 k tag-SNPs for genome-wide association studies (GWAS)
Protein Structure Initiative (PSI)
- Large-scale determination of representative protein folds to link structure ↔ function; accelerates rational drug design
Human Epigenome Consortium
- Comprehensive mapping of DNA methylation & histone modifications to elucidate epigenetic regulation in health & disease
Human Genome Diversity Project (HGDP)
- Collect & preserve DNA from ~700 endangered/isolated ethnic groups (initial plan)
- Goals: reconstruct human evolutionary history, migration, adaptation, disease resistance alleles; ensure ethical sampling & benefit sharing
Summary Key-Points
- HGP completed high-quality reference sequence of 3.2 Gb human genome, ahead of schedule & under budget
- Two complementary sequencing strategies (BAC-by-BAC vs WGS) proved feasible
- Genome data revolutionized biology, medicine, biotechnology, evolutionary studies, and spawned numerous large-scale initiatives (HapMap, 1kG, personal genomes, epigenomics)
- Only ~1.4 % of genome encodes proteins; vast non-coding landscape harbours regulatory and structural elements
- Ethical, legal & social frameworks essential to maximize benefits & minimize harms of genomic advances