The Human Genome and the Evolution of Sequencing Technologies

Foundations and History of Human Genome Sequencing

Context and Objectives
- The study of the human genome encompass history, current data findings, sequencing technologies, and future directions.
- Primary Learning Objectives:
  - Understanding current information regarding the human genome.
  - Appreciating the variability of the human genome sequence across different populations.
The Invention and Mechanics of Sanger Sequencing
- Invention: Invented in 1977 by Fred Sanger in Cambridge, United Kingdom.
- Longevity: For an extremely long time (since 1977), this was the only viable method for understanding DNA sequences.
- The Process:
  - The double-stranded DNA is separated.
  - A polymerase chain reaction (PCR) is performed, but with the addition of fluorescently labeled dideoxynucleotides ( $ddNTPs$ ).
  - The resulting products are of varying lengths.
  - The color of the fluorescent nucleotide is determined by the final nucleotide added by the dideoxy reaction.
  - Detection equipment records the fluorescent signals, providing the DNA sequence.
Early Milestones in Sequencing
- Epstein-Barr Virus (1985): Following Sanger's invention, the Epstein-Barr virus (the cause of glandular fever) was sequenced. It consists of approximately $170,000$ base pairs (bp). This was considered a major achievement for Sanger sequencing at the time.
- Initial Genome Meeting (1985): A meeting in California served as the first serious discussion regarding sequencing the human genome.
- Funding (1987): The United States Department of Energy allocated $1,000,000,000$ to what became the Human Genome Consortium for mapping and sequencing.
- International Collaboration (1988): The first Human Genome Organization (HUGO) meeting was held. It was a global consortium involving any country or scientist wishing to participate.
- Anecdotal Lab Distribution: The job of sequencing was distributed globally. For instance, a lab in Adelaide, Australia, was responsible for sequencing a small portion at the end of chromosome 16.
The Human Genome Organization (HUGO) Strategy
- The Five-Year Plan (1990):
  - Genetic Map: Generate a complete genetic map to establish the order of known regions on chromosomes (the "backbone").
  - Physical Map: Establish a physical map with known sequences every $100\,kb$ (kilobases).
  - Model Organisms: Proof of principle projects used organisms with $20\,Mb$ (megabases, or $20,000,000$ base pairs).
- Progress Report (1994): The consortium published its first complete linkage map, providing the framework to distribute genome parts to various labs. The process took ten years of "slow and steady" work.

The Competition: Public vs. Private Efforts

Craig Venter and the Shotgun Approach (1995)
- Craig Venter, operating a private business, announced he had sequenced and assembled two small bacterial genomes (significantly larger than viruses).
- Methodological Comparison:
  - Public Consortium Approach: Constructing maps of landmarks first, then breaking DNA into large chunks, then into smaller clones, sequencing those, and mapping them back to the chromosome. This was methodical and ordered.
  - Venter’s Shotgun Sequencing: A non-methodical approach leveraging high-level computing systems and bioinformatics. The entire genome was chopped into tiny pieces immediately, sequenced, and then reassembled via computer algorithms searching for overlapping sequences.
Ethics and the Celera Conflict
- Celera Announcement (1998): Venter created Celera and claimed he would sequence the genome in 3 years for $300,000,000$ . At that point, the public consortium had already spent ten times that amount ( $3,000,000,000$ ).
- Point of Contention: Venter used the public consortium’s landmark data as a backbone for his shotgun assembly.
- The Dilemma: Celera intended to charge researchers for access to their sequence despite it being built upon freely available public data. This caused fear that the private sector might undermine the future of publicly funded medical research.
The Joint Completion (2000–2022)
- June 26, 2000 Announcement: A joint announcement by the UK Prime Minister, the US President, Francis Collins (Public Consortium), and Craig Venter declared the completion of a working draft.
- Publication (2001):
  - Science Magazine published Venter’s sequence (derived from 5 individuals, including Venter himself).
  - Nature published the consortium sequence (derived from 8 individuals).
- Draft Quality: These 2001 publications were only draft sequences with approximately $90\,%$ coverage.
- Completion Milestones:
  - 2004: First "completed" sequence at $99.7\,%$ coverage.
  - April 2022: The "T2T" (Telomere-to-Telomere) sequence was published, representing the first truly complete genome sequence.

Characteristics and Variability of the Human Genome

Current Scientific Understanding
- The genome contains approximately $3,000,000,000$ nucleotides.
- Protein Coding: Only about $1\,%$ codes for proteins. This accounts for nearly $20,000$ genes. Complexity arises through alternative splicing.
- Highly Conserved Regions: About $5\,%$ of the genome is highly conserved through evolution (identical sequences across primates, mice, yeast, etc.), suggesting significant functional importance.
- Non-Coding RNA: Large segments between protein-coding genes are transcribed into RNA but never translated into protein. Their function is mostly unknown.
- Transcription: Nearly all ( $100\,%$ ) DNA is transcribed into RNA at some point, yet only $1\,%$ reaches the protein stage.
The Thousand Genomes Project (2008–2015)
- Aim: To discover all human DNA polymorphisms (from "Poly"-more than one, and "morphism"-shape/look).
- Scale: Sequenced $2,500$ genomes from 26 different populations.
- Large-Scale Variation: Variation over $1,000$ base pairs (deletions, insertions, duplications) occurs in $4-5\,%$ of the genome. These are found even in healthy (non-pathogenic) individuals.
- Copy Number Variations (CNVs): These variations are linked to cognitive and mood disorders, including intellectual disabilities, anxiety, depression, and schizophrenia.
Individual Variation vs. Reference Genome
- A typical person compared to the reference genome has:
  - ~ $10,000$ non-synonymous changes (altering amino acids in proteins).
  - ~ $10,000$ synonymous changes (no change to amino acids).
  - ~ $100$ mutations causing premature stop codons (wild-type/non-pathogenic).
  - A total of about $300$ genes containing mutations that render the sequence non-functional.
- Key takeaway: There is no such thing as a "perfect" human genome; it is extremely variable.

Technological Evolution and Costs

Cost and Efficiency Comparison
- Year 2000: $2,700,000,000$ and 13 years for one genome.
- Modern Day: ~ $600$ and a couple of hours for one genome.
- Driving Factors: Technology, competition for market utility (drug creation, disease diagnosis), and machine miniaturization.
Generations of Sequencing
- Sanger (1st Generation): Highly accurate, the "gold standard" for diagnostics, but slow.
- Next-Generation Sequencing (NGS - Illumina):
  - Dominates the market.
  - Uses very short reads ( $75-150\,bp$ ) compared to Sanger's $1,000\,bp$ .
  - Limitations: Poor at handling repetitive sequences and large-scale structural changes (inversions/duplications).
- Third-Generation Sequencing (Long-Read):
  - Focuses on long reads to overcome Illumina's limitations.
  - Examples: SMRT sequencing and Oxford Nanopore.
- Massively Parallel Sequencing: The umbrella term for these modern methods where a huge genome is sequenced via many reactions occurring simultaneously.

Medical and Research Applications

Exome Sequencing: Sequencing only the exons (protein-coding regions) of genes, which is more cost-efficient for clinical mutation finding.
Cancer Genomes: Has become a game-changer for oncological treatment mapping.
Pharmacogenomics: Sequencing individuals to prescribe the correct drugs and dosages based on how fast they metabolize substances.
Direct-to-Consumer Genetic Testing: Companies allowing individuals to swab their cheeks for genetic data (raises ethical/regulatory concerns for clinicians).
Non-Invasive Prenatal Testing (NIPT): Accessing fetal DNA from maternal blood samples (sensitive enough to distinguish maternal vs. fetal DNA).
Complex Disease Research: Advancing the understanding of the genetics behind non-Mendelian disorders.