lec 15-Genetics and Molecular Biology: Genomics and Bioinformatics

Definitions and Fundamental Concepts of Genomics and Bioinformatics

Genomics is defined as the study of genomes, which encompasses the totality of an organism's genetic material and the investigation of how this genetic information translates into biological function. The field involves methods to accumulate and understand genomic data at a large scale.
Bioinformatics is the application of mathematics, statistics, and computer science to better understand and interpret genomic data. It is characterized as data science applied to biology, representing the intersection of domain knowledge (Biology), Computer Science, and Statistics.
The field of genomics generates, processes, stores, and shares large amounts of high-complexity data.

The Co-evolution of Genomics and Computing

Genomics and computer science have scaled together over time. The increase in genomic sequences necessitated better algorithms, increased computing power, and more storage space. Conversely, advancements in computing knowledge and power enabled biotechnological breakthroughs.
The development of the Internet was essential. It grew concurrently with genomic technologies, evolving from ARPANET to the modern Internet with billions of users. The Internet facilitates the dissemination of genomic information, sharing of bioinformatic software, and international research collaboration.

Detailed Chronology of Genome Sequencing and Milestones

1869: DNA, then called 'nuclein', was isolated by Friedrich Miescher.
1944: DNA was identified as the hereditary material by Avery, MacLeod, and McCarty.
1953: The DNA double-helix structure was solved by Watson, Crick, Franklin, and Wilkins.
1965: First nucleic acid sequence obtained (Yeast tRNA).
1977: Sanger Sequencing was developed, and the first DNA genome ( $\text{phiX174}$ ) was sequenced.
1983: Polymerase Chain Reaction (PCR) invented.
1984: First ancient DNA (aDNA) sequenced from the extinct Quagga.
1989: Pre-HGP meeting in Banbury.
1990: Launch of the Human Genome Project (HGP).
1995: First bacterial genome sequence ( $H. influenzae$ ).
1996: First eukaryotic genome sequence (Yeast) and first Archaea genome ( $M. jannaschii$ ).
1998: First animal genome sequence ( $C. elegans$ ).
2000: Joint statement by Blair and Clinton declaring human genome data free and public; first plant ( $Arabidposis$ ) and fruit fly genomes sequenced.
2001: Publication of the first draft of the Human Genome.
2002: Draft of the Mouse genome released.
2003: Human Genome declared 'Complete' ( $\sim 92\%$ ) and ENCODE Phase 1 launch.
2007: Next-Generation Sequencing (NGS) emerges.
2008: 1000 Genomes Project and NIH Roadmap Epigenomics Project launch.
2009: Genome 10K (G10K) aims to sequence vertebrate diversity.
2010: Neanderthal and Denisovan genome drafts published.
2011: BLUEPRINT Epigenome (EU) and 10k Bird Genome (B10K) launch.
2012: ENCODE Phase 2 results and high-coverage Denisovan genome.
2013: United States Supreme Court (SCOTUS) bans human gene patents.
2014: Long-read sequencing technology becomes prominent.
2015: 1000 Genomes Project and NIH Roadmap Epigenomics results.
2016: All of Us project launch.
2017: Vertebrate Genomes Project (VGP) launch and gnomAD release.
2018: Earth BioGenome Project and Darwin Tree of Life launch; Genomics England 100k Genomes and UK Biobank Exome release.
2020: Pan-Cancer Analysis of Whole Genomes (PCAWG).
2021: African BioGenome Project clinical launch and sequencing of $1.2 \times 10^6$ -year-old Mammoth DNA.
2022: Telomere-to-Telomere (T2T) Human Genome ( $100\%$ complete) and ENCODE Phase 4.
2023: Human Pangenome Reference Draft and Zoonomia Project comparison of 241 mammals.
2024: UK Biobank 500k Whole Genome Sequencing (WGS) and T2T Bread Wheat genome assembly.

Significance of Model Organisms in Genomics

Bacteriophage MS2 (1976): Virus (RNA), $3,569\,nt$ . First RNA-based genome.
Bacteriophage phiX174 (1977): Virus (DNA), $5,386\,bp$ . First DNA-based genome.
Haemophilus influenzae (1995): Bacterium, $1.9\,Mb$ . First free-living organism sequenced.
Saccharomyces cerevisiae (1996): Yeast, $12\,Mb$ . First eukaryotic genome.
Methanococcus jannaschii (1996): Archaea, $1.66\,Mb$ . First Archaea genome.
Caenorhabditis elegans (1998): Roundworm, $100\,Mb$ . First multicellular animal.
Arabidopsis thaliana (2000): Plant, $135\,Mb$ . First plant genome.
Drosophila melanogaster (2000): Fruit fly, $180\,Mb$ . First insect genome.
Homo sapiens (2001): Human, $3.2\,Gb$ . First mammal (Draft).
Mus musculus (2002): Mouse, $2.5\,Gb$ . Primary mammalian model.

Genome Assembly and Evolution

Genome assembly is compared to a puzzle with millions of short fragmented reads that must be pieced together.
The human genome took 20 years to go from draft (2002) to full completion (2022). Different versions require reprocessing old data.
Specific regions are hard to assemble: centromeres, telomeres, and other repeated regions. Long-read technologies like Oxford Nanopore and PacBio were essential to fill these remaining gaps.
Transition to Pangenomes: The original reference was based on roughly one individual of blended ancestry ( $\sim 70\%$ ) and 19 others mostly of European ancestry ( $\sim 30\%$ ). Modern pangenomes collect diverse sequences to capture Single Nucleotide Variants (SNVs), duplications, deletions, insertions, and inversions.

Gene Annotation and Protein Prediction

Computer programs find protein-coding genes by detecting specific signals: TATA box ( $TATA(A/T)A(A/T)$ ), CAAT box, translation initiation sites, splice sites, exons, introns, stop codons, and poly-A addition sites ( $AATAAA$ ).
Software detects Open Reading Frames (ORFs) to annotate genomes. RNA-seq (sequencing transcribed DNA) helps locate exons.
Because the genetic code is known, protein sequences can be predicted from gene sequences. Chains start with Methionine (Met) and end at specific stop codons.
Global Human Genome Composition: - Introns: $26\%$ - LINEs: $20\%$ - SINES: $13\%$ - LTR retrotransposons: $8\%$ - Miscellaneous heterochromatin: $8\%$ - Segmental duplications: $5\%$ - DNA transposons: $3\%$ - Simple sequence repeats: $3\%$ - Protein-coding genes (UniProtKB): $1.5\%$ - Miscellaneous unique sequences: $12\%$
There are fewer protein-coding genes in humans than once expected ( $\sim 20,000$ ).

Bioinformatics Tools and File Formats

Genome Browsers (e.g., UCSC Genome Browser): Allow visualization of genomic data, including tracks for SNPs, transcription levels, H3K27AC marks (active regulatory elements), and conservation across 100 vertebrates using PhyloP ( $ln(x+1)$ ).
Key File Formats: - FASTA/FASTQ: Sequencing data (FASTQ includes quality scores). - wig/bigwig: Genomic continuous data. - SAM/BAM: Mapping locations. - VCF: Genetic variants. - BED: Genomic intervals. - MAF: Multiple alignments.
Searching Databases: BLAST (Basic Local Alignment Search Tool) uses heuristics for fast, fuzzy searches to identify similar sequences across species.
Genomic Intervals: Tools like 'bedtools' allow manipulation of intervals to answer questions like correlation between SET domain genes and upregulated RNA-seq genes.

Functional and Comparative Genomics

ENCODE Project: A collaborative effort to decrypt genome function using multi-dimensional data (TF binding, gene expression, epigenetic modifications, 3D contacts) across cell types. It challenged the concept of "junk DNA."
NIH Roadmap: Expanded on ENCODE with integrative analysis of 111 reference human epigenomes across over 100 cell types.
Machine learning models summarize this complex data into functional categories (regulatory regions, transcribed genes).
Transcription Factor (TF) Binding: Represented by Position Weight Matrices (PWMs). Databases like JASPAR store validated motifs. Scanning utilizes Hidden Markov Models (HMMs).
Conserved Sequences: Identified by comparing species using software like CLUSTAL Omega. This helps identify functional domains and evolutionary changes.
Alignment Algorithms: - Needleman–Wunsch: Guarantees the best global alignment. - Smith–Waterman: Guarantees the best local alignment.
Gene Ontology (GO): A hierarchical classification of genes based on biological process, molecular function, and cellular context.

Evolutionary and Specialized Genomics

Human vs. Chimpanzee: At the DNA level, they are $\sim 99\%$ similar in coding orthologues ( $96\%$ whole genome similarity). However, $80\%$ of genes differ by at least one amino acid.
Ancient DNA (aDNA): Nobel Prize winner Svante Pääbo developed algorithms to correct sequences from degraded/chemically damaged DNA. Current record is mammoth DNA from $1.2$ million years ago.
Epidemiology: Real-time analysis of viral evolution (e.g., COVID-19, bird flu) and antibiotic resistance tracking using tools like Nextstrain.
Metagenomics: Study of microbial communities (e.g., soil or gut) that cannot be grown in a lab.
eDNA (Environmental DNA): Used for monitoring and control in human and veterinary parasitology.
GWAS (Genome-Wide Association Study): A statistical framework to link genetic variation with complex traits or diseases using Manhattan plots. Even with 100k genomes, studies can be underpowered for rare phenotypes.
Cancer Genomics: Uses projects like PCAWG and shared databases (Xena) to track somatic mutations and inherited risks.