Genomes and Genomics Notes

Genomes and Genomics

An organism's genome is defined as the complete haploid genetic complement of a typical cell, essentially one copy of the hereditary information required to specify the organism. For sexually reproducing organisms, the genome includes one set of autosomes and one of each type of sex chromosome. Organelles with their own DNA are not considered part of the nuclear genome. Viruses, which can have DNA or RNA, also have a genome, which is a complete copy of the nucleic acid required to specify the virus.

Diploid organisms have sequence variations between the two copies of each chromosome. These differences are used to solve crimes, define parentage, and trace inherited diseases. Sequencing these complete genetic complements of a diploid cell is becoming more common.

The modern science of genomics studies DNA on a cellular scale, propelled by improvements in sequencing technology, computer technologies, and innovative approaches to information organization and searching.

Many Genomes Have Been Sequenced in Their Entirety

The genome is the ultimate source of information about an organism. Genome sequencing is becoming routine, with thousands of genomes sequenced, ranging from bacteria to mammals.

The International Human Genome Project, along with linked projects focused on other organisms, was initiated with substantial funding in the late 1980s. The $Haemophilus influenzae$ bacterium was the first complete genome sequenced in 1995, followed by the yeast $Saccharomyces cerevisiae$ in 1996 and $Escherichia coli$ in 1997. The Human Genome Project involved 20 sequencing centers across six nations, coordinated by the Office of Genome Research at the National Institutes of Health. The draft sequence of the human genome was published in February 2001, and the completed project was published in April 2003. The genome sequence is a composite derived from several anonymous donors, but any given genomic region comes from one individual.

Research teams initially used restriction enzymes to digest the human genome partially, cloning segments into BAC and YAC vectors. Overlapping clones were identified by hybridization and organized into contigs, which are contiguous stretches of chromosomal DNA. These contigs included sequence tagged sites (STS) or expressed sequence tags (EST), which were mapped along a specific chromosome. Contigs were divided among international sequencing centers, which sequenced mapped BAC or YAC clones. Because clones were longer than contemporary sequencing techniques could resolve, each clone was sequenced in pieces using a shotgun approach, then assembled by identifying overlaps. Each clone was sequenced four to six times to ensure accuracy, making the data available in the genome database.

In 1997, Celera Corporation, led by J. Craig Venter, initiated a competing commercial effort, using a different strategy called whole-genome shotgun sequencing, which eliminates the assembly of a physical map. The DNA segments from throughout the genome were sequenced at random, then ordered by computerized identification of sequence overlaps (with some reference to the public project's detailed and published physical map). The Celera effort used DNA from several human donors, with about 70% of the sequence from Craig Venter himself.

Advances in computer software and sequencing automation made the approach feasible by 1997. The competition substantially shortened the timeline for completing the project. Publication of the draft human genome sequence in 2001 was followed by two years of follow-up work. The genomes of many other species have been sequenced, providing a look at genomic complexity throughout the three domains of living organisms: Bacteria, Archaea, and Eukarya. Completed genomes are a resource for scientists around the world.

Multiple individual human genomes have been sequenced, and genome sequences provide a source for broad comparisons that help pinpoint both variable and highly conserved gene segments and allow the identification of genes unique to a species or group of species. Efforts to map genes, identify new proteins and disease genes, elucidate genetic patterns of medical interest, and trace evolutionary history.

Annotation Provides a Description of the Genome

A genome sequence is a long string of A, G, T, and C residues. Genome annotation converts the sequence itself to information to scientists for use, yielding a listing of information about the location and function of genes and other critical sequences. The annotation exercise is challenging because every newly sequenced genome often contains 40% or more genes about which little or nothing is known.

Protein and RNA function can be described on three levels:

Phenotypic function: Describes the effects of a gene product on the entire organism.
Cellular function: Describes the metabolic processes in which a gene product participates and its interactions with other proteins or RNAs in the cell.
Molecular function: Refers to the precise biochemical activity of a protein or an RNA.

Each of these functions can be elucidated by computational and experimental approaches. Computational approaches involve Web-based programs that define gene locations and assign tentative gene functions, based on similarity to genes previously studied. Resources such as the BLAST (Basic Local Alignment Search Tool) algorithm allow a rapid search of all genome databases for related sequences.

Two internet resources offering public access to genome sequence information:

NCBI (National Center for Biotechnology Information).
Ensembl.

The availability of many genome sequences in online databases enables researchers to assign gene functions by genome comparisons, an enterprise referred to as comparative genomics. Sequence comparison can be done with DNA, RNA, or protein. Any two genes with a demonstrable sequence similarity, are called homologs. When two genes in different species possess a clear sequence and functional relationship to each other, they are known as orthologs—genes derived from an ancestral gene in the last common ancestor of these two species. Paralogs are genes similarly related but within a single species; they arise most often from gene duplication, followed by specialization.

If the function of any gene has been characterized for one species, this information can be used to tentatively assign gene function to a related gene in a second species. Gene identity is often easiest to discern when comparing genomes from closely related species, such as mouse and human. Conserved gene order, or synteny, provides additional evidence for an orthologous relationship between genes at identical locations in the related segments.

In every newly described genome sequence, the uncharacterized genomic segments and genes represent a special challenge. Experimental approaches and new ones being developed now focus again on protein-coding genes. For several genomes, such as those of $S. cerevisiae$ and the plant Arabidopsis thaliana, gene knockout (inactivation) collections have been developed by genetic engineering. In other available libraries, each gene in a specific genome is expressed as a tagged fusion protein. tags may be designed to allow protein isolation, investigate interactions with other proteins, or explore subcellular localization.

Genome Databases Provide Information About Every Type of Organism

The available genome sequences are assisting research in all biological disciplines and enable new questions to be asked.

Viruses

Viruses are obligate intracellular parasites that are pathogens of some organism. Viruses are divided into seven classes, depending on the genomic nucleic acid (RNA or DNA), whether it is single-stranded or double-stranded, and the mechanisms employed to replicate it, varying in genomic complexity. Thousands of viral genomes have been sequenced.

Bacteria

Bacteria inhabit every environment and perform essential tasks without which all other life forms would perish. Molecular biologists subject thousands of representative bacterial species to genome sequencing.

Metagenomics

The need to know more about microbial communities has given rise to metagenomics, where DNA is isolated from an entire community of microbial species.

Archaea

Discovered in 1977 and formerly known as Archaebacteria, archaea are single-celled organisms that share properties with both bacteria and eukaryotes. Many of the most interesting species are extremophiles.

Eukaryotes

Eukaryotic genomes can be larger than the genomes in the other two domains. The sequencing of large eukaryotic genomes is becoming routine. Orthologs of genes involved in important processes and disease states in humans can be found in the genomes of model organisms, facilitating laboratory research into gene function. Specialized databases have been developed for the genomes of organisms of particular interest to science, including mouse, fruit fly, mustard weed, and yeast.

Individual human genomes are also available online, including those of James Watson and Craig Venter!

The Human Genome Contains Many Types of Sequences

Humans are not as complicated as once imagined. Humans have only about 25,000 protein-encoding genes—less than twice the number in a fruit fly, not many more than in a nematode worm, and fewer than in a rice plant. The study of eukaryotic chromosome structure, and sequencing of entire eukaryotic genomes, has revealed that many eukaryotic genes contain one or more intervening segments of DNA that do not code for the amino acid sequence of the polypeptide product. These nontranslated inserts are called intervening sequences, or introns, and the coding segments are called exons.

The process of removing introns from a primary RNA transcript to generate a transcript that can be translated contiguously into a protein product is known as splicing. Humans use these domains in more complex arrangements. Humans and other vertebrates engage in this process far more than bacteria, worms, or any other form of life—thereby allowing greater complexity in the proteins generated.

In mammals and some other eukaryotes, the typical gene has a much higher proportion of intron DNA than exon DNA; Only about 1.5% of human DNA is “coding” or exon DNA. However, when the much larger introns are included in the count, as much as 30% of the human genome consists of genes.

Much of the non-gene DNA is in the form of repeated sequences. About half the human genome is made up of moderately repeated sequences derived from transposable elements (transposons). There are multiple classes of transposons in the human genome, with some being strictly DNA segments and others being retrotransposons that transpose from one genomic location to another via RNA intermediates that are reconverted to DNA by reverse transcription.

Once coding genes and transposons are accounted for, perhaps 25% of the total DNA remains. The largest portion of this consists of unique sequences found between protein-coding genes. Another 3% of the human genome consists of highly repetitive sequences referred to as simple-sequence repeats (SSRs). Long repeats of simple sequences also occur throughout the genome. Within the human population, there are millions of single-base variations, called single nucleotide polymorphisms, or SNPs.

From these genetic differences comes the human variety Differences in hair color, stature, eyesight, allergies to medication, foot size, and behavior. Genetic recombination during meiosis tends to mix and match these small genetic variations. However, groups of SNPs and other genetic differences that are close together on a chromosome are rarely affected by recombination and are usually inherited together; these groupings are known as haplotypes.

Defining a haplotype requires several steps:

Positions containing SNPs in the human population are identified.
SNPs that are inherited together are compiled into haplotypes.
Tag SNPs—a subset of the SNPs that define the entire haplotype—are chosen to uniquely identify each haplotype.

Haplotypes can be used as markers to trace human migrations.

Genome Sequencing Informs Us About Our Humanity

A primary purpose of sequencing the human genome is to discover the molecular basis of human genetic diseases and identify genes, gene alterations, and other genomic features that are unique to the human genome.

Analyses of the human lineage have not detected an enrichment of genetic changes in protein-coding genes involved in brain development or size but have found such enrichments in transcription factors.

Genome Comparisons Help Locate Genes Involved in Disease

A primary purpose of most genome sequencing projects is to identify conserved genetic elements of functional significance. A purpose of sequencing the human genome is to understand what makes humans, human (as compared to other species) i.e. that of comparing our genome to those of other organisms.

One approach involves mapping the gene involved in a disease condition relative to well-characterized genetic polymorphisms that occur throughout the human genome using methods rooted in evolutionary biology. This approach involves one or more large families that include several individuals affected by a particular disease.

Modern genome databases are opening up alternative paths to the identification of disease genes. Researchers know a little about the kinds of enzymes or other proteins likely to contribute to a disease, these databases can quickly generate a list of genes known to encode proteins with relevant functions.

Key points:

A genome is one copy of the complete genetic complement of an organism. Thousands of complete genome sequences are now available.
Sequencing of a genome, is followed by genome annotation
The human genome contains approximately 25,000 genes
The sequencing of multiple primate genomes is informing on human evolution
Genome sequence databases facilitate the search for genes for a particular trait, and for genes involved in disease.