Molecular Basis of Inheritance

5.1 The DNA

DNA is the genetic material for most organisms.
RNA acts as a genetic material in some viruses and functions as a messenger, adapter, structural, and catalytic molecule.

5.2 The Search for Genetic Material

The nature of the genetic material (DNA) was investigated over a hundred years, culminating in its realization.
Nucleic acids are polymers of nucleotides; DNA and RNA are the two types.
This chapter discusses the structure of DNA, its replication, transcription (making RNA from DNA), the genetic code, protein synthesis (translation), and the basis of their regulation.
The complete nucleotide sequence of the human genome has ushered in a new era of genomics.

5.3 RNA World

The essentials of human genome sequencing and its consequences are discussed.

5.4 Replication

The structure of DNA is discussed as the most interesting molecule in the living system.

5.5 Transcription

The relationship between DNA and RNA is explored, explaining why DNA is the primary genetic material.

5.6 Genetic Code

DNA is a long polymer of deoxyribonucleotides.
The length of DNA is measured by the number of nucleotides or base pairs it contains, a characteristic of an organism.
- Example: Bacteriophage $\phi \times 174$ has 5386 nucleotides.
- Bacteriophage lambda has 48502 base pairs (bp).
- Escherichia coli has $4.6 \times 10^6$ bp.
- Haploid content of human DNA is $3.3 \times 10^9$ bp.

5.6.1 Structure of Polynucleotide Chain

A nucleotide has three components:
- A nitrogenous base
- A pentose sugar (ribose in RNA, deoxyribose in DNA)
- A phosphate group
Nitrogenous bases are of two types:
- Purines (Adenine and Guanine)
- Pyrimidines (Cytosine, Uracil, and Thymine)
Cytosine is common to both DNA and RNA.
Thymine is present in DNA; Uracil is present in RNA instead of Thymine.
A nitrogenous base links to the 1' C of the pentose sugar through an N-glycosidic linkage to form a nucleoside (e.g., adenosine, guanosine, cytidine, uridine).
A phosphate group links to the 5' C of a nucleoside through a phosphoester linkage, forming a nucleotide.
Two nucleotides link through a 3'-5' phosphodiester linkage to form a dinucleotide; more nucleotides can join to form a polynucleotide chain.
A polynucleotide chain has a free phosphate moiety at the 5' end (5' end) and a free OH of the 3'C group at the other end (3' end).
The backbone of the polynucleotide chain consists of sugar and phosphates, with nitrogenous bases projecting from it.
RNA has an additional -OH group at the 2' position in the ribose and contains uracil instead of thymine.
DNA was first identified as an acidic substance in the nucleus by Friedrich Meischer in 1869, who named it 'Nuclein'.
In 1953, James Watson and Francis Crick proposed the Double Helix model for DNA structure based on X-ray diffraction data from Maurice Wilkins and Rosalind Franklin.
Erwin Chargaff's observation: in double-stranded DNA, the ratios between Adenine and Thymine, and Guanine and Cytosine are constant and equal to one.
Base pairing makes the polynucleotide chains complementary; knowing the sequence of one strand allows prediction of the sequence of the other.
Each strand of parental DNA acts as a template for synthesizing a new strand, resulting in daughter DNA molecules identical to the parental DNA.
Salient features of the Double-helix structure of DNA:
- Two polynucleotide chains form the helix, with the sugar-phosphate backbone on the outside and bases projecting inside.
- The two chains have anti-parallel polarity (5'→3' and 3'→5').
- Bases pair through hydrogen bonds (H-bonds): Adenine with Thymine (two H-bonds), Guanine with Cytosine (three H-bonds).
- Purines always pair with pyrimidines to maintain a uniform distance between the two strands.
- The two chains are coiled in a right-handed fashion; the pitch of the helix is 3.4 nm, with roughly 10 bp per turn, resulting in a distance of approximately 0.34 nm between a bp.
- The plane of one base pair stacks over the other, contributing to the stability of the helical structure.
Francis Crick proposed the Central Dogma of Molecular Biology: Genetic information flows from DNA to RNA to Protein.
In some viruses, the flow of information is in reverse, from RNA to DNA.

5.6.2 Packaging of DNA Helix

The distance between two consecutive base pairs is 0.34 nm ( $0.34 \times 10^{-9}$ m).
The length of DNA double helix in a typical mammalian cell is approximately 2.2 meters ( $6.6 \times 10^9$ bp $\times 0.34 \times 10^{-9}$ m/bp).
E. coli DNA is 1.36 mm; the number of base pairs can be calculated.
In prokaryotes like E. coli, DNA (negatively charged) is held with positively charged proteins in a region called the ‘nucleoid.’ DNA is organized into large loops held by proteins.
In eukaryotes, the organization is more complex: DNA is associated with positively charged, basic proteins called histones.
Histones are rich in lysine and arginine, which carry positive charges.
Histones organize into a unit of eight molecules called a histone octamer.
Negatively charged DNA wraps around the positively charged histone octamer to form a nucleosome.
A typical nucleosome contains 200 bp of DNA helix.
Nucleosomes constitute the repeating unit of chromatin, seen as ‘beads-on-string’ under an electron microscope.
The beads-on-string structure is packaged into chromatin fibers, which are further coiled and condensed during cell division to form chromosomes.
Packaging of chromatin at a higher level requires non-histone chromosomal (NHC) proteins.
Loosely packed chromatin that stains lightly is called euchromatin (transcriptionally active), while densely packed chromatin that stains dark is called heterochromatin (inactive).

5.7 The Search for Genetic Material

The discovery of nuclein by Meischer and the principles of inheritance by Mendel occurred at roughly the same time.
The determination that DNA acts as a genetic material took a long time to prove.
By 1926, the search for the mechanism of genetic inheritance had reached the molecular level.
Previous discoveries narrowed the search to chromosomes in the nucleus, but the identity of the genetic material remained unknown.

Transforming Principle

In 1928, Frederick Griffith experimented with Streptococcus pneumoniae, observing a transformation in the bacteria.
Streptococcus pneumoniae grown on a culture plate produced smooth shiny colonies (S) and rough colonies (R).
S strain bacteria have a mucous (polysaccharide) coat, while R strain bacteria do not.
Mice infected with the S strain (virulent) die from pneumonia, but mice infected with the R strain do not develop pneumonia.
Heat-killed S strain bacteria did not kill mice.
A mixture of heat-killed S and live R bacteria killed the mice, and living S bacteria were recovered from the dead mice.
Griffith concluded that the R strain bacteria were transformed by a ‘transforming principle’ from the heat-killed S strain, enabling them to synthesize a smooth polysaccharide coat and become virulent.
The biochemical nature of the genetic material remained undefined.

Biochemical Characterisation of Transforming Principle

Prior to the work of Oswald Avery, Colin MacLeod, and Maclyn McCarty (1933-44), protein was thought to be the genetic material.
They purified biochemicals (proteins, DNA, RNA, etc.) from heat-killed S cells to transform live R cells into S cells.
DNA alone from S bacteria transformed R bacteria.
Protein-digesting enzymes (proteases) and RNA-digesting enzymes (RNases) did not affect transformation, indicating that the transforming substance was not a protein or RNA.
Digestion with DNase inhibited transformation, suggesting that DNA caused the transformation.
They concluded that DNA is the hereditary material; however, not all biologists were convinced.

5.7.1 The Genetic Material is DNA

Unequivocal proof that DNA is the genetic material came from the experiments of Alfred Hershey and Martha Chase (1952).
They worked with bacteriophages, viruses that infect bacteria.
The bacteriophage attaches to the bacteria, and its genetic material enters the bacterial cell.
The bacterial cell treats the viral genetic material as its own, manufacturing more virus particles.
Hershey and Chase determined whether protein or DNA from the viruses entered the bacteria.
They grew viruses on media containing radioactive phosphorus (radioactive DNA) or radioactive sulfur (radioactive protein).
Radioactive phages were allowed to attach to E. coli bacteria.
Viral coats were removed from the bacteria by agitation in a blender, and virus particles were separated by centrifugation.
Bacteria infected with viruses containing radioactive DNA were radioactive, indicating that DNA passed from the virus to the bacteria.
Bacteria infected with viruses containing radioactive proteins were not radioactive, indicating that proteins did not enter the bacteria.
DNA is the genetic material passed from virus to bacteria.

5.7.2 Properties of Genetic Material (DNA versus RNA)

The Hershey-Chase experiment resolved the debate between proteins and DNA as the genetic material.
DNA acts as genetic material; however, RNA is the genetic material in some viruses (e.g., Tobacco Mosaic virus, QB bacteriophage).
The differences between the chemical structures of DNA and RNA explain why DNA is the predominant genetic material, while RNA performs dynamic functions.
A molecule that acts as a genetic material must:
- Be able to generate its replica (Replication).
- Be chemically and structurally stable.
- Provide scope for slow changes (mutation) required for evolution.
- Be able to express itself in the form of 'Mendelian Characters.’
Both DNA and RNA can direct their duplication due to base pairing and complementarity.
Proteins fail to fulfill the replication criteria.
Genetic material should be stable and not change with the life cycle, age, or physiology of the organism.
Griffith’s ‘transforming principle’ showed that heat, which killed bacteria, did not destroy the properties of the genetic material.
Complementary DNA strands can separate by heating and come together under appropriate conditions.
The 2'-OH group in RNA makes it labile and easily degradable.
RNA is catalytic and reactive; DNA is chemically less reactive and structurally more stable than RNA.
Thymine in DNA confers additional stability compared to uracil in RNA.
Both DNA and RNA can mutate; RNA mutates faster due to its instability.
Viruses with RNA genomes mutate and evolve faster due to their shorter life spans.
RNA can directly code for protein synthesis and easily express characters.
DNA depends on RNA for protein synthesis.
The protein synthesizing machinery has evolved around RNA.
RNA and DNA can both function as genetic material, but DNA is preferred for storing genetic information due to its stability.
RNA is better for transmitting genetic information.

5.8 RNA World

RNA was the first genetic material.
Essential life processes (metabolism, translation, splicing, etc.) evolved around RNA.
RNA acted as both genetic material and a catalyst (ribozymes).
RNA's catalytic nature made it reactive and unstable.
DNA evolved from RNA with chemical modifications that make it more stable.
Double-stranded DNA with a complementary strand further resists changes through repair processes.

5.9 Replication

Watson and Crick proposed a replication scheme with the double helical structure for DNA.
The two strands separate and act as templates for synthesizing new complementary strands.
After replication, each DNA molecule has one parental and one newly synthesized strand.
This scheme is termed semiconservative DNA replication.

5.9.1 The Experimental Proof

It is proven that DNA replicates semiconservatively in Escherichia coli and higher organisms.
Matthew Meselson and Franklin Stahl’s experiment (1958):
- E. coli was grown in a medium containing $^{15}NH_4Cl$ (heavy isotope of nitrogen) as the only nitrogen source for many generations.
- $^{15}N$ was incorporated into newly synthesized DNA.
- Heavy DNA was distinguished from normal DNA by centrifugation in a cesium chloride (CsCl) density gradient.
- Cells were transferred to a medium with normal $^{14}NH_4Cl$ , and samples were taken at various time intervals as the cells multiplied.
- DNA was extracted and separated on CsCl gradients to measure densities.
- DNA extracted one generation after the transfer (20 minutes, E. coli divides in 20 minutes) had a hybrid density.
- DNA extracted after another generation (40 minutes) was composed of equal amounts of hybrid DNA and ‘light’ DNA.
Taylor and colleagues (1958) performed similar experiments using radioactive thymidine on Vicia faba (faba beans), proving that DNA in chromosomes also replicates semiconservatively.

5.9.2 The Machinery and the Enzymes

Replication requires enzymes; the main enzyme is DNA-dependent DNA polymerase.
DNA polymerase uses a DNA template to catalyze the polymerization of deoxynucleotides.
These enzymes are highly efficient, catalyzing polymerization of many nucleotides quickly.
E. coli (with $4.6 \times 10^6$ bp) completes replication within 18 minutes, averaging approximately 2000 bp per second.
Polymerases must catalyze the reaction with high accuracy to avoid mutations.
Replication is energetically expensive; deoxyribonucleoside triphosphates act as substrates and provide energy for the polymerization reaction (like ATP).
Additional enzymes are required for accurate replication.
Long DNA molecules cannot be separated entirely due to high energy requirements, so replication occurs within a small opening called the replication fork.
DNA-dependent DNA polymerases catalyze polymerization only in the 5'→3' direction.
On the template with polarity 3'→5', replication is continuous.
On the template with polarity 5'→3', replication is discontinuous; the fragments are joined by DNA ligase.
DNA polymerases cannot initiate replication randomly; there is a definite origin of replication in E. coli DNA.
Vectors provide the origin of replication for propagating DNA during recombinant DNA procedures.
Eukaryotic DNA replication occurs during the S-phase of the cell cycle and is highly coordinated with cell division.
Failure in cell division after DNA replication results in polyploidy.

5.10 Transcription

Transcription is copying genetic information from one DNA strand into RNA.
The principle of complementarity governs transcription, with adenosine forming a base pair with uracil instead of thymine.
Only a segment of DNA and one of the strands is copied into RNA, unlike replication where the total DNA is duplicated.
Boundaries must demarcate the region and strand of DNA to be transcribed.
If both strands acted as templates, they would code for RNA molecules with different sequences, leading to different proteins and complicating genetic information transfer.
Two RNA molecules produced simultaneously would be complementary, forming double-stranded RNA, preventing translation into protein.

5.10.1 Transcription Unit

A transcription unit in DNA is defined by three regions:
- A Promoter
- The Structural gene
- A Terminator
Convention defines the two DNA strands in the structural gene.
The strand with polarity 3'→5' acts as a template.
DNA-dependent RNA polymerase catalyzes polymerization in the 5'→3' direction.
The other strand with polarity 5'→3' has the same sequence as RNA (except thymine instead of uracil) and is called the coding strand.
The promoter and terminator flank the structural gene in a transcription unit.
The promoter is located towards the 5' end (upstream) of the structural gene.
It is a DNA sequence providing a binding site for RNA polymerase.
The presence of a promoter defines the template and coding strands.
The terminator is located towards the 3' end (downstream) of the coding strand and defines the end of transcription.
Additional regulatory sequences may be present further upstream or downstream to the promoter.

5.10.2 Transcription Unit and the Gene

A gene is the functional unit of inheritance located on the DNA.
A cistron is a segment of DNA coding for a polypeptide.
The structural gene in a transcription unit can be monocistronic (eukaryotes) or polycistronic (prokaryotes).
Eukaryotic monocistronic structural genes have interrupted coding sequences; genes are split.
Coding sequences (expressed sequences) are defined as exons, which appear in mature or processed RNA.
Exons are interrupted by introns (intervening sequences), which do not appear in mature or processed RNA.
The split-gene arrangement complicates the definition of a gene based on a DNA segment.
Inheritance of a character is affected by promoter and regulatory sequences of a structural gene.
Regulatory sequences are sometimes loosely defined as regulatory genes, even though they do not code for any RNA or protein.

5.10.3 Types of RNA and the process of Transcription

In bacteria, there are three major types of RNAs: mRNA, tRNA, and rRNA.
All three RNAs are needed to synthesize a protein in a cell.
mRNA provides the template.
tRNA brings amino acids and reads the genetic code.
rRNAs play structural and catalytic roles during translation.
A single DNA-dependent RNA polymerase catalyzes transcription of all types of RNA in bacteria.
RNA polymerase binds to the promoter and initiates transcription (Initiation).
It uses nucleoside triphosphates as substrates and polymerizes in a template-dependent fashion following the rule of complementarity.
It facilitates opening of the helix and continues elongation.
Only a short stretch of RNA remains bound to the enzyme.
Once the polymerase reaches the terminator region, the nascent RNA and the RNA polymerase fall off, terminating transcription.
RNA polymerase associates transiently with initiation-factor ($\sigma$) and termination-factor ($\rho$) to initiate and terminate transcription.
In bacteria, mRNA does not require processing, and transcription and translation occur in the same compartment, allowing coupling of transcription and translation.
In eukaryotes, there are two additional complexities:
- There are at least three RNA polymerases in the nucleus.
  - RNA polymerase I transcribes rRNAs (28S, 18S, and 5.8S).
  - RNA polymerase III transcribes tRNA, 5srRNA, and snRNAs.
  - RNA polymerase II transcribes precursors of mRNA, heterogeneous nuclear RNA (hnRNA).
- Primary transcripts contain both exons and introns and are non-functional; they undergo splicing where introns are removed and exons are joined.
- hnRNA undergoes capping (addition of methyl guanosine triphosphate to the 5' end) and tailing (addition of 200-300 adenylate residues at the 3' end).
- Fully processed hnRNA, now called mRNA, is transported out of the nucleus for translation.
Split-gene arrangements represent an ancient feature, and the presence of introns is reminiscent of the RNA-world.
Understanding RNA and RNA-dependent processes has assumed more importance.

5.11 Genetic Code

Translation requires the transfer of genetic information from a polymer of nucleotides to synthesize a polymer of amino acids.
There is no complementarity between nucleotides and amino acids.
Changes in nucleic acids were responsible for changes in amino acids in proteins, leading to the proposition of a genetic code.
Deciphering the genetic code was very challenging and required involvement of scientists from multiple disciplines.
George Gamow argued that since there are only 4 bases and they have to code for 20 amino acids, the code should constitute a combination of bases.
He suggested a triplet code, where a permutation combination of $4^3$ ( $4 \times 4 \times 4$ ) would generate 64 codons, which is more than required.
Har Gobind Khorana synthesized RNA molecules with defined combinations of bases.
Marshall Nirenberg’s cell-free system for protein synthesis helped decipher the code.
Severo Ochoa enzyme (polynucleotide phosphorylase) was helpful in polymerizing RNA with defined sequences in a template-independent manner.
A checker-board for the genetic code was prepared.
The codon is triplet: 61 codons code for amino acids, and 3 codons function as stop codons.
Some amino acids are coded by more than one codon, hence the code is degenerate.
The codon is read in mRNA in a contiguous fashion without punctuations.
The code is nearly universal; for example, UUU codes for Phenylalanine (phe) from bacteria to humans.
Some exceptions are found in mitochondrial codons and some protozoans.
AUG has dual functions: it codes for Methionine (met) and acts as an initiator codon.
UAA, UAG, UGA are stop terminator codons.

5.11.1 Mutations and Genetic Code

Relationships between genes and DNA are best understood by mutation studies.
Large deletions and rearrangements in DNA can result in the loss or gain of a gene and its function.
A point mutation is a change of a single base pair.
Example: change of glutamate to valine in the beta-globin chain gene results in sickle cell anemia.
Insertion or deletion of a base in a structural gene can be understood with an example:
- Original statement: RAM HAS RED CAP
- Insertion of B: RAM HAS BRE DCA P (frameshift)
- Insertion of BI: RAM HAS BIR EDC AP (frameshift)
- Insertion of BIG: RAM HAS BIG RED CAP (reading frame unaltered)
- Deletion of R: RAM HAS EDC AP (frameshift)
- Deletion of RE: RAM HAS DCA P (frameshift)
- Deletion of RED: RAM HAS CAP (reading frame unaltered)
Insertion or deletion of one or two bases changes the reading frame from the point of insertion or deletion (frameshift mutations).
Insertion or deletion of three or its multiple bases inserts or deletes one or multiple codons; the reading frame remains unaltered.

5.11.2 tRNA – the Adapter Molecule

Francis Crick postulated an adapter molecule to read the code and link it to the amino acids.
tRNA has an anticodon loop complementary to the code and an amino acid acceptor end to bind to specific amino acids.
tRNAs are specific for each amino acid.
There is another specific tRNA for initiation: the initiator tRNA.
There are no tRNAs for stop codons.
The secondary structure of tRNA looks like a clover-leaf, but the actual structure is compact and looks like an inverted L.

5.12 Translation

Translation is the polymerization of amino acids to form a polypeptide.
The order and sequence of amino acids are defined by the sequence of bases in the mRNA.
Amino acids are joined by a peptide bond, which requires energy.
First, amino acids are activated in the presence of ATP and linked to their cognate tRNA (charging or aminoacylation of tRNA).
If two such charged tRNAs are brought close enough, the formation of a peptide bond between them would be favored energetically.
The ribosome, consisting of structural RNAs and about 80 different proteins, is responsible for protein synthesis.
In its inactive state, it exists as two subunits: a large subunit and a small subunit.
When the small subunit encounters an mRNA, translation begins.
The large subunit has two sites for subsequent amino acids to bind and form a peptide bond.
The ribosome also acts as a catalyst (23S rRNA in bacteria is the enzyme-ribozyme) for the formation of a peptide bond.
A translational unit in mRNA is flanked by the start codon (AUG) and the stop codon and codes for a polypeptide.
mRNA also has untranslated regions (UTR) at both the 5' end (before start codon) and the 3' end (after stop codon), required for efficient translation.
The ribosome binds to the mRNA at the start codon (AUG), recognized by the initiator tRNA.
The ribosome proceeds to the elongation phase of protein synthesis.
Complexes of amino acids linked to tRNA sequentially bind to the appropriate codon in mRNA through complementary base pairs with the tRNA anticodon.
The ribosome moves from codon to codon along the mRNA.
Amino acids are added one by one, translating into Polypeptide sequences dictated by DNA and represented by mRNA.
At the end, a release factor binds to the stop codon, terminating translation and releasing the complete polypeptide from the ribosome.

5.13 Regulation of Gene Expression

Regulation of gene expression occurs at various levels:
- Transcriptional level (formation of primary transcript).
- Processing level (regulation of splicing).
- Transport of mRNA from nucleus to the cytoplasm.
- Translational level.
Genes in a cell are expressed to perform a particular function or set of functions.
Example: beta-galactosidase in E. coli hydrolyzes lactose into galactose and glucose.
Metabolic, physiological, or environmental conditions regulate the expression of genes.
Development and differentiation of an embryo into an adult organism result from the coordinated regulation of gene expression.
In prokaryotes, the control of transcriptional initiation is the predominant site for control of gene expression.
RNA polymerase activity at a promoter is regulated by accessory proteins (activators and repressors).
Accessibility of promoter regions is regulated by the interaction of proteins with operator sequences adjacent to the promoter elements.
Each operon has its specific operator and specific repressor.
Example: lac operator in the lac operon interacts specifically with the lac repressor.

5.13.1 The Lac operon

The elucidation of the lac operon was a result of the work of geneticist Francois Jacob and biochemist Jacque Monod.
In the lac operon (lac refers to lactose), a polycistronic structural gene is regulated by a common promoter and regulatory genes.
Arrangements like this are common in bacteria and are referred to as operons.
Examples: lac operon, trp operon, ara operon, his operon, val operon, etc.
The lac operon consists of one regulatory gene (i gene) and three structural genes (z, y, and a).
- The i gene codes for the repressor of the lac operon.
- The z gene codes for beta-galactosidase ($\beta$-gal), which hydrolyzes lactose into galactose and glucose.
- The y gene codes for permease, which increases cell permeability to $\beta$-galactosides.
- The a gene encodes a transacetylase.
All three gene products are required for lactose metabolism.
In most operons, the genes present are needed together to function in the same or related metabolic pathway.
Lactose is the substrate for beta-galactosidase and regulates the switching on and off of the operon (inducer).
In the absence of a preferred carbon source like glucose, if lactose is provided, it is transported into the cells through permease.
A very low level of expression of lac operon must be present for lactose to enter the cells.
The repressor of the operon is synthesized constitutively from the i gene.
The repressor protein binds to the operator region, preventing RNA polymerase from transcribing the operon.
In the presence of an inducer (lactose or allolactose), the repressor is inactivated by interaction with the inducer.
This allows RNA polymerase access to the promoter and transcription proceeds.
Regulation of the lac operon can be visualized as regulation of enzyme synthesis by its substrate.
Glucose or galactose cannot act as inducers for the lac operon.
Regulation of the lac operon by a repressor is referred to as negative regulation.
The lac operon is also under positive regulation.

5.14 Human Genome Project

The sequence of bases in DNA determines the genetic information of an organism.
Differences between individuals are due to differences in their DNA sequences.
Finding the complete DNA sequence of the human genome was an ambitious project launched in 1990.
The Human Genome Project (HGP) was a mega project.
Human genome has approximately $3 \times 10^9$ bp, and sequencing cost was estimated at US $3 per bp, making the total estimated cost approximately 9 billion US dollars.
Storing the sequences in books would require 3300 books.
The enormous amount of data required high-speed computational devices for data storage, retrieval, and analysis.
HGP was closely associated with the rapid development of Bioinformatics.

Goals of HGP

Identify all approximately 20,000-25,000 genes in human DNA.
Determine the sequences of the 3 billion chemical base pairs.
Store this information in databases.
Improve tools for data analysis.
Transfer related technologies to other sectors, such as industries.
Address the ethical, legal, and social issues (ELSI) that may arise from the project.
The HGP was a 13-year project coordinated by the U.S. Department of Energy and the National Institute of Health, with contributions from the Wellcome Trust (U.K.), Japan, France, Germany, China, and others. The project was completed in 2003.
Knowledge about the effects of DNA variations can lead to new ways to diagnose, treat, and prevent disorders.
Learning about non-human organisms' DNA sequences can lead to understanding their natural capabilities for solving challenges in health care, agriculture, energy production, and environmental remediation.
Many non-human model organisms (bacteria, yeast, Caenorhabditis elegans, Drosophila, rice, Arabidopsis, etc.) have also been sequenced.

Methodologies

Two major approaches:
- Identifying all genes expressed as RNA (Expressed Sequence Tags (ESTs)).
- Sequencing the whole genome and assigning functions to different regions (Sequence Annotation).
Total DNA is isolated, fragmented, and cloned in suitable hosts using vectors (BAC and YAC).
Cloning amplifies each DNA fragment.
Fragments were sequenced using automated DNA sequencers based on Frederick Sanger’s method.
Sequences were arranged based on overlapping regions, requiring specialized computer programs.
Sequences were annotated and assigned to each chromosome.
The sequence of chromosome 1 was completed in May 2006.
Genetic and physical maps were generated using information on polymorphism of restriction endonuclease recognition sites and repetitive DNA sequences (microsatellites).

5.14.1 Salient Features of Human Genome

The human genome contains 3164.7 million bp.
The average gene consists of 3000 bases, but sizes vary; the largest known human gene (dystrophin) is 2.4 million bases.
The estimated total number of genes is 30,000—much lower than previous estimates.
Almost all (99.9%) nucleotide bases are the same in all humans.
Functions are unknown for over 50% of the discovered genes.
Less than 2% of the genome codes for proteins.
Repeated sequences make up a very large portion of the human genome.
Repetitive sequences are stretches of DNA sequences repeated many times and are thought to have no direct coding functions but provide insights into chromosome structure, dynamics, and evolution.
Chromosome 1 has the most genes (2968), and the Y has the fewest (231).
Scientists have identified about 1.4 million locations where single-base DNA differences (SNPs) occur in humans.
This information promises to revolutionize finding chromosomal locations for disease-associated sequences and tracing human history.

5.14.2 Applications and Future Challenges

Deriving meaningful knowledge from DNA sequences will define research through the coming decades, leading to an understanding of biological systems.
Studying all the genes in a genome, transcripts in tissues, and how genes and proteins work together in interconnected networks can orchestrate the chemistry of life.

5.15 DNA Fingerprinting

99.9% of the base sequence is the same among humans.
Differences in the DNA sequence make individuals unique.
DNA fingerprinting is a quick way to compare the DNA sequences of individuals.
It involves identifying differences in specific regions in DNA called repetitive DNA (small stretches of DNA repeated many times).
Repetitive DNA is separated from bulk genomic DNA as satellite DNA during density gradient centrifugation.
Depending on base composition, segment length, and the number of repetitive units, satellite DNA is classified into microsatellites, minisatellites, etc.
These sequences do not code for any proteins but form a large portion of the human genome.
They show high degrees of polymorphism and form the basis of DNA fingerprinting.
DNA from every tissue of an individual shows the same degree of polymorphism, making it a useful identification tool in forensic applications.
Polymorphisms are inheritable from parents to children, forming the basis of paternity testing.
Polymorphism (variation at the genetic level) arises due to mutations: new mutations may arise in somatic cells or germ cells.
Allelic sequence variation is described as DNA polymorphism if more than one variant (allele) at a locus occurs in the human population with a frequency greater than 0.01.
Mutations in non-coding DNA sequences may not have any immediate effect/impact in an individual’s reproductive ability.
These mutations keep accumulating generation after generation, forming one of the bases of variability/polymorphism.
There are various types of polymorphisms, ranging from single nucleotide changes to large-scale changes.
The technique of DNA Fingerprinting was initially developed by Alec Jeffreys, using a satellite DNA called Variable Number of Tandem Repeats (VNTR).
Southern blot hybridization used radiolabeled VNTR probes.
The technique included:
- Isolation of DNA.
- Digestion of DNA by restriction endonucleases.
- Separation of DNA fragments by electrophoresis.
- Transferring (blotting) of separated DNA fragments to synthetic