Comprehensive Notes on Proteomics, Chromatin Analysis, and Gene Cloning
Challenges in Protein Sequencing and Functional Characterization
- Efficiency and Scalability Limitations: Current protein sequencing techniques, while they exist, lack the efficiency and scalability seen in DNA sequencing technologies.
- Hierarchy of Protein Identification: Identifying a protein's sequence is only a preliminary step; knowing the sequence does not inherently reveal the protein's function.
* The Structure-Function Relationship: To determine function, the structure must be known, as the protein must fold into a specific three-dimensional shape.
* Predicting Folds:
* Comparison: Folds can be guessed by comparing a sequence to known protein structures.
* AI and Computational Prediction: Tools like DeepMind's AlphaFold have revolutionized fold prediction.
* Think: An AI-based protein folding tool that was released shortly after AlphaFold and received less public attention.
* Rosetta Fold: Developed by David Baker at the University of Washington. David Baker shared the Nobel Prize with the inventors of AlphaFold for this work. Rosetta Fold is considered comparable to AlphaFold in its predictive capabilities. - Limitations of Prediction Algorithms: Despite their sophistication, computational algorithms do not provide definitive answers regarding protein function. Definitive determination usually requires biochemical experimentation.
Biochemical and Genetic Approaches to Protein Function
- Classic Protein Biochemistry: This involves the purification of the protein to isolate it from all other cellular components (the "ultra pure thing" in a test tube).
* Assays: Specific tests are used to determine activity:
* Enzymes: Subjected to enzyme assays.
* Transcription Factors: Tested for DNA binding or protein-protein interactions.
* Limitations of In Vitro Studies: Isolating a protein takes it out of its biological context. It lacks the normal interactors and cellular environment, leading to the "open question": is the behavior observed in a test tube biologically relevant in an actual cell? - Genetic Analysis: This approach involves introducing a mutation into the organism to observe the resulting biological influence or phenotype.
- Long-Term Scope: Generating a full-spectrum picture of a protein's function is a long-term project often spanning an entire career dedicated to a single protein or family of proteins.
Proteomics and Large-Scale Characterization Strategies
- The Bacterial Proteome: A typical bacterium may encode between 2,000 and 4,000 different proteins.
- Parallel Screens: Similar to parallel sequencing, researchers can set up parallel screens for specific protein classes.
* DNA Interacting Proteins: Crucial for gene regulation, development, and environmental response.
* Structural Genomics: A large-scale effort that attempted to solve the structure of many proteins. While it provided many structures, it did not answer as many functional questions as anticipated because a structure "floating in space" does not necessarily reveal its mechanism. - Biotech Strategy—Extremophile Screens:
* Scenario: Studying microbes from extreme environments (high temperature, acidic conditions, or high salt) because their proteins are robust and do not unfold easily.
* Workflow: Clone the gene for each protein
ightarrow Express the protein
ightarrow Purify the protein
ightarrow Attach the protein (or short peptides) to a reaction surface in clusters.
* Assay: Perform chemical or enzymatic assays to identify activity and "fish out" candidates for further characterization.
* Shortcomings: These isolated proteins are attached to a surface and exposed to artificial conditions, missing multi-protein complex dynamics and cellular context.
Analysis of Chromatin and DNA-Binding Proteins
- Chromatin Immunoprecipitation (ChIP): An effective proteomic approach identifying where proteins interact with the genome.
* Definition: Chromatin consists of DNA and the diverse proteins (histones, transcription factors, etc.) that manage DNA during the cell cycle, division, and transcription.
* The Crosslinking Process: Cells are grown and treated with formaldehyde, a crosslinker that fuses proteins and DNA in close proximity.
* Hazard Warning: Formaldehyde is highly toxic and damages tissues by creating tangles of crosslinked matter that cells cannot process.
* Immunoprecipitation: Antibodies (Y-shaped molecules) specific to a target protein are used to pull down that protein along with the attached DNA fragments.
* Mapping: The attached DNA is sequenced and compared to the genome to create a map of protein-DNA interactions. - ChIP Variations:
* ChIP-Hybridization (ChIP-chip): DNA is labeled with fluorescence and hybridized to immobilized sequences representing the genome.
* ChIP-Seq: The purified DNA is directly sequenced. This is currently the standard assay for studying DNA-binding proteins. - Hi-C Assay: Used to study the higher-level 3D organization of chromatin (e.g., toroidal structures/visible chromosomes).
* Mechanism: DNA is crosslinked, and the ends of the DNA are ligated to each other to create a hybrid circular sequence. Sequencing this hybrid reveals two regions of the genome that were physically close to each other even if they are far apart in the primary sequence.
Functional Annotation and the ENCODE Project
- ENCODE (Encyclopedia of DNA Elements): A project aimed at identifying functional elements in the genome.
* Data Types: Transcribed regions (RNA-seq), transcription factor binding sites (ChIP-seq), 3D structure (Hi-C), DNA modifications (e.g., methylation), and histone modifications (acetylation/deacetylation).
* Scale: As of 2019, the project contained greater than 14,000 individual datasets. - General Findings from ENCODE:
* 80% of the human genome is involved in at least one chromatin-associated or biochemical event in at least one cell type.
* <5% of the genome actually encodes protein-coding genes.
* Regulation: RNA synthesis levels are tightly correlated with the presence of transcription factors and the state of the chromatin. - Disease Implications:
* Single-Gene Disorders: Example: Glutaric Acid Anemia, where a single base change in a protein involved in amino acid metabolism causes permanent disability.
* Regulatory Mutations: ENCODE indicates that most disease-causing mutations are not in protein-coding regions but in intergenic (non-coding) regions. This suggests that the regulation of synthesis is often the primary cause of disease.
* Multigenic Interactions: Most genetic disorders involve interactions between multiple genes and nuanced mutations that are difficult to assign cause-and-effect to without massive datasets and computational analysis.
Introduction to Gene Cloning and Vectors
- Foundations: Cloning relies on knowledge of promoters, transcription factors, and biochemical signals. However, predicting how these elements behave in a new organism is difficult and requires empirical testing.
- Vectors: Movable DNA elements used to transport DNA into a host cell.
* Plasmids: Small, usually circular, supercoiled DNA elements that replicate independently of the host chromosome.
* Selection: Cloning is inefficient (only 1 in 100,000 or 1 in 1,000,000 cells take up DNA). Selection involves using toxins (antibiotics) that kill any cell lacking the plasmid. - Common Antibiotics and Resistance Mechanisms:
* Ampicillin: Related to penicillin. Resistance is provided by beta-lactamase, which is secreted out of the cell. This can deplete the antibiotic in the medium, allowing sensitive cells to grow (non-plasmid "satellites").
* Chloramphenicol: Binds to the ribosome and inhibits translation. Resistance is provided by acetyl-transferase.
* Tetracycline: Inhibits protein synthesis. Resistance involves a pump that removes the toxin from the cell.
* Kanamycin / G418 (Aminoglycosides): Inhibit ribosomes. Resistance involves an enzyme that phosphorylates the drug. G418 works in both bacteria and eukaryotes. - Plasmid Replicons and Copy Number:
* The replicon (origin of replication + associated proteins/RNAs) determines the "copy number" (number of plasmids per cell).
* High Copy Number: Allows for high DNA yield but imposes a metabolic burden on the cell, potentially slowing growth or leading to plasmid instability/deletions.
* Low Copy Number: Useful for large plasmids or proteins that are toxic to the cell.
Historical and Modern Plasmid Vectors
- pBR322: One of the first widely used plasmids (developed in the 1970s). Approximately 4,361 base pairs. It featured two resistance genes (Ampicillin and Tetracycline) and a low copy number replicon (≈15 to 20 copies).
- pUC19: A smaller, high copy number plasmid (≈500 to 700 copies).
* Multiple Cloning Site (MCS): A designed sequence with single restriction enzyme sites.
* Alpha-Complementation (Blue/White Screening): Utilizes the Lac Operon.
* The enzyme beta-galactosidase is split into an alpha fragment (on the plasmid) and an omega fragment (in the host E. coli strain).
* Functional enzyme: Forms only if both fragments are present.
* Screening: If a gene is successfully inserted into the MCS, it disrupts the alpha fragment. The colonies appear white (or the color of normal E. coli) on an X-gal plate. Non-recombinant plasmids reform the enzyme and turn the colonies blue.
Questions & Discussion
- Question: For ChIP-seq, how do you visualize a change between cell states?
- Response: You compare the sequencing signal from a cell in one state (e.g., specific tissue stage) where a transcription factor is active to another state where it is absent or inactive. The DNA sequences that disappear or appear in the pulled-down fraction between the two conditions represent the genes regulated or activated by that specific transcription factor.