Notes on Protein Function, Purification, and AI-Based Prediction
Genomes, transcriptomes, and the proteome
- Context: This module sits between protein–ligand binding (hemoglobin) and the start of enzyme function; it sets up core biochemical topics essential for signal transduction, metabolism, and later translation.
- Key idea: Proteins are central to biology because they execute the functions encoded by the genome; the genome is relatively fixed, but the proteome is highly dynamic.
- Genomes (parts list) vs. proteomes (expressed proteins):
- The genome contains information in DNA; generally fixed across cells, with some changes discussed later.
- Model organisms and gene counts mentioned:
- Caenorhabditis elegans (worm): ~19,000 genes.
- Drosophila melanogaster (fruit fly): ~14,000 genes.
- Humans have ~21–22k genes (roughly a few thousand more than C. elegans) but many more bases in the genome.
- Central dogma recap (DNA → RNA → protein):
- DNA carries information (genome).
- RNA (transcriptome) is the working copy; variable between cell types and conditions.
- Protein synthesis (proteome) through translation; proteins are the functional units.
- Exceptions noted: RNA can function as a ribozyme or carry information; but the general flow is DNA → RNA → protein.
- Proteome as context-dependent: the set of expressed/modified proteins varies with:
- Cell type (e.g., kidney vs. brain).
- Developmental stage.
- Environmental conditions (e.g., oxygen, nutrients, day vs. night).
- Example: fetal hemoglobin vs. adult hemoglobin
- Fetal hemoglobin is downregulated after birth and replaced by beta chain hemoglobin.
- Re-expression of fetal hemoglobin can alleviate some sickle cell symptoms; illustrates dynamic proteome regulation.
- Why study protein function?
- Proteins are the functional executors of life; understanding their presence and function reveals how cells/tissues/organisms work.
- Approaches to protein function
- Reductionist approach: isolate a specific protein (e.g., hemoglobin) and study its mechanism in isolation, then place it in a cellular/organismal context.
- Systems/proteomics approach: study how the entire proteome changes under a condition, then infer functional implications from those changes.
- In this course, emphasis starts with reductionist biochemistry (protein purification) with later expansion into systems-level proteomics.
- Environments for studying proteins
- In vivo: within the native environment (cell/tissue); generally physiologically relevant but high background noise due to many proteins and interactions.
- In vitro: in glass (test tubes); controlled environment, often purified protein, lower background, easier to measure kinetics and structure, but less physiological relevance.
- In silico: computational models and simulations; no background noise, can simulate entire cells or single proteins; useful for prediction and screening.
- Ambiguity between in vivo and in vitro: context matters; tissue culture can be debated as in vivo or in vitro depending on perspective.
- Nobel-level context: AI-based protein prediction (e.g., AlphaFold) is transforming predictions of structure and function, enabling rapid candidate identification for experimental validation.
Why purification is essential in biochemistry
- Purification vs. the end goal
- Purification is a means to an end: once a protein is purified, you can conduct assays to measure function, kinetics, binding, and interactions without background noise.
- Purification goals can include:
- Obtaining a protein in its functional form (active) and free of contaminants.
- Isolating a specific isoform or a protein with a particular post-translational modification.
- Preserving interactions to study the interactome or co-purifying partners.
- Sources for recombinant expression
- Possible sources for protein purification include:
- Endogenous/native sources (from the organism or tissue).
- Recombinant expression in systems such as E. coli, yeast, or mammalian cell lines.
- Choice depends on:
- Post-translational modifications (which E. coli may lack for some proteins).
- Yield and cost considerations.
- Whether the N-terminus or C-terminus tags might interfere with function.
- Purification pipeline: core ideas
- Start with cell disruption (lysis) and then separate components by centrifugation to isolate the soluble protein fraction (lysate) from membranes and other debris.
- Purification strategies rely on exploiting physicochemical properties to separate proteins from contaminants:
- Size (molecular exclusion/gel filtration).
- Charge (ion exchange chromatography).
- Solubility (salting in/out).
- Specific interactions (affinity purification using tags or ligands).
- Purification is often done in sequence (e.g., size exclusion followed by ion exchange) because no single method is perfectly selective.
- A practical note: the goal is to maximize yield of the protein of interest while minimizing contaminants and preserving activity.
Purification techniques and concepts
- Solubility and salt effects
- Solubility can be manipulated via salt concentration (salting in vs. salting out):
- High salt can “salt out” poorly soluble proteins, causing precipitation.
- Salt ions compete with proteins for water, altering solubility and promoting separation.
- The Dutch saying referenced: chemistry as the art of separation; emphasizes purification as the essence of chemical analysis.
- Size-exclusion chromatography (gel filtration)
- Column with porous beads; proteins separate by size:
- Large proteins bypass pores and elute first (shorter path).
- Small proteins enter pores and elute later (longer path).
- Practical use: collect fractions corresponding to the size of the protein of interest.
- Ion-exchange chromatography
- Exploits protein charge at a given pH; two main types:
- Cation exchange (positive proteins bind to negatively charged beads).
- Anion exchange (negative proteins bind to positively charged beads).
- Mechanism: bound proteins are eluted by increasing salt concentration (competition with salt ions for binding sites).
- Key variables:
- Choice of resin (negative beads for cation exchange, positive beads for anion exchange).
- pH control: pH relative to the protein’s pI determines the net charge.
- Isoelectric point (pI) and pKa concepts
- pKa: pH at which a functional group is 50% protonated.
- pI (isoelectric point): pH at which the molecule has net zero charge.
- At pH below pI, protein tends to be positively charged; at pH above pI, negatively charged.
- Example discussed: a protein with pI around 10.6 would be positively charged at pH 9–10 and neutral only near pH 10.6; at higher pH, negative.
- Charge patches: localized clusters (e.g., lysine- and arginine-rich patches) can alter behavior on ion-exchange columns despite the overall charge.
- Affinity purification and tags
- Proteins can be purified using affinity tags that bind specifically to a ligand:
- His-tag (six histidines) binds to nickel-NTA columns; eluted with imidazole.
- GST-tag binds to glutathione; eluted with competitive glutathione or other strategies.
- Tags add a purification handle but can interfere with function if placed at critical regions (e.g., N-terminus).
- Monitoring purification: presence and activity at each step
- Two complementary readouts:
- Presence: typically assessed by SDS-PAGE (denaturing gel) to visualize protein size and purity.
- Activity: enzymatic assays or binding assays to confirm functionality.
- Ideally, per-mass activity should increase during purification (specific activity rises) even as total yield declines.
- Yield: fraction of starting protein recovered at each step; some loss is expected, but the aim is to retain the protein while removing contaminants.
- Practical notes on purification workflow
- Start with cell disruption; collect soluble protein; discard membrane and debris unless membrane proteins are the target.
- Use a combination of techniques to achieve sufficient purity for downstream experiments.
- Anticipate and manage trade-offs between purity, yield, and activity.
- Antibodies and immunoprecipitation
- Antibodies can be raised against the protein (antigen) or a specific peptide epitope; antibodies enable highly specific capture.
- Immunoprecipitation (IP): use antibody-bound beads to pull down the target protein from a lysate; co-precipitated proteins can represent interactors (the interactome).
- IP is used for targeted purification and for discovering interacting partners (proteomics).
- Proteomics (bottom-up): compare proteins co-immunoprecipitated under different conditions to identify changes in interactions and potential functional pathways.
- Two-dimensional gel electrophoresis (2D-GE)
- First dimension: isoelectric focusing (IEF) separates proteins by their isoelectric point (pI).
- Second dimension: SDS-PAGE separates by size.
- Visualization: dots representing individual proteins; comparisons across conditions reveal differential interactors or changes in expression.
- Proteomics and systems-level insights
- Immunoprecipitation followed by proteomic analysis provides a view of the interactome for a given protein under defined conditions.
- Changes between control and treatment or healthy vs. diseased tissue can highlight candidate proteins linked to specific functions or disease processes.
AI and computational prediction of protein function
- Context and impact
- Artificial intelligence and machine learning are transforming protein structure and function prediction.
- AlphaFold (and other models) predict structure from sequence, enabling rapid hypothesis generation about active sites, ligand-binding residues, and regulatory regions.
- AlphaFold’s impact: rapid generation of candidate structures, enabling focused experimental validation; cited as revolutionizing the field due to the scale and speed of prediction.
- How AlphaFold works (high-level)
- Input: amino acid sequence (can be whole protein, a domain, or a segment).
- Step 1: sequence similarity search across diverse species; collect related sequences.
- Step 2: multiple sequence alignment to identify conserved residues and co-evolving contacts.
- Step 3: identify coevolving residues that likely interact in 3D space; construct a distance map of contacting atoms/residues.
- Step 4: computationally assemble a 3D structure that satisfies the distance constraints while avoiding steric clashes.
- Step 5: apply AI refinement to improve the model and estimate a probability score for correspondence to reality.
- Output: a hypothetical 3D structure with an associated confidence/probability metric.
- Strengths and limitations
- Strengths: strong ability to predict well-structured regions and overall folds; provides a valuable candidate structure when experimental structures are unavailable.
- Limitations: disordered regions are harder to predict accurately; predictions are probabilistic and require experimental validation.
- The accuracy tends to be higher for proteins with well-defined tertiary structures (e.g., rigid cores) than for highly flexible or intrinsically disordered regions.
- Practical use of AlphaFold predictions
- Use predicted structures to identify potential active sites, ligand-binding pockets, and regulatory sites as starting points for experimental work.
- Generate candidate residues for mutational analysis or targeted screening of ligands.
- Still requires experimental validation to confirm function and binding.
Quick recap: core concepts and equations
- Central dogma recap (DNA → RNA → protein) and proteome variability across conditions and cell types.
- Four main physical/chemical properties used for purification:
- Size (gel filtration): see size-based separation.
- Charge (ion exchange; pH-dependent).
- Solubility (salting in/out).
- Specific interactions (affinity purification).
- Key definitions
- pKa: pH at which a functional group is 50% protonated.
- pI (isoelectric point): the pH at which the protein has net zero charge.
- Important quantitative concept
- Enzyme kinetics can be described by Michaelis–Menten behavior:
v = \frac{V{max} [S]}{Km + [S]} - This underpins the rationale for purifying proteins to measure catalytic parameters (Km, Vmax) without cellular noise.
- Practical experimental notes
- Purification is not the goal itself; it is a means to enable measurement of presence, purity, activity, and interactions.
- Always verify presence and activity at each purification step to avoid wasted time and resources.
- When purifying with tags, ensure the tag does not disrupt function; choose expression system accordingly.
Connections to broader themes
- Bridging reductionist and systems perspectives:
- Reductionist purification yields detailed mechanistic insight into a single protein.
- Systems proteomics reveals how a protein fits into broader networks and pathways.
- Real-world relevance:
- Proteome variability under different physiological conditions underpins development, disease, and response to therapy (e.g., fetal vs. adult hemoglobin, sickle cell disease).
- AI-driven structure prediction accelerates discovery and hypothesis generation but must be complemented by empirical validation.
- Ethical and practical implications:
- Computational predictions should be validated experimentally to avoid over-interpretation.
- Use of model systems and expression hosts must consider post-translational modifications and physiological relevance.
Summary of key takeaways
- The proteome is dynamic and context-dependent; understanding protein function requires knowing which proteins are present and active under given conditions.
- Purification is a critical, multifaceted tool, chosen based on the protein’s properties, required yield, and downstream assays.
- Purification relies on exploiting size, charge, solubility, and binding affinities, with ongoing monitoring of presence and activity.
- Antibodies and proteomics extend purification into interactome analysis, enabling discovery of protein networks.
- AI-powered predictions (e.g., AlphaFold) provide powerful structure-function hypotheses that drive experimental prioritization and discovery.