Notes on Protein Function, Purification, and AI-Based Prediction

Genomes, transcriptomes, and the proteome

Context: This module sits between protein–ligand binding (hemoglobin) and the start of enzyme function; it sets up core biochemical topics essential for signal transduction, metabolism, and later translation.
Key idea: Proteins are central to biology because they execute the functions encoded by the genome; the genome is relatively fixed, but the proteome is highly dynamic.
Genomes (parts list) vs. proteomes (expressed proteins):
- The genome contains information in DNA; generally fixed across cells, with some changes discussed later.
- Model organisms and gene counts mentioned:
- Caenorhabditis elegans (worm): ~19,000 genes.
- Drosophila melanogaster (fruit fly): ~14,000 genes.
- Humans have ~21–22k genes (roughly a few thousand more than C. elegans) but many more bases in the genome.
Central dogma recap (DNA → RNA → protein):
- DNA carries information (genome).
- RNA (transcriptome) is the working copy; variable between cell types and conditions.
- Protein synthesis (proteome) through translation; proteins are the functional units.
- Exceptions noted: RNA can function as a ribozyme or carry information; but the general flow is DNA → RNA → protein.
Proteome as context-dependent: the set of expressed/modified proteins varies with:
- Cell type (e.g., kidney vs. brain).
- Developmental stage.
- Environmental conditions (e.g., oxygen, nutrients, day vs. night).
Example: fetal hemoglobin vs. adult hemoglobin
- Fetal hemoglobin is downregulated after birth and replaced by beta chain hemoglobin.
- Re-expression of fetal hemoglobin can alleviate some sickle cell symptoms; illustrates dynamic proteome regulation.
Why study protein function?
- Proteins are the functional executors of life; understanding their presence and function reveals how cells/tissues/organisms work.
Approaches to protein function
- Reductionist approach: isolate a specific protein (e.g., hemoglobin) and study its mechanism in isolation, then place it in a cellular/organismal context.
- Systems/proteomics approach: study how the entire proteome changes under a condition, then infer functional implications from those changes.
- In this course, emphasis starts with reductionist biochemistry (protein purification) with later expansion into systems-level proteomics.
Environments for studying proteins
- In vivo: within the native environment (cell/tissue); generally physiologically relevant but high background noise due to many proteins and interactions.
- In vitro: in glass (test tubes); controlled environment, often purified protein, lower background, easier to measure kinetics and structure, but less physiological relevance.
- In silico: computational models and simulations; no background noise, can simulate entire cells or single proteins; useful for prediction and screening.
- Ambiguity between in vivo and in vitro: context matters; tissue culture can be debated as in vivo or in vitro depending on perspective.
Nobel-level context: AI-based protein prediction (e.g., AlphaFold) is transforming predictions of structure and function, enabling rapid candidate identification for experimental validation.

Why purification is essential in biochemistry

Purification vs. the end goal
- Purification is a means to an end: once a protein is purified, you can conduct assays to measure function, kinetics, binding, and interactions without background noise.
- Purification goals can include:
- Obtaining a protein in its functional form (active) and free of contaminants.
- Isolating a specific isoform or a protein with a particular post-translational modification.
- Preserving interactions to study the interactome or co-purifying partners.
Sources for recombinant expression
- Possible sources for protein purification include:
- Endogenous/native sources (from the organism or tissue).
- Recombinant expression in systems such as E. coli, yeast, or mammalian cell lines.
- Choice depends on:
- Post-translational modifications (which E. coli may lack for some proteins).
- Yield and cost considerations.
- Whether the N-terminus or C-terminus tags might interfere with function.
Purification pipeline: core ideas
- Start with cell disruption (lysis) and then separate components by centrifugation to isolate the soluble protein fraction (lysate) from membranes and other debris.
- Purification strategies rely on exploiting physicochemical properties to separate proteins from contaminants:
- Size (molecular exclusion/gel filtration).
- Charge (ion exchange chromatography).
- Solubility (salting in/out).
- Specific interactions (affinity purification using tags or ligands).
- Purification is often done in sequence (e.g., size exclusion followed by ion exchange) because no single method is perfectly selective.
- A practical note: the goal is to maximize yield of the protein of interest while minimizing contaminants and preserving activity.

Purification techniques and concepts

Solubility and salt effects
- Solubility can be manipulated via salt concentration (salting in vs. salting out):
- High salt can “salt out” poorly soluble proteins, causing precipitation.
- Salt ions compete with proteins for water, altering solubility and promoting separation.
- The Dutch saying referenced: chemistry as the art of separation; emphasizes purification as the essence of chemical analysis.
Size-exclusion chromatography (gel filtration)
- Column with porous beads; proteins separate by size:
- Large proteins bypass pores and elute first (shorter path).
- Small proteins enter pores and elute later (longer path).
- Practical use: collect fractions corresponding to the size of the protein of interest.
Ion-exchange chromatography
- Exploits protein charge at a given pH; two main types:
- Cation exchange (positive proteins bind to negatively charged beads).
- Anion exchange (negative proteins bind to positively charged beads).
- Mechanism: bound proteins are eluted by increasing salt concentration (competition with salt ions for binding sites).
- Key variables:
- Choice of resin (negative beads for cation exchange, positive beads for anion exchange).
- pH control: pH relative to the protein’s pI determines the net charge.
Isoelectric point (pI) and pKa concepts
- pKa: pH at which a functional group is 50% protonated.
- pI (isoelectric point): pH at which the molecule has net zero charge.
- At pH below pI, protein tends to be positively charged; at pH above pI, negatively charged.
- Example discussed: a protein with pI around 10.6 would be positively charged at pH 9–10 and neutral only near pH 10.6; at higher pH, negative.
- Charge patches: localized clusters (e.g., lysine- and arginine-rich patches) can alter behavior on ion-exchange columns despite the overall charge.
Affinity purification and tags
- Proteins can be purified using affinity tags that bind specifically to a ligand:
- His-tag (six histidines) binds to nickel-NTA columns; eluted with imidazole.
- GST-tag binds to glutathione; eluted with competitive glutathione or other strategies.
- Tags add a purification handle but can interfere with function if placed at critical regions (e.g., N-terminus).
Monitoring purification: presence and activity at each step
- Two complementary readouts:
- Presence: typically assessed by SDS-PAGE (denaturing gel) to visualize protein size and purity.
- Activity: enzymatic assays or binding assays to confirm functionality.
- Ideally, per-mass activity should increase during purification (specific activity rises) even as total yield declines.
- Yield: fraction of starting protein recovered at each step; some loss is expected, but the aim is to retain the protein while removing contaminants.
Practical notes on purification workflow
- Start with cell disruption; collect soluble protein; discard membrane and debris unless membrane proteins are the target.
- Use a combination of techniques to achieve sufficient purity for downstream experiments.
- Anticipate and manage trade-offs between purity, yield, and activity.
Antibodies and immunoprecipitation
- Antibodies can be raised against the protein (antigen) or a specific peptide epitope; antibodies enable highly specific capture.
- Immunoprecipitation (IP): use antibody-bound beads to pull down the target protein from a lysate; co-precipitated proteins can represent interactors (the interactome).
- IP is used for targeted purification and for discovering interacting partners (proteomics).
- Proteomics (bottom-up): compare proteins co-immunoprecipitated under different conditions to identify changes in interactions and potential functional pathways.
Two-dimensional gel electrophoresis (2D-GE)
- First dimension: isoelectric focusing (IEF) separates proteins by their isoelectric point (pI).
- Second dimension: SDS-PAGE separates by size.
- Visualization: dots representing individual proteins; comparisons across conditions reveal differential interactors or changes in expression.
Proteomics and systems-level insights
- Immunoprecipitation followed by proteomic analysis provides a view of the interactome for a given protein under defined conditions.
- Changes between control and treatment or healthy vs. diseased tissue can highlight candidate proteins linked to specific functions or disease processes.

AI and computational prediction of protein function

Context and impact
- Artificial intelligence and machine learning are transforming protein structure and function prediction.
- AlphaFold (and other models) predict structure from sequence, enabling rapid hypothesis generation about active sites, ligand-binding residues, and regulatory regions.
- AlphaFold’s impact: rapid generation of candidate structures, enabling focused experimental validation; cited as revolutionizing the field due to the scale and speed of prediction.
How AlphaFold works (high-level)
- Input: amino acid sequence (can be whole protein, a domain, or a segment).
- Step 1: sequence similarity search across diverse species; collect related sequences.
- Step 2: multiple sequence alignment to identify conserved residues and co-evolving contacts.
- Step 3: identify coevolving residues that likely interact in 3D space; construct a distance map of contacting atoms/residues.
- Step 4: computationally assemble a 3D structure that satisfies the distance constraints while avoiding steric clashes.
- Step 5: apply AI refinement to improve the model and estimate a probability score for correspondence to reality.
- Output: a hypothetical 3D structure with an associated confidence/probability metric.
Strengths and limitations
- Strengths: strong ability to predict well-structured regions and overall folds; provides a valuable candidate structure when experimental structures are unavailable.
- Limitations: disordered regions are harder to predict accurately; predictions are probabilistic and require experimental validation.
- The accuracy tends to be higher for proteins with well-defined tertiary structures (e.g., rigid cores) than for highly flexible or intrinsically disordered regions.
Practical use of AlphaFold predictions
- Use predicted structures to identify potential active sites, ligand-binding pockets, and regulatory sites as starting points for experimental work.
- Generate candidate residues for mutational analysis or targeted screening of ligands.
- Still requires experimental validation to confirm function and binding.

Quick recap: core concepts and equations

Central dogma recap (DNA → RNA → protein) and proteome variability across conditions and cell types.
Four main physical/chemical properties used for purification:
- Size (gel filtration): see size-based separation.
- Charge (ion exchange; pH-dependent).
- Solubility (salting in/out).
- Specific interactions (affinity purification).
Key definitions
- pKa: pH at which a functional group is 50% protonated.
- pI (isoelectric point): the pH at which the protein has net zero charge.
Important quantitative concept
- Enzyme kinetics can be described by Michaelis–Menten behavior:
  v = \frac{V{max} [S]}{Km + [S]}
- This underpins the rationale for purifying proteins to measure catalytic parameters (Km, Vmax) without cellular noise.
Practical experimental notes
- Purification is not the goal itself; it is a means to enable measurement of presence, purity, activity, and interactions.
- Always verify presence and activity at each purification step to avoid wasted time and resources.
- When purifying with tags, ensure the tag does not disrupt function; choose expression system accordingly.

Connections to broader themes

Bridging reductionist and systems perspectives:
- Reductionist purification yields detailed mechanistic insight into a single protein.
- Systems proteomics reveals how a protein fits into broader networks and pathways.
Real-world relevance:
- Proteome variability under different physiological conditions underpins development, disease, and response to therapy (e.g., fetal vs. adult hemoglobin, sickle cell disease).
- AI-driven structure prediction accelerates discovery and hypothesis generation but must be complemented by empirical validation.
Ethical and practical implications:
- Computational predictions should be validated experimentally to avoid over-interpretation.
- Use of model systems and expression hosts must consider post-translational modifications and physiological relevance.

Summary of key takeaways

The proteome is dynamic and context-dependent; understanding protein function requires knowing which proteins are present and active under given conditions.
Purification is a critical, multifaceted tool, chosen based on the protein’s properties, required yield, and downstream assays.
Purification relies on exploiting size, charge, solubility, and binding affinities, with ongoing monitoring of presence and activity.
Antibodies and proteomics extend purification into interactome analysis, enabling discovery of protein networks.
AI-powered predictions (e.g., AlphaFold) provide powerful structure-function hypotheses that drive experimental prioritization and discovery.