Proteins: shape, function, and the folding problem
- Proteins carry out the labor in our cells; their shape dictates their function. The three-dimensional (3D) shape, or fold, is the key to how a protein works.
- When unfolded, a protein is a long string of amino acids.
- There are 20 different amino acids, each with its own chemical behaviors. When a protein folds, you get a long tangled piece of spaghetti with various chemical functionalities on it; the 3D shape has evolved over billions of years to perform specific jobs.
- Understanding the minute details of a protein’s structure yields insights into function and potentially allows changing that function.
- Researchers have been trying for many years to solve the protein folding problem.
The protein folding problem and data sources
- Question: Can we predict how a protein folds from its amino acid sequence alone?
- Approach 1: Input the amino acid sequence into a computer and test folding predictions with algorithms.
- Approach 2: Use experimental techniques like X-ray crystallography or other imaging methods to determine structure, though this has not been done for many proteins.
- A related, broader question asked a couple of decades ago: Could the genome sequence data—the billions of letters in the human genome and in other organisms—reveal anything about how proteins fold? The DNA code and the amino acid code are related but distinct.
- The DNA in our genes codes for RNA, which is translated into proteins. There is a relation between the 4-letter DNA code and the 20-letter amino acid sequence.
- An important observation: because a protein folds into twists and turns, amino acids that are far apart linearly (e.g., the 6th amino acid and the 18th amino acid) can end up close in 3D space, creating interactions that influence folding and function.
Coevolution and residue-residue interactions
- If two amino acids end up near each other in the folded protein and interact, mutations in one amino acid must often be compensated by mutations in the other to preserve the interaction. This is coevolution.
- Example in spirit: a mutation in the DNA that changes one amino acid must be accompanied by another mutation in the partner amino acid to maintain the interaction and the overall fold.
- If you can collect and analyze on the order of ext{K} \ge 100 cases of such nearby interacting pairs across many genome sequences, you can feed these tight constraints into a folding program, improving its accuracy.
- The idea is that many genome sequences carry correlated mutations in residue pairs that are physically close in 3D space, and these correlations can guide folding predictions.
Folding programs and computational advances
- With abundant constraints from co-evolution data, folding programs gain a much better chance of predicting accurate structures.
- The approach has proven effective: scientists can fold many proteins that were previously intractable, opening up insights into how those proteins work.
- Beyond prediction, researchers have steadily improved computer models of protein shapes, enabling the design of new proteins—not found in nature.
Applications: medicine and materials
- The most obvious application is in medicine:
- Designing proteins that target very specific parts of pathogens like the flu virus, enabling vaccines that work across multiple flu strains.
- Creating proteins that naturally assemble into tiny cages capable of delivering different molecules inside the body.
- In materials science and engineering, designed proteins can lead to new materials and devices, such as engineered surfaces for solar cells and electronic devices.
- The statement underscores the versatility of protein design: you can pursue a thousand different directions.
Concepts, connections, and implications
- Connection to foundational biology: DNA -> RNA -> Protein; genetic information guides protein function through sequence and structure.
- Conceptual takeaway: structure determines function; understanding or controlling structure enables manipulation of function.
- Practical implication: improved predictive power for protein folding accelerates drug design, vaccine development, and new materials.
- Metaphor: proteins as evolved molecular machines whose three-dimensional shapes are the blueprint for their actions.
- Ethical and practical considerations (implied): as design capabilities grow, considerations about safety, biosecurity, and responsible use become increasingly important when creating novel proteins or delivery systems.
Key numbers, terms, and concepts to remember
- Number of amino acids: 20, each with distinct chemical behavior.
- Genome length reference: the human genome contains about 3{,}000{,}000{,}000 letters; many other genomes exist with billions of letters.
- DNA code vs. amino acid code: four-letter DNA code vs. twenty-letter amino acid code; the two are related through transcription and translation.
- Protein substructure interactions: a protein’s fold brings distant linear residues into proximity, e.g., $a6$ and $a{18}$ may interact in 3D space.
- Coevolution criterion: correlated mutations in interacting residue pairs can stabilize the fold; gathering at least around 100+ such cases strengthens predictive models via constraints.
- Potential outputs of design: proteins that assemble into nanoscale cages, targeted therapeutics, and novel materials.
Hypothetical scenarios and examples
- Scenario: A researcher gathers a database of correlated mutations across many genomes, identifying a set of residue pairs with high mutual information. These pairs become constraints in a folding algorithm, reducing the number of plausible folds and revealing the correct structure faster.
- Scenario: A designed protein self-assembles into a cage that safely carries a therapeutic payload to a targeted tissue, reducing off-target effects.
- Scenario: A new protein-based material is engineered to form a surface with enhanced efficiency for solar energy capture or electron transfer in devices.
Summary takeaway
- The shape of a protein—the 3D arrangement of its amino acids—drives its function.
- Predicting folding from sequence is challenging, but leveraging co-evolutionary constraints derived from genome data significantly improves accuracy.
- Computational design is enabling the creation of proteins with novel functions and applications in medicine and technology, with broad potential but also important ethical and practical considerations.