Protein Structure and Function W1

Proteins: Basic Concepts

Potentially infinite diversity of proteins exists, arising from the vast number of ways amino acids can be combined.
Proteins are polypeptides, which are linear heteropolymers of amino acids linked by peptide bonds. They are the most abundant macromolecules in cells and perform a wide array of functions.
Size range: typically from a few thousand to several million Relative Molecular Mass (RMM), with a common range of $10^4 - 10^6$ RMM.
Twenty standard amino acids are genetically encoded and used in protein synthesis, each with a unique side chain (R-group) dictating its chemical properties.
- More amino acids, beyond the standard 20, can be created by post-translational modification (e.g., hydroxylation, phosphorylation, glycosylation), expanding functional diversity.
Point mutations, which cause a single amino acid change, often demonstrate the critical importance of the precise amino acid sequence for protein structure and function.

Protein Properties and 3D Shape

Protein properties, including their biological activity and interactions, are critically dependent on their precise three-dimensional (3D) shape, also known as their native conformation.
The native (functional) shape can be destroyed by denaturation, a process that disrupts the non-covalent interactions and disulfide bonds that stabilize the 3D structure without breaking peptide bonds.
- Denaturation causes a loss of the specific spatial arrangement, leading to a loss of function.
- Examples of denaturing agents: mild heat (increases kinetic energy, breaking weak bonds), extreme pH (alters ionization states of side chains, disrupting ionic and hydrogen bonds), strong detergents (disrupt hydrophobic interactions), chaotropic agents (e.g., urea, guanidinium chloride, interfere with hydrogen bonding of water, destabilizing hydrophobic effect).
Basic 3D shapes of proteins:
- Globular: Very compact for its size, roughly spherical or ovoid. These proteins are typically soluble in aqueous solutions and perform dynamic functions (e.g., enzymes, transport proteins, regulatory proteins).
- Fibrous: Much longer than wide, often forming long filaments, rods, or sheets. They are typically water-insoluble and possess structural roles (e.g., collagen, keratin).
- Random coil: Effectively shapeless, constantly fluctuating, floppy, and less compact than globular proteins. These often represent denatured states or intrinsically disordered proteins that lack a stable 3D structure under physiological conditions.
Simple proteins: Contain only amino acid residues.
Conjugated proteins: Contain a prosthetic group of non-amino acid nature, which is an essential, tightly bound non-polypeptide component required for biological activity.

Conjugated Proteins and their Prosthetic Groups

Lipoproteins:
- Prosthetic Group: Lipids (e.g., cholesterol, phospholipids)
- Example: $B_1$ -Lipoprotein of blood, involved in lipid transport.
Glycoproteins:
- Prosthetic Group: Carbohydrates (oligosaccharides or polysaccharides)
- Example: Immunoglobulin G, antibodies with crucial roles in immune recognition; cell surface receptors, components of extracellular matrix.
Phosphoproteins:
- Prosthetic Group: Phosphate groups (often covalently attached to serine, threonine, or tyrosine residues)
- Example: Casein of milk, a key nutrient; many regulatory enzymes are activated/deactivated by phosphorylation.
Hemoproteins:
- Prosthetic Group: Heme (iron porphyrin, a complex organic molecule containing an iron ion)
- Example: Hemoglobin (oxygen transport in blood), myoglobin (oxygen storage in muscle), cytochromes (electron transport).
Flavoproteins:
- Prosthetic Group: Flavin nucleotides (FAD or FMN, derived from riboflavin)
- Example: Succinate dehydrogenase, a crucial enzyme in the citric acid cycle and electron transport chain.
Metalloproteins: Contain specific metal ions as prosthetic groups, which are often involved in catalysis, electron transfer, or structural stabilization.
- Prosthetic Group: Iron (e.g., Ferritin, an iron storage protein; alcohol dehydrogenase, though less common than zinc).
- Prosthetic Group: Zinc (e.g., Alcohol dehydrogenase, many hydrolases and proteases like carbonic anhydrase).
- Prosthetic Group: Calcium (e.g., Calmodulin, a regulatory protein involved in signal transduction; parvalbumin).
- Prosthetic Group: Molybdenum (e.g., Dinitrogenase, involved in nitrogen fixation).
- Prosthetic Group: Copper (e.g., Plastocyanin, involved in photosynthesis; cytochrome c oxidase).

3D Structure and Compaction

Native proteins are generally compact macromolecular structures, often globular, and do not resemble long, floppy, extended chains. This compaction is critical for their stability and function.
In forming compact structures, the primary driving force is the hydrophobic effect:
- Hydrophobic (non-polar) side chains cluster spontaneously in the interior (center) of the protein, shielded from the aqueous environment.
- Polar (hydrophilic) side chains are predominantly located on the outside (surface), where they can interact favorably with water through hydrogen bonds and ionic interactions, making the protein soluble.
- Numerous weak, non-covalent interactions (hydrogen bonds, ionic bonds, van der Waals forces) form within the protein, stabilizing the specific 3D architecture on top of the hydrophobic core.

Globular Protein Structure Details

Surface: Characterized by an abundance of polar and charged side chains, which interact with the surrounding water molecules, making the protein soluble in physiological buffers. This hydrophilic surface allows the protein to move freely within cells and interact with other molecules.
Interior: Primarily hydrophobic and tightly packed with non-polar side chains. These non-polar residues are shielded from water, forming a stable, non-aqueous environment beneficial for specific functions (e.g., enzyme active sites).
- Few polar groups, such as those forming hydrogen bonds or participating in catalysis, are present in the interior, and if they are, they are typically involved in strong, protective interactions.
- Strong hydrogen bonds almost always exist between any peptide $C=O$ and >NH groups that are buried in the interior, neutralizing their polarity and contributing significantly to structural stability (e.g., in $\alpha$-helices and $\beta$-sheets).

Membrane Proteins

Approximately 25% of human proteins are membrane proteins, embedded within or associated with cellular membranes.
Unlike cytoplasmic (globular) proteins, which are typically hydrophilic on their exterior, membrane proteins are characterized by significant hydrophobic regions on their outer surfaces. These hydrophobic surfaces interact favorably with the non-polar tails of the lipid bilayer, anchoring the protein within the membrane. Their interior may still contain hydrophilic regions for channels or active sites.

Levels of Protein Structure

Protein structure is hierarchically organized, with each level contributing to the overall 3D conformation and function:
- Primary structure: The linear sequence of amino acids linked by peptide bonds, extending from the N-terminus to the C-terminus. This sequence is predominantly genetically coded and dictates all higher levels of structure.
- Secondary structure: Regular, recurring local folding patterns of the polypeptide chain, stabilized by hydrogen bonds between the backbone peptide carbonyl ( $C=O$ ) and amide ( $N-H$ ) groups. Common examples include the $\alpha$-helix and $\beta$-sheet.
- Tertiary structure: The overall three-dimensional folding of the entire polypeptide backbone and its side chains, forming a compact, globular structure. This level is stabilized by various weak interactions (hydrophobic interactions, hydrogen bonds, ionic bonds, van der Waals forces) and disulfide bridges between side chains.
- Quaternary structure: The specific non-covalent association of multiple independent polypeptide subunits (each with its own tertiary structure) to form a larger, biologically active multimeric protein complex. Not all proteins possess a quaternary structure.

Primary Structure: The Peptide Chain

The peptide bond, formed by a condensation reaction between the carboxyl group of one amino acid and the amino group of another, has partial double-bond character. This makes it rigid and planar, restricting rotation around the $C-N$ bond.
The peptide chain, however, has limited flexibility due to rotation around two dihedral angles per amino acid residue:
- phi ( $\Phi$ ): Describes rotation around the $N-C_\alpha$ bond.
- psi ( $\Psi$ ): Describes rotation around the $C_\alpha-C$ bond.
These angles are shown as $180$ degrees (fully extended bond) in standard diagrams. The specific combinations of these angles determine the local conformation of the polypeptide backbone and thus the secondary structure.

Secondary Structure

Secondary structures are regular, repeating arrangements of local regions of the polypeptide backbone, characterized by repeating phi ( $\Phi$ ) and psi ( $\Psi$ ) angles at successive residues.
Stabilization is provided exclusively by hydrogen bonds between the backbone peptide carbonyl ( $C=O$ ) oxygen of one amino acid and the peptide amino ( $N-H$ ) proton of another amino acid within the polypeptide backbone.
- Crucially, all peptide $C=O$ and $N-H$ groups that are part of the regular secondary structure are typically involved in these hydrogen bonds, contributing to the stability and regular geometry.

Main Secondary Structures

The $\alpha$ Helix:
- A helical structure, resembling a coiled ribbon, where the polypeptide backbone is coiled around an imaginary central axis.
- Stabilized by hydrogen bonds between the backbone carbonyl oxygen of residue $i$ and the amide proton of residue $(i+4)$ . These H-bonds are roughly parallel to the helix axis.
- Pitch of $5.4 \mathring{A}$ (the axial distance per turn), containing approximately $3.6$ amino acid residues per turn. It is typically right-handed.
- All $\alpha$-helices are amphipathic, meaning they can have both hydrophilic and hydrophobic faces, influencing protein interactions and membrane insertion.
$\beta$ Sheets:
- Formed by hydrogen bonding between two or more polypeptide strands that lie side-by-side. The backbone is almost fully extended in a zig-zag pattern.
- Can be antiparallel (adjacent strands run in opposite N-to-C directions), which typically results in stronger and more numerous inter-strand hydrogen bonds, often creating a pleated appearance.
- Can be parallel (adjacent strands run in the same N-to-C direction), which leads to less stable hydrogen bonding geometry.
- Hydrogen bonds form between $C=O$ groups of one strand and $N-H$ groups of an adjacent strand.
- Often form the core structure of globular proteins, creating a rigid platform.
$\beta$ Turns (also known as reverse turns or hairpins):
- Short, compact structures that enable the polypeptide chain to reverse direction abruptly, often found at the surface of globular proteins.
- Typically involve four amino acid residues, with a hydrogen bond forming between the carbonyl oxygen of residue $i$ and the amide proton of residue $(i+3)$ .
- Residues $2$ and $3$ typically do not contribute to intrachain hydrogen bonding within the turn, allowing flexibility.
- Proline (Pro), due to its cyclic side chain and rigid conformation, is often found at residue $2$ position in Type II $\beta$ turns, facilitating the tight bend.
- Glycine (Gly), with its small, flexible side chain, is often found at residue $3$ position in Type II $\beta$ turns, allowing for specific backbone angles.
- Residues $2$ and $3$ can form hydrogen bonds with the surrounding medium (e.g., water), further stabilizing the turn at the protein's surface.

Ramachandran Plot and Favored Angles

The Ramachandran plot is a graphical representation of the energetically allowed and favored regions for the phi ( $\Phi$ ) and psi ( $\Psi$ ) dihedral angles in the polypeptide backbone. It reveals sterically permissible combinations of these angles.
Secondary structures ($\alpha$-helices and $\beta$-sheets) correspond to specific, relatively narrow regions on the Ramachandran plot, highlighting the restricted rotational freedom due to steric clashes between atoms.
Plot analysis helps validate protein structures and understand conformational preferences of amino acids.

Structure-Promoting Probabilities for Amino Acids

The propensity of an amino acid to adopt a particular secondary structure is influenced by the chemical properties and steric bulk of its side chain. These propensities are derived from observing amino acids in known protein structures.
Amino Acids favoring $\alpha$ Helix: Glu, Met, Ala, Leu, Lys, Phe, Gln, Trp, Ile, Val. These often have relatively small or linear side chains that fit well into the helical structure without causing steric hindrance.
Amino Acids favoring $\beta$ Conformation: Ile, Val, Asp, His, Arg, Thr, Ser, Cys, Asn, Tyr, Pro, Gly. These often have bulkier side chains (e.g., branched $\beta$-carbons like Ile, Val) or can form extensive hydrogen bond networks.
Amino Acids favoring $\beta$ Turn: Asp, His, Arg, Thr, Ser, Cys, Asn, Tyr, Pro, Gly. Glycine and Proline are particularly prominent due to their unique conformational flexibility (Gly) or rigidity (Pro) suited for tight turns.
Note: Some amino acids show preference for multiple secondary structures, indicating their versatility or context-dependent roles.

Protein 3D Structure Stabilization

Protein 3D structure is primarily stabilized by a vast number of individually weak, non-covalent interactions, including:
- Hydrophobic interactions: The strongest driving force, arising from the entropic gain of water molecules when non-polar groups are sequestered from the aqueous environment.
- Hydrogen bonds: Electrostatic interactions between a hydrogen atom covalently bonded to an electronegative atom (N, O) and another electronegative atom.
- Ionic bonds (salt bridges): Electrostatic interactions between oppositely charged amino acid side chains (e.g., Lys and Asp).
- Van der Waals forces: Weak, transient attractive forces between all atoms due arising from temporary fluctuations in electron distribution.
The cumulative effect of these many weak bonds provides significant stability, but makes the native structure of a protein easily denatured by relatively mild changes in environment, as individual bonds are readily broken.

The Disulfide Bridge

A disulfide bridge (or disulfide bond) is a strong covalent bond formed by the oxidation of the thiol (-SH) groups of two cysteine residues to form a cystine residue (containing the -S-S- bond).
This covalent bond is significantly stronger than any non-covalent interaction and plays a crucial role in stabilizing the tertiary and quaternary structures, particularly in extracellular proteins which face harsher environments.
The reaction is reversible: reduction of cystine yields two cysteine residues, an important factor in protein folding and unfolding processes.
Represented as:
$H3N^+-CH(COO^-)-CH2-SH + HS-CH2-CH(COO^-)-NH3^+ \rightleftharpoons H3N^+-CH(COO^-)-CH2-S-S-CH2-CH(COO^-)-NH3^+ + 2H^+ + 2e^-$

Classification of Amino Acids

A major characteristic distinguishing amino acids is the water solubility (hydrophobicity/hydrophilicity) and charge of their side chain (R-group). This property dictates their location in a folded protein and their role in interactions.
Amino acids are broadly classified into categories such as:
- Nonpolar, aliphatic: Gly, Ala, Val, Leu, Ile, Met, Pro
- Aromatic: Phe, Tyr, Trp
- Polar, uncharged: Ser, Thr, Cys, Asn, Gln
- Positively charged (basic): Lys, Arg, His
- Negatively charged (acidic): Asp, Glu

Secondary Structure Prediction

Predictions for secondary structure (e.g., $\alpha$ helices, $\beta$ sheets, turns) can be made with reasonable accuracy based on the known propensities and contextual information within the amino acid sequence.
Methods utilize sequence properties like hydropathy (tendency to interact with water) and propensity scores (statistical likelihood of an amino acid being in a certain structure).
Example: Prediction for adenylate kinase, showing good but not perfect agreement between observed vs. predicted helices, sheets, and turns, highlighting the complexity of sequence-to-structure mapping.

Super-Secondary Structures (Motifs)

Super-secondary structures, also known as motifs or folds, are recurring combinations of two or more secondary structural elements ($\alpha$-helices, $\beta$-sheets) that are often found connected by turns or loops.
These specific geometric arrangements are stable and commonly recur frequently in protein structures, often associated with a particular function or binding capacity.
Examples:
- $\beta$ Hairpin: A simple motif consisting of two adjacent antiparallel $\beta$ strands connected by a short loop or turn, effectively reversing the direction of the polypeptide chain.
- Greek Key: A more complex arrangement, typically involving four antiparallel $\beta$ strands folded into a specific coiled pattern, reminiscent of the Greek key pattern found in ancient art.
- Helix-Turn-Helix: A common DNA-binding motif, often composed of two $\alpha$ helices connected by a short turn. One helix positions itself in the major groove of DNA.
- Calcium-binding motif (EF-hand): Often involves a helix-turn-helix like structure, where the loop region coordinates calcium ions.
- $\beta-\alpha-\beta$ motif: A $\beta$ strand connected to an adjacent $\beta$ strand by an intervening $\alpha$ helix, a very common arrangement found in many enzymes.

Protein Domains

A domain represents an organizational level between secondary structure and tertiary structure, defining an independent folding unit within a larger polypeptide chain.
Domains are compact, semi-independent structural and functional units with a recognizable substructure of secondary motifs. A single polypeptide chain can consist of one or multiple domains, each typically 50-200 amino acids long.
Domains often perform specific, identifiable functions (e.g., catalytic activity, ligand binding, DNA binding) and can be found in a number of different proteins, reflecting their modular nature in evolution.
While the entire tertiary structure of a protein may consist of a single domain, it frequently comprises an assembly of multiple domains linked by flexible loops.
Interestingly, domains observed in the tertiary structure of a protein in one species might exist as separate, individual polypeptide subunits in the equivalent protein in another species (e.g., components of fatty acid synthase can be single domains in mammals but separate subunits in bacteria).

Domain Families Examples

Rossmann domain: A very common nucleotide-binding domain (e.g., for NAD$^+$, NADH$^+$, FAD), characterized by a conserved $\beta-\alpha-\beta$ pattern often found in dehydrogenases and reductases.
Alpha-Beta Plait: A topology often associated with specific functions, though less broadly defined than Rossmann.
Arc repressor: A small, dimeric protein acting as a transcription factor, known for its specific DNA binding domain.
TIM barrel: A highly stable and common protein fold, consisting of an alternating $\alpha/\beta$ barrel architecture, found in many metabolically diverse enzymes.
Immunoglobulin domain: A characteristic compact $\beta$-sandwich fold found in antibodies and other immune system proteins, crucial for antigen recognition and cellular interactions.

Domains and Evolution

Homologous Proteins: Different proteins that share a common evolutionary ancestor often contain the same domains. This structural resemblance can persist even when amino acid sequences have diverged significantly, making direct sequence comparison difficult but structural comparison powerful for inferring evolutionary relationships.
Gene Rearrangement: The modular nature of domains suggests that novel proteins appear to have been assembled from domains originating in different proteins through evolutionary processes like gene fusion (joining of two genes to form a single hybrid gene), genetic transposition, or chromosome rearrangement.
- This implies that a domain can correspond to a genetic unit below the level of a complete gene, acting as building blocks for diversified protein function.
Eukaryotic Exons: In eukaryotes, protein domain boundaries are thought to often correspond to exon boundaries (coding regions) within a gene. This supports the idea that exons encode functional units that can be recombined through alternative splicing or gene rearrangement to create new proteins.

Examples of Domains and Evolution

EGF (Epidermal Growth Factor) domains: These are conserved domains homologous to EGF, a small polypeptide of $53$ amino acids. They are found in many extracellular and transmembrane proteins, mediating protein-protein interactions.
Chymotrypsin-like serine proteinase domains: Homologous to chymotrypsin (about $245$ amino acids arranged in two domains), these domains are characteristic of a large family of proteolytic enzymes involved in digestion and blood clotting.
Kringle domains: Characterized by three internal disulfide bridges within an $85$ -amino acid region, forming a distinctive loop structure. Found in blood coagulation and fibrinolytic proteins (e.g., plasminogen).
Calcium-binding domain (e.g., EF-hand): A specific domain that binds calcium ions, often regulating protein activity in response to calcium signals.

Abundance of Known Domains

The increasing number of identified protein interaction domains underscores the complexity of cellular machinery and signaling networks. Examples include: 14-3-3, ANK repeat, BAR, BEACH, BH1-BH4, BIR, BRCT, Bromo, BTB, C1, C2, CC, CARD, CALM, CH, Chr, CUE, DD, DED, DEP, DH, EH, EFh, ENTH, EVH1, F-box, FERM, FF, FH2, FHA, FYVE, GAT, GEL, GLUE, GRAM, GRIP, GYF, HEAT, hect, IQ, LIM, LRR, MBT, MH1, MH2, MIU, NZF, PAS, PB1, PDZ, Polo, PH, PTB, PUF, SPWWP, PX, RGS, RING, SAM, Box, SC, SH3, SOCS, SH2, SPRY, START, SWIRM, TIR, TPR, TRAF, tsnare, Tubby, TUDOR, UBA, UEV, UIM, VHLB, VHS, W, WW, PRIN. Each of these mediates specific molecular interactions.

Protein Structure Determination Methodologies

First Steps: Protein Preparation (Traditional Methods)

Purification: A critical initial step to obtain a homogeneous sample, often requiring >97% purity from cells or tissues. Can require large quantities (many grams) of starting material.
- Procedures include various types of fractional precipitation (e.g., ammonium sulfate cut, isoelectric precipitation) and a series of chromatographic techniques (e.g., ion-exchange, hydrophobic interaction, size-exclusion, affinity chromatography) tailored to the protein's properties.
Physical Properties Characterization: After purification, initial characterization provides crucial information:
- Mass / subunit makeup: Determined by methods like size-exclusion chromatography (determines hydrodynamic radius/molecular weight in solution), analytical centrifugation (sedimentation coefficient, molecular weight), SDS-PAGE (denaturing electrophoresis for subunit molecular weight), and mass spectrometry with/without protein digestion (highly accurate molecular weight, peptide mapping).
Protein Sequencing: Historically performed by Edman degradation, which sequentially removes and identifies N-terminal amino acids. Modern approaches often combine mass spectrometry with genomic sequence data.
Secondary Structure Prediction: From amino acid sequence, using computational algorithms that analyze amino acid propensities and patterns.
Physical Measurements: E.g., circular dichroism (CD) spectroscopy, which measures the differential absorption of left and right circularly polarized light to estimate the content of $\alpha$-helix, $\beta$-sheet, and random coil structures in solution.

Modern Methods: Gene-based Protein Preparation

These methods leverage molecular biology to produce proteins efficiently.
Cloning: The gene encoding the target protein is cloned into a suitable expression system (e.g., E. coli for bacteria, yeast, baculovirus in insect cells, or mammalian cells) for overexpression, allowing production of large quantities (milligrams to grams) of recombinant protein.
Epitope Tagging: A short peptide tag (epitope) is genetically fused to the protein (e.g., a His-tag, typically six sequential histidines). This allows for highly efficient, one-step purification using affinity chromatography (e.g., nickel affinity chromatography for His-tags), significantly simplifying and speeding up the purification process.

Determination of 3D Structure

The primary goal of structural biology is to determine the precise 3D atomic coordinates of a protein, utilizing a range of powerful biophysical techniques:
- NMR Spectroscopy: Used for proteins in solution, particularly effective for smaller to medium-sized proteins (up to ~30-40 kDa). Provides information on dynamics and local environments.
- X-ray Crystallography: The most established and widely used technique, for proteins in well-ordered crystal form. Can determine structures of very large proteins and complexes at atomic resolution.
- Electron Microscopy (Cryo-EM): An emerging and rapidly advancing technique, particularly powerful for determining structures of large macromolecular complexes, membrane proteins, or proteins that are difficult to crystallize. It captures images of frozen protein samples.
- Computer Prediction from Sequence:
- Homology modeling (comparative modeling): Starting with approximate structures by comparison with related known structures (templates), then refining by energy minimization. This is the most accurate prediction method when a good template exists.
- Ab initio prediction (de novo prediction): Predicting structure from sequence without reliance on any known template structures. This is computationally much more challenging and generally less accurate, but rapidly improving with methods like AlphaFold.

NMR Spectroscopy

Principle: Measures transitions between nuclear spin states when nuclei (e.g., \text{^1H, ^13C, ^15N}) are placed in a strong magnetic field and exposed to radiofrequency pulses. The energy separation ( $\Delta E$ ) for these transitions, and thus the resulting NMR signal's frequency (chemical shift), is highly dependent on the local chemical environment of each nucleus.
2D NMR (NOESY - Nuclear Overhauser Effect Spectroscopy): A powerful technique that detects spatial interactions between nuclei. The intensity of an NOE cross-peak is inversely proportional to the sixth power of the distance between the two interacting nuclei ( $1/r^6$ ).
- This means that nuclei about 5 \mathring{A}} (0.5 nm) apart will show a detectable interaction.
- By identifying thousands of such short-distance NOE contacts across the protein, NOESY allows the calculation of 3D structures by establishing a network of spatial constraints that define the protein's folded conformation.
- Example: NOESY structure of a cellulase domain, illustrating a typical output of such studies.

X-ray Crystallography (Detailed Procedure)

Crystal Growth: The first and often most challenging step is to grow a well-ordered, diffraction-quality single crystal of the purified protein. This involves:
- Large-scale initial screening of crystallization conditions (e.g., $1536$ conditions tested simultaneously) to find initial