Proteomics and the Proteome 1

From Individual Proteins to the Biological System

The study of proteins can be viewed as a progression from analyzing an individual protein to understanding the "proteome" within a larger biological system. This involves complex intracellular signaling pathways, such as the Wnt pathway, which involves various components:

Extracellular and Membrane Components: Dkk, Wnt, SFRPS, WIF-1, Cer, Kremen, LRP5/6, and Frizzled (Fz).
Cytoplasmic Components: Adherin Junctions, Cytoskeleton, $\beta\text{-Catenin}$ , GBP, Frat, $\text{CKIE}$ , and the destruction complex including $\text{GSK-3}\beta$ , APC, and Axin.
Post-translational events: $\beta\text{-Catenin}$ can undergo phosphorylation and ubiquitination ( $\text{E3/dTICP}$ ) leading to degradation by the Proteasome ( $\text{-Wnt}$ state). Conversely, in the ( $\text{+Wnt}$ ) state, $\beta\text{-Catenin}$ gains stability.
Nuclear Transcription: Stabilized $\beta\text{-Catenin}$ interacts with Lef-1/Tcf and TBP to initiate transcription of targets like Fibronectin, Cyclin D1, and c-myc.

Core Definitions: Proteome and Proteomics

Proteome

The entire PROTein complement expressed by a genOME (e.g., "the human proteome").
Alternatively, the entire protein complement expressed by a specific cell or tissue (e.g., "the liver proteome").
Proteomes are considered hypothetical because we do not yet know exactly every component that comprises the human proteome.

Proteomics

Refers to the qualitative and quantitative studies of protein expression and function.
It encompasses the techniques and methodologies used to study the proteome.
The term was significantly defined by Wilkins et al. in Biotech Genet. Eng. Rev. (1996).

The Scientific Necessity for Proteomics

While genome sequencing can predict protein sequences, proteomics is essential for several reasons:

Functional Dependence: Protein function is dependent on its three-dimensional structure, post-translational modifications, and molecular interactions. These features are NOT predictable from the primary sequence alone.
Abundance Disconnect: The abundance of mRNA in a cell may not reflect the actual level or activity of the corresponding proteins.
Localization: Protein function often depends on its specific location within the cell.
Complexity of Gene Expression: There is a one-to-many relationship between genes and the polypeptides/proteins they produce.

The Complexity and Diversity of the Proteome

The proteome is defined as "the entire complement of proteins that is or can be expressed by a cell, tissue, or organism." Its complexity arises from molecular diversity:

Protein Types: Variations include globular and membrane proteins.
Physico-chemical Properties: Differences in amino-acid compositions and domains.
Functions: Ranging from structural roles to catalytic activities.
Compartmentalization: Presence in different cellular areas such as the nucleus or cytosol.

Dynamic Nature Proteomes are not static. While the genome remains largely the same, the proteome changes across life cycles (e.g., the transition from a Caterpillar to a Butterfly). Proteomes change in real-time in response to:

Environment
Infectious agents
Cellular cycles (circadian rhythms, cell division)
Hormones

Post-Translational Modifications (PTMs)

PTMs on a single protein significantly increase complexity. A primary example is the p53 protein, an important tumor suppressor whose inactivating mutations are found in over $50\%$ of cancers. PTMs regulate its activity through processes such as:

Phosphorylation
Acetylation

Combinatorial Complexity of PTMs Modifications follow combinatorial rules, as illustrated in Cell Signaling (Garland Science, 2015):

3 sites with 1 modification type/site: $2^3 = 8$ possible states.
5 sites with 1 modification type/site: $2^5 = 32$ possible states.
3 sites with 2 modification types/site: $3^3 = 27$ possible states.
3 sites with 3 modification types/site: $4^3 = 64$ possible states.

Protein Expression Variations and the Plasma Proteome

Protein molecule "copy numbers" per cell vary drastically, ranging from $100s$ to over $20 \times 10^6$ .

Most Abundant Proteins: Metabolic enzymes, Ribosomal proteins, Structural proteins, and Heat-shock proteins.
Least Abundant Proteins: Signaling proteins and Transcription factor proteins.

Blood Plasma Proteome Blood plasma is the liquid part of blood (minus cells). It possesses an extremely diverse proteome because it bathes all cells in the body. However, analyzing it is challenging:

The top $10$ most abundant proteins constitute $95\%$ of the plasma proteome.
Albumin alone makes up $55\%$ .
There is a $9$ orders of magnitude variation ( $1 \times 10^9$ ) in protein concentration within blood plasma, making low-abundance proteins very difficult to detect.

Information Derived from Proteomics and Sample Sources

Proteomics can answer several critical biological questions:

Sequence: What is the specific protein sequence (e.g., AVACCDLRDTYWP…)?
Modifications: Identification of hydroxylation, phosphorylation, acetylation, methylation, glycosylation, prenylation, nitrosylation, and carbonylation.
Interactions and Abundance: Which proteins interact and how does their abundance change?
Distribution and Turnover: Where are molecules distributed and what is the rate of protein turnover?
Immunopeptidomics: What immunopeptides are presented?
Targeted Quantitation: Verification of biomarker panels using techniques such as Multiple Reaction Monitoring (MRM).

Analyzable Sample Types:

Cell Models: Fibroblasts, Monocytes, Erythrocytes, T-lymphocytes.
Animal Models: Rat/Mice models (Tissue and Serum).
Human Fluids: Sputum/BALF, Oocytes/Blastocysts, and other fluids.
Pathogens: Bacteria and viruses.

Mass Spectrometry Proteomics Workflow

A typical workflow involves the following steps:

Protein Separation: Complex samples are separated into simpler "fractions."
- 1D-SDS PAGE: Separates proteins by size using Heat + SDS + reducing agent on a polyacrylamide gel.
- 2D Gels: Separate proteins by mass and charge.
- Challenges: Hydrophobic (membrane) proteins and low abundance proteins.
Protein Digestion into Peptides:
- Intact proteins (high molecular weight) are challenging to analyze directly.
- Peptides are smaller, more soluble, and serve as a "tag" for the parent protein.
- Trypsin: A digestive enzyme commonly used to cut proteins at lysine (K) and arginine (R) residues.
Tandem Mass-Spectrometry (MS/MS):
- MS1: Measures the mass of the intact tryptic peptide ("precursor" ion).
- MS2: Fragments the precursor ion (by imparting energy to break peptide bonds) and measures the masses of the resulting "product" ions.

Peptide Identification and Collation

Data analysis involves matching MS/MS spectra to peptide sequences in a protein database.

In Silico Prediction: Theoretical MS/MS patterns are predicted from proteolysis in silico.
Matching: Experimental spectra are matched against these theoretical patterns to identify peptides.
Protein Collation: Tandem MS identifies peptides which are then matched back to their original proteins. Confidence in protein identification increases with the number of unique peptides found. "Coverage" refers to the percentage of the protein sequence covered by identified peptides; it may be low (only $1$ or $2$ peptides).

Protein Families

Protein families are sets of proteins related by sequence, often sharing the same domains and functions. They typically arise from gene duplication events. Most proteins belong to a family with more than one member, which complicates proteome analysis because peptides may be shared across multiple family members.

Example: Assignment of Peptides to the Alpha-Tubulin Family Peptides identified include:

TIGGGDDSFNTFFSETGAGK
VGINYQPPTVVPGGDLAK
AVCMLSNTTAIAEAWAR
LDHKFDLMYAK
AVFVDLEPTVIDEVR
QLFHPEQLITGKEDAANNYAR
YMACCLLYR
SIQFVDWCPTGFK
IHFPLATYAPVISAEK
AYHEQLSVAEITNACFEPANQMVK
NLDIERPTYTNLNR

These peptides are assigned to various members (Proteins) based on sequence overlap:

P05209 (alpha-1): Matches peptides 1, 2, 3, 4, 5, 6, 7, 8.
Q13748-1 (alpha-2 (1)): Matches peptides 1, 2, 3, 4, 11.
Q13748-2 (alpha-2 (2)): Matches peptides 1, 2, 3, 4.
NP_006000 (alpha-3): Matches peptides 1, 2, 3, 4, 5, 6, 7, 8.
P05215 (alpha-4): Matches peptides 1, 2, 3, 4, 5, 6, 7, 8, 9.
Q9BQE3 (alpha-6): Matches peptides 1, 2, 3, 4, 5, 6, 11.
Q9NY65 (alpha-8): Matches peptides 1, 2, 3, 4, 5, 6, 7, 10.