Proteomics 2

Applications and Capabilities of Proteomics

Proteomics provides a diverse array of information regarding cellular proteins and their functional states. This field of study goes beyond simple identification to include:

Post-translational Modifications (PTMs): Identifying chemical modifications that occur after protein synthesis, including:
- Hydroxylation
- Phosphorylation
- Acetylation
- Methylation
- Glycosylation
- Prenylation
- Nitrosylation
- Carbonylation
Protein Sequencing: Determining the specific primary structure (the linear sequence of amino acids) of a protein, e.g., $AVACCDLRDTYWP...$
Protein-Protein Interactions: Determining which proteins physically interact with one another.
Protein Abundance: Measuring how the quantity of specific proteins changes under different conditions.
Molecular Distribution: Mapping where molecules are distributed within the cell or tissue.
Protein Turnover: Measuring the rate of protein synthesis and degradation.
Immunopeptidomics: Identifying which immunopeptides are presented on the cell surface.
Biomarker Verification: Validating specific panels of biomarkers using targeted quantitation methods such as Multiple Reaction Monitoring (MRM).

Protein-Protein Interactions (PPIs)

Protein-protein interactions are integral to almost every biological process. To fully understand a protein's function, one must understand its interactions. These interactions often form complex networks rather than isolated events.

Characteristics of PPIs include:

Variability: Interactions can be dynamically changing, ranging from very strong to very weak.
Biological Roles: They are essential for processes including:
- Transcription
- Protein transport
- Immune response
- Metabolism
- Chromatin remodeling
- Cell signaling

The Human Protein Interactome

The interactome is defined as the complete set of protein-protein interactions within a specific biological system. While the exact size of the human interactome remains unknown, estimates are derived from the number of proteins and known interactions.

Human Genome: Approximately $21,000$ genes.
Cell Types: More than $200$ distinct cell types.
Estimated Interactions: Approximately $650,000$ interactions (based on Huttlin et al., Nature, Volume 545, 2017).

Techniques for Mapping the Interactome

There are two primary methodologies used to identify and map protein interactions:

Yeast Two-Hybrid (Y2H): Used to identify binary (one-to-one) interactions.
Affinity-Purification Mass Spectrometry (AP-MS): Used to identify protein complexes.

Method 1: Yeast Two-Hybrid (Y2H)

This method involves fusing test proteins (labeled $X$ and $Y$ ) to the two functional domains of a transcription factor required for gene expression:

The Binding Domain (BD)
The Activation Domain (AD)

Mechanism:

Two hybrid proteins are created: Protein $X$ fused to the Binding Domain, and Protein $Y$ fused to the Activation Domain.
A Reporter Gene is utilized; this is a gene that, when active, allows yeast to grow.
The reporter gene is only expressed if the Activation and Binding domains interact to reconstruct a functional transcription factor.
If test proteins $X$ and $Y$ interact, the transcription factor is brought together, the reporter gene is turned "on," and the yeast cells grow.

High-Throughput Applications: To test many protein combinations, researchers create libraries containing all proteins fused to either the activation or binding domains. These libraries are screened by mating various yeast strains; crossed strains that survive and grow indicate a positive protein-protein interaction. This system can be used to test interactions between proteins from other species, such as human proteins.

Method 2: Affinity-Purification Mass Spectrometry (AP-MS)

This method follows a biochemical approach to isolate complexes:

An antibody is used to purify a specific "bait" protein along with any associated interacting proteins (the "prey").
Mass spectrometry techniques are then employed to identify and quantify the proteins present in the purified sample.

Case Study: SARS-CoV-2 Human-Viral Interactome

Research by Gordon et al. (2020) titled "A SARS-CoV-2-Human Protein-Protein Interaction Map Reveals Drug Targets and Potential Drug-Repurposing" investigated how the virus takes over human cells.

Viral Context: SARS-CoV-2 is the causative agent of the COVID-19 pandemic. Its genome expresses $29$ viral proteins.
Experimental Approach: $26$ of the $29$ viral proteins were cloned, tagged, and expressed in human cells. AP-MS was used to identify host proteins interacting with these viral proteins.
Findings:
- Identified $332$ high-confidence SARS-CoV-2-human protein-protein interactions (PPIs).
- The map provides a detailed view of how the virus manipulates host cells for processes like replication.
- Identified $67$ druggable human proteins or host factors targeted by $69$ existing FDA-approved drugs.
- Many of these human proteins function in basic cellular processes and may provide targets for anti-COVID-19 therapies.

Analyzing Protein Networks and Functional Associations

Proteomics experiments typically identify thousands of proteins. Analysis focuses on whether these proteins function together in networks or biological processes (e.g., Wnt signaling, BMP signaling, PI3K pathway, Apoptosis, Cell-cell adhesion).

STRING Database

The STRING database (Search Tool for the Retrieval of Interacting Genes/Proteins) is a resource for exploring functional networks.

Integration: It integrates direct protein-protein interactions with indirect functional associations.
Data Sources: It pulls data from large-scale proteomics experiments as well as small-scale individual studies.
Combined Score: Provides a confidence score for each interaction.
Functionality: Users can identify "cliques" in networks—sets of proteins that interact more closely than expected by chance.
Interface: Users can search by protein name, sequence, or by uploading a file containing multiple proteins. It supports search across many different organisms.

Standardizing Function: The Gene Ontology (GO)

A significant challenge in proteomics is nomenclature ambiguity:

One name, many concepts: Identical terms (e.g., "cell") can have different meanings across contexts.
One concept, many names: A single biological process (e.g., Histidine biosynthesis) may be referred to as Histidine formation, Histidine synthesis, or Histidine anabolism.

To solve this, researchers use a Standardized Ontology. An ontology is a framework for organizing information that includes a vocabulary of terms, a definition for each, and logical relationships between them.

The Gene Ontology Structure

Founded by Ashburner et al. (2000), the Gene Ontology (GO) project provides a unified view of life by documenting processes, structures, and functions that recur across diverse organisms. GO is divided into three components:

Molecular Function: What the protein does at the molecular level (e.g., "kinase activity", "catalytic activity"). This category contains approximately $10,000$ terms.
Biological Process: The broader series of events the protein contributes to (e.g., "translation", "transcription"). This category contains approximately $20,000$ terms.
Cellular Component: Where the protein acts within the cell (e.g., "nucleus", "mitochondrion", "ribosome"). This category contains approximately $3,000$ terms.

GO Terms and Graph Organization

A specific GO term (e.g., Translation) includes:

Identifier: e.g., $GO:0006412$
Division: Biological Process
Synonyms: Protein anabolism, protein biosynthesis, protein synthesis
Definition: The cellular metabolic process in which a protein is formed…

GO is organized as a Directed Acyclic Graph:

Nodes: Represent terms.
Edges: Represent relationships, primarily "is a" or "part of."
Hierarchy:
- A parent term may have multiple children (e.g., Organelle is a parent to Mitochondrion).
- A child term may have multiple parents (e.g., Mitochondrion is an organelle AND is part of the cytoplasm).
- "Is a" Relation: Indicates a subtype relationship (e.g., Mitotic cell cycle is a cell cycle). In ontology, GO terms represent a class of entities rather than specific instances (e.g., "Cat" is a class, while "Garfield" is an instance). If every cat is a mammal, then every instance of a cat is also an instance of a mammal.
Increasing Specificity: Terms become more specific as you move down the hierarchy: Metabolic Process $\rightarrow$ Protein Metabolic Process $\rightarrow$ Translation.

Functional Enrichment Analysis

Proteins are annotated with multiple GO terms to describe their complexity. For example, $\beta$ -catenin is annotated with terms such as "Wnt receptor signaling," "Cell-cell adhesion," "Chromatin binding," "Cadherin binding," "Apical junction," and "Transcription factor complex."

In proteomics, researchers perform Enrichment Analysis to see if specific GO terms are over-represented in their data. This involves calculating:

P-values: Indicating the statistical significance of the enrichment (e.g., values < 10^{-9}, $10^{-7}$ to $10^{-9}$ , etc.).
FDR q-values: Corrected significance values to account for multiple testing.
Enrichment Score: Calculated based on the number of genes in the list ( $b$ ) relative to the total number of genes in the background ( $B$ ) and the total number of genes in the sample ( $n$ ) versus the population ( $N$ ). Example: $(N, B, n, b) = (9089, 690, 585, 107)$ .

Tools like GOrilla are used to visualize these enriched processes, revealing clusters of related terms such as "cell cycle process," "mitotic cell cycle," "chromosome segregation," and "DNA repair."