Proteomics and Protein Databases and Tools Part II (Lectures 15–16)

Context and scope of proteomics

Proteomics and protein databases/tools Part II (Lectures 15–16) continue from Part I (Lectures 13–14) on proteomics and associated databases/tools.
Video example: proteomics in NASA illustrating how genomics, transcriptomics, and proteomics are applied across biology and space exploration.
Key themes: proteome definition, why blood proteomics is informative, biomarkers, integration with DNA information, and how proteomics informs therapeutics and personalized medicine.
Blood as a diagnostic reservoir: when cells die or pathogens cause issues, diagnostic proteins are shed into blood; biomarkers reflect health status and disease states.
Practical goals of proteomics in medicine:
- identify biomarkers for diseases,
- track changes in proteins in response to stresses or therapies,
- predict drug responses/doses from DNA and proteomics data,
- enable personalized medicine, including long-duration spaceflight pharmacology.
Proteomics in space: study proteome changes in astronauts (e.g., Scott in space, Mark on Earth) to understand stresses from g-forces, space travel, and re-entry; tailor drug dosing and avoid adverse effects.
Overall summary: proteomics enables tracking biomarkers and systemic changes across time, including in extreme environments like space.

Core concept: three steps to protein identification by mass spectrometry

There are three essential steps for identifying proteins by mass spectrometry:
1) proteolytic digestion,
2) mass spectrometry analysis,
3) database interrogation.
The sequence of these steps is crucial for correct identification.

Step 1: proteolytic digestion (trypsin)

Start with a protein (e.g., a spot from 2D gel or a protein in a solution).
Digest with the enzyme trypsin.
Trypsin specificity: it cleaves after arginine (Arg, R) or lysine (Lys, K) residues.
Result: a set of peptides (peptidic fragments) called peptides, often referred to as triptych fragments in the lecture context.
Conditions: typical digestion is carried out in buffer, around 4–6 hours at 37°C.
Example interpretation: given a protein sequence with Arg/Lys positions highlighted in red, trypsin digestion yields peptides at those cleavage sites; some fragments are long, some short.
After digestion, the protein is converted into a pool of peptides for MS analysis.

Step 2: mass spectrometry analysis (MS)

The peptide mixture is introduced into a mass spectrometer via a liquid handling/ionization process.
Ionization: peptides are ionized in the source (gas phase) so they carry a charge.
Separation: ions are separated by mass-to-charge ratio (m/z) in an electromagnetic field; lighter ions reach the detector faster than heavier ones.
Mass spectrometers mentioned:
- MALDI-TOF (MALDI time-of-flight),
- Orbitrap (high-resolution MS),
- Fourier Transform Mass Spectrometer (FTMS), a high-resolution analyzer used in advanced proteomics labs.
Outputs: a mass fingerprint, which is a spectrum showing detected peptide fragments with their masses (m/z) and intensities.
Mass fingerprint interpretation:
- Each peak corresponds to a peptide fragment with a specific mass.
- Example masses: 976 Da, 954 Da, etc. (Da = Dalton, unit of molecular mass).
- The average mass of an amino acid is about $\bar{M}_{AA} \,\approx\, 110\ \text{Da}$ , so a fragment with mass $m$ corresponds to roughly $n \approx \frac{m}{110}$ amino acids (e.g., a 976 Da fragment ~ $n \approx \frac{976}{110} \approx\ 8.9 <br>ightarrow 9\text{ amino acids}$ ).
Important note on mass units:
- A Dalton (Da) is a unit of mass; the mass of one hydrogen atom is about 1 Da. In practice, peaks reflect the masses of peptide fragments; entire peptide masses are in the hundreds to thousands of Da.
Key limitation without a genome: if the genome weren’t sequenced, a mass fingerprint alone wouldn’t identify which protein the fragments came from; identification relies on known protein sequences.
The power of a sequenced genome: allows prediction of all tryptic peptides for every gene; theoretical mass fingerprints can be generated and matched to observed spectra.

Step 3: database interrogation (protein identification from MS data)

With a known genome, we can predict for each gene the amino acid sequence and the tryptic peptide set.
The experimental mass fingerprint is matched against the theoretical digestion products from the genome/protein database.
The best match reveals the protein identity.
This database-driven step turns spectra into protein identifications.
Conceptual takeaway: proteomics relies on a mapped genome to interpret mass fingerprints and identify proteins.

Tandem mass spectrometry (MS/MS) and sequencing

If the goal is to obtain peptide sequence information, tandem MS (MS/MS) is used.
Process:
- After the initial MS step, a selected peptide ion is isolated and subjected to collision-induced dissociation in a collision cell.
- This fragmentation generates a second MS spectrum (MS/MS) consisting of fragment ions, typically related to breaks along the peptide backbone (e.g., b- and y-ions).
- The MS/MS spectrum shows mass differences corresponding to individual amino acids; for example, the difference between adjacent peaks corresponds to the mass of a single amino acid residue.
Example concept: reading off a sequence from a tandem MS spectrum by analyzing mass differences between consecutive fragment peaks.
Outcome: MS/MS provides peptide sequence data, which can be used to confirm peptide identity and, by extension, the parent protein, especially when combined with database search and/or de novo sequencing.
Database interrogation after MS/MS can further confirm protein identity, increasing confidence in the identification.

Big-picture workflow in proteomic analysis

Start with a cell culture or tissue (e.g., brain, liver) and extract proteins.
Protein extraction: detergents like SDS or DSDS; reduce disulfide bonds with beta-mercaptoethanol; remove contaminants (lipids, sugars, DNA).
Generate peptides: digest proteins with trypsin to create a population of peptides.
Analytical platform: typically LC-MS/MS (liquid chromatography coupled to tandem MS) for high-throughput analysis.
Shotgun proteomics: often, all proteins from the sample are digested together (a “soup” of peptides) and analyzed; a strong bioinformatic pipeline maps peptides back to proteins.
LC-MS/MS setup: liquid chromatography (LC) separates peptides prior to MS/MS; tandem MS provides sequencing information.
Throughput and automation: modern proteomics relies on automation (e.g., spot picking in gels, digestion robots, off-gel fractionation, LC-MS/MS workflows) to handle thousands of proteins efficiently.
Bioinformatics backbone: crucial for database searching, spectral matching, PTM assignment, and proteome reconstruction.
Timeframe: protein extraction and peptide generation can take hours; LC-MS/MS analyses often run overnight; comprehensive identifications can be obtained within about two days.
Proteomics is increasingly integrated with genomics and transcriptomics (proteogenomics) to improve identification and interpretation (see below).

Post-translational modifications (PTMs) and MS-based detection

PTMs add or modify mass on amino acids, enabling functional regulation of proteins.
Common mass shifts (examples):
- Phosphorylation: +80 Da,
- Acetylation: +42 Da,
- Methylation: +14 Da,
- Sulfation: +40 Da,
- Other PTMs exist; there are more than 200 types (including various subtypes at specific amino acids).
PTMs can be identified by MS by detecting expected mass shifts relative to the unmodified peptide, often with MS/MS confirmation.
Example: phosphorylation of p53 at Ser33
- Serine mass (unmodified) ≈ $m( ext{Ser}) = 105\ \text{Da}$ .
- A phosphate group adds +80 Da, but phosphorylation typically involves loss of a water molecule (H2O, 18 Da) during modification, yielding phosphoserine mass:
- $m( ext{phosphoserine}) = 105 + 80 - 18 = 167\ \text{Da}.$
- In MS/MS, the mass difference corresponding to a phospho-Ser residue reveals the site of phosphorylation (e.g., Ser33 in p53).
Not all proteins are modified; there is always an unmodified peptide reference to compare against in MS data for confirmation.

Protein quantification in proteomics

Two broad approaches:
- Gel-based quantification: SDS-PAGE or 2D electrophoresis; quantify based on band/spot intensity after staining (e.g., Coomassie blue). PTMs can shift isoforms and alter band/spot intensities slightly, but resolution is limited.
- MS-based quantification: quantify by peak intensities or peak areas in MS spectra; higher peaks indicate higher abundance of the corresponding peptide/protein.
2D gel limitations: thousands of proteins may be present; gel resolution may not separate all isoforms or PTMs.
2D gel and MS can be complemented by more sensitive MS-based methods for higher throughput and depth.

SILAC: stable isotope labeling by amino acids in cell culture

SILAC enables precise relative quantification of proteins across conditions.
Concept: grow cells in media containing isotopically labeled amino acids (non-radioactive stable isotopes). Three (or more) conditions can be compared by using different isotopic labels:
- Example scheme: control (zero minutes), 1 minute after EGF stimulation, 10 minutes after EGF stimulation.
- Arg or other amino acids are replaced with isotopically labeled forms (e.g., Arg with C-12/C-14, C-13/C-14, or other isotopic combinations).
Practical details: cells incorporate labeled amino acids into newly synthesized proteins, causing a predictable mass shift for peptides derived from each condition.
Analysis: peptides from different conditions are mixed and analyzed together by LC-MS/MS; the relative abundances are inferred from the relative intensities of light vs heavy isotope-labeled peptide peaks.
Advantages: precise, multiplexed quantitative comparison across conditions; widely used in contemporary proteomics.
Limitations: costs of labeled amino acids and isotope-labeled reagents; not always feasible for primary tissues (more suitable for cell culture).
Brief example interpretation: a peptide from a protein shows three peaks corresponding to three isotopic labels; differences in peak intensities across time points reveal up- or down-regulation in response to a stimulus (e.g., EGF).
SILAC is a staple technique in modern quantitative proteomics, though expensive; many labs rely on it for accurate proteome-level quantification.

Technological challenges and advances in proteomics

Core challenges:
- Achieving high-throughput analysis across tens to tens of thousands of proteins per sample,
- Maintaining accuracy and sensitivity while handling complex samples,
- Automating sample preparation and data analysis to reduce manual effort and variance.
Automation in sample prep:
- Spot/picking automation for 2D gels,
- Digestion robots that perform trypsin digestion in high-throughput formats,
- Off-gel electrophoresis and direct LC-MS/MS workflows to maximize throughput.
The rise of LC-MS/MS as a standard workflow for shotgun proteomics, often with automated data analysis pipelines.
In situ proteomics imaging: mass spectrometry imaging (MSI) as a “proteomic microscope”
- Principle: a frozen tissue section is interrogated pixel-by-pixel by laser desorption/ionization; each pixel yields MS data reflecting local peptide/protein composition.
- Output: spatial distribution maps of specific peptides/proteins across tissue, analogous to immunohistochemistry but without specific antibodies.
- Challenges: high cost, specialized training, and current limitations in clinical adoption; projected trend toward wider use as technology matures and costs fall.
Proteogenomics: integration of proteomics with genomics and transcriptomics
- Goals: leverage genome and transcriptome data to interpret proteome data, reducing ambiguity and enabling discovery of novel protein-coding events, splicing variants, and mutation-specific peptides.
- Approach: perform genomics, transcriptomics (e.g., RNA-seq), and proteomics on the same samples (sometimes in a single workflow) to create a model-specific proteome database.
- Benefit: can use a custom genome/proteome database for more accurate identifications, especially in cancer where genomes can deviate from the reference.
- Practical note: advances in lysis buffers enable simultaneous extraction of DNA, RNA, and proteins for multi-omics analyses from the same sample.
Practical takeaways:
- The proteomics workflow is an interplay of wet-lab techniques and bioinformatics; success hinges on good experimental design and robust data analysis pipelines.
- Proteogenomics and MS-based PTM analysis are at the forefront of enabling deeper biological insights beyond simple protein identification.

Practical resources and readings

Online proteomics principles, techniques, and applications: a recommended resource that mirrors lecture content and expands on topics.
A case study: breast cancer quantitative proteome and proteogenomic analysis (PDF provided in Canvas) illustrating how proteomics data are generated, analyzed, and published.
Purpose of readings: to connect theory with data interpretation and to understand data presentation in proteomics publications.

Quick quiz and exam-style reminders

Question 1: You are asked to define the proteome of a cancer cell line. Which approach is best?
- Options: a) SDS-PAGE, b) 2D electrophoresis plus Western blot, c) mass spectrometry, d) Western blot plus mass spectrometry.
- Correct answer: c) mass spectrometry. Rationale: SDS-PAGE lacks the resolution to identify the entire proteome; 2D gels reveal a subset; Western blot targets specific proteins; MS can identify and quantify many proteins simultaneously.
Question 2: What is the role of trypsin in proteomics?
- Correct answer: b) to cut proteins into peptides before mass spectrometry analysis. Trypsin digestion produces a peptide mixture suitable for MS analysis; it is not used to extract proteins or to process lipids.
Question 3: Why are bioinformatics and databases essential in proteomics?
- Brief explanation: to store spectral data, to interpret MS data, to identify proteins by matching observed spectra to theoretical spectra derived from genome/protein databases, to annotate PTMs, and to reconstruct the proteome.
Short-answer prompts (topics you should practice):
- Explain why bioinformatics is essential in proteomics.
- Explain how proteins are identified by mass spectrometry (stepwise: digestion, MS, database search, optional MS/MS confirmation).

Final reminders for exam preparation

Master the three-step MS identification workflow and be able to explain each step in your own words.
Understand how a genome sequence enables peptide prediction and database matching for protein IDs.
Be able to discuss PTMs, their typical mass shifts, and how MS detects them (including a worked example like phosphoserine).
Know the major quantitative strategies (gel-based vs MS-based; SILAC concepts and use cases).
Be familiar with shotgun proteomics, LC-MS/MS workflows, and the rise of proteogenomics and MS imaging.
Recognize the difference between single-protein targeted methods (e.g., Western blot) and global proteome analyses (MS-based identification).