Proteomics and Protein Databases and Tools Part II (Lectures 15–16)

Context and scope of proteomics

  • Proteomics and protein databases/tools Part II (Lectures 15–16) continue from Part I (Lectures 13–14) on proteomics and associated databases/tools.

  • Video example: proteomics in NASA illustrating how genomics, transcriptomics, and proteomics are applied across biology and space exploration.

  • Key themes: proteome definition, why blood proteomics is informative, biomarkers, integration with DNA information, and how proteomics informs therapeutics and personalized medicine.

  • Blood as a diagnostic reservoir: when cells die or pathogens cause issues, diagnostic proteins are shed into blood; biomarkers reflect health status and disease states.

  • Practical goals of proteomics in medicine:

    • identify biomarkers for diseases,

    • track changes in proteins in response to stresses or therapies,

    • predict drug responses/doses from DNA and proteomics data,

    • enable personalized medicine, including long-duration spaceflight pharmacology.

  • Proteomics in space: study proteome changes in astronauts (e.g., Scott in space, Mark on Earth) to understand stresses from g-forces, space travel, and re-entry; tailor drug dosing and avoid adverse effects.

  • Overall summary: proteomics enables tracking biomarkers and systemic changes across time, including in extreme environments like space.

Core concept: three steps to protein identification by mass spectrometry

  • There are three essential steps for identifying proteins by mass spectrometry:
    1) proteolytic digestion,
    2) mass spectrometry analysis,
    3) database interrogation.

  • The sequence of these steps is crucial for correct identification.

Step 1: proteolytic digestion (trypsin)

  • Start with a protein (e.g., a spot from 2D gel or a protein in a solution).

  • Digest with the enzyme trypsin.

  • Trypsin specificity: it cleaves after arginine (Arg, R) or lysine (Lys, K) residues.

  • Result: a set of peptides (peptidic fragments) called peptides, often referred to as triptych fragments in the lecture context.

  • Conditions: typical digestion is carried out in buffer, around 4–6 hours at 37°C.

  • Example interpretation: given a protein sequence with Arg/Lys positions highlighted in red, trypsin digestion yields peptides at those cleavage sites; some fragments are long, some short.

  • After digestion, the protein is converted into a pool of peptides for MS analysis.

Step 2: mass spectrometry analysis (MS)

  • The peptide mixture is introduced into a mass spectrometer via a liquid handling/ionization process.

  • Ionization: peptides are ionized in the source (gas phase) so they carry a charge.

  • Separation: ions are separated by mass-to-charge ratio (m/z) in an electromagnetic field; lighter ions reach the detector faster than heavier ones.

  • Mass spectrometers mentioned:

    • MALDI-TOF (MALDI time-of-flight),

    • Orbitrap (high-resolution MS),

    • Fourier Transform Mass Spectrometer (FTMS), a high-resolution analyzer used in advanced proteomics labs.

  • Outputs: a mass fingerprint, which is a spectrum showing detected peptide fragments with their masses (m/z) and intensities.

  • Mass fingerprint interpretation:

    • Each peak corresponds to a peptide fragment with a specific mass.

    • Example masses: 976 Da, 954 Da, etc. (Da = Dalton, unit of molecular mass).

    • The average mass of an amino acid is about MˉAA110 Da\bar{M}_{AA} \,\approx\, 110\ \text{Da}, so a fragment with mass mm corresponds to roughly nm110n \approx \frac{m}{110} amino acids (e.g., a 976 Da fragment ~ n976110 8.9<br>ightarrow9 amino acidsn \approx \frac{976}{110} \approx\ 8.9 <br>ightarrow 9\text{ amino acids}).

  • Important note on mass units:

    • A Dalton (Da) is a unit of mass; the mass of one hydrogen atom is about 1 Da. In practice, peaks reflect the masses of peptide fragments; entire peptide masses are in the hundreds to thousands of Da.

  • Key limitation without a genome: if the genome weren’t sequenced, a mass fingerprint alone wouldn’t identify which protein the fragments came from; identification relies on known protein sequences.

  • The power of a sequenced genome: allows prediction of all tryptic peptides for every gene; theoretical mass fingerprints can be generated and matched to observed spectra.

Step 3: database interrogation (protein identification from MS data)

  • With a known genome, we can predict for each gene the amino acid sequence and the tryptic peptide set.

  • The experimental mass fingerprint is matched against the theoretical digestion products from the genome/protein database.

  • The best match reveals the protein identity.

  • This database-driven step turns spectra into protein identifications.

  • Conceptual takeaway: proteomics relies on a mapped genome to interpret mass fingerprints and identify proteins.

Tandem mass spectrometry (MS/MS) and sequencing

  • If the goal is to obtain peptide sequence information, tandem MS (MS/MS) is used.

  • Process:

    • After the initial MS step, a selected peptide ion is isolated and subjected to collision-induced dissociation in a collision cell.

    • This fragmentation generates a second MS spectrum (MS/MS) consisting of fragment ions, typically related to breaks along the peptide backbone (e.g., b- and y-ions).

    • The MS/MS spectrum shows mass differences corresponding to individual amino acids; for example, the difference between adjacent peaks corresponds to the mass of a single amino acid residue.

  • Example concept: reading off a sequence from a tandem MS spectrum by analyzing mass differences between consecutive fragment peaks.

  • Outcome: MS/MS provides peptide sequence data, which can be used to confirm peptide identity and, by extension, the parent protein, especially when combined with database search and/or de novo sequencing.

  • Database interrogation after MS/MS can further confirm protein identity, increasing confidence in the identification.

Big-picture workflow in proteomic analysis

  • Start with a cell culture or tissue (e.g., brain, liver) and extract proteins.

  • Protein extraction: detergents like SDS or DSDS; reduce disulfide bonds with beta-mercaptoethanol; remove contaminants (lipids, sugars, DNA).

  • Generate peptides: digest proteins with trypsin to create a population of peptides.

  • Analytical platform: typically LC-MS/MS (liquid chromatography coupled to tandem MS) for high-throughput analysis.

  • Shotgun proteomics: often, all proteins from the sample are digested together (a “soup” of peptides) and analyzed; a strong bioinformatic pipeline maps peptides back to proteins.

  • LC-MS/MS setup: liquid chromatography (LC) separates peptides prior to MS/MS; tandem MS provides sequencing information.

  • Throughput and automation: modern proteomics relies on automation (e.g., spot picking in gels, digestion robots, off-gel fractionation, LC-MS/MS workflows) to handle thousands of proteins efficiently.

  • Bioinformatics backbone: crucial for database searching, spectral matching, PTM assignment, and proteome reconstruction.

  • Timeframe: protein extraction and peptide generation can take hours; LC-MS/MS analyses often run overnight; comprehensive identifications can be obtained within about two days.

  • Proteomics is increasingly integrated with genomics and transcriptomics (proteogenomics) to improve identification and interpretation (see below).

Post-translational modifications (PTMs) and MS-based detection

  • PTMs add or modify mass on amino acids, enabling functional regulation of proteins.

  • Common mass shifts (examples):

    • Phosphorylation: +80 Da,

    • Acetylation: +42 Da,

    • Methylation: +14 Da,

    • Sulfation: +40 Da,

    • Other PTMs exist; there are more than 200 types (including various subtypes at specific amino acids).

  • PTMs can be identified by MS by detecting expected mass shifts relative to the unmodified peptide, often with MS/MS confirmation.

  • Example: phosphorylation of p53 at Ser33

    • Serine mass (unmodified) ≈ m(extSer)=105 Dam( ext{Ser}) = 105\ \text{Da}.

    • A phosphate group adds +80 Da, but phosphorylation typically involves loss of a water molecule (H2O, 18 Da) during modification, yielding phosphoserine mass:

    • m(extphosphoserine)=105+8018=167 Da.m( ext{phosphoserine}) = 105 + 80 - 18 = 167\ \text{Da}.

    • In MS/MS, the mass difference corresponding to a phospho-Ser residue reveals the site of phosphorylation (e.g., Ser33 in p53).

  • Not all proteins are modified; there is always an unmodified peptide reference to compare against in MS data for confirmation.

Protein quantification in proteomics

  • Two broad approaches:

    • Gel-based quantification: SDS-PAGE or 2D electrophoresis; quantify based on band/spot intensity after staining (e.g., Coomassie blue). PTMs can shift isoforms and alter band/spot intensities slightly, but resolution is limited.

    • MS-based quantification: quantify by peak intensities or peak areas in MS spectra; higher peaks indicate higher abundance of the corresponding peptide/protein.

  • 2D gel limitations: thousands of proteins may be present; gel resolution may not separate all isoforms or PTMs.

  • 2D gel and MS can be complemented by more sensitive MS-based methods for higher throughput and depth.

SILAC: stable isotope labeling by amino acids in cell culture

  • SILAC enables precise relative quantification of proteins across conditions.

  • Concept: grow cells in media containing isotopically labeled amino acids (non-radioactive stable isotopes). Three (or more) conditions can be compared by using different isotopic labels:

    • Example scheme: control (zero minutes), 1 minute after EGF stimulation, 10 minutes after EGF stimulation.

    • Arg or other amino acids are replaced with isotopically labeled forms (e.g., Arg with C-12/C-14, C-13/C-14, or other isotopic combinations).

  • Practical details: cells incorporate labeled amino acids into newly synthesized proteins, causing a predictable mass shift for peptides derived from each condition.

  • Analysis: peptides from different conditions are mixed and analyzed together by LC-MS/MS; the relative abundances are inferred from the relative intensities of light vs heavy isotope-labeled peptide peaks.

  • Advantages: precise, multiplexed quantitative comparison across conditions; widely used in contemporary proteomics.

  • Limitations: costs of labeled amino acids and isotope-labeled reagents; not always feasible for primary tissues (more suitable for cell culture).

  • Brief example interpretation: a peptide from a protein shows three peaks corresponding to three isotopic labels; differences in peak intensities across time points reveal up- or down-regulation in response to a stimulus (e.g., EGF).

  • SILAC is a staple technique in modern quantitative proteomics, though expensive; many labs rely on it for accurate proteome-level quantification.

Technological challenges and advances in proteomics

  • Core challenges:

    • Achieving high-throughput analysis across tens to tens of thousands of proteins per sample,

    • Maintaining accuracy and sensitivity while handling complex samples,

    • Automating sample preparation and data analysis to reduce manual effort and variance.

  • Automation in sample prep:

    • Spot/picking automation for 2D gels,

    • Digestion robots that perform trypsin digestion in high-throughput formats,

    • Off-gel electrophoresis and direct LC-MS/MS workflows to maximize throughput.

  • The rise of LC-MS/MS as a standard workflow for shotgun proteomics, often with automated data analysis pipelines.

  • In situ proteomics imaging: mass spectrometry imaging (MSI) as a “proteomic microscope”

    • Principle: a frozen tissue section is interrogated pixel-by-pixel by laser desorption/ionization; each pixel yields MS data reflecting local peptide/protein composition.

    • Output: spatial distribution maps of specific peptides/proteins across tissue, analogous to immunohistochemistry but without specific antibodies.

    • Challenges: high cost, specialized training, and current limitations in clinical adoption; projected trend toward wider use as technology matures and costs fall.

  • Proteogenomics: integration of proteomics with genomics and transcriptomics

    • Goals: leverage genome and transcriptome data to interpret proteome data, reducing ambiguity and enabling discovery of novel protein-coding events, splicing variants, and mutation-specific peptides.

    • Approach: perform genomics, transcriptomics (e.g., RNA-seq), and proteomics on the same samples (sometimes in a single workflow) to create a model-specific proteome database.

    • Benefit: can use a custom genome/proteome database for more accurate identifications, especially in cancer where genomes can deviate from the reference.

    • Practical note: advances in lysis buffers enable simultaneous extraction of DNA, RNA, and proteins for multi-omics analyses from the same sample.

  • Practical takeaways:

    • The proteomics workflow is an interplay of wet-lab techniques and bioinformatics; success hinges on good experimental design and robust data analysis pipelines.

    • Proteogenomics and MS-based PTM analysis are at the forefront of enabling deeper biological insights beyond simple protein identification.

Practical resources and readings

  • Online proteomics principles, techniques, and applications: a recommended resource that mirrors lecture content and expands on topics.

  • A case study: breast cancer quantitative proteome and proteogenomic analysis (PDF provided in Canvas) illustrating how proteomics data are generated, analyzed, and published.

  • Purpose of readings: to connect theory with data interpretation and to understand data presentation in proteomics publications.

Quick quiz and exam-style reminders

  • Question 1: You are asked to define the proteome of a cancer cell line. Which approach is best?

    • Options: a) SDS-PAGE, b) 2D electrophoresis plus Western blot, c) mass spectrometry, d) Western blot plus mass spectrometry.

    • Correct answer: c) mass spectrometry. Rationale: SDS-PAGE lacks the resolution to identify the entire proteome; 2D gels reveal a subset; Western blot targets specific proteins; MS can identify and quantify many proteins simultaneously.

  • Question 2: What is the role of trypsin in proteomics?

    • Correct answer: b) to cut proteins into peptides before mass spectrometry analysis. Trypsin digestion produces a peptide mixture suitable for MS analysis; it is not used to extract proteins or to process lipids.

  • Question 3: Why are bioinformatics and databases essential in proteomics?

    • Brief explanation: to store spectral data, to interpret MS data, to identify proteins by matching observed spectra to theoretical spectra derived from genome/protein databases, to annotate PTMs, and to reconstruct the proteome.

  • Short-answer prompts (topics you should practice):

    • Explain why bioinformatics is essential in proteomics.

    • Explain how proteins are identified by mass spectrometry (stepwise: digestion, MS, database search, optional MS/MS confirmation).

Final reminders for exam preparation

  • Master the three-step MS identification workflow and be able to explain each step in your own words.

  • Understand how a genome sequence enables peptide prediction and database matching for protein IDs.

  • Be able to discuss PTMs, their typical mass shifts, and how MS detects them (including a worked example like phosphoserine).

  • Know the major quantitative strategies (gel-based vs MS-based; SILAC concepts and use cases).

  • Be familiar with shotgun proteomics, LC-MS/MS workflows, and the rise of proteogenomics and MS imaging.

  • Recognize the difference between single-protein targeted methods (e.g., Western blot) and global proteome analyses (MS-based identification).