Peptide Sequencing & Proteomics – Core Vocabulary

Historical Evolution of Protein/Peptide Sequencing

❖ 1970-80s: Protein sequencing primarily relied on Edman degradation.
- Required large amounts of purified protein and a free N-terminus.
- Failed for N-terminally blocked/acetylated proteins.
❖ 1990s: Mass spectrometry (MS) displaced Edman degradation.
- Detects ionized biomolecules in vacuum; fragments peptides within seconds vs. hours–days for Edman.
- Handles tiny sample amounts, heterogeneous mixtures, blocked termini, post-translational modifications (PTMs).
❖ 2000s: MS becomes centerpiece of proteomics.
- Routine identification of single gel spots/bands; advanced global screens (MudPIT, GeLC, shotgun).
- Need for biologists to understand MS principles to avoid over-interpretation (e.g., mistaking minor contaminants for main band).

Core Rationale: Why Sequence Peptides, Not Intact Proteins

MS sensitivity and fragmentation efficiency are far higher for peptides (≤ 20 aa) than full proteins.
Proteins vary in solubility, stability, modifications; peptides have more uniform physico-chemical behavior.
Digestion removes issues of detergent incompatibility and membrane-protein insolubility.
Reduced complexity allows database matching; however, partial coverage misses complete PTM & processing info.
Specialized "top-down" FTICR methods can sequence intact proteins, but remain niche.

Proteomic Workflow Overview (Fig 1 Analogue)

Cell/tissue source → Sample prep (SDS–PAGE, 2-D gels, fractionation) → Protein digestion (trypsin, Lys-C, Asp-N, Glu-C) → Peptide separation (1-D/2-D LC, ion-exchange) → Ionization (Electrospray, MALDI) → Mass analysis (Quadrupole, TOF, Ion-trap, FTICR, hybrids) → Data analysis (PeptideSearch, Sequest, Mascot) → Biological interpretation.

Protein Digestion Specifics

Trypsin: Cleaves C-terminal to Arg/Lys; generates peptides in ideal mass range with basic C-terminus → rich y-ion spectra.
Lys-C: Even more stable; useful prior to trypsin under 8\,\text{M} urea.
Asp-N / Glu-C: Complementary specificity, lower activity.
Non-specific enzymes avoided; create excessive overlapping spectra.

Peptide Separation by Microscale Capillary HPLC

Inner diameter 50{-}150\,\mu\text{m}, reversed-phase C18.
Elution via increasing organic gradient; order by hydrophobicity.
- Very hydrophilic peptides may elute in void; very hydrophobic may stick.
Nano-flow rates \sim100\,\text{nL·min}^{-1}; peak width 10{-}60\,\text{s}.
Options:
- GeLC–MS: Slice SDS-PAGE lane; digest each slice → higher dynamic range & known M_w context.
- MudPIT: 2-D LC (strong cation exchange → RP) on peptide level.

Ionization Techniques (Box 1)

Electrospray Ionization (ESI)

Spray needle at \sim2\,\text{kV} potential creates charged droplets → solvent evaporation → ion desorption.
Produces multiply-protonated ions, commonly doubly charged for tryptic peptides.

Matrix-Assisted Laser Desorption/Ionization (MALDI)

Peptides co-crystallized in aromatic acid matrix; laser pulse yields mainly singly protonated ions.
Off-line coupling: LC fractions spotted on target for automated MALDI–TOF/TOF or MALDI–ion-trap sequencing.

Mass Analyzers & Resolution

Quadrupole (Q): Filters ions by stabilizing trajectories using sinusoidal RF/DC; sequential scan.
Time-of-Flight (TOF): Ions accelerated to equal kinetic energy; flight time ∝ \sqrt{m/z} → lighter ions arrive earlier.
Quadrupole Ion Trap (3-D or Linear): Traps ions in oscillating field; can isolate, fragment (MSⁿ), then eject.
FTICR (Penning trap): Ions orbit in high B-field; frequency→mass via Fourier transform. Resolution >100{,}000, mass error few ppm.

m/z Calculation Example

Doubly protonated peptide mass M = 1232.55\,\text{Da}.
\frac{1232.55 + 2\times1.0073}{2} = 617.28 (observed m/z).
Isotope spacing =1/z → 0.5\,\text{Th} spacing confirms charge 2+.

Tandem MS & Peptide Fragmentation (Box 2)

Workflow: Survey MS scan → isolate top N precursors → collision-induced dissociation (CID) → MS² spectra.
Ion types:
- b_m: charge on N-term fragment.
- y_{n-m}: charge on C-term fragment.
- am = bm-\text{CO} (−27.9949\,\text{Da}).
Proline (N-term) & Aspartate (C-term) bonds are labile → intense ions.
Multi-stage MSⁿ (MS³…) now feasible in linear traps for deeper sequencing.

De Novo vs. Database-Driven Sequencing

De novo: Interpret mass gaps; ambiguous when spectrum incomplete.
Database Searching converts problem to pattern matching.
- Vast reduction of solution space as only biological sequences considered.

Major Algorithms (Box 3)

Peptide Sequence Tags (PeptideSearch)
- Short internal sequence + masses to termini → unique DB hit.
Sequest
- Correlates experimental vs. theoretical spectra by cross-correlation.
Mascot
- Probability-based; matches highest-intensity ions first → score = -10\log_{10}(P).

Others: Sonar, ProteinProspector, graph-theory approaches.

Statistical Validation of Peptide/Protein IDs

Reported as expectation/probability scores.
Use fully tryptic peptides unless strong evidence for semi-tryptic.
False-positive estimation via decoy databases (reversed/randomized sequences).
Two-component score distribution: low-score (random) vs. high-score (true) → choose cut-off for desired \le!1\% FDR.
Protein probability combines peptide probabilities; caution with very large proteins (many theoretical peptides).
Single-peptide IDs only accepted with very high mass accuracy & manual spectrum validation (Box 4).

Manual Spectrum Validation Heuristics (Box 4)

Majority of intense peaks above precursor should form continuous y-series.
Check characteristic labile cleavages (Pro, Asp) & satellite losses (e.g., -98\,\text{Da} for \text{H}3PO4, -64\,\text{Da} for \text{CH}_3\text{SOH}).
Consider same-charge fragment ions below precursor (e.g., doubly charged y-ions).

Quantification Strategies

Absolute (AQUA-like)

Spike synthetic isotopically labelled peptides of known amount; compare extracted ion currents.
Averaging top 3 intense peptides per protein gives estimate within ±4-fold without standards.

Relative Quantification via Stable Isotopes

Principle: heavy/light forms co-elute → peak ratio reflects abundance ratio.
Metabolic Labelling
- SILAC: culture cells with ^{13}\text{C}/^{15}\text{N} Arg/Lys; mix cell lysates before processing.
- \ge3 Da shift required to separate isotope clusters.
Post-Harvest Chemical Labelling
- ICAT: thiol-specific tag with biotin + light/heavy linker (FIG 4b); enrich cysteine peptides.
- Other amine-reactive tags, deuterated reagents, ^{18}\text{O} exchange.
Accuracy limited by resolution & S/N; replicate/label-swap experiments recommended.

Applications & Case Studies

Complex/Interactome Mapping: Immunoprecipitation + MS identifies protein networks (Refs 51-52). Stable-isotope pull-downs capture transient, phosphorylation-dependent binding (Refs 56-58).
Organelle Proteomics: Protein-correlation profiling distinguishes genuine organellar proteins via fractionation profiles (Ref 59).
Expression Proteomics / Biomarker Discovery: Whole-lysate quantification (SILAC, ICAT) seeks differential expression; challenges include dynamic range & data noise.
Top-Down FTICR: Partial sequences & PTM mapping on intact proteins.

Challenges & Common Pitfalls

Under-appreciating contaminants: keratins, minor co-migrating proteins.
Over-interpreting long protein lists without statistical confidence.
Missing low-abundance peptides due to ion suppression/co-elution.
Database redundancy vs. minimalism: conflicting isoform assignments.
Incomplete PTM coverage owing to limited sequence coverage.

Key Terms & Numerical Reminders

Dalton (Da): 1\,\text{Da} = 1.6605\times10^{-27}\,\text{kg}.
Thomson (Th): Proposed unit for m/z scale.
Total Ion Current (TIC): sum of all signal intensities per scan.
Extracted Ion Chromatogram (XIC): intensity trace of single m/z across LC run.
Mass Resolution: R = \frac{m/z}{\text{FWHM}}; TOF ≈10{,}000; FTICR >100{,}000.
Isotope Spacing: \Delta (m/z)=1/z.

Ethical / Practical Implications

Proper statistical validation prevents publication of unreliable proteomes.
SILAC avoids radioactivity and enables in-vivo dynamic studies without additional chemical manipulations.
ICAT selects for cysteine-containing proteins; researchers must report potential bias against cysteine-free proteins.
Sample handling: use gloves/lab coats to minimize keratin contamination.

Connections to Foundational Principles & Other ‘Omics’

Complementarity with transcriptomics: protein abundance correlates poorly with mRNA (Ref 60) → proteomics indispensable.
Systems biology integrates MS-based proteomics, mRNA arrays, imaging for holistic cellular maps (Refs 62-67).

Future Directions

Higher MS speed & resolution promise near-complete proteome coverage (accurate-mass-tag approaches).
Improved statistical tools (machine learning) & community standards for FDR reporting.
Clinical translation: SELDI patterns under scrutiny; need robust bioinformatics for biomarker validation.