KS

Beyond Mass Spectrometry – Key Proteomics Vocabulary

CENTRAL DOGMA & MOTIVATION FOR DIRECT PROTEIN SEQUENCING

  • Information flow: DNA → RNA → Protein

    • Transcription, splicing, translation determine amino-acid (AA) sequence

    • Start-site choice, open-reading-frame (ORF) selection and post-translational modifications (PTMs) add further variability

  • Proteins govern structure, signalling, catalysis; mutations or mis-processing underpin diseases (e.g. Alzheimer’s, Huntington’s)

  • Genome & transcriptome sequencing are inexpensive (≈\$1000/human genome) yet insufficient:

    • Long-read genome assemblies retain \approx0.1\% error (frameshifts) ⇒ hundreds of protein errors

    • RNA levels do not correlate linearly with protein abundance because of translation efficiency, RNA lifetime, poly(A) length, RNA modifications

    • Analyses of a “well-characterised” human transcriptome found \approx1.1\times10^5 novel transcripts

  • Human proteome complexity

    • \approx2\times10^4 protein-coding genes

    • >100 isoforms per gene after alternative splicing, single-AA polymorphisms, PTMs ⇒ >10^6 proteoforms

    • \approx60\% of proteins glycosylated; other frequent PTMs: methylation, acetylation, phosphorylation

HISTORICAL & CURRENT GOLD STANDARD: MASS SPECTROMETRY (MS)

Edman degradation (1950s–1990s)

  • Cyclic N-terminal chemistry; >99\% efficiency/AA but

    • Slow (≈1 h/cycle), limited to <30 AA, needs ≈100 pmol, fails on blocked N-termini & many PTMs

Bottom-Up MS (BU-MS)

  • Workflow: Protease digest (typically trypsin) → ionisation (ESI/MALDI) → MS→MS/MS → database search (Mascot, Sequest)

  • Generates peptide “fingerprints” rather than complete sequences

  • Ambiguities: isoleucine vs leucine (isobaric), homologous fragments, shared peptides; reconstruction uses inclusion, exclusion or parsimony strategies

  • Sensitivity limits

    • Typical detection limit ≈480 fg ⇒ \approx6 million molecules (50 kDa)

    • Dynamic range ≈ 10^5 (Orbitrap) vs biological range 10^{12} (e.g. antibodies mg mL⁻¹; cytokines pg mL⁻¹)

    • Identification usually needs 10^{-15}–10^{-18} mol (10⁶–10⁹ copies)

  • 75 % of acquired spectra remain unidentified; clustering rescues only ≈20 %

  • PTM site localisation problematic; phosphorylation example:

    • Human proteome: \approx2\times10^7 residues; avg. 1.5 Ser/Thr/Tyr per 10-mer ⇒ multiple candidate sites

    • Requires enrichment (IMAC, ion exchange) and MMA ≤1 ppm; many instruments offer 10–250 ppm

Top-Down MS (TD-MS)

  • Analyses intact proteins (≤70 kDa) via ESI + fragmentation (CID, ECD, ETD)

  • \approx100\times less sensitive than BU-MS; needs high-field magnets (7–14 T)

  • Resolution challenge: difference between trimethyl-lysine & acetyl-lysine = 0.0364 Da ⇒ need <1 ppm on 50 kDa ion

  • Fragmentation efficiency decreases with MW; generally requires ≥0.5 µg mL⁻¹ pure protein

KEY NUMERICAL RELATIONS & FORMULAE

  • Protein volume–mass relation: V{mol}\,[\text{nm}^3]=1.21\times10^{-3}\,MW{Da}

  • Fractional current blockade (idealised): \frac{\Delta I}{I0}=f\;\frac{\Delta V{mol}}{V_{pore}}\;S

    • f shape/orientation factor, S field-distortion size factor

LIMITATIONS THAT DRIVE NEW TECHNOLOGIES

  • Insufficient sensitivity & dynamic range for low-abundance proteins/isoforms/PTMs

  • Dependence on reference databases; difficulty in de novo sequencing

  • Need for large sample amounts prevents single-cell proteomics

ALTERNATIVES BEYOND MS

1. Long-Read Transcriptomics (PacBio, Oxford Nanopore)

  • Reads entire cDNAs or native RNA (10 kb+) allowing isoform resolution without assembly

  • Example: CDKN2A locus

    • Isoforms p14^{ARF} vs p16^{INK4a} differ by frameshift despite shared exons; long reads revealed 93 vs 33 reads respectively; short reads would mis-classify

  • Error correction: circular consensus (PacBio), alignment polishing ⇒ >99 % accuracy

  • Limitations: RT processivity, low depth for rare transcripts, RNA ≠ protein, cannot capture PTMs

2. Single-Cell Multi-omics – CITE-seq & Spatial Methods

  • Attach DNA barcodes to antibodies; droplet microfluidics (Drop-seq/10X): joint sequencing of

    • mRNA (transcriptome)

    • Antibody-derived tags (proxy for surface protein abundance)

  • Advantages: thousands of proteins possible (DNA barcode space > fluorescence channels)

  • Present limits: surface proteins only; antibody specificity/cross-reactivity; steric hindrance blocks multi-PTM probing

  • Spatial transcriptomics (MERFISH, STARmap, seqFISH+) adds localisation but low throughput & spectral crowding

3. Fluorescent Protein “Fingerprinting”

a) Fluorosequencing by Edman degradation (Swaminathan et al.)
  • Fragment proteins; covalently label specific AA types with dyes; tether C-termini on flow-cell; TIRF imaging

  • Cycles: image → Edman remove N-terminal AA → image; drops in intensity reveal labelled AA positions

  • Scalable (millions of molecules) leveraging NGS hardware; can detect fluorescently tagged PTMs

  • Challenges: slow (≈1 h/cycle), <30 AA read length due to <91 % cycle yield, limited PTM chemistry, photobleaching/dark states, dynamic range \sim10^7 required

b) Single-molecule FRET with ClpXP motor (van Ginkel et al.)
  • Donor-labelled ClpP14 + acceptor-labelled AAs; ClpX unfolds & feeds peptide through pore; FRET bursts encode order of labels

  • Database alignment identifies peptide; still limited by label chemistry, throughput, fluorescence noise

4. Nanopore-Based Approaches

a) 5-Dimensional Fingerprinting of Folded Proteins (Yusko et al.)
  • Use 30 nm pore coated with fluid lipid; tethered protein drifts slowly ⇒ measurable blockade modulation

  • Extract per molecule:

    • Volume

    • Shape (oblate/prolate)

    • Net charge

    • Rotational diffusion coefficient

    • Dipole moment

  • Provides unique fingerprint without purification; bandwidth limited, ≈1 molecule/2 s

  • Throughput solution: fabricate dense nanopore arrays (30 nm pores on 200 nm pitch; even 2 nm pores via STEM sputtering)

b) Sequencing Denatured Proteins with Sub-Nanopores
  • Molecular dynamics (MD) of 2.2 nm pores in 2D MoS$_2$/graphene shows stepwise AA passage; jamming issues suggest smaller pores needed

  • Subnanopore fabrication: focused electron beam drills 0.25–0.6 nm waist in 10 nm SiN; biconical geometry focuses field over ≈1.5 nm

  • Denaturation with SDS + heat + β-ME; SDS confers uniform negative charge & rod-like shape

  • Experimental results (Timp lab):

    • Single CCL5 (67 AA) produces fluctuating blockade; consensus of 400 events correlates with AA volume model (k=3 moving average, PCC=0.75, 65 % accuracy)

    • Random-forest regression using volume + hydrophilicity removes bias, improves PCC and identification

    • Cluster of 5–10 blockades sufficient to identify a protein (20 % of human proteome search) with P\le10^{-6}

    • Sensitivity: single-molecule detection; PTM or single-AA substitutions distinguished (≈0.07 nm$^3$ volume differences)

  • Technical hurdles

    • Noise sources: thermal, 1/f, dielectric (dominant above 1 kHz), amplifier; mitigation via low-κ dielectrics, on-chip amplifiers

    • Control of translocation speed; potential aids: pore/chain friction, SDS steric steps, unfoldase coupling (ClpXP), electric field optimisation

    • Manufacturing: high-yield, CMOS-compatible subnanopore arrays still under development

COMPARATIVE INSIGHTS & REAL-WORLD IMPLICATIONS

  • Sensitivity hierarchy (fewest molecules needed):
    \text{subnanopore} < \text{fluorescence fingerprinting} < \text{TD-MS} < \text{BU-MS}

  • Data-analysis parallels with genomics: alignment, consensus building, error-correction; need for proteome-specific bioinformatics

  • Ethical & clinical impact: ability to profile whole proteomes could revolutionise diagnosis (e.g., low-abundance cytokines as biomarkers), drug discovery, personalised medicine

FUTURE OUTLOOK

  • Near term (0–2 yr):

    • MS continues as core; supplemented by long-read transcriptomics & CITE-seq for isoform & surface protein insight

    • Pilot fluorosequencing instruments may deliver millions of reads albeit slowly

  • Mid term (≈5 yr):

    • Advances in fluorescence chemistry & imaging could shorten cycle times; better PTM tags broaden scope

    • Subnanopore arrays with improved noise control, high-bandwidth electronics likely to mature; potential single-cell proteome quantitation

  • Long term (>5 yr):

    • Affordable high-density nanopore chips (Moore-law scaling) offer direct, PTM-aware, database-free protein sequencing

    • Integration with single-cell and spatial omics creates holistic molecular diagnostics platforms

KEY REFERENCES TO KNOW

  • Timp & Timp, Sci. Adv. 2020 – survey discussed here

  • Swaminathan et al., Nat. Biotechnol. 2018 – fluorosequencing

  • van Ginkel et al., PNAS 2018 – ClpXP FRET fingerprinting

  • Yusko et al., Nat. Nanotechnol. 2017 – 5D nanopore fingerprinting

  • Kennedy et al., Nat. Nanotechnol. 2016 – subnanopore AA-volume reading

  • Workman et al., Nat. Methods 2019 – native RNA nanopore sequencing

End of notes