Beyond Mass Spectrometry – Key Proteomics Vocabulary

CENTRAL DOGMA & MOTIVATION FOR DIRECT PROTEIN SEQUENCING

Information flow: DNA → RNA → Protein
- Transcription, splicing, translation determine amino-acid (AA) sequence
- Start-site choice, open-reading-frame (ORF) selection and post-translational modifications (PTMs) add further variability
Proteins govern structure, signalling, catalysis; mutations or mis-processing underpin diseases (e.g. Alzheimer’s, Huntington’s)
Genome & transcriptome sequencing are inexpensive (≈\$1000/human genome) yet insufficient:
- Long-read genome assemblies retain \approx0.1\% error (frameshifts) ⇒ hundreds of protein errors
- RNA levels do not correlate linearly with protein abundance because of translation efficiency, RNA lifetime, poly(A) length, RNA modifications
- Analyses of a “well-characterised” human transcriptome found \approx1.1\times10^5 novel transcripts
Human proteome complexity
- \approx2\times10^4 protein-coding genes
- >100 isoforms per gene after alternative splicing, single-AA polymorphisms, PTMs ⇒ >10^6 proteoforms
- \approx60\% of proteins glycosylated; other frequent PTMs: methylation, acetylation, phosphorylation

HISTORICAL & CURRENT GOLD STANDARD: MASS SPECTROMETRY (MS)

Edman degradation (1950s–1990s)

Cyclic N-terminal chemistry; >99\% efficiency/AA but
- Slow (≈1 h/cycle), limited to <30 AA, needs ≈100 pmol, fails on blocked N-termini & many PTMs

Bottom-Up MS (BU-MS)

Workflow: Protease digest (typically trypsin) → ionisation (ESI/MALDI) → MS→MS/MS → database search (Mascot, Sequest)
Generates peptide “fingerprints” rather than complete sequences
Ambiguities: isoleucine vs leucine (isobaric), homologous fragments, shared peptides; reconstruction uses inclusion, exclusion or parsimony strategies
Sensitivity limits
- Typical detection limit ≈480 fg ⇒ \approx6 million molecules (50 kDa)
- Dynamic range ≈ 10^5 (Orbitrap) vs biological range 10^{12} (e.g. antibodies mg mL⁻¹; cytokines pg mL⁻¹)
- Identification usually needs 10^{-15}–10^{-18} mol (10⁶–10⁹ copies)
75 % of acquired spectra remain unidentified; clustering rescues only ≈20 %
PTM site localisation problematic; phosphorylation example:
- Human proteome: \approx2\times10^7 residues; avg. 1.5 Ser/Thr/Tyr per 10-mer ⇒ multiple candidate sites
- Requires enrichment (IMAC, ion exchange) and MMA ≤1 ppm; many instruments offer 10–250 ppm

Top-Down MS (TD-MS)

Analyses intact proteins (≤70 kDa) via ESI + fragmentation (CID, ECD, ETD)
\approx100\times less sensitive than BU-MS; needs high-field magnets (7–14 T)
Resolution challenge: difference between trimethyl-lysine & acetyl-lysine = 0.0364 Da ⇒ need <1 ppm on 50 kDa ion
Fragmentation efficiency decreases with MW; generally requires ≥0.5 µg mL⁻¹ pure protein

KEY NUMERICAL RELATIONS & FORMULAE

Protein volume–mass relation: V{mol}\,[\text{nm}^3]=1.21\times10^{-3}\,MW{Da}
Fractional current blockade (idealised): \frac{\Delta I}{I0}=f\;\frac{\Delta V{mol}}{V_{pore}}\;S
- f shape/orientation factor, S field-distortion size factor

LIMITATIONS THAT DRIVE NEW TECHNOLOGIES

Insufficient sensitivity & dynamic range for low-abundance proteins/isoforms/PTMs
Dependence on reference databases; difficulty in de novo sequencing
Need for large sample amounts prevents single-cell proteomics

ALTERNATIVES BEYOND MS

1. Long-Read Transcriptomics (PacBio, Oxford Nanopore)

Reads entire cDNAs or native RNA (10 kb+) allowing isoform resolution without assembly
Example: CDKN2A locus
- Isoforms p14^{ARF} vs p16^{INK4a} differ by frameshift despite shared exons; long reads revealed 93 vs 33 reads respectively; short reads would mis-classify
Error correction: circular consensus (PacBio), alignment polishing ⇒ >99 % accuracy
Limitations: RT processivity, low depth for rare transcripts, RNA ≠ protein, cannot capture PTMs

2. Single-Cell Multi-omics – CITE-seq & Spatial Methods

Attach DNA barcodes to antibodies; droplet microfluidics (Drop-seq/10X): joint sequencing of
- mRNA (transcriptome)
- Antibody-derived tags (proxy for surface protein abundance)
Advantages: thousands of proteins possible (DNA barcode space > fluorescence channels)
Present limits: surface proteins only; antibody specificity/cross-reactivity; steric hindrance blocks multi-PTM probing
Spatial transcriptomics (MERFISH, STARmap, seqFISH+) adds localisation but low throughput & spectral crowding

3. Fluorescent Protein “Fingerprinting”

a) Fluorosequencing by Edman degradation (Swaminathan et al.)

Fragment proteins; covalently label specific AA types with dyes; tether C-termini on flow-cell; TIRF imaging
Cycles: image → Edman remove N-terminal AA → image; drops in intensity reveal labelled AA positions
Scalable (millions of molecules) leveraging NGS hardware; can detect fluorescently tagged PTMs
Challenges: slow (≈1 h/cycle), <30 AA read length due to <91 % cycle yield, limited PTM chemistry, photobleaching/dark states, dynamic range \sim10^7 required

b) Single-molecule FRET with ClpXP motor (van Ginkel et al.)

Donor-labelled ClpP14 + acceptor-labelled AAs; ClpX unfolds & feeds peptide through pore; FRET bursts encode order of labels
Database alignment identifies peptide; still limited by label chemistry, throughput, fluorescence noise

4. Nanopore-Based Approaches

a) 5-Dimensional Fingerprinting of Folded Proteins (Yusko et al.)

Use 30 nm pore coated with fluid lipid; tethered protein drifts slowly ⇒ measurable blockade modulation
Extract per molecule:
- Volume
- Shape (oblate/prolate)
- Net charge
- Rotational diffusion coefficient
- Dipole moment
Provides unique fingerprint without purification; bandwidth limited, ≈1 molecule/2 s
Throughput solution: fabricate dense nanopore arrays (30 nm pores on 200 nm pitch; even 2 nm pores via STEM sputtering)

b) Sequencing Denatured Proteins with Sub-Nanopores

Molecular dynamics (MD) of 2.2 nm pores in 2D MoS$_2$/graphene shows stepwise AA passage; jamming issues suggest smaller pores needed
Subnanopore fabrication: focused electron beam drills 0.25–0.6 nm waist in 10 nm SiN; biconical geometry focuses field over ≈1.5 nm
Denaturation with SDS + heat + β-ME; SDS confers uniform negative charge & rod-like shape
Experimental results (Timp lab):
- Single CCL5 (67 AA) produces fluctuating blockade; consensus of 400 events correlates with AA volume model (k=3 moving average, PCC=0.75, 65 % accuracy)
- Random-forest regression using volume + hydrophilicity removes bias, improves PCC and identification
- Cluster of 5–10 blockades sufficient to identify a protein (20 % of human proteome search) with P\le10^{-6}
- Sensitivity: single-molecule detection; PTM or single-AA substitutions distinguished (≈0.07 nm$^3$ volume differences)
Technical hurdles
- Noise sources: thermal, 1/f, dielectric (dominant above 1 kHz), amplifier; mitigation via low-κ dielectrics, on-chip amplifiers
- Control of translocation speed; potential aids: pore/chain friction, SDS steric steps, unfoldase coupling (ClpXP), electric field optimisation
- Manufacturing: high-yield, CMOS-compatible subnanopore arrays still under development

COMPARATIVE INSIGHTS & REAL-WORLD IMPLICATIONS

Sensitivity hierarchy (fewest molecules needed):
\text{subnanopore} < \text{fluorescence fingerprinting} < \text{TD-MS} < \text{BU-MS}
Data-analysis parallels with genomics: alignment, consensus building, error-correction; need for proteome-specific bioinformatics
Ethical & clinical impact: ability to profile whole proteomes could revolutionise diagnosis (e.g., low-abundance cytokines as biomarkers), drug discovery, personalised medicine

FUTURE OUTLOOK

Near term (0–2 yr):
- MS continues as core; supplemented by long-read transcriptomics & CITE-seq for isoform & surface protein insight
- Pilot fluorosequencing instruments may deliver millions of reads albeit slowly
Mid term (≈5 yr):
- Advances in fluorescence chemistry & imaging could shorten cycle times; better PTM tags broaden scope
- Subnanopore arrays with improved noise control, high-bandwidth electronics likely to mature; potential single-cell proteome quantitation
Long term (>5 yr):
- Affordable high-density nanopore chips (Moore-law scaling) offer direct, PTM-aware, database-free protein sequencing
- Integration with single-cell and spatial omics creates holistic molecular diagnostics platforms

KEY REFERENCES TO KNOW

Timp & Timp, Sci. Adv. 2020 – survey discussed here
Swaminathan et al., Nat. Biotechnol. 2018 – fluorosequencing
van Ginkel et al., PNAS 2018 – ClpXP FRET fingerprinting
Yusko et al., Nat. Nanotechnol. 2017 – 5D nanopore fingerprinting
Kennedy et al., Nat. Nanotechnol. 2016 – subnanopore AA-volume reading
Workman et al., Nat. Methods 2019 – native RNA nanopore sequencing

End of notes