BCMB 322 Protein Chemistry — Primary Structure and Protein Sequencing Notes

Levels of Protein Structure

Protein structure is organized into hierarchical levels:
- Primary structure
- Secondary structure
- Tertiary structure
- Quaternary structure
Other levels include motifs and domains.
Visual cue from slide: a polypeptide chain with amino acid residues forming helices and assembled subunits.

Primary structure

Definition: the sequence of amino acids in the polypeptide chain (or chains) and the amino acid composition of the protein.
It is the basic product of DNA transcription and translation.
It determines the nature of the subsequent folding of the polypeptide chain.
Bond type emphasized: peptide bonds (covalent).

Extent of diversity of primary structure

There are 20 different amino acids.
For a protein of n residues, there are $20^n$ possible sequences.
Example: for a protein with 100 residues, the number of possible sequences is $20^{100} \,\approx\, 1.27 \times 10^{130}$ unique polypeptide chains.

Actual size and composition of polypeptides

Most polypeptides contain between 100 and 1000 residues.
Very long polypeptides may challenge protein synthesis machinery.
Longer peptides and genes increase the likelihood of transcription/translation errors.

Why determine primary structure?

Amino acid sequence is prerequisite for:
- Determining 3D structure and understanding molecular mechanism of action.
- Sequence comparisons reveal function and evolutionary relationships among organisms.
- Inherited diseases often arise from mutations altering amino acids (e.g., sickle cell disease).
- Enables development of diagnostic tests and therapies.

Determination of primary structure

Protein sequencing means determining the order of amino acids in the protein.
The protein must be reasonably pure for sequencing.
Frederick Sanger first sequenced bovine insulin in 1953; it took >10 years and ~100 g of protein.
- Sanger Nobel prize awarded in 1958 for this work.
Since then procedures have been refined and automated (Sequenators).

Techniques in protein sequencing (general concept)

Break the protein into fragments small enough to sequence individually.
Reconstruct the intact protein sequence from overlapping fragment sequences.

Outline of protein sequencing procedure

Determine end groups.
Separate subunits if present.
Dislodge disulfide and other linkages.
Perform specific cleavage.
Sequence fragments.
Work out peptide sequences and positions of disulfide bonds.

End group analysis

N-terminal and C-terminal analysis provides information about subunit count and subunit separation needs.
Subunits are held together by disulfide bonds or noncovalent interactions.

N-terminal analysis

Methods include:
- Dansylation of N-terminal amino acids, followed by acid hydrolysis and chromatographic identification against standards.
- Edman degradation: reacts the N-terminus with phenyl isothiocyanate under mildly alkaline conditions; the N-terminal amino acid is released as a thiazolinone derivative under acidic conditions.
- Sanger’s reagent (FDNB) treatment, followed by acid hydrolysis and chromatographic identification of the FDNB-derivative.
- Enzymatic hydrolysis with aminopeptidases to cleave N-terminal residues.

Dansylation (N-terminal analysis)

Reagent: 1-dimethylaminonaphthalene-5-sulfonyl chloride (dansyl chloride).
Reaction: primary amines (N-terminal amino group) become dansylated polypeptides.
Post-hydrolysis: the modified N-terminal residue is released and identified by intense yellow fluorescence after chromatographic separation.
Fluorescence comparison with standards determines the N-terminal amino acid.
Utility: indicates the number of polypeptide chains (e.g., two polypeptides yield two distinct dansylated products).

Edman degradation (N-terminal analysis)

Reagent: Phenylisothiocyanate (PITC).
Reaction: PITC reacts with the N-terminal amino group to form a phenylthiocarbamyl (PTC) adduct under mildly alkaline conditions.
Acidic treatment with trifluoroacetic acid cleaves the N-terminal residue as a thiazolinone derivative without cleaving other peptide bonds.
The thiazolinone derivative is converted to a more stable phenylthiohydantoin (PTH) derivative, which is identified by chromatography.
This process can be repeated (Edman degradation) to sequentially remove and identify residues.
Note: Partial Edman degradation can be used to sequence peptides when full Edman cycles are not feasible.

C-terminal analysis

There is no universally reliable chemical method for direct C-terminal residue identification.
Hydrazinolysis: hydrazine forms aminoacyl hydrazides with every residue except the C-terminal residue, so the C-terminus cannot be directly determined this way; it’s destructive to the rest of the sample.
Enzymatic approach: use carboxypeptidases to hydrolyze C-terminal residues one by one.
- Carboxypeptidase A (bovine pancreas): cleaves all C-terminal peptide bonds except when Arg/Lys are at C-terminus or when Pro is prior to the C-terminus.
- Carboxypeptidase B (porcine pancreas): cleaves basic C-term residues (Arg/Lys) but not if Pro precedes them.
- Carboxypeptidase C: broad C-terminal cleavage.
- Carboxypeptidase Y (yeast): broad activity, active in presence of urea/detergents.

Cleavage of disulfide (S–S) bonds

Types of S–S linkages:
- Interchain: between cysteines in different polypeptide chains.
- Intrachain: between cysteines within the same polypeptide.
Cleavage of S–S bonds can be done to separate subunits or to linearize the chain.
Methods:
- Oxidative cleavage with performic acid (phyromedic approach, pioneered by Sanger): converts all cysteines to cysteic acid; also oxidizes methionine and partially destroys tryptophan indole.
- Reductive cleavage with mercaptans (e.g., 2-mercaptoethanol) or DTT (dithiothreitol, Cleland’s reagent).
After reduction, free sulfhydryl groups are protected by alkylation (commonly with iodoacetic acid) to form carboxymethylated cysteine and prevent reformation of disulfide bonds.

Cleavage of non-disulfide linkages

Some oligomeric proteins are linked by non-disulfide interactions; these subunits can be dissociated under acidic conditions with low salt and low temperature.
Denaturing agents used to dissociate subunits include:
- Urea (H2N-CO-NH2)
- Guanidinium ion ([C(NH2)3]+)
- Detergents such as SDS (sodium dodecyl sulfate)

Separation of subunits

After disulfide or non-disulfide dissociation, subunits are separated and purified by chromatography and/or electrophoresis.
This separation helps determine the number of subunits and their individual properties.

Protein sequence from DNA

When the gene is available, gene sequencing is faster and often more accurate than protein sequencing.
Genomics and proteomics are complementary approaches.
Example mapping (codons) from DNA to amino acids:
- DNA: CAG TAT CCT ACG ATT TGG
- Protein: Gln Tyr Pro Thr Ile Trp
This illustrates the genetic code linking nucleotide sequences to protein sequences.

Determination of amino acid composition

Sometimes it is useful to know the number of each amino acid type in a protein.
Method: complete hydrolysis of the polypeptide, followed by analysis of the liberated amino acids.
Hydrolysis can be chemical (acid or base) or enzymatic; different methods complement each other.

Acid hydrolysis

Typical conditions: 6 M HCl, ~100–120 °C, ~24 h, in vacuum.
Disadvantages: Ser, Thr, Tyr, and Trp are degraded; Asn and Gln convert to Asp and Glu, respectively.
Faster/more complete approach: add protein to 6 M HCl in an inert N2 atmosphere, seal, and heat to ~200 °C in a microwave for 5–30 min (protein-dependent).
HCl vapour aids hydrolysis.

Base hydrolysis

Conditions: 2–4 M NaOH at 100 °C for 4–8 h.
Problems: decomposition of Cys, Ser, Thr, Arg; racemization of other residues.
Advantage: can be used to detect Trp, which is not reliably detected by acid hydrolysis.

Enzyme hydrolysis

Enzymatic hydrolysis often yields peptide fragments rather than free amino acids.
Peptidases themselves are susceptible to proteolysis, which can contribute to the amino acid content measurement.

Other determination methods

Chemical: Cyanogen bromide (CNBr) cleaves Met on the carboxyl side.
Spectrophotometric: Tryptophan absorbs UV light best in unhydrolyzed protein.

Derivatization and amino acid analysis

After hydrolysis, derivatives are added to enable detection, often via chromatography.
Common derivatization approaches before/after chromatography:
- Ninhydrin derivatization generates a blue-purple product with the α-amino group of most amino acids; Pro (a secondary amine) yields a yellow product.
- Pre-column fluorometric derivatization with reagents such as o-phthalaldehyde (OPA) or phenylisothiocyanate (PITC).
Modern amino acid analyzers (often RP-HPLC) can quantify amino acids from digests very rapidly (as little as 1 pmol of each amino acid per run).

Polypeptide cleavage (fragmentation) for sequencing

Endopeptidases are used to cleave the polypeptide at specific residues to produce fragments.
The generated fragments are sequenced individually and then reassembled by overlap.
In some cases, specific amino acid side chains can be modified to protect or preserve particular bonds (e.g., Lys side chain modification to prevent cleavage at Lys).

Specificity of common fragmentation reagents (overview)

Trypsin (bovine pancreas): cleaves after Lys and Arg (C-terminal fragments).
Submaxillaris protease: cleaves after Arg (C).
Chymotrypsin (bovine pancreas): cleaves after Phe, Trp, Tyr (C).
Staphylococcus aureus V8 protease: cleaves after Asp and Glu (C or N depending on conditions).
Asp-N-protease (Pseudomonas fragi): cleaves before Asp/Glu (N-terminal side).
Pepsin (porcine stomach): cleaves after hydrophobic residues (varies with pH).
Endoproteinase Lys-C (Lys-C, Lysobacter enzymogenes): cleaves after Lys.
Cyanogen bromide (CNBr): cleaves at Met residues (C-terminal side).
Note: All except CNBr are proteases with specific recognition points; cleavage occurs on the C- or N- side as indicated.

Determination of amino acid sequence

Achieved by repeated Edman degradation cycles.
If a fragment is too large, it may be re-cut into smaller pieces and sequenced again.
Sequencing of peptides is often automated in sequenators.

Edman degradation (detailed)

Inventor: Pehr Edman.
PITC reacts with the N-terminal amino group to form a PTC adduct.
Under acidic conditions, the N-terminal residue is cleaved as a thiazolinone derivative.
The thiazolinone derivative is converted to the phenylthiohydantoin (PTH) derivative, which is identified by chromatography.

Reconstruction of the protein’s sequence

After sequencing multiple overlapping fragments, align overlaps to reconstruct the full sequence.
A second round of cleavage with a reagent of different specificity may be used to generate overlapping fragments.
The overlapping information allows ordering of fragments and assembly of the complete primary sequence.

Practice question (example workflow)

Given CNBr and chymotrypsin fragments with known sequences, deduce the intact polypeptide sequence by overlap.
This exercise emphasizes the logic of fragment-based sequencing and the use of multiple cleavage specificities.

Assignment of S–S linkages (disulfide bonds)

Inter-chain disulfide linkages: identified by diagonal electrophoresis.
- Run first dimension non-reducing electrophoresis to separate subunits; then oxidize or reduce and run second dimension to map disulfide connectivity.
Intra-chain disulfide linkages: determine by analyzing intact protein with disulfides, cleaving to yield peptide fragments containing a single Cys linked by a disulfide bond.
- Isolate the disulfide-linked fragments, cleave and alkylate the disulfide bonds, then sequence the peptides and compare to the full protein.

Practice questions on disulfide linkages

A problem set asks to deduce positions of disulfide bonds from given proteolytic fragmentation data and disulfide mapping results.

Post-translational modifications (PTMs)

PTMs are covalent or non-covalent alterations to a protein after translation.
Covalent modifications include those involving the peptide bond, the N-terminus, the C-terminus, and side chains.
Non-covalent modifications involve associations with metal ions or other non-covalent interactions (e.g., metalloproteins).

Modifications involving the peptide bond (limited proteolysis)

Performed by proteases (peptidases) to activate proenzymes and prohormones (e.g., digestive enzymes like trypsinogen, pepsinogen; blood clotting factors; proinsulin).
Role in producing active neuropeptides and peptide hormones from larger precursors.
Involves macromolecular assembly of virus particles (e.g., HIV protease).
Also involved in removal of signal sequences from nascent proteins.

Processing of pre-pro-insulin to active insulin (example)

Preproinsulin contains an N-terminal signal sequence.
Processing yields proinsulin, then insulin consisting of A- and B-chains connected by disulfide bonds; signal sequence removed during maturation.

Modifications involving the amino terminus

Deformylation of formyl-Met during initiation of protein synthesis in bacteria (removal of N-formyl group).
Proteolytic removal of N-terminal Met by aminopeptidases.
Acetylation (N-acetylation): common in eukaryotes (60–90% of cytosolic/cellular proteins), often conferring resistance to proteolysis.
Lipidation or myristoylation (C14) at the N-terminus to target proteins to membranes.

Deformylation and acetylation illustrations

Deformylase removes formyl group from formyl-Met to yield Met-start proteins.
N-acetyltransferases catalyze acetylation using acetyl-CoA as the donor.
Myristoylation attaches a C14 fatty acyl group (myristoyl) from myristoyl-CoA to the N-terminus.

Modifications involving the carboxyl terminus

Amidation of C-terminal glycine is common for peptide hormones/neurotransmitters and can affect activity.
Attachment of membrane anchors to the C-terminus.
Methyl esterification can modulate localization and exopeptidase resistance.
Ubiquitination: covalent attachment of one or more ubiquitin monomers; targets proteins for proteasomal degradation and also influences localization and interactions.

Amidation and membrane anchors (examples)

Amidation often involves removal/processing of C-terminal glycine and generation of an amide group at the terminus.
Membrane anchors can involve GPI anchors or other lipid-linked moieties that tether the protein to membranes.

Modifications involving amino acid side chains

1) Disulfide cross-linking: formation of covalent linkages between cysteine residues.
2) Lysino-norleucine cross-linking (observed in collagen) provides tensile strength and mechanical stability.
3) Phosphorylation of hydroxyl-containing residues (serine, threonine, tyrosine) by kinases is a major regulatory PTM.

Phosphorylation and its role

Example: Glycogen phosphorylase is activated by phosphorylation at Ser-14, converting inactive phosphorylase b to the active phosphorylase a.
Phosphorylation is reversible and used in many metabolic pathways to regulate enzyme activity.
Enzymes that add phosphate groups are kinases; those that remove are phosphatases.

Glycosylation

Most abundant form of PTM.
Covalent attachment of oligosaccharide chains (4–15 sugars) to the polypeptide.
Sugars can constitute 50%+ of the protein’s molecular weight.
Most glycosylated proteins are secreted or membrane-bound.
Glycosylation protects against proteolysis (steric protection) and is important for cell–cell recognition.
Two basic types: N-linked (Asn) and O-linked (Ser/Thr).

N-linked vs O-linked glycosylation

N-linked glycosylation: bond between the amide nitrogen of asparagine and the C-1 of an N-linked sugar; occurs co-translationally in the endoplasmic reticulum (ER).
O-linked glycosylation: bond between serine/threonine hydroxyl and an amino sugar; occurs in ER or Golgi; mediated by glycosyl transferases; sugars are added one at a time from nucleotide-activated monosaccharides.
Examples: ABO blood group antigens on erythrocyte surfaces; A and B antigens are pentasaccharides; O antigen is a tetrasaccharide lacking the fifth sugar residue and is non-antigenic.

Non-covalent modifications

Example: Metalloproteins – proteins that bind metal ions as cofactors for enzymatic activity, transport, signaling, etc.
Metal ions influence structure, activity, and interactions without forming covalent bonds.

Peptide mapping (fingerprinting)

Purpose: differentiate a peptide from very closely related peptides; confirm protein identity and purity.
Approach: hydrolyze protein to amino acids or to characteristic fragments; generate fingerprints using SDS-PAGE or HPLC and compare chromatograms to standards.
Applications: protein identification, detection of genetic polymorphisms, stability of modified organisms, disease diagnostics.

Key takeaways and connections

Primary structure determines folding and ultimately function; mutations can cause diseases.
Multiple analytical strategies (N- and C- terminal analyses, disulfide mapping, fragmentation, and Edman sequencing) work together to reveal complete sequences.
Fragment-based sequencing relies on overlapping fragments and different cleavage specificities to reconstruct the entire protein sequence.
Post-translational and chemical modifications expand functional diversity; understanding them is essential for biochemistry, cell biology, and medicine.
While DNA sequencing can infer protein sequence, proteomics and genomics are complementary in understanding gene expression, regulation, and protein variants.

$20^n$

For a protein with n residues, there are $20^n$ possible sequences.
Example: for n = 100, $20^{100} \approx 1.27 \times 10^{130}$ possible unique polypeptide chains.