Forensic Genetics – Week 4 Lecture Notes

Topic = STRs and Statistics

STRs: Basics

STR stands for Short Tandem Repeats. Also known as microsatellites.
Type of size polymorphism: small DNA fragments (2–6 bp repeats) that occur in tandem.
Typical total length: generally only 100–400 bp in total.

STRs vs VNTRs

STRs (microsatellites)
- Repeat unit size: 2–6 base pairs.
- Repeated many times (e.g., 8–20 times).
VNTRs (minisatellites; Variable Number Tandem Repeats)
- Repeat unit size: hundreds of base pairs.
- Repeated a smaller number of times (illustrative comparison: VNTRs ~repeat unit size hundreds bp; STRs repeat units of 2–6 bp).

Location of STRs

Not clustered near telomeres; spread unevenly across the entire length of chromosomes.
Located in non-coding DNA generally.
Patches of high and low STR content across the genome.
Occur via mistakes during DNA replication.
Example: Human Chromosome 5 STRs near genes (contextual note in figure).

Types of STRs (repeat motifs by base count)

Mononucleotide repeats (single base): A or C (e.g., AGATAAAAAAAAGTGTCA) — not G or T alone.
Dinucleotide repeats (2 bases): AC, AG, AT, CG (AC = CA = GT = TG).
Trinucleotide repeats (3 bases): AAC, AAG, AAT, ACC, ACG, ACT, AGC, AGG, ATC, CCG.
Tetranucleotide repeats (4 bases): examples include AAAC, AAAG, AAAT, AGAT, CCCG, CCGG, etc.
Pentanucleotide repeats (5 bases).
Hexanucleotide repeats (6 bases).

STR nomenclature difficulties

Questions to resolve when describing STRs:
- Which DNA strand to read from?
- Which motif is the repeat unit?
- The chosen strand can affect how many repeats are counted.
To standardize, use GenBank reference sequence to satisfy courts.
The first strand sequenced is usually treated as the reference sequence.

STR typing steps — PCR

Amplify across the STR via PCR.
STRs shorter than ~400 bp are useful in forensics due to DNA quality constraints.
Flanking regions around STRs are stable across the population, enabling reliable amplification.

STR typing steps — Capillary Electrophoresis (CE)

After PCR, run products on a capillary electrophoresis gel.
Detection relies on lasers and fluorescence to detect DNA fragments.
CE is more accurate than traditional gel electrophoresis and high-throughput: instruments can run 12–96 samples at a time.
CE separates DNA by size in a thin capillary tube using an electrical current and a separation matrix.
A laser excites fluorescence as fragments pass a detection window; the emitted light is recorded (electropherogram).

CE: Detector and signal concepts

Detector measures RFU (Relative Fluorescence Units) as fragments pass by.
Electropherogram: graph of signal vs. time showing fragment sizes.
The capillary system includes:
- Capillary tube
- Buffer and matrix
- Laser excitation and a fluorescence detector
- PC (pseudo). The schematic shows excitation, emission, and the moving DNA fragment.

CE fragment detection: fluorescence principles

Two methods to promote fluorescence: 1) Stain the DNA with intercalating dyes (e.g., Ethidium bromide or SYBR Safe).
- Dyes insert between DNA strands and fluoresce under UV; laser detects signal as fragments pass the window.
  2) Attach a fluorescent label to one primer (5′ end) in the PCR primer pair.
- The PCR product carries the fluorescent tag; as the fragment passes the detector, it emits a colored signal.
Multiple colours can be used in a single PCR reaction when using fluorescent primers.

DNA staining vs fluorescent primers details

DNA staining:
- Uses intercalating dyes; glow when bound to DNA under UV light; signal detected by laser as fragments pass.
Fluorescent primers:
- A fluorophore is attached to a primer; incorporated into PCR products; single/ multiple colours allow multiplexing.

STR typing steps – Allelic ladders

Allelic ladder: a reference ladder containing all possible alleles for a given STR.
Sample peaks are overlaid with the ladder to determine the exact alleles present.
Conceptual illustration: ladder shows all possible alleles; sample overlays show detected alleles.

Making an Allelic ladder

Create ladders by pooling DNA from individuals with different alleles for the STR; amplify, then combine into a single tube.
Allelic ladders exist for all forensic STRs.

Multiplex PCR

Advantage: amplify more than one STR at a time in a single tube.
Relies on size differences and/or fluorescent tags on primers to distinguish loci.

Multiplex PCR — size differences vs colour differences

Size differences: If alleles of different STRs do not overlap in size, they can be tested simultaneously with a single colour. (e.g., Qiaxcel system.)
Colour differences: If allele sizes overlap between loci, use different fluorescent colours to distinguish them. Typically 4–5 colours can be detected at once.

Normalised intensity and colour considerations

Only 4–5 colours are practical due to spectral overlap between fluorophores.
Spectral overlap makes distinguishing very close colours challenging.
Diagrammatic note: spectral overlap reduces separation efficiency; normalisation helps but limits the number of distinguishable colours.

Multiplex PCR output and interpretation

Output: one or two peaks per STR locus.
Each peak is assigned a colour corresponding to its STR marker.
Peaks are overlaid on an allelic ladder to determine the alleles present.

Multiplex PCR — practical considerations

All primers must work under the same reaction conditions.
Increased amounts of dNTPs and Taq polymerase may be required.
Fine-tuning is essential for reliable amplification across all targets.

STR kits used in forensics

PowerPlex® 16: shows multiple loci with peak ranges around 100–300 bp; includes loci such as D3S1358, TH01, D21S11, D18S51, D5S818, D13S317, D7S820, CSF1PO, VWA, D8S1179, TPOX, FGA, etc.; includes an internal lane standard like ILS-600.
Identifiler®: includes loci such as D8S1179, D21S11, D7S820, CSF1PO, FGA, D3S1358, D13S317, D16S539, D2S1338, D19S433, VWA, TPOX, D18S51, D5S818, etc.; uses an internal lane standard (GS500) for sizing.
Each kit provides a core set of STRs with size ranges and colour-coded primers to enable multiplexing.

Changing the size of alleles (to avoid overlaps)

Two methods to adjust amplicon size so alleles no longer overlap:
1) Moving primers closer to or farther from the STR to adjust overall amplicon size.
2) Adding non-nucleotide linkers between primer and fluorescent tag to increase product size without affecting primer performance. Roughly, 1 linker ≈ 2.5 bp.

Moving primers

Shifting primers alters amplicon length; still must consider all other primer interactions in the multiplex.

Linkers

Small non-nucleotide linkers can be added to increase size by ~2.5 bp per linker, enabling separation of alleles from different STRs that would otherwise overlap.

What makes a good STR marker?

Narrow allele size range: limits the number of possible alleles and reduces overlap with other markers.
Narrow allele size range also helps to minimize random dropout, especially for larger alleles (e.g., 1 kb vs 200 bp have different amplification efficiencies).
Small PCR product size: enhances amplification efficiency; small alleles are more likely to remain intact in degraded DNA samples.
Older DNA samples still testable due to smaller amplicons.
Larger repeat unit (e.g., 4 bp) is generally better than 2 bp or 3 bp because it reduces slippage during copying; slippage causes stutter peaks that complicate interpretation in mixtures.

CODIS: Combined DNA Index System

FBI’s core forensic database with 13 core STR markers routinely tested at the start.
They provide a very low probability of random match (a large negative exponent).
Loci are spread across different chromosomes to exploit Independent Assortment.
2017 update added 7 new STR markers; Australia uses a core set of 17 loci plus a sex determination test.

Sex determination in STR profiling: Amelogenin

Amelogenin marker (AMEL) is used for sex identification; not a true STR marker, it’s a deletion in a non-coding region of the Amelogenin gene.
AMEL-X vs AMEL-Y differences:
- AMEL-X has a 6 bp deletion relative to AMEL-Y.
- PCR uses the same primers for X and Y; X yields a shorter fragment, Y yields a longer fragment.
Interpretations:
- Female (XX): one peak at 106 bp (106 bp allele on X) – generally a single 106 bp fragment.
- Male (XY): peaks at 106 bp and 218 bp (106 bp + 112 bp) reflecting both X and Y products.
Amelogenin is also useful in detecting mixtures (e.g., rape cases). In males, peak heights for X and Y should be similar; disproportionate peaks may indicate mixtures.

Practical applications and implications of STRs

What to do with a profile:
- You have a DNA profile; a suspect’s DNA may match the evidence.
- To prove to the court that you have your person, you rely on statistics.
Population data forms the basis of statistics:
- Data collected from >100 individuals in different ethnic groups.
- For each STR, count the number of times an allele is observed.
- Allele counts are converted to allele frequencies.
- When a match is made, multiply allele frequencies to predict how often a particular genotype will be observed (i.e., random match probability).
Genotype definition: the combination of alleles inherited from mother and father.

Allele frequencies: concepts and examples

Allele frequency at a locus i is the probability that a randomly chosen chromosome carries that allele.
Example: D13S317 in Caucasians is shown in a genotype-frequency table with allele pairs and their frequencies.
For a given locus, genotypes can be homozygous (AA) or heterozygous (AB).
Functions used:
- If genotype is AA (homozygous), frequency = p^2 where p is the allele frequency for A.
- If genotype is AB (heterozygous), frequency = 2 p q, where p and q are frequencies of A and B, respectively.
- If genotype is BB (homozygous for B), frequency = q^2.

Caucasian example (D13S317, TH01, D18S51, D21S11, D3S1358, D5S818, D7S820, D8S1179, CSF1PO, FGA, D16S539, TPOX, VWA)

D13S317: alleles 11 and 14; p = 0.33940, q = 0.04801; genotype frequency = 2pq = 0.0326
TH01: alleles 6 and 6; p = 0.23179; genotype frequency = p^2 = 0.0537
D18S51: alleles 14 and 16; p = 0.13742, q = 0.13907; genotype frequency = 2pq = 0.0382
D21S11: alleles 28 and 30; p = 0.15894, q = 0.27815; genotype frequency = 2pq = 0.0884
D3S1358: alleles 16 and 17; p = 0.25331, q = 0.21523; genotype frequency = 2pq = 0.1090
D5S818: alleles 12 and 13; p = 0.38411, q = 0.14073; genotype frequency = 2pq = 0.1081
D7S820: alleles 9 and 9; p = 0.17715; genotype frequency = p^2 = 0.0314
D8S1179: alleles 12 and 14; p = 0.18543, q = 0.16556; genotype frequency = 2pq = 0.0614
CSF1PO: alleles 10 and 10; p = 0.21689; genotype frequency = p^2 = 0.0470
FGA: alleles 21 and 22; p = 0.18543, q = 0.21854; genotype frequency = 2pq = 0.0810
D16S539: alleles 9 and 11; p = 0.11258, q = 0.32119; genotype frequency = 2pq = 0.0723
TPOX: alleles 8 and 8; p = 0.53477; genotype frequency = p^2 = 0.2860
VWA: alleles 17 and 18; p = 0.28146, q = 0.20033; genotype frequency = 2pq = 0.1128
Product across all 13 markers (the “random match probability” calculation) = approximately 1.2 imes 10^{-15}, corresponding to a random-match probability of about 1/(1.2 imes 10^{-15}) \,=\, 8.37 imes 10^{14} (i.e., 1 in 8.37 x 10^14) for this Caucasian example.

African American example (D13S317, TH01, D18S51, D21S11, D3S1358, D5S818, D7S820, D8S1179, CSF1PO, FGA, D16S539, TPOX, VWA)

D13S317: allele pair 11 and 14; p = 0.30620, q = 0.03488; genotype frequency = 0.0214
TH01: 6 and 6; p = 0.12403; genotype frequency = p^2 = 0.0154
D18S51: 14 and 16; p = 0.07198, q = 0.15759; genotype frequency = 2pq = 0.0227
D21S11: 28 and 30; p = 0.25775, q = 0.17442; genotype frequency = 2pq = 0.0899
D3S1358: 16 and 17; p = 0.33527, q = 0.20543; genotype frequency = 2pq = 0.1377
D5S818: 12 and 13; p = 0.35271, q = 0.23837; genotype frequency = 2pq = 0.1682
D7S820: 9 and 9; p = 0.10853; genotype frequency = p^2 = 0.0118
D8S1179: 12 and 14; p = 0.14147, q = 0.30039; genotype frequency = 2pq = 0.0850
CSF1PO: 10 and 10; p = 0.25681; genotype frequency = p^2 = 0.0660
FGA: 21 and 22; p = 0.11628, q = 0.31783; genotype frequency = 2pq = 0.0739
D16S539: 9 and 11; p = 0.19574, q = 0.32119; genotype frequency = 2pq = 0.1257
TPOX: 8 and 8; p = 0.37209; genotype frequency = p^2 = 0.1385
VWA: 17 and 18; p = 0.24225, q = 0.15504; genotype frequency = 2pq = 0.0751
Product across all 13 markers yields a random match probability around 1.66 imes 10^{16} (i.e., about 1 in 1.66 x 10^16) for this African American example.
Important point: Allele frequencies differ among ethnic groups; combined probabilities should be calculated using population-specific allele frequencies.

Population-specific allele frequencies

Allele frequencies differ across ethnic groups; e.g., Caucasian vs African American frequencies for the same loci are different (example data show marked differences in p, q values).
The effect: genotype frequencies and final random-match probabilities change depending on the assumed population.
The slide set provides a comparative table showing these differences across loci (e.g., D13S317, TH01, D18S51, etc.).

Multi-locus random match probability (RMP)

Concept: Random match probability is the probability that two random individuals in a population would share the exact same multi-locus genotype across all tested STRs.
Calculation method:
- For each locus i with observed genotype (Ai, Bi) and allele frequencies pi = freq(Ai), qi = freq(Bi):
- If Ai = Bi (homozygous): fi = pi^2
- If Ai ≠ Bi (heterozygous): fi = 2 pi q_i
- Then multiply across all loci: RMP = ∏{i} fi
Example outcome (Caucasian): RMP ≈ 1.2 imes 10^{-15}; equivalent to 1 in 8.37 imes 10^{14}.
Example outcome (African American): RMP ≈ 1 in 1.66 imes 10^{16}.

Practical interpretation and ethical considerations

The more STR loci used, the smaller the random match probability, increasing evidentiary strength.
Population substructure and ethnicity must be considered; using the wrong population allele frequencies can misestimate the RMP.
The Amelogenin sex marker helps identify sex and serves as a check against sample contamination or mixtures, but it is not itself a STR marker.
The use of multiple markers and allelic ladders enhances the reliability of genotyping in court.
Statistical interpretation in court relies on well-established population genetics principles (Hardy–Weinberg equilibrium, independence of loci, and assumption of random mating).

Hardy–Weinberg and genotype frequencies (recap)

For a locus with allele frequencies p and q (p + q = 1):
- Homozygous AA frequency: p^2
- Heterozygous AB frequency: 2 p q
- Homozygous BB frequency: q^2
Random match probability for a specific genotype is the product of locus-specific genotype frequencies across all loci tested.

Key takeaways

STRs are small, highly variable regions ideal for individual identification due to their high polymorphism and independence across loci.
Capillary electrophoresis with fluorescent primers enables rapid, high-throughput STR profiling and precise sizing.
Allelic ladders and multiplex PCR are essential tools for robust, efficient forensic STR analysis.
Sex determination (Amelogenin) and mixture interpretation are important practical considerations.
Population-specific allele frequencies underpin probabilistic interpretation; the strength of a match depends on the number of loci and the diversity of the loci used.
The math of random match probability relies on Hardy–Weinberg principles and simple genotype-frequency formulas, multiplied across loci to yield an overall extremely small probability of a coincidental match.

Quick reference formulas

Homozygous genotype frequency: f(AA) = p^2
Heterozygous genotype frequency: f(AB) = 2 p q \text{where } p = ext{freq}(A), q = ext{freq}(B), p + q = 1
Random match probability across n loci: RMP = \prod{i=1}^{n} fi
Example: For a locus with alleles 11 and 14, p = 0.33940, q = 0.04801, so f = 2pq = 0.0326