Unit 6 - Population Analysis in Disease

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/47

There's no tags or description

Looks like no tags are added yet.

Last updated 3:54 AM on 4/26/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

48 Terms

New cards

myth of the human genome

the human genome project was an average over a few people and one cell type, individual variation is the main think when we talk about personalized genomics (germline or somatic), pathogen genomes and programmed DNA changes (B cells and T cells, changes diversify genome), the epigenome (changes that makeup a genome)

New cards

individual variation terms

SNV (single nucleotide variant), MAF (minor allele frequency, position in human genome that is variable, can have most common variant at that position or smth less common) and SNP (single nucleotide polymorphism, common variant typically MAF over 1%)

New cards

genotyping

usually refers to spot sequencing of specific positions where variation is common in humans or less common but really bad if it happens (ex: BRCA1 gene variant causing cancer), resequencing is when a species has been sequenced by particular individual hasnt (exome one specifically focuses on the part of the genome that encodes proteins plus splice junctions, doesnt code protein but often get mutations that matter), whole genome includes everything including non coding regions (more expensive but gives more info, dont really know how to interpret this yet)

New cards

variants

germline ones can be analyzed by genotyping if its common and if its rare we use exome or whole genome sequencing cause dont know where they’re located, also have somatic variants, burden of mutation studies and experimental testing of rare variant function

New cards

genome wide association studies

this is when you map the gene for something, its used for common variant/common disease paradigm (idea is to study positions that commonly vary in the population correlated with getting a disease, genotype everyone at these positions and ask what positions are correlated with getting the disease, are there positions where MAF is different in cases vs unaffected, suggests smth in neighborhood of variant has to do with disease, if you are above the line of significance being above it unlikely to be significant association, more correlated SNP position is with being in case group vs control group the higher it is on the graph, 10*-8 is pretty high threshold to account for associations by chance

New cards

GWAS challenges

imagine this is the disease variant on the index causal SNV (landed in region of genome next to a bunch of variable positions), not enough time past to scramble all mutations in that region so might get a few crossover per chromosome but not enough, linked SNPs are a collection of linked alleles where any 1 of them could be the causative one, even if we know the locus, 1 gene has many variants

New cards

haplotype block

group of linked variant where one of them is the disease causing one (everything in that block is a variant)

New cards

common variants

have small effects, makes sense cause if a variant was really bad for you it wouldn’t be in the population anymore

New cards

rare variants

these are the ones that are really bad for you so we want to study them bit we cant do tests on them because they are rare, we can easily connect disease to mutations that are common but they dont seem to be the cause of most disease,

New cards

functional mutations

most are deleterious, selection indicates functional mutations, whether or not the tested trait is under selection, it never messes with protein’s ability to fold, most pathogenic mutations reduce stability

New cards

PolyPhen2

puts lots of different sources of info together to predict if variants are functional, if we dont see it in cousin species, its more likely to be bad for you or if its in the protein active site you’re also more likely to be bad

New cards

somatic cancer (osteosarcoma) mutations

they are an example of burden of mutation analysis, sequencing those tumors and since they are somatic its rare but there are some hotspots, TP53 mutation has high prevalence but individually they are rare, the challenge with burden of mutation studies is mixing of functional mutations with neutral ones (want to distinguish them, creates noise in the data)

New cards

what is the ideal scenario of experimental tests of rare variants?

it would be to have an animal model of the disease so candidate mutations from a human gene can be introduced into the corresponding animal gene, if the mutation is pathogenic in humans, the animal shows a similar disease, the issue is this exists for relatively few human disease genes, where it does, testing all the variants we find is rarely feasible (too expensive

New cards

experimental testing of rare variation

surrogate genetics to identify pathogenic human variation, deep mutational scanning atlases by functional complementation, potential to generate atlases for most human disease genes

New cards

trans species complementation

surrogate genetics, you have a normal yeast gene and a mutant one being a cousin of the human one, the human gene rescues the defective strain allowing normal cell growth, if its neutral the variant that still rescues is neutral but the one that no longer rescues is functional

New cards

can yeast complementation separate disease from non disease variants?

the allele has 2 disease associated genes, they do a 5 fold serial dilution and if its permissive the yeast mutant works so get normal cell growth but if non permissive the temperature is too high so the cell doesnt grown (no rescue/cell death)

New cards

precision vs recall (PolyPhen-2)

precision element means that at some given threshold of polyphen score, if you reach it, it will predict disease, recall measures the fraction of disease mutations that are predicted, based on the 138 variants in 21 genes, 76 were disease and 62 were non disease, there was a high confidence rate for the results, the right level of precision is context dependent

New cards

yeast complementation for human disease genes

complementation beats computation even if computation gives you a model organism with billion year divergence and phenotype is as simple as cell growth, there are currently 176 human disease genes with yeast complementation assay but possibly many more using sensitized backgrounds and environments, balance tractability (whether we can have a lot of experiments to test variants) with fidelity (how closely the organism model is able to test the function of the human gene)

New cards

Sumo E2 conjugase UBE21

the sumo is a protein tag that gets added onto other proteins, UBE21 is an enzyme that attaches the tag to its substrates its used in yeast functional complementation, 19 Y2H, co-crystal structures, somatic cancer mutations of unknown significance, for scoring if its all synonymous that means there was no dropout at all but synonymous ones are not gonna be affecting function but nonsense ones are premature stop so impairs protein function

New cards

genophenogram for human sumo ligase (UBE21)

basically makes a heatmap with all the possible amino acid changes, it works better in yeast compared to humans

New cards

Cys94

performs catalysis, it covalently attaches to sumo and does vendoff’s to the substrate, mutation in the cysteine should be bad, all changes in that position are deleterious

New cards

CALM1 (calmodulin 1)

associated with tachycardia

New cards

forward genetics

have patient go to the hospital and they get genotype then make expression clone for the variant and do functional assays (complementation or edgotyping) and report back to the hospital

New cards

reverse genetics

an atlas of variant effects, prepare an atlas with functional impact of all possible missense variants before they are seen in the clinic, measure all possible AA changes ahead of time and come into the clinic to get their genome sequenced, just look at table to see if the variant is functional

New cards

protein interaction assay

see if protein can interact with 1 or more of its partners

New cards

1000 genomes project

international project to construct a foundational data set for human genetics, discover virtually all common variolations by investigating many genomes at base pair level, consortium with multiple centers, platforms and funders, it aims to discover population human genetic variations of all types (95% of variation is over 1% frequency), define haplotype structure in the human genome + variation by individual, develop sequence analysis methods, tools and other reagents that can be transferred to other sequencing project, large scale of genome variation, population variation, patterns fo selection, genome evolution, markers for GWAS studies, design of genotyping arrays, origin of disease and functional inferences

New cards

Hapmap project

had 14000 individuals, 11 populations and it had high throughput genotyping chips, its similar to whole genome project on smaller scale, also captured haplotype variation (how SNPs are linked to eachother on the same chromosome

New cards

1000 genomes pilot project

had 14 million SNPs, 179 individuals, 4 populations and low coverage compared to next gen sequencing, just covers genome a few times during sequencing, more times we sequence a genome the better we are at capturing SNPs, medium coverage (50x)

New cards

1000 genomes phase 1

had way more SNPs (36.6 millon SNPs, 1k individuals and 14 populations, low coverage and exome next gen sequencing

New cards

1000 genomes phase 2

1715 individuals, 19 populations, low coverage and exome next gen sequencing, the project has data from 3 different providers and multiple platforms, the titanium one has a max read length of 400 bp so very long reads which indicates greater quality, we use illumina GA II and Hiseq cause although smaller reads (100 ish) its more cost effective while still being high throughput

New cards

what are the main steps for data processing and variant calling?

these are standard tools for aligning and assembling reads into a reference, the main steps include read mapping, duplicate filtering, base quality value recalibration (quality filters on SNPs), INDEL realignment, variant site discovery, individual genotype assignment (sometimes part of site discovery), variant filtering/call set refinement, variant reporting

New cards

alignment data

there has been over 10 releases of alignment data (into a reference, much faster then de novo where you sequence from scratch, pilot project was aligned to NCB136, phase 1 was aligned to GRCh37 (still used for assembly, alignment) and phase 2 was aligned to extended GRCh37 leading to improvements to base quality recalibration

New cards

variant calling

developed by 1000 genomes, early call sets used a single variant caller, intersect approach developed during pilot, variant quality score recalibration (VQSR) developed for phase 1, integrated genotype calling based on individual variant call sets, phase 2 looks to improve site discovery and improve integration, one of the filters we might include might be how often we see that mutation in an individual, if we sequence many times it increases the confidence its a real mutation which is why coverage is important cause increases confidence

New cards

trio pilot coverage strategy

sequence a few individuals at high coverage to capture variants and test database, looks at individual haploid genomes, phased by transmission

New cards

low coverage pilot strategy

uses common haplotypes, statistical phasing (good strategy but not as good as exon)

New cards

exon pilot coverage strategy

uses exon variants and is unphased

New cards

what is the goal of phase 1 analysis?

an integrated view of human variations, reconstruct haplotypes including all variant types, using all datasets, so if a deletion and another SNP are on the same chromosome they are linked, insertions that are on a different chromosome than a SNP means its a different haplotype

New cards

main project design

based on the result of the pilot project, we decided to collect data on 2500 samples from 5 continental groupings, whole genome is low coverage being less than 4 times, full exome data is deep coverage (over 20 times), a number of deep coverage genomes to be sequenced with details to be decided, high density genotyping at subsets of sites using both illumina omni and affymetrix axion, phase 1 release integrated variant release has been made

New cards

sequencing and analysis strategy

phasing haplotypes and mutation interference, construction of haplotype scaffold from SNP microarray genotypes using trio data where available, joint genotyping and statistical phasing of biallelic variants from sequence data onto haplotype scaffold, independent genotyping and phasing of multi allelic and complex variants onto haplotype scaffold, integration of variant calls into unified haplotypes, it does have major bioinformatic challenges cause its large scale, this is cause data has to move around a lot, there are machine learning classifiers that separate true variants from errors, balance between sensitivity (discovering rare alleles) vs specificity (reporting true alleles), genotyping arrays (relatives for constructing haplotypes, low coverage WGS and deep exome sequencing, very high power to detect variants and accuracy of heterozygote calls

New cards

biallelic SNPs

position where only 2 nucleotides are observed

New cards

human germline mutation rate

able to count up the number of mutations in child vs mom and dad to estimate mutation rate, rate of new mutations is not very high, males have 1-2x higher mutation rate compared to females, most whole genome estimates congregate around the same value, most are shared across populations, greater diversity (absolute number of variants and proportion of private variants in africa cause oldest population, subsequent loss of genetic diversity due to founder effects

New cards

admixture

each column is an individual, people in the west have genetic variants from africa that are different from people in the east, an admixture would be variant that comes from puerto rico and african descent, when looking at population history assume K ancestral populations, determine proportion of ancestry in present-day populations, gives an idea of how genetic variation is structured among populations (how populations are genetically disconnected or connected by gene flow)

New cards

how much of your genome is unique?

everytime we sequence a genome we capture a new variant, most variation is rare but a majority of variants are common in any single genome, estimate that improving rare variant discovery leads to more per genome, most are singleton (mutation seen only once in that sample)

New cards

imputation

statistical inference of unknown genotypes used to deal with missing data, provides greater power in GWAS, similar accuracy for bi-allelic SNPs, multi allelic SNPs and bi-allelic indels, greater accuracy of imputation for common alleles, input some of the variation we didn’t sequence everyone on coverage is low so use 1000 genomes info to input potential rare variants that haven’t been discovered yet, infer using background level of variation found in 1000 genome project

New cards

genomic impact of damaging and disease causing mutations

african always has the most variation cause its the oldest population, smaller populations have less time to remove potentially damaging mutations, higher number of variants associated with complex traits (GWS) or disease (ClinVar) in europe despite low overall diversity, ethnic bias vs fixation of deleterious alleles during migration out of africa, african population was bertter at removing damaging mutations cause increases recombination

New cards

coldspots

lower recombination areas, they show higher proportion of rare and non-synonymous variants relative to regions and high recombination, common synonymous variants at neutral sites enriched in HRR vs CS in all populations, results confirmed by regression models, hold after correcting for GC-content, gene expression, substitution rates, types of mutations, exon size and total SNP density

New cards

accumulation of mutations in individual genomes

odds ratio computed by individual, distribution per individual for rare variants shows significant differences between populations, the mean is shifted to the right in out of africa populations and larger variance in FC, for variants with minor allele frequency less than .001, disease related mutations are enriched in CS relative to HRR,, effect driven mainly by the coldest regions in the human genome, lack of common variants implicatedis notable in CS, higher power in GWAS to find disease risk factors in highly tagged genomic regions

New cards

mapping variants associated with gene expression

we can assign function based on disease relevance but we can ask if mutations were capturing are associated with gene expression and phenotype (ex; blood pressure)’ 3 genotypes assayed at positions (pattern shows set of data is associated with gene expression