1/47
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
myth of the human genome
the human genome project was an average over a few people and one cell type, individual variation is the main think when we talk about personalized genomics (germline or somatic), pathogen genomes and programmed DNA changes (B cells and T cells, changes diversify genome), the epigenome (changes that makeup a genome)
individual variation terms
SNV (single nucleotide variant), MAF (minor allele frequency, position in human genome that is variable, can have most common variant at that position or smth less common) and SNP (single nucleotide polymorphism, common variant typically MAF over 1%)
genotyping
usually refers to spot sequencing of specific positions where variation is common in humans or less common but really bad if it happens (ex: BRCA1 gene variant causing cancer), resequencing is when a species has been sequenced by particular individual hasnt (exome one specifically focuses on the part of the genome that encodes proteins plus splice junctions, doesnt code protein but often get mutations that matter), whole genome includes everything including non coding regions (more expensive but gives more info, dont really know how to interpret this yet)
variants
germline ones can be analyzed by genotyping if its common and if its rare we use exome or whole genome sequencing cause dont know where theyâre located, also have somatic variants, burden of mutation studies and experimental testing of rare variant function
genome wide association studies
this is when you map the gene for something, its used for common variant/common disease paradigm (idea is to study positions that commonly vary in the population correlated with getting a disease, genotype everyone at these positions and ask what positions are correlated with getting the disease, are there positions where MAF is different in cases vs unaffected, suggests smth in neighborhood of variant has to do with disease, if you are above the line of significance being above it unlikely to be significant association, more correlated SNP position is with being in case group vs control group the higher it is on the graph, 10*-8 is pretty high threshold to account for associations by chance
GWAS challenges
imagine this is the disease variant on the index causal SNV (landed in region of genome next to a bunch of variable positions), not enough time past to scramble all mutations in that region so might get a few crossover per chromosome but not enough, linked SNPs are a collection of linked alleles where any 1 of them could be the causative one, even if we know the locus, 1 gene has many variants
haplotype block
group of linked variant where one of them is the disease causing one (everything in that block is a variant)
common variants
have small effects, makes sense cause if a variant was really bad for you it wouldnât be in the population anymore
rare variants
these are the ones that are really bad for you so we want to study them bit we cant do tests on them because they are rare, we can easily connect disease to mutations that are common but they dont seem to be the cause of most disease,
functional mutations
most are deleterious, selection indicates functional mutations, whether or not the tested trait is under selection, it never messes with proteinâs ability to fold, most pathogenic mutations reduce stability
PolyPhen2
puts lots of different sources of info together to predict if variants are functional, if we dont see it in cousin species, its more likely to be bad for you or if its in the protein active site youâre also more likely to be bad
somatic cancer (osteosarcoma) mutations
they are an example of burden of mutation analysis, sequencing those tumors and since they are somatic its rare but there are some hotspots, TP53 mutation has high prevalence but individually they are rare, the challenge with burden of mutation studies is mixing of functional mutations with neutral ones (want to distinguish them, creates noise in the data)
what is the ideal scenario of experimental tests of rare variants?
it would be to have an animal model of the disease so candidate mutations from a human gene can be introduced into the corresponding animal gene, if the mutation is pathogenic in humans, the animal shows a similar disease, the issue is this exists for relatively few human disease genes, where it does, testing all the variants we find is rarely feasible (too expensive
experimental testing of rare variation
surrogate genetics to identify pathogenic human variation, deep mutational scanning atlases by functional complementation, potential to generate atlases for most human disease genes
trans species complementation
surrogate genetics, you have a normal yeast gene and a mutant one being a cousin of the human one, the human gene rescues the defective strain allowing normal cell growth, if its neutral the variant that still rescues is neutral but the one that no longer rescues is functional
can yeast complementation separate disease from non disease variants?
the allele has 2 disease associated genes, they do a 5 fold serial dilution and if its permissive the yeast mutant works so get normal cell growth but if non permissive the temperature is too high so the cell doesnt grown (no rescue/cell death)
precision vs recall (PolyPhen-2)
precision element means that at some given threshold of polyphen score, if you reach it, it will predict disease, recall measures the fraction of disease mutations that are predicted, based on the 138 variants in 21 genes, 76 were disease and 62 were non disease, there was a high confidence rate for the results, the right level of precision is context dependent
yeast complementation for human disease genes
complementation beats computation even if computation gives you a model organism with billion year divergence and phenotype is as simple as cell growth, there are currently 176 human disease genes with yeast complementation assay but possibly many more using sensitized backgrounds and environments, balance tractability (whether we can have a lot of experiments to test variants) with fidelity (how closely the organism model is able to test the function of the human gene)
Sumo E2 conjugase UBE21
the sumo is a protein tag that gets added onto other proteins, UBE21 is an enzyme that attaches the tag to its substrates its used in yeast functional complementation, 19 Y2H, co-crystal structures, somatic cancer mutations of unknown significance, for scoring if its all synonymous that means there was no dropout at all but synonymous ones are not gonna be affecting function but nonsense ones are premature stop so impairs protein function
genophenogram for human sumo ligase (UBE21)
basically makes a heatmap with all the possible amino acid changes, it works better in yeast compared to humans
Cys94
performs catalysis, it covalently attaches to sumo and does vendoffâs to the substrate, mutation in the cysteine should be bad, all changes in that position are deleterious
CALM1 (calmodulin 1)
associated with tachycardia
forward genetics
have patient go to the hospital and they get genotype then make expression clone for the variant and do functional assays (complementation or edgotyping) and report back to the hospital
reverse genetics
an atlas of variant effects, prepare an atlas with functional impact of all possible missense variants before they are seen in the clinic, measure all possible AA changes ahead of time and come into the clinic to get their genome sequenced, just look at table to see if the variant is functional
protein interaction assay
see if protein can interact with 1 or more of its partners
1000 genomes project
international project to construct a foundational data set for human genetics, discover virtually all common variolations by investigating many genomes at base pair level, consortium with multiple centers, platforms and funders, it aims to discover population human genetic variations of all types (95% of variation is over 1% frequency), define haplotype structure in the human genome + variation by individual, develop sequence analysis methods, tools and other reagents that can be transferred to other sequencing project, large scale of genome variation, population variation, patterns fo selection, genome evolution, markers for GWAS studies, design of genotyping arrays, origin of disease and functional inferences
Hapmap project
had 14000 individuals, 11 populations and it had high throughput genotyping chips, its similar to whole genome project on smaller scale, also captured haplotype variation (how SNPs are linked to eachother on the same chromosome
1000 genomes pilot project
had 14 million SNPs, 179 individuals, 4 populations and low coverage compared to next gen sequencing, just covers genome a few times during sequencing, more times we sequence a genome the better we are at capturing SNPs, medium coverage (50x)
1000 genomes phase 1
had way more SNPs (36.6 millon SNPs, 1k individuals and 14 populations, low coverage and exome next gen sequencing
1000 genomes phase 2
1715 individuals, 19 populations, low coverage and exome next gen sequencing, the project has data from 3 different providers and multiple platforms, the titanium one has a max read length of 400 bp so very long reads which indicates greater quality, we use illumina GA II and Hiseq cause although smaller reads (100 ish) its more cost effective while still being high throughput
what are the main steps for data processing and variant calling?
these are standard tools for aligning and assembling reads into a reference, the main steps include read mapping, duplicate filtering, base quality value recalibration (quality filters on SNPs), INDEL realignment, variant site discovery, individual genotype assignment (sometimes part of site discovery), variant filtering/call set refinement, variant reporting
alignment data
there has been over 10 releases of alignment data (into a reference, much faster then de novo where you sequence from scratch, pilot project was aligned to NCB136, phase 1 was aligned to GRCh37 (still used for assembly, alignment) and phase 2 was aligned to extended GRCh37 leading to improvements to base quality recalibration
variant calling
developed by 1000 genomes, early call sets used a single variant caller, intersect approach developed during pilot, variant quality score recalibration (VQSR) developed for phase 1, integrated genotype calling based on individual variant call sets, phase 2 looks to improve site discovery and improve integration, one of the filters we might include might be how often we see that mutation in an individual, if we sequence many times it increases the confidence its a real mutation which is why coverage is important cause increases confidence
trio pilot coverage strategy
sequence a few individuals at high coverage to capture variants and test database, looks at individual haploid genomes, phased by transmission
low coverage pilot strategy
uses common haplotypes, statistical phasing (good strategy but not as good as exon)
exon pilot coverage strategy
uses exon variants and is unphased
what is the goal of phase 1 analysis?
an integrated view of human variations, reconstruct haplotypes including all variant types, using all datasets, so if a deletion and another SNP are on the same chromosome they are linked, insertions that are on a different chromosome than a SNP means its a different haplotype
main project design
based on the result of the pilot project, we decided to collect data on 2500 samples from 5 continental groupings, whole genome is low coverage being less than 4 times, full exome data is deep coverage (over 20 times), a number of deep coverage genomes to be sequenced with details to be decided, high density genotyping at subsets of sites using both illumina omni and affymetrix axion, phase 1 release integrated variant release has been made
sequencing and analysis strategy
phasing haplotypes and mutation interference, construction of haplotype scaffold from SNP microarray genotypes using trio data where available, joint genotyping and statistical phasing of biallelic variants from sequence data onto haplotype scaffold, independent genotyping and phasing of multi allelic and complex variants onto haplotype scaffold, integration of variant calls into unified haplotypes, it does have major bioinformatic challenges cause its large scale, this is cause data has to move around a lot, there are machine learning classifiers that separate true variants from errors, balance between sensitivity (discovering rare alleles) vs specificity (reporting true alleles), genotyping arrays (relatives for constructing haplotypes, low coverage WGS and deep exome sequencing, very high power to detect variants and accuracy of heterozygote calls
biallelic SNPs
position where only 2 nucleotides are observed
human germline mutation rate
able to count up the number of mutations in child vs mom and dad to estimate mutation rate, rate of new mutations is not very high, males have 1-2x higher mutation rate compared to females, most whole genome estimates congregate around the same value, most are shared across populations, greater diversity (absolute number of variants and proportion of private variants in africa cause oldest population, subsequent loss of genetic diversity due to founder effects
admixture
each column is an individual, people in the west have genetic variants from africa that are different from people in the east, an admixture would be variant that comes from puerto rico and african descent, when looking at population history assume K ancestral populations, determine proportion of ancestry in present-day populations, gives an idea of how genetic variation is structured among populations (how populations are genetically disconnected or connected by gene flow)
how much of your genome is unique?
everytime we sequence a genome we capture a new variant, most variation is rare but a majority of variants are common in any single genome, estimate that improving rare variant discovery leads to more per genome, most are singleton (mutation seen only once in that sample)
imputation
statistical inference of unknown genotypes used to deal with missing data, provides greater power in GWAS, similar accuracy for bi-allelic SNPs, multi allelic SNPs and bi-allelic indels, greater accuracy of imputation for common alleles, input some of the variation we didnât sequence everyone on coverage is low so use 1000 genomes info to input potential rare variants that havenât been discovered yet, infer using background level of variation found in 1000 genome project
genomic impact of damaging and disease causing mutations
african always has the most variation cause its the oldest population, smaller populations have less time to remove potentially damaging mutations, higher number of variants associated with complex traits (GWS) or disease (ClinVar) in europe despite low overall diversity, ethnic bias vs fixation of deleterious alleles during migration out of africa, african population was bertter at removing damaging mutations cause increases recombination
coldspots
lower recombination areas, they show higher proportion of rare and non-synonymous variants relative to regions and high recombination, common synonymous variants at neutral sites enriched in HRR vs CS in all populations, results confirmed by regression models, hold after correcting for GC-content, gene expression, substitution rates, types of mutations, exon size and total SNP density
accumulation of mutations in individual genomes
odds ratio computed by individual, distribution per individual for rare variants shows significant differences between populations, the mean is shifted to the right in out of africa populations and larger variance in FC, for variants with minor allele frequency less than .001, disease related mutations are enriched in CS relative to HRR,, effect driven mainly by the coldest regions in the human genome, lack of common variants implicatedis notable in CS, higher power in GWAS to find disease risk factors in highly tagged genomic regions
mapping variants associated with gene expression
we can assign function based on disease relevance but we can ask if mutations were capturing are associated with gene expression and phenotype (ex; blood pressure)â 3 genotypes assayed at positions (pattern shows set of data is associated with gene expression