1/114
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Difference between genetics and genomics
volume and complexity of data
C0t analysis
where you heat up dsDNA to separate the two strands, then let it cool and let the ssDNA find complementary sequences and become dsDNA again.
-repeated sequences will find complementary sequences much faster than single copy (rare/unique) sequences
Cot graph
The Y axis is % ssDNA, and X axis is Log10 Co (time??)
If a C0t graph has more repeats
the more steep the curve
genetic map
the order and distance between genetic markers
genetic map units
recombination frequency
Why use recombination frequency?
There can be recombination that occurs more often in some parts of a chromosome than others, so the perceived distance between equally, physically spaced markers can be different
genetic map marker
visible phenotype resulting from mutation
physical genome map
ordered collection of clones from a genomic library
most commonly used cloning vector and how long is it
BAC, ~200 kb
physical genome marker
a sequence tagged site (STS)
What is a STS?
STS is a defining/unique part of a genome; any fragment/sequence that hybridizes in only one location in the genome.
Contig
A contiguous set of clones (overlapping sequences that make continous sense when kept together)
Finger-printed contigs (FPC)
Fingerprint = unique pattern of restriction fragments. Clones that overlap in sequences can have fingerprints in common, which means they can overlap each other. can be full of gaps that they need long-range PCR or genomic libraries for fixing
X value
Denotes the completeness of a map; how much of the sequence has been duplicated in various clones.
Formula for coverage?
Coverage = 1 - e ^ (-X)
How does coverage work?
If a physical map has 3X coverage, then that should be about 95% of the genome. That’s weird because it depends on the typa genome?
Given 4x BAC coverage, draw a representative contig and compute how much of the genome will NOT be sampled. You may use 2 in place of e.
ASK
reference genome
the complete sequence of an organism’s chromosomes.
Why is a reference genome an oversimplification of a real genome?
Genomes are variant within the species (each lil guy is special). The organism might be diploid while assembly is haploid which makes things weird
Maxam-Gilbert sequencing tech
200-600 nucl.;
0.01 MB/h (speed);
1e^-4 error;
useful for footprinting
Sanger sequencing
500-1000 nucleotides;
0.1-0.2 MB/h; 1e^-4 error;
useful for verification
pyrosequencing
200-500 nucleotides;
20-30 MB/h;
1e^-3 error
SOLiD
25-35 nucleotides;
5-15 MB/h;
1e^-2 error
EARLY illlumina
25-50nucl.
20 speed;
1e^-2 error
LATE illumina
100-150 nuc
50k MBh
3e^-3 error
useful for RNA-seq, ChiP-seq, etc.
PacBio
30kb nuc;
1300 speed;
10-20% error;
useful for genome assembly
oxford nanopore
15kb
700 speed
10-20% error
useful for genome assembly
What are the 3 eras of sequencing and how do they solve the genome assembly problem?
Sanger era →
Short-read era →
Long-read era →
sanger era of sequencing
-relied on cut and compare tactics
-pick a minimal tiling path across clones (golden path)
-you did shot gun sequencing (randomly chosen sub-clones); the sequences that you found that aligned with each other were thrown into a contig
-altogether you make a consensus sequence (OLC approach, overlap, layout, consensus)
-not easy to connect contigs (lots of experimentation with PCR)
-once the sequences were complete, you could keep combining them and go onto to creat chromsomes
scaffolds
a natural/logical connection between contigs
FPC-OLC approach produced which genomes?
S. cerevisae, C. elegans. D. melanogaster, homo sapians, A. thaliana, M. musculus
Issues of sanger era
1) need to create a physical map
2) subcloning the BACs to sequencing them
Short read era /NGS era
Instead of cloning DNA, they used PCR amplifiaction, and sequencing couldbe done with a WHOLE GENOME SHOT GUN (WGS) without BAC clones/subclones; involved SOLiD, pyrosequencing, and Illumina
-described by N50 value
Problems of NGS/short read stuff (pyro, illumina, SOLiD)?
They were…short (at most 150 bp). Genomes are full of repeats (ex. Illumina did NOT like GC rich sequences and didn’t sequence it.
-tons of contigs (not great for whole chromosome sequencing)
How to calculate N50 value?
-sort all contigs by size
(1,2,3,4,5,6,7,8,9,10=total is 55)
-Add up fragments till you exceed half the total assembly size
half of 55 IS 27.5.
10+9+8+7=34 OR 1+2+3+4+5+6+7=28 *THESE ARE UR ONLY OPTIONS, U MUST GO IN ORDER OF SIZE
Those values (34 or 28) is way higher than 27.5
Note that 7 occurs in them both (it’s sort of the median value if you will)
So 7 is your N50.
long-read era
Current era; PacBio, Oxford nanopore.
single molecule sequencing technologies that can do long reads with high error rates, but overall better for genome assembly.
Inclusive of repetitive structures.
difficult to assemble since you’re assembling whole chromosomes rather than just one BAC.
E.col genome size
4.6M
D. radiodurans genome size
3.1 M
S.CEREVISEIA genome size
12.1M
c elegans genome size
100M
d MELANOGASTER GENOME size
140M
A thaliana genome size
160 M
o SATIVA GENOME SIZE
430 m
Z mays genome size
2.5G
P. taeda genome size
22G
Humans GENOME SIZE
3.1M
M musculus genome size
2.5g
D rerio genome size
1.7G
T rubripes genome size
390 M
We tend to reserve the word repeats for…
…something that doesn’t have an obvious function. For example, different members/branches of a gene family would not neccessarily be called repeats.
tandem repeats
occur right next to each other.
includes microsatellites or minisatellites
can grow/shrink based on replication slippage
microsatellites / SSR / STR
really really short, like only a few nucleotides
most common in humans is AC repeat
minisatellites
when the repat is longer(10-60 bp)
long tandem repeats can also be called duplications/inversions depending on..
…whether they are in a row or in a antiparallel set up.
Interspersed repeats
distributed across the entire genome; though more concentrated in some regions than others
can copy themselves to new locations > can interrupt genes
can help combat gene conversion and homologous combination (identical repeats copying over multiple times/keep sequences looking similar, it combats by maintaining variation in sequences)
RNA transposons
RNA retrotransposons. 42% of genome is this. Occurs by going through an RNA intermediate and go backwards from RNA to DNA
LTR and non LTR
LTR transposons
long terminal repeats (LTR) at the end, (100nucleotides)
but the bulk of it encodes enzymes that help them replicate and integrate into genome (5kb or more)
NON LTR retrotransposons are also called LINEs
LINE stands for long interspersed nuclear elements
21% of genome is composed of LINES
recall the LINE1 and SINE elements
LINE
this is a non-LTR retrotransposon that is up to 21% of ur genome.
6kb long and encodes endonuclease protein and revers transcriptase stuff. Only a few of the expressed proteins from this are active becuaes the rest of them are warped up bc the rev transcriptase activity gives up. Still, these can be inserted into the genome (AT rich sequences)
SINE
short interspersed nuclear element/Alu elemts.
100-700 nucl.
genome as 1 mil copies of the 300 nt Alu (10%)
originate from small RNA like 72L RNA (signal peptide recognition) or from tRNA variants
transcribed by RNAP III
reverse transcribed by LINE RT (DONT NEED THEIR OWN ENZYMES to copy themselves to new locations)
DNA transposons
do NOT go through an RNA intermediate. Both proks and euks have’em
structure is a transposase (enzyme) gene flanked by inverted tandem repeats (TIR)
recall helitrons and polintons
helitrons
eukaryotic DNA transposon that makes copies of itself using rolling circle replication (present in most euks but no common)
polintons
newest member of DNA transposons
15-20 kb in length
widely dist in euks
Psuedogenes
“broken” protein coding genes
broken = truncation, frameshift, nonsense mutations
consider the 4 classes
Processed Pseudogenes
most common; retro-pseudogene.
mRNA is rev-transcribed to DNA and put into genome
have poly A tails and all introns removed
derived from high expression genes
common to be truncated bc the RT falls off before it reaches 5’ end
Non processed DNA pseudogenes
result of incomplete duplication; look really similar to original sequence; sometimes the copy can pick up mutations and become useless
unitary pseudogenes
nonfunctional without duplication
GULO
example of a unitary pseudogene. Broken verion of L-gulon-lactone oxidase in the human genome, which is why we gotta eat vitamin C and othe mammals don’t
pseudo-pseudogenes
a gene with nonsense mutations that can actually function normally or like the protein isnt translated but the nucleotide sequence acts like a decoy so the actual gene is unaffected. a secret agent gene if you will
where does variation come from?
replication errors are about 1/100KB
external mutation sources(UV, radiation, etc.)
every cell division there’s 30000 errors per haploid genome
no two cells are alike
SNPs
single nucleotide polymorphisms are mots common form
mutations that segregate in a population that may or may not lead to a phenotype
can be markers in GWAS(genome wide association study)
classified as noncoding, coding, indel
non coding SNP
occurs in intergenic regions (between genes)
in intragenic regions(within a gene)
which can be a spice site, within a intron, or a UTR
coding synonymous snp
this SNP doesn’t change the amino acid
non synonymous coding SNP
this coding snp can be a missense (changes the amino acid) or nonsense (changes amino acid to a stop)
indel frameshift SNP
SNP is inserted into a protein coding region and causes a frameshift
indel noncoding snp
an SNP inserted into something and rendering it noncoding CHECK LECTURE
structural variants
structural variants (SVs) are genetic differences that make larger changes to chromosomes than single nucleotide variations (SNVs). hard to assess than SNPs so the quality in population isn’t known
includes insertions, deletions, inversions, translocations, copy number variations (CNVs)
CNV
a structural variation where a region of a chr. has some difference in the number of repeats.
EX. fragile x syndrome and dup15q syndrome
size of human genome
around 3 billion bp
longest chromosomes are what and how long
1(249 Mbp),
2, (242 Mbp)
3 (198 Mbp)
smallest chromosomes are what and how long
21 (47 Mbp)
22 (51 mbp)
Y (57 mbp
Tell me ur gene trivia
we got 20k protein coding genes
1% of the genome has to do with protein coding DNA
we got no clue about how many RNA genes we have
How much of the genome is repetitive?
More than 50 percent is repetitive
13% SINE (11 percent Alu)
20% LINE) (17 percent LINE1)
8 percent LTR transposons
3 percent DNA elements
3 percent SSR
3 percent duplications
people differ from each other by about
1 SNP per 1000 bp
free living bacteria
are bacteria that can kinda live anywhere since they aint dependent on other bacteria; only make up about 1 percent or less of all bacteria (no idea really)
metagenomics / environmental sequencing
sequencing literally everything in a particular environment to see how it all relates to each other
the simplest form of metagenomic analysis is sequencing 16S rDNA. what’s up with 16S that makes it cool?
it has highly conserved sequences, so somebody could PCR them with very universal/highly applicable primers and then look at the variable regions between them. not a very high res technique tho
a rarefaction curve
tells you how many different species are present in a metagenomic analysis
What is operational taxonomic units refer to (OTU)?
An indication of how complex your environment is, but kind of a standard for differentiating different species.
Higher OTU = higher complexity (less duplicates)
Low OTU = less complex (more duplicates)
X and Y axis for rarefaction curve?
x axis - number of sequences
y axis - unique OTU
Do people still target the 16s?
No. Most people are doing shotgun sequecning (randomly sequencing whole genomes). They compare the reads to known proteins and figure out what their functions might be. Sometimes you wanna know who’s in your sample, sometimes you wanna know what’s even happening in there
When studying RNA, why do we use cDNA? What is cDNA?
rna is not stable. so we use reverse transcriptase to convert RNA sequence into a DNA sequence
reverse transcriptase - all that you know