genomics - rna biology II

studied byStudied by 0 people
0.0(0)
learn
LearnA personalized and smart learning plan
exam
Practice TestTake a test on your terms and definitions
spaced repetition
Spaced RepetitionScientifically backed study method
heart puzzle
Matching GameHow quick can you match all your cards?
flashcards
FlashcardsStudy terms and definitions

1 / 54

encourage image

There's no tags or description

Looks like no one added any tags here yet for you.

55 Terms

1

RNA seq

genomic technique that uses next gen sequencing to analyze the quantity and presence of rna molecules in a biological sample

New cards
2

how to remove highly abundant rRNA

-rRNA is >90% of total RNA

-enrich for mRNA using polyA selection, which requires higher amount of starting material and minimal degradation

-Or deplete rRNA when quantity is low or RIN is low (prokaryotes can only use rrna depletion)

New cards
3

polyA selection

-can be done during illumina rna seq library prep

-RNA degradation produces 3’ bias

-non-polyA RNAs are not recovered

New cards
4

ribosomal rna subtraction

-species-specific probes

-allow enrichment of non-poly(a) transcripts

New cards
5

increase in biological replication

significantly inc power and number of differentially expressed genes identified

easier outlier detection and removal

New cards
6

how many replicates do i need

-minimum 3-6 biological replicates

-statistical power increases w effect size, sequencing depth, and number of replicates per group

New cards
7

how many reads do we need

>10 reads per gene per sample is standard cutoff

New cards
8

reads for mrna genes

5-30 million reads per sample

New cards
9

reads for mrna transcripts

for measuring alternative splicing

30-60 million reads per sample

New cards
10

reads for transcript discovery

100-200 million short reads

long read data better

New cards
11

reads for mi-rna seq or small-rna

-varies significantly depending on the tissue type being sequenced

-most applications require 1-5 million reads per sample

New cards
12

read length

-affects the ability to determine where each read in the transcriptome came from

-longer reads do not add much value in quantification-based analysis but valuable to isoform analysis

New cards
13

gene expression/rna profiling read length

50-75 bp

New cards
14

read length for novel transcriptome assembly and annotation

longer, paired-end reads (2 × 75 bp or 2 × 100 bp), or long read sequencing

New cards
15

read length for small rna

a single read(usually a 50 bp read) typically covers the entire sequence

New cards
16

paired end sequencing

-improves read mapping

-preferred for alternative-exon quantification, fusion transcript detection and de novo transcript discovery, particularly when working with poorly annotated transcriptomes

<p>-improves read mapping</p><p>-preferred for alternative-exon quantification, fusion transcript detection and de novo transcript discovery, particularly when working with poorly annotated transcriptomes</p><p></p>
New cards
17

single end sequencing

knowt flashcard image
New cards
18

DNA contamination

-can be mapped back as intergenic or intronic sequence

-cannot distinguish between contamination vs alternative splicing, unannotated or noncoding transcripts, or spurious transcription

-treat w DNase to remove dna contamination

New cards
19

illumina short read sequencing

<200 bp

-the de facto method to detect and quantify transcriptome-wide gene expression

-cheaper

-easier to implement

-comprehensive, high quality data

New cards
20

long read cDNA sequencing

-converting mrna to cdna before sequencing

-pacbio and nanopore

-up to 50 kb

generate full length isoform reads

-isoform detection

-de novo transcriptome analysis

-fusion transcript detection

New cards
21

long read direct rna sequencing (drna-seq)

-no cdna synthesis or pcr amplification during library prep

-nanpore

1-10 kb

-all analysis same as long-read rna seq

-detect base modification

-estimate poly A tail length

New cards
22

long read technologies limitations

-lower throughput

-lower sensitivity, depends on rna integrity, and cdna synthesis can be truncated

-biases inherent to sequencing platforms

-low diffusion of long library molecules onto the surface of the sequencing chip can reduce the coverage of longer transcripts

New cards
23

main sources of variation

batch effects

lane effects

New cards
24

batch effects

any errors that occur after random fragmentation of the rna until it is input to the flow cell

ex: pcr amplification and reverse transcription artifacts

variations in reagents, supplies, instruments and operators may introduce random or systematic errors at any step of rna-seq data generation

New cards
25

lane effects

any errors that occur from the point at which the sample is input to the flow cell until data are output from the sequencing machine

ex: systematically bad sequencing cycles and errors in base calling

New cards
26

randomization

randomize samples across library preparation batches and lanes so as to avoid technical factors becoming confounded with experimental factors

New cards
27

FastQC

assessment of data quality

-num of reads

-per base sequence quality

-per sequence quality score

-per base sequence content

-per sequence GC content

-per base N content

-sequence length distribution

-sequence duplication levels

-overrepresented sequences

-adapter content

-kmer content

New cards
28

per base sequence quality

-read quality decreases towards the 3’ end of reads

-to improve read mappability, discard low qual reads, trim adapter sequences, and eliminate poor qual bases

New cards
29

per sequence quality scores

knowt flashcard image
New cards
30

per base sequence content

knowt flashcard image
New cards
31

per sequence GC content

knowt flashcard image
New cards
32

duplicate sequences

-some amount of duplication is to be expected in rna seq

-high complexity library: low level of duplication may indicate a high level of coverage of the target sequence

-highly expressed transcripts can be over-sequenced in order to be able to see lowly expressed transcripts

-a badly pcr duplicated library might have levels >90%

New cards
33

Gene annotations

-choice of a gene model has dramatic effect on both gene quantification and differential analysis

-encode

-ensembl

-refseq: oldest db

-ucsc known genes

New cards
34

mapping rna-seq reads

annotated reference is required

-to map junctions the algorithm needs to divide the sequencing reads and map portions independently

-much more complex algorithms are required to identify alternative transcripts

New cards
35

genome mapping

knowt flashcard image
New cards
36

transcriptome mapping

knowt flashcard image
New cards
37

SAM

standard alignment file format generated from all mappers

Sequence Alignment Map format

New cards
38

BAM

binary version of SAM

alignments stored in bam file

indexed to be read by other tools and genome browsers

New cards
39

Alignment QC

-number of reads mapped/unmapped/paired

-uniquely mapped

-insert size distribution

-coverage

-gene body coverage

-biotype counts/chromosome counts

-counts by region (gene/intron/non-genic)

-sequencing saturation

-strand specificity

New cards
40

quantification

read counts = gene expression

quantification at diff levels: exon, transcriptm gene

New cards
41

multi-mapped reads

discard or probabilistic assignment

could have the largest impact on the ultimate results

New cards
42

pcr duplicates

-ignore for rna seq data

-use pcr free library prep kits

-use UMIs during library-prep

New cards
43

normalization

raw read counts cannot be used to compare expression levels among samples

-transcript length

-sequencing depth

-sequencing biases

-difference in rna composition

New cards
44

CPM or RPM

counts per million / reads per million

normalizes only for sequencing depth within-sample

suitable for sequencing protocols that generate reads independent of gene length

New cards
45

FPKM

fragments per kilobase of transcript per million fragments mapped

normalize for feature length and sequencing depth within-sample

New cards
46

RPKM

reads per kilobase per million mapped reads

normalize for feature length and sequencing depth within sample

New cards
47

TPM

transcripts per million

normalize for feature length and sequencing depth within sample

New cards
48

TMM

trimmed mean of M values/edgeR

-accounts for differences in rna composition between samples

-effective in normalization of samples with diverse RNA repertoires

New cards
49

median-of-ratios method

DeSeq2 normalization

-use the median of the ratios of observed counts to pseudo-reference sample as size factor to scale the counts

-normalize sequencing depth

New cards
50

both TMM and median-of-ratios method

-do not consider gene length for normalization as it assumes that the gene length would be constant between the samples

-assume that most of the genes are not differentially expressed

New cards
51

reproducibility

remove lowly expressed genes w <10 reads

-sample-sample clustering heatmap

-PCA

-batch effects

-outlier detection

New cards
52

outliers

true biological differences or technical failures during the process of sample preparation could lead to extreme deviation of a sample from samples of the same treatment group (biological replicates)

New cards
53

fold change

measurement of the changing magnitude (effect size)

typically use log2(FC)

New cards
54

padj

FDR(false discovery rate) adjusted p-value

aka q value

New cards
55

biological interpretation

-gene ontology enrichment analysis

-KEGG oathway analysis

-reactome databases

-gene set enrichment analysis (GSEA)

New cards
robot