genomics - rna biology II

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/54

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

55 Terms

1
New cards

RNA seq

genomic technique that uses next gen sequencing to analyze the quantity and presence of rna molecules in a biological sample

2
New cards

how to remove highly abundant rRNA

-rRNA is >90% of total RNA

-enrich for mRNA using polyA selection, which requires higher amount of starting material and minimal degradation

-Or deplete rRNA when quantity is low or RIN is low (prokaryotes can only use rrna depletion)

3
New cards

polyA selection

-can be done during illumina rna seq library prep

-RNA degradation produces 3’ bias

-non-polyA RNAs are not recovered

4
New cards

ribosomal rna subtraction

-species-specific probes

-allow enrichment of non-poly(a) transcripts

5
New cards

increase in biological replication

significantly inc power and number of differentially expressed genes identified

easier outlier detection and removal

6
New cards

how many replicates do i need

-minimum 3-6 biological replicates

-statistical power increases w effect size, sequencing depth, and number of replicates per group

7
New cards

how many reads do we need

>10 reads per gene per sample is standard cutoff

8
New cards

reads for mrna genes

5-30 million reads per sample

9
New cards

reads for mrna transcripts

for measuring alternative splicing

30-60 million reads per sample

10
New cards

reads for transcript discovery

100-200 million short reads

long read data better

11
New cards

reads for mi-rna seq or small-rna

-varies significantly depending on the tissue type being sequenced

-most applications require 1-5 million reads per sample

12
New cards

read length

-affects the ability to determine where each read in the transcriptome came from

-longer reads do not add much value in quantification-based analysis but valuable to isoform analysis

13
New cards

gene expression/rna profiling read length

50-75 bp

14
New cards

read length for novel transcriptome assembly and annotation

longer, paired-end reads (2 × 75 bp or 2 × 100 bp), or long read sequencing

15
New cards

read length for small rna

a single read(usually a 50 bp read) typically covers the entire sequence

16
New cards

paired end sequencing

-improves read mapping

-preferred for alternative-exon quantification, fusion transcript detection and de novo transcript discovery, particularly when working with poorly annotated transcriptomes

<p>-improves read mapping</p><p>-preferred for alternative-exon quantification, fusion transcript detection and de novo transcript discovery, particularly when working with poorly annotated transcriptomes</p><p></p>
17
New cards

single end sequencing

knowt flashcard image
18
New cards

DNA contamination

-can be mapped back as intergenic or intronic sequence

-cannot distinguish between contamination vs alternative splicing, unannotated or noncoding transcripts, or spurious transcription

-treat w DNase to remove dna contamination

19
New cards

illumina short read sequencing

<200 bp

-the de facto method to detect and quantify transcriptome-wide gene expression

-cheaper

-easier to implement

-comprehensive, high quality data

20
New cards

long read cDNA sequencing

-converting mrna to cdna before sequencing

-pacbio and nanopore

-up to 50 kb

generate full length isoform reads

-isoform detection

-de novo transcriptome analysis

-fusion transcript detection

21
New cards

long read direct rna sequencing (drna-seq)

-no cdna synthesis or pcr amplification during library prep

-nanpore

1-10 kb

-all analysis same as long-read rna seq

-detect base modification

-estimate poly A tail length

22
New cards

long read technologies limitations

-lower throughput

-lower sensitivity, depends on rna integrity, and cdna synthesis can be truncated

-biases inherent to sequencing platforms

-low diffusion of long library molecules onto the surface of the sequencing chip can reduce the coverage of longer transcripts

23
New cards

main sources of variation

batch effects

lane effects

24
New cards

batch effects

any errors that occur after random fragmentation of the rna until it is input to the flow cell

ex: pcr amplification and reverse transcription artifacts

variations in reagents, supplies, instruments and operators may introduce random or systematic errors at any step of rna-seq data generation

25
New cards

lane effects

any errors that occur from the point at which the sample is input to the flow cell until data are output from the sequencing machine

ex: systematically bad sequencing cycles and errors in base calling

26
New cards

randomization

randomize samples across library preparation batches and lanes so as to avoid technical factors becoming confounded with experimental factors

27
New cards

FastQC

assessment of data quality

-num of reads

-per base sequence quality

-per sequence quality score

-per base sequence content

-per sequence GC content

-per base N content

-sequence length distribution

-sequence duplication levels

-overrepresented sequences

-adapter content

-kmer content

28
New cards

per base sequence quality

-read quality decreases towards the 3’ end of reads

-to improve read mappability, discard low qual reads, trim adapter sequences, and eliminate poor qual bases

29
New cards

per sequence quality scores

knowt flashcard image
30
New cards

per base sequence content

knowt flashcard image
31
New cards

per sequence GC content

knowt flashcard image
32
New cards

duplicate sequences

-some amount of duplication is to be expected in rna seq

-high complexity library: low level of duplication may indicate a high level of coverage of the target sequence

-highly expressed transcripts can be over-sequenced in order to be able to see lowly expressed transcripts

-a badly pcr duplicated library might have levels >90%

33
New cards

Gene annotations

-choice of a gene model has dramatic effect on both gene quantification and differential analysis

-encode

-ensembl

-refseq: oldest db

-ucsc known genes

34
New cards

mapping rna-seq reads

annotated reference is required

-to map junctions the algorithm needs to divide the sequencing reads and map portions independently

-much more complex algorithms are required to identify alternative transcripts

35
New cards

genome mapping

knowt flashcard image
36
New cards

transcriptome mapping

knowt flashcard image
37
New cards

SAM

standard alignment file format generated from all mappers

Sequence Alignment Map format

38
New cards

BAM

binary version of SAM

alignments stored in bam file

indexed to be read by other tools and genome browsers

39
New cards

Alignment QC

-number of reads mapped/unmapped/paired

-uniquely mapped

-insert size distribution

-coverage

-gene body coverage

-biotype counts/chromosome counts

-counts by region (gene/intron/non-genic)

-sequencing saturation

-strand specificity

40
New cards

quantification

read counts = gene expression

quantification at diff levels: exon, transcriptm gene

41
New cards

multi-mapped reads

discard or probabilistic assignment

could have the largest impact on the ultimate results

42
New cards

pcr duplicates

-ignore for rna seq data

-use pcr free library prep kits

-use UMIs during library-prep

43
New cards

normalization

raw read counts cannot be used to compare expression levels among samples

-transcript length

-sequencing depth

-sequencing biases

-difference in rna composition

44
New cards

CPM or RPM

counts per million / reads per million

normalizes only for sequencing depth within-sample

suitable for sequencing protocols that generate reads independent of gene length

45
New cards

FPKM

fragments per kilobase of transcript per million fragments mapped

normalize for feature length and sequencing depth within-sample

46
New cards

RPKM

reads per kilobase per million mapped reads

normalize for feature length and sequencing depth within sample

47
New cards

TPM

transcripts per million

normalize for feature length and sequencing depth within sample

48
New cards

TMM

trimmed mean of M values/edgeR

-accounts for differences in rna composition between samples

-effective in normalization of samples with diverse RNA repertoires

49
New cards

median-of-ratios method

DeSeq2 normalization

-use the median of the ratios of observed counts to pseudo-reference sample as size factor to scale the counts

-normalize sequencing depth

50
New cards

both TMM and median-of-ratios method

-do not consider gene length for normalization as it assumes that the gene length would be constant between the samples

-assume that most of the genes are not differentially expressed

51
New cards

reproducibility

remove lowly expressed genes w <10 reads

-sample-sample clustering heatmap

-PCA

-batch effects

-outlier detection

52
New cards

outliers

true biological differences or technical failures during the process of sample preparation could lead to extreme deviation of a sample from samples of the same treatment group (biological replicates)

53
New cards

fold change

measurement of the changing magnitude (effect size)

typically use log2(FC)

54
New cards

padj

FDR(false discovery rate) adjusted p-value

aka q value

55
New cards

biological interpretation

-gene ontology enrichment analysis

-KEGG oathway analysis

-reactome databases

-gene set enrichment analysis (GSEA)