1/10
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Obtaining reads
Genome is fragmented before sequencing and each fragment results in a sequencing read
FASTQC
Tool used for assessing the quality of high-throughput sequencing data. Its primary purpose is to provide a comprehensive report on various quality metrics, allowing users to identify potential issues in the sequencing data.
Components of FASTQC report
Per base sequence quality
Per base sequence content
Per sequence GC content
Adapter content
Overrepresented sequences
Per base sequence quality
Shows an overview of the range of quality values across all bases at each position in the FastQ file. The y-axis shows the quality scores. The higher the score, the better the base call. The background of the graph divides the y-axis into very good quality scores (green), scores of reasonable quality (orange) and reads of poor quality (red).
Per base sequence content
Plots the % of each of the 4 nucleotides at each position across all reads in the input sequence file.
In random library you would expect that there would be little to no differences between the different bases of a sequence run, so the lines in this plot should run parallel with each other. If you see strong baises which change in different bases then this usually indicates an overrepresented sequence which is contaminating the library. A bias which is consistent across all bases either indicates that the original library was sequence baised or that there was a systematic problem during the sequencing of the library.
Good base per sequence content report
The percentage of each nucleotide at every position across all reads is shown.
The four lines should run parallel.
The proportions should be relatively constant across the reads. (A=T and G=C for most genomes).
Skewed base composition= the library was generated using random primers. The adapters/priming sequences contribute non-random bases.
Per sequence GC content
Gives GC distribution over all sequences. Good: GC content of the central peak corresponds to the expected %GC for organisms. This distribution should be normal unless overrepresentation or contamination with another organism.
If central peak does not correspond to the theoretical distribution, this would indicate some type of over-represented sequence with the shap peaks, indicating ether contamination or a highly over expressed gene
Good per sequence content content report
Normal bell shaped. GC content forms a smooth gassing distribution. Real sequencing datasets from unbiased libraries typically show a single symmetric peak.
Observed GC content matches the theoretical curve meaning no contamination, no enrichment for GC rich or AT rich regions, no strong overrepresented sequences. No major primer or adapter influence.
The maximum of the peak is around the expected GC% for the organism (-40-50%). Peak shifted left or right could indicate a possible contamination or library bias.
Adapter content
Plot shows cumulative % of reads with the different adapter sequences at each position. Once an adapter sequence is seen in a read it is counted as being present right through to the end of the read so the percentage increases with read length.
Overrepresented sequences
Display the sequences that occur in more than 0.1% of the total number of sequences.
Table aids in identifying contamination (if %GC was not ideal, table helps id source).
A normal high-throughput library will contain a diverse set of sequences, with no individual sequence making up a tiny fraction of the whole. Finding that a single sequence is very overrepsented in the set either means that it is highly biologically significant or indicates that the library is contaminated or not as diverse as you expected.
Quality profiling of raw sequencing data
When the median quality is below a Phred score of 20, we should consider trimming away bad quality bases from the sequence. The quality control and preprocessing of raw FASTQ files is critical especially for degraded samples and involves removing adapter sequences, filtering low quality/complexity reads, error correction etc.
Sequence matching based adapter trimming tools like Trimmomatic, Cutadapt and SOAPnuke can be employed as adapter trimmers and can also perform quality filtering