Quality parameters to check for

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/10

There's no tags or description

Looks like no tags are added yet.

Last updated 6:54 PM on 3/14/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

11 Terms

New cards

Obtaining reads

Genome is fragmented before sequencing and each fragment results in a sequencing read

New cards

FASTQC

Tool used for assessing the quality of high-throughput sequencing data. Its primary purpose is to provide a comprehensive report on various quality metrics, allowing users to identify potential issues in the sequencing data.

New cards

Components of FASTQC report

Per base sequence quality
Per base sequence content
Per sequence GC content
Adapter content
Overrepresented sequences

New cards

Per base sequence quality

Shows an overview of the range of quality values across all bases at each position in the FastQ file. The y-axis shows the quality scores. The higher the score, the better the base call. The background of the graph divides the y-axis into very good quality scores (green), scores of reasonable quality (orange) and reads of poor quality (red).

New cards

Per base sequence content

Plots the % of each of the 4 nucleotides at each position across all reads in the input sequence file.

In random library you would expect that there would be little to no differences between the different bases of a sequence run, so the lines in this plot should run parallel with each other. If you see strong baises which change in different bases then this usually indicates an overrepresented sequence which is contaminating the library. A bias which is consistent across all bases either indicates that the original library was sequence baised or that there was a systematic problem during the sequencing of the library.

New cards

Good base per sequence content report

The percentage of each nucleotide at every position across all reads is shown.

The four lines should run parallel.

The proportions should be relatively constant across the reads. (A=T and G=C for most genomes).

Skewed base composition= the library was generated using random primers. The adapters/priming sequences contribute non-random bases.

New cards

Per sequence GC content

Gives GC distribution over all sequences. Good: GC content of the central peak corresponds to the expected %GC for organisms. This distribution should be normal unless overrepresentation or contamination with another organism.

If central peak does not correspond to the theoretical distribution, this would indicate some type of over-represented sequence with the shap peaks, indicating ether contamination or a highly over expressed gene

New cards

Good per sequence content content report

Normal bell shaped. GC content forms a smooth gassing distribution. Real sequencing datasets from unbiased libraries typically show a single symmetric peak.

Observed GC content matches the theoretical curve meaning no contamination, no enrichment for GC rich or AT rich regions, no strong overrepresented sequences. No major primer or adapter influence.

The maximum of the peak is around the expected GC% for the organism (-40-50%). Peak shifted left or right could indicate a possible contamination or library bias.

New cards

Adapter content

Plot shows cumulative % of reads with the different adapter sequences at each position. Once an adapter sequence is seen in a read it is counted as being present right through to the end of the read so the percentage increases with read length.

New cards

Overrepresented sequences

Display the sequences that occur in more than 0.1% of the total number of sequences.

Table aids in identifying contamination (if %GC was not ideal, table helps id source).

A normal high-throughput library will contain a diverse set of sequences, with no individual sequence making up a tiny fraction of the whole. Finding that a single sequence is very overrepsented in the set either means that it is highly biologically significant or indicates that the library is contaminated or not as diverse as you expected.

New cards

Quality profiling of raw sequencing data

When the median quality is below a Phred score of 20, we should consider trimming away bad quality bases from the sequence. The quality control and preprocessing of raw FASTQ files is critical especially for degraded samples and involves removing adapter sequences, filtering low quality/complexity reads, error correction etc.

Sequence matching based adapter trimming tools like Trimmomatic, Cutadapt and SOAPnuke can be employed as adapter trimmers and can also perform quality filtering