1/17
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Whole-genome sequencing
Whole-genome sequencing allows investigation of genome structure, function and evolution.
Modern tech
Modern technologies generate many short DNA reads rather than complete genome sequences.
So genome assembly and annotation are needed to convert sequencing data into meaningful info.
This involves assembling sequencing reads into adjacent sequences, identifying genes as open reading frames (ORFs) and assigning functions to these genes.
Errors at this stage can propagate into all downstream analyses.
quality control of raw sequencing reads
sequencing data can often contain errors, adapter sequences and low-quality bases particularly at read ends. quality control removes adapter contamination,low-quality bases and reads that fall below the quality threshold.
What can poor quality reads lead to
incorrect overlap, fragmented assemblies and false genomic structures. High-quality input data improves assembly accuracy and reliability
Types of genome assembly
reference based
de novo
Reference based
reads are aligned to an existing reference genome.
Efficient and useful when a closely related genome is available.
It introduces bias towards the reference and cannot detect novel genomic regions, rearrangements, or horizontally acquired genes.
It is unsuitable for newly sequenced organisms
De novo
reconstructs the genome without relying on a reference sequence .
This is important for novel organisms.
Most bacterial genomes sequenced using short-read technologies use the de Bruijn graph algorithms.
Sequencing reads are broken into smaller overlapping sequences called k-mers.
each k-mer represents a node in the graph while overlaps between k-mers form edges.
The assembler identifies paths through this graph that correspond to the origional genome sequence.
k-mer size
small increase connectivity but can collapse repetitive regions while large improve specificity but may fragment assemblies when coverage is low
output of de novo
Contigs which are contiguous sequences assembled from overlapping reads.
Vary in length depending on genome complexity, sequencing depth and read quality.
Assembly quality of contigs
Assessed using N50 which represents the length at which half of the assembled genome is contained in contigs of that length or longer
What can contigs be joined into to
to improve assemblies contigs can be joined into scaffolds using additional information from paired-end reads, mate-pair libraries, or long-reads.
Scaffolds provide information about contig order and orientation.
The final product is a draft genome, which may contain unresolved regions but represents most of the organism’s genomic content.
ORF
Once draft is made next step is gene prediction through ORFs. ORF is a stretch of DNA beginning with a start codon ending with a stop codon and capable of encoding protein.
identifying ORFs
Tools scan the genome sequence to identify ORFs and evaluate the features such as codon usage bias, the presence of ribosome binding sites and gene length.
These features distinguish true-protein coding genes from random open reading freames that arise by chance.
short ORFs my represent real genes or false positives
Functional annotation
assign putative functions to predicted genes.
involves comparing predicted protein sequences against reference databases using sequence similarity searches.
If a protein shares high similarity with a protein of known function, a putative annotation can be assigned.
Other ways proteins may be annotated
on conserved domains, protein families and participation in metabolic pathways.
Genome annotation also includes
identification of non-coding RNA genes such as transfer RNAs and ribosomal RNA which are essential for cellular function.
Annotation pipelines may identify regulatory regions, promoters and mobile genetic elements such as plasmids, transposons and prophages.
Together these steps transform a raw genome sequence into a biologically interpretable dataset that can be used for coompatative genomics and functional analysis
Challenges
repetitive sequences - cannot be easily resolved using short reads and often lead to fragmented assemblies
uneven sequencing coverage - result in missing genomic regions
sequencing errors and contamination - introduce sequences into the assembly
Limitations
limited by the completeness and accuracy of reference databases. Many predicted ORFs are annotated as hypothetical proteins due to absence of homologues with known function.
Errors during assembly can propagate into annotation, resulting in incorrect gene prediction.