Genome assembly, Annotation and ORFs

0.0(0)

Studied by 0 people

0.0(0)

Call with Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/17

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No study sessions yet.

18 Terms

New cards

Whole-genome sequencing

Whole-genome sequencing allows investigation of genome structure, function and evolution.

New cards

Modern tech

Modern technologies generate many short DNA reads rather than complete genome sequences.

So genome assembly and annotation are needed to convert sequencing data into meaningful info.

This involves assembling sequencing reads into contiguous sequences, identifying genes as open reading frames (ORFs) and assigning functions to these genes.

Errors at this stage can propagate into all downstream analyses.

New cards

quality control of raw sequencing reads

sequencing data can often contain errors, adapter sequences and low-quality bases particularly at read ends. quality control removes adapter contamination,low-quality bases and reads that fall below the quality threshold.

New cards

What can poor quality reads lead to

incorrect overlap, fragmented assemblies and false genomic structures. High-quality input data improves assembly accuracy and reliability

New cards

Types of genome assembly

reference based

de novo

New cards

Reference based

reads are aligned to an existing reference genome.

Efficient and useful when a closely related genome is available.

It introduces bias towards the reference and cannot detect novel genomic regions, rearrangements, or horizontally acquired genes.

It is unsuitable for newly sequenced organisms

New cards

De novo

reconstructs the genome without relying on a reference sequence .

This is important for novel organisms.

Most bacterial genomes sequenced using short-read technologies use the de Bruijn graph algorithms.

Sequencing reads are broken into smaller overlapping sequences called k-mers.

each k-mer represents a node in the graph while overlaps between k-mers form edges.

The assembler identifies paths through this graph that correspond to the origional genome sequence.

New cards

k-mer size

small increase connectivity but can collapse repetitive regions while large improve specificity but may fragment assemblies when coverage is low

New cards

output of de novo

Contigs which are contiguous sequences assembled from overlapping reads.

Vary in length depending on genome complexity, sequencing depth and read quality.

New cards

Assembly quality of contigs

Assessed using N50 which represents the length at which half of the assembled genome is contained in contigs of that length or longer

New cards

What can contigs be joined into to

to improve assemblies contigs can be joined into scaffolds using additional information from paired-end reads, mate-pair libraries, or long-reads.

Scaffolds provide information about contig order and orientation.

The final product is a draft genome, which may contain unresolved regions but represents most of the organism’s genomic content.

New cards

ORF

Once draft is made next step is gene prediction through ORFs. ORF is a stretch of DNA beginning with a start codon ending with a stop codon and capable of encoding protein.

New cards

identifying ORFs

Tools scan the genome sequence to identify ORFs and evaluate the features such as codon usage bias, the presence of ribosome binding sites and gene length.

These features distinguish true-protein coding genes from random open reading freames that arise by chance.

short ORFs my represent real genes or false positives

New cards

Functional annotation

assign putative functions to predicted genes.

involves comparing predicted protein sequences against reference databases using sequence similarity searches.

If a protein shares high similarity with a protein of known function, a putative annotation can be assigned.

New cards

Other ways proteins may be annotated

on conserved domains, protein families and participation in metabolic pathways.

New cards

Genome annotation also includes

identification of non-coding RNA genes such as transfer RNAs and ribosomal RNA which are essential for cellular function.

Annotation pipelines may identify regulatory regions, promoters and mobile genetic elements such as plasmids, transposons and prophages.

Together these steps transform a raw genome sequence into a biologically interpretable dataset that can be used for coompatative genomics and functional analysis

New cards

Challenges

repetitive sequences - cannot be easily resolved using short reads and often lead to fragmented assemblies

uneven sequencing coverage - result in missing genomic regions

sequencing errors and contamination - introduce sequences into the assembly

New cards

Limitations

limited by the completeness and accuracy of reference databases. Many predicted ORFs are annotated as hypothetical proteins due to absence of homologues with known function.

Errors during assembly can propagate into annotation, resulting in incorrect gene prediction.