Platform | Sequencing Principle | Highest Capacityper run;format | Accuracy | Direct Cost ($/GB) | Machine cost | Highlights/error type |
---|---|---|---|---|---|---|
Illumina | SBS, fluorescent | NovaSeqP25B: | >99%,fading | $5 or | Versatile; | High capacity & |
(2nd Gen) | dNTPs, reversibleterminators, 4Ns at a time | 150 bp x 2 x 3Bx8= 7.2 Tbp; SE/PE/ MP;PCR | at the ends | >$50?(72hrs) | >750k | accuracy, PE and MPreads, shortreads; substitution |
IonTorrent/ proton(2nd Gen) | SBS, regulardNTPs, pH changes, 1 N at atime | S5/S5 L540:200bp x 80M=16Gbp;SR; PCR | >99%? High error ratefor long homo- polymers | ~$1000 (~4hrs) | 80K | Cheap reagents,quick sequencing,lower capacity/high on indels |
PacBio | SBS, fluorescent | Sequel II: 20kb x | 87% insingle | $2000 | 700K | Long reads, higherror |
(3rd Gen) | dNTPs, all together, | 1.5M =30Gbp,SR; | round read, | (2- 30 | rate, lower capacity, | |
continuously, noterminators | Single molecule,real time/SMART | 99.9%(HiFi) | hrs) | DNA methylation | ||
Nanopore | Sequencing by the | PromethION 48: | ~85%? | $2-16 | MinION: | Extremely longreads, |
(3rd Gen) | current profile of | 1Mb x 50 x200K | (1 min to | $1K; | excellent mobility, | |
the passing Ns on | =~10TB; SR; | 72 hrs) | PromethIO | DNA methylation, low | ||
the DNA through | single molecule, | N 48 | capacity/high error | |||
the pore | real time | CapEX:!600K | on substitution &indels |
Comparison of NGS platforms quality vs read length (bp)
DNA sequencing overview:
Cost per human genome ($3B to $300 in 30 yrs)
Sequencing cheaper than data analysis
Cost
v Cost of sequencing library/sample: very little room for decreasing
v Cost of sequencing ($/GB): decrease with increasing throughput; but limited by the maximal capacity of indexing
v $300 as the wall?
Ø Read length
v Read length limit for the platforms: 1MB for ONT
v DNA integrity limit: <50kb? Up to 1Mb?
Ø Accuracy
v Low accuracy for the long read platforms, especially for ONT
Ø The relationship between read length vs. coverage.
v Only 2 coverage is required if …..
v The shorter the reads, the higher the coverage required.
v Long read is not always advantageous
= Illumina most effective WHEN Long read is not always advantageous
NOTE: RAW DATA WILL PRODUCE MORE DISCREPACY
Ø Opportunities:
§ Cheaper: down to $1 k, now $300 per human genome
§ Faster: less than 1 week or even 1 day per genome
Ø Challenges:
§ Data management:
§ Storage: raw data at hundreds of GB, total to TB level per genome
§ Data analysis:
§ difficult to assemble;
§ How to identify variations? SAM/BAM/CRAM format
APPLICATIONS OF NGS:
Ø Genome sequencing of new species:
Ø more suitable for smaller genomes with less repetitive sequences
Ø Re-sequencing of model organisms:
Ø identifying genome variations based on references
Ø RNA sequencing for gene expression
Ø replacing microarray? YES
Ø Metagenomics:
Ø Study the dynamics of environmental biological communities
PAIR END, MATE PAIR SEQUENCING
YOU CAN RECONNECT THE PAIRED ENDS
Pair end, mate pair sequencing (applications) Ø ENCHANCING ASSEMBLY QUALITY FROM SHORT READS v The pair reads extend the length of single reads v gDNA fragments can be longer than the sum of the two reads v Mate pairs help joining contigs Ø DETECT STRUCTURAL VARIANTS with PE reads v Deletions v Insertions v Translocation v Inversions
two shorter regions can be covered by
if you have a test genome in the test genome
THE USE OF HI-C SEQ: Method: The method is known as chromosome conformation capture, which has different versions such as 1/2/3/4/5-C. Application: It is used for unbiased identification of DNA regions in physical proximity in the nucleus, including loops and topologically associating domains (TADs). The method is also applied in assigning sequences into chromosomes.
SCAFFOLDS IN THE GENOME TEND TO BE CONNECTED
Genome Sequencing & assembly
Motivation
Sequencing strategies: (Historical & Current)
Historical: BAC cloning, chromosomal walking
Current: whole genome shotgun sequencing
Steps:
Make DNA libraries
Sequencing (SEQUENCING THEN ASSEMBLE TOGETHER )
Assembly
Annotation (HOW THIS GENOME IN TERMS OF STRUCTURE DIFFERS FROM ANOTHER GENOME?)
COST OF HUMAN GENOME SEQUENCING:
WAS 3 billion for completed human genome projects in 1990s
Current reachable cost (goal) 1 thousand per genome
HOW TO MAKE A GENOMIC LIBRARY:
USE A VECTOR ( SUCH AS A PLASMID FOR EXAMPLE)
BREAK UP THE DNA !!!
THEN GET A PRIMER AND DUPLICATE
HOW TO SEQUENCE ?? THE ULTIMATE STRAEETGY:
An idealized representation of the hierarchical shotgun sequencing strategy is shown to the right:The genomic DNA fragments represented in the BAC library are organized into a physical mapindividual BAC clones are selected and sequenced by the random shotgun strategy. Finally, the clone sequences are assembled to reconstruct the sequence of the genome. (Lander, et al, 2001, Nature) |
---|
SHOTGUN SEQUENCING:
heirchal shotgun sequencing
Shotgun Sequencing - Another definition:
Shotgun sequencing is a sequencing method that involves randomly breaking down the DNA or RNA molecule into smaller fragments. The fragmented DNA or RNA is then sequenced, generating short reads of the nucleotide sequence. These reads are then assembled using computational algorithms to reconstruct the original sequence of the DNA or RNA molecule.
The process of shotgun sequencing involves several steps. First, the DNA or RNA sample is extracted and randomly sheared into small fragments. Then, adapters are added to the ends of the fragments, which enables them to bind to a sequencing platform. Next, the fragments are amplified and sequenced using high-throughput sequencing platforms, such as Illumina or PacBio. Finally, computational algorithms are used to assemble the short reads into longer contiguous sequences, known as contigs, and then into larger genomic scaffolds.
traditional methods
Below are they levels of clone and sequence coverage:
Note: the gaps in between are called scaffolds
Whole Genome Shotgun:
Whole Genome Shotgun Sequencing Method:DNA was cut into small pieces and sequenced completely.These fragments were organized into contigs ( a contiguous stretch of DNA or RNA sequence that has been assembled from overlapping sequencing reads) |
---|
Terms to note:
Ø Sequence contigs: Contigs produced by merging overlapping sequence reads;
THUS Sequence contigs = continuous sequence
Ø Sequence scaffolds: Scaffolds produced by joining contigs on the basis of linking information with gaps (”NNNNNNN”) at estimated sizes.
Ø Sequencing coverage: The number of sequences covering any given point of a genome; =total sequence length/genome size.
What would be a good coverage?
Ø N50: A measure of contig/scaffold length in a genome assembly. Specifically, it is the maximum length L such that 50% of all nucleotides lie in contigs (or scaffolds) of size at least l.
n50 is a middle point/ a standard measure of the genome ( the better the n50 the better the quality of the genome)
Ø L50: the number of sequencing to reach 50% of the genome
Large Sequencing Projects:
Genomes:
v 1000 Genome projects; 2008-2015; The Genome 10K Project (G10KP)
v Personal genome projects (PGP): 2005--, PGP-Canada (2012)
v Non-human: Earth BioGenome Project (EBP): 2017
Exomes:
Exome Aggregation Consortium (ExAC): exome sequencing for >60,000 individiuals; published in 2016
1000 Personal Genome Project
The 1000 Genome Project aimed to sequence at least 1000 individuals from various populations worldwide and catalogue human genetic variation down to variants occurring at 1% frequency or less. The project completed in 2015 with 2504 individuals from 26 populations and identified 88 million genetic variants, with the most variation found in African ancestry. The project used a combination of whole-genome sequencing, deep exome sequencing, and high-density SNPs microarray to obtain data. The results include thousands of variants associated with complex traits and rare diseases, along with overlapping regulatory regions.
Summarized:
Began in 2007 and completed in 2015
HapMap (Haplotype map of the human genome) was the previous owner // organization
Millions of SNPs were discovered and GWAS (genome wide association studies) used the dataset for research in disease association
2007 GOALS: sequence min. 1000 volunteers from populations worldwide
RESULT: greatest variation sites same from African ancestry
RESULT: 88 million variants:
Personal Genome Project
The Personal Genome Project (PGP) aims to publicly share the complete genomes and medical records of thousands of participants. It provides researchers with genomic, environmental, and human trait data to study the relationships between genotype, environment, and phenotype.
The PGP raises ethical, legal, and social issues regarding privacy, informed consent, and data accessibility.
Initiated by George Church in 2005
The Personal Genome Project Canada launched in 2007 and sequenced DNA from whole blood using the Illumina HiSeq X system.
LECTURE 8: GENE PREDICTION/GENOMEANNOTATION – March 16, 2023
REMEMBER: GENOME SEQUENCING & ASSEMBLY PROCESS motivation find out the proper sequencing Strategy ( Historical or Current) Historical: BAC cloning, chromosomal walking Current: whole genome shotgun sequencing (WGS) Follow the steps of Whole Genome Sequencing Make DNA libraries Sequencing Assembly Annotation (predict the genome? what does it mean? where are the transposable elements?)
Bioinformatics: Steps in genome assembly
Preprocessing ( clean up reads remove low quality parts)
Contiging (reads to contigs)
Polishing (error correction)
Scaffolding ( longer pieces) THEN, GENOME ANNOTIATION
REMAINING CHALLENGES IN GENOME SEQUNCING: Obtaining accurate continuous sequences for individual chromosomes v Errors in joining contigs (e.g. highly repeated regions) v Lack of sequences for certain regions (e.g. centromeres) Obtaining assembled sequences representing the diploid nature of genomes v Difficulties in obtaining long DNA molecules v Lack of diploid (flattened consensus) phasing with long reads v Short haplotype structure • Solutions: hybrid between long read and short read NGS, chromosomal imaging… • Would the use of Hi-Seq help here? Note : The use of HiSeq, which is a short-read sequencing technology, may not be sufficient on its own to address these issues. Instead, a combination of different sequencing technologies and approaches is often necessary to obtain accurate and complete diploid genome sequences.
Ploidy, Haplotype, Phasing
(top image) At the end of assembly, you can generate 2 sets of sequences ( each haplotype - one unphased and one phased: complete picture of genetic variation)
(bottom left image)
A) the first image may be insufficient to carry on to the next generation.
B) the second pair may bring diversity to the couple and all favorable portions should ( theoretically) be involved
A TERM TO KNOW:
Haplotype A haplotype is group of variants in a section of a chromosome that tend to stay together in transmission across generations ( Piang Liang)
they are important for assessing functional impact of variants (genetic variations, WGA and pop studies)
date back to the Human HapMap era ( 2002 – 2010)
(TOP) Future Challenges:
Precise, predictive model of transcription initiation and termination: ability to predict where and when transcription will occur in a genome
Precise, predictive model of RNA splicing/ alternative splicing: ability to predict the splicing pattern of any primary transcript in any tissue
Accurate ab initio protein structure prediction
CAN YOU IDENITFY EXONS VS INTRONS IN A GENOME SEQ?? Introns in BLACK Exons in PINK
WHAT IS A GENE? A locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions (promotor), transcribed regions and/or other functional sequence regions.
HOW HAVE SCIENTISTS DEFINED IT THRU THE YEARS?? A gene is the most basic unit of inheritance Gregor Mendel; Traits are determined by discrete unit that are passed from one generation to the next (1860) Wilhelm Johanssen coined the word “gene” for the unit associated with an inherited trait (1909) Thomas Morgan “genes as beads on a string” (1910) George Beadle “one gene, one enzyme” (1941) Avery and MacLeod and McCarty “genes are made of DNA” (1944) Watson and Crick “Info flows from DNA to RNA: (1953) Richard Roberts and Philip Sharp - Gene Splicing (1977) Discover of microRNA & RNA interference (1993)
WHAT ARE THE COMPONENTS OF A GENE? (i.e. introns, exons, promoters, enhancers, silencers) Introns Exons Promoters Enhancers Silencers Operons