4P06 final

Platform	Sequencing Principle	Highest Capacityper run;format	Accuracy	Direct Cost ($/GB)	Machine cost	Highlights/error type
Illumina	SBS, fluorescent	NovaSeqP25B:	>99%,fading	$5 or	Versatile;	High capacity &
(2nd Gen)	dNTPs, reversibleterminators, 4Ns at a time	150 bp x 2 x 3Bx8= 7.2 Tbp; SE/PE/ MP;PCR	at the ends	>$50?(72hrs)	>750k	accuracy, PE and MPreads, shortreads; substitution
IonTorrent/ proton(2nd Gen)	SBS, regulardNTPs, pH changes, 1 N at atime	S5/S5 L540:200bp x 80M=16Gbp;SR; PCR	>99%? High error ratefor long homo- polymers	~$1000 (~4hrs)	80K	Cheap reagents,quick sequencing,lower capacity/high on indels
PacBio	SBS, fluorescent	Sequel II: 20kb x	87% insingle	$2000	700K	Long reads, higherror
(3rd Gen)	dNTPs, all together,	1.5M =30Gbp,SR;	round read,	(2- 30		rate, lower capacity,
continuously, noterminators	Single molecule,real time/SMART	99.9%(HiFi)	hrs)		DNA methylation
Nanopore	Sequencing by the	PromethION 48:	~85%?	$2-16	MinION:	Extremely longreads,
(3rd Gen)	current profile of	1Mb x 50 x200K		(1 min to	$1K;	excellent mobility,
	the passing Ns on	=~10TB; SR;		72 hrs)	PromethIO	DNA methylation, low
	the DNA through	single molecule,			N 48	capacity/high error
	the pore	real time			CapEX:!600K	on substitution &indels

Comparison of NGS platforms  quality vs read length (bp)

DNA sequencing overview:

Cost per human genome ($3B to $300 in 30 yrs)

Sequencing cheaper than data analysis

Cost

v Cost of sequencing library/sample: very little room for decreasing

v Cost of sequencing ($/GB): decrease with increasing throughput; but limited by the maximal capacity of indexing

v $300 as the wall?

Ø Read length

v Read length limit for the platforms: 1MB for ONT

v DNA integrity limit: <50kb? Up to 1Mb?

Ø Accuracy

v Low accuracy for the long read platforms, especially for ONT

Ø The relationship between read length vs. coverage.

v Only 2 coverage is required if …..

v The shorter the reads, the higher the coverage required.

v Long read is not always advantageous

$= Illumina most effective WHEN Long read is not always advantageous$

NOTE: RAW DATA WILL PRODUCE MORE DISCREPACY

Ø Opportunities:

§ Cheaper: down to $1 k, now $300 per human genome

§ Faster: less than 1 week or even 1 day per genome

Ø Challenges:

§ Data management:

§ Storage: raw data at hundreds of GB, total to TB level per genome

§ Data analysis:

§ difficult to assemble;

§ How to identify variations? SAM/BAM/CRAM format

APPLICATIONS OF NGS:

Ø Genome sequencing of new species:

Ø more suitable for smaller genomes with less repetitive sequences

Ø Re-sequencing of model organisms:

Ø identifying genome variations based on references

Ø RNA sequencing for gene expression

Ø replacing microarray? YES

Ø Metagenomics:

Ø Study the dynamics of environmental biological communities

PAIR END, MATE PAIR SEQUENCING

Diagram

YOU CAN RECONNECT THE PAIRED ENDS

Pair end, mate pair sequencing (applications) Ø ENCHANCING ASSEMBLY QUALITY FROM SHORT READS v The pair reads extend the length of single reads v gDNA fragments can be longer than the sum of the two reads v Mate pairs help joining contigs Ø DETECT STRUCTURAL VARIANTS with PE reads v Deletions v Insertions v Translocation v Inversions

two shorter regions can be covered by
if you have a test genome in the test genome

THE USE OF HI-C SEQ: Method: The method is known as chromosome conformation capture, which has different versions such as 1/2/3/4/5-C. Application: It is used for unbiased identification of DNA regions in physical proximity in the nucleus, including loops and topologically associating domains (TADs). The method is also applied in assigning sequences into chromosomes.

SCAFFOLDS IN THE GENOME TEND TO BE CONNECTED

Genome Sequencing & assembly

Motivation

Sequencing strategies: (Historical & Current)

Historical: BAC cloning, chromosomal walking

Current: whole genome shotgun sequencing

Steps:

Make DNA libraries
Sequencing (SEQUENCING THEN ASSEMBLE TOGETHER )
Assembly
Annotation (HOW THIS GENOME IN TERMS OF STRUCTURE DIFFERS FROM ANOTHER GENOME?)

COST OF HUMAN GENOME SEQUENCING:

WAS  3 billion for completed human genome projects in 1990s
Current reachable cost (goal)  1 thousand per genome

HOW TO MAKE A GENOMIC LIBRARY:

USE A VECTOR ( SUCH AS A PLASMID FOR EXAMPLE)
BREAK UP THE DNA !!!
THEN GET A PRIMER AND DUPLICATE

HOW TO SEQUENCE ?? THE ULTIMATE STRAEETGY:

An idealized representation of the hierarchical shotgun sequencing strategy is shown to the right:The genomic DNA fragments represented in the BAC library are organized into a physical mapindividual BAC clones are selected and sequenced by the random shotgun strategy. Finally, the clone sequences are assembled to reconstruct the sequence of the genome. (Lander, et al, 2001, Nature)

SHOTGUN SEQUENCING:

heirchal shotgun sequencing

Diagram

Shotgun Sequencing - Another definition:

Shotgun sequencing $is a sequencing method that involves randomly breaking down the DNA or RNA molecule into smaller fragments$ . The fragmented DNA or RNA is then sequenced, generating short reads of the nucleotide sequence. $These reads are then assembled using computational algorithms to reconstruct the original sequence$ of the DNA or RNA molecule.

The process of shotgun sequencing involves several steps. First, the DNA or RNA sample is extracted and randomly sheared into small fragments. $Then, adapters are added to the ends of the fragments, which enables them to bind to a sequencing platform$ . Next, the fragments are amplified and sequenced using high-throughput sequencing platforms, such as Illumina or PacBio. Finally, computational algorithms are used to assemble the short reads into longer contiguous sequences, known as contigs, and then into larger genomic scaffolds.

traditional methods

Below are they levels of clone and sequence coverage:

Diagram, timeline

$Note: the gaps in between are called scaffolds$

Whole Genome Shotgun:

Whole Genome Shotgun Sequencing Method:DNA was cut into small pieces and sequenced completely.These fragments were organized into contigs ( a contiguous stretch of DNA or RNA sequence that has been assembled from overlapping sequencing reads)

Diagram

Terms to note:

Ø Sequence contigs: Contigs produced by merging overlapping sequence reads;

$THUS Sequence contigs = continuous sequence$

Ø Sequence scaffolds: Scaffolds produced by joining contigs on the basis of linking information with gaps (”NNNNNNN”) at estimated sizes.

Ø Text Box: Sequencing coverage: The number of sequences covering any given point of a genome; =total sequence length/genome size.

What would be a good coverage?

Ø $N50: A measure of contig/scaffold length in a genome assembly$ . Specifically, it is the maximum length L such that 50% of all nucleotides lie in contigs (or scaffolds) of size at least l.

n50 is a middle point/ a standard measure of the genome ( the better the n50 the better the quality of the genome)

Ø L50: the number of sequencing to reach 50% of the genome

Large Sequencing Projects:

Genomes:

v 1000 Genome projects; 2008-2015; The Genome 10K Project (G10KP)

v Personal genome projects (PGP): 2005--, PGP-Canada (2012)

v Non-human: Earth BioGenome Project (EBP): 2017

Exomes:

Exome Aggregation Consortium (ExAC): exome sequencing for >60,000 individiuals; published in 2016

1000 Personal Genome Project

The 1000 Genome Project $aimed to sequence at least 1000 individuals from various populations worldwide and catalogue human genetic variation$ down to variants occurring at 1% frequency or less. The project completed in 2015 with 2504 individuals from 26 populations and identified 88 million genetic variants, with the most variation found in African ancestry. The project used a combination of whole-genome sequencing, deep exome sequencing, and high-density SNPs microarray to obtain data. The results include thousands of variants associated with complex traits and rare diseases, along with overlapping regulatory regions.

Summarized:

 Began in 2007 and completed in 2015

 HapMap (Haplotype map of the human genome) was the previous owner // organization

 Millions of SNPs were discovered and GWAS (genome wide association studies) used the dataset for research in disease association

 2007 GOALS: sequence min. 1000 volunteers from populations worldwide

 RESULT: greatest variation sites same from African ancestry

 RESULT: $88 million variants$ :

Personal Genome Project

The Personal Genome Project (PGP) $aims to publicly share the complete genomes and medical records of thousands of participants.$ It provides researchers with genomic, environmental, and human trait data to study the relationships between genotype, environment, and phenotype.

The PGP raises ethical, legal, and social issues regarding privacy, informed consent, and data accessibility.

 Initiated by George Church in 2005

The Personal Genome Project Canada launched in 2007 and sequenced DNA from whole blood using the Illumina HiSeq X system.

LECTURE 8: GENE PREDICTION/GENOMEANNOTATION – March 16, 2023

REMEMBER: GENOME SEQUENCING & ASSEMBLY PROCESS  motivation  find out the proper sequencing Strategy ( Historical or Current) Historical: BAC cloning, chromosomal walking Current: whole genome shotgun sequencing (WGS)  Follow the steps of Whole Genome Sequencing  Make DNA libraries  Sequencing  Assembly  Annotation (predict the genome? what does it mean? where are the transposable elements?)

Bioinformatics: Steps in genome assembly

Preprocessing ( clean up reads  remove low quality parts)
Contiging (reads to contigs)
Polishing (error correction)
Scaffolding ( longer pieces) THEN, GENOME ANNOTIATION

REMAINING CHALLENGES IN GENOME SEQUNCING:  Obtaining accurate continuous sequences for individual chromosomes v Errors in joining contigs (e.g. highly repeated regions) v Lack of sequences for certain regions (e.g. centromeres) Obtaining assembled sequences representing the diploid nature of genomes v Difficulties in obtaining long DNA molecules v Lack of diploid (flattened consensus) phasing with long reads v Short haplotype structure • Solutions: hybrid between long read and short read NGS, chromosomal imaging… • Would the use of Hi-Seq help here? Note : The use of HiSeq, which is a short-read sequencing technology, may not be sufficient on its own to address these issues. Instead, a combination of different sequencing technologies and approaches is often necessary to obtain accurate and complete diploid genome sequences.

Ploidy, Haplotype, Phasing

(top image) At the end of assembly, you can generate 2 sets of sequences ( each haplotype - one unphased and one phased: complete picture of genetic variation)

(bottom left image)

 A) the first image may be insufficient to carry on to the next generation.

 B) the second pair may bring diversity to the couple and all favorable portions should ( theoretically) be involved

A TERM TO KNOW:

Haplotype  A haplotype is group of variants in a section of a chromosome that tend to stay together in transmission across generations ( Piang Liang)

 they are important $for assessing functional impact of variants$ (genetic variations, WGA and pop studies)

 date back to the Human HapMap era ( 2002 – 2010)

(TOP) Future Challenges:

Precise, predictive model of transcription initiation and termination: ability to predict where and when transcription will occur in a genome
Precise, predictive model of RNA splicing/ alternative splicing: ability to predict the splicing pattern of any primary transcript in any tissue
Accurate ab initio protein structure prediction

CAN YOU IDENITFY EXONS VS INTRONS IN A GENOME SEQ?? Introns in BLACK Exons in PINK

WHAT IS A GENE?  A locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions (promotor), transcribed regions and/or other functional sequence regions.

HOW HAVE SCIENTISTS DEFINED IT THRU THE YEARS??  A gene is the most basic unit of inheritance  Gregor Mendel; Traits are determined by discrete unit that are passed from one generation to the next (1860)  Wilhelm Johanssen coined the word “gene” for the unit associated with an inherited trait (1909)  Thomas Morgan “genes as beads on a string” (1910)  George Beadle “one gene, one enzyme” (1941)  Avery and MacLeod and McCarty “genes are made of DNA” (1944)  Watson and Crick “Info flows from DNA to RNA: (1953)  Richard Roberts and Philip Sharp - Gene Splicing (1977)  Discover of microRNA & RNA interference (1993)

WHAT ARE THE COMPONENTS OF A GENE? (i.e. introns, exons, promoters, enhancers, silencers) Introns  Exons  Promoters  Enhancers  Silencers  Operons 