1/32
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
human genome project
-in 1995 first complete sequence of a bacterial genome, setting a precedent for more complex organisms
-project declared complete in April 2003, 99% of the human genome with an accuracy of 99.99%
-information is freely available and placed in the public domain within 24 hours
why sequence the genome
-human genome – 3,000,000,000 bp
-coding and non-coding sequence
-regulatory sequences
-higher order structure
-chromosome maintenance
how is the genome sequence obtained
-obtain the organisms genomic DNA
-break the DNA into small fragments
-the computer aligns them and looks for overlapping
-search for overlaps to ‘reconstruct’ the genome sequence
-problem was that the end has to be unique for the overlapping regions to be identified
why are model organisms helpful
-small genome = value for money
-easy organisms to manipulate
-provide information on fundamental biological processes
-technology development
-useful for comparative genomics
example of pathogenic bacterial genome
-early summer 2011 a bacterial infection appeared in Europe
-genome sequence was obtained in less than two days
-outcomes revealed an unidentified new strain of E. coli; insight into the antibiotic resistance characteristics; insight as to why the bacteria was so virulent and also why it seemed to be targeting adults
outcome of the HGP
- Comprehensive genome map
-provided a detailed map of the human genome, identifying approximately 20,000-25,000 genes
- Technological advancements
-spurred the development of high-throughput sequencing technologies, bioinformatics tools and data storage solutions
-reduced the cost and time required for DNA sequencing dramatically
- Scientific discoveries
-identified genetic variants associated with various diseases, enhancing our understanding of the genetic basis of health and disease
- Collaborative efforts
-fostered international collaboration among scientists, leading to the establishment of global databases and resources like the Ensembl genome browser
what are the major issues identifying genes within genomes
- How big is a valid ORF
-searching for start codon, common start codon = ATG
-in an example piece of DNA there is 6 different reading frames
-the DNA is read as a triplet, each triplet encodes an amino acid
-if the stop codon is shortly after the start codon, is it still a valid open reading frame
-every 50 codons there is an occasional random ORF
-the closer it gets to 50 codons in length the more the validity is questioned
- Identification of RNA splice sites
-pre-mRNA splicing: the ATG/AUG could be separated by other codons and then when this is removed the ATG/AUG are closer together and could become a potential ORF
-the computer will miss these
looking the RNA analyses can help but it depends on the expression of the gene
- Incomplete coverage
-some highly repetitive and structurally complex regions of the genome, such as centromeres and telomeres
-major ‘gaps’ in 2003 sequencing data; recent advancement in sequencing technology filled most gaps
- Genetic variation
-different individuals, and therefore does not capture the full extent of genetic diversity
- Functional understanding
-sequencing the genome is only the first step; understanding the function of every gene and regulatory element remains a significant challenge
-many identified genes have unknown functions, and the regulatory mechanisms governing gene expression are complex and not fully understood
s. cerevisiae genomics
-problems in gene identification are emphasised by genome analyses in yeast
-about 0.001 the size of the human genome
-in contrast to the human genome, genes are tightly packed in the S.cerevisiae genome with very little repetitive DNA
-in contrast to the human genome, RNA alternative splicing in S.cerevisiae rarely occurs to complicate gene identification
-simple genetics can be performed in S. cerevisiae to analyse potential gene function
-despite these advantages nearly 30 years after obtaining the yeast genome sequence…
~10% of the ~6600 ORFs are only classified as ‘dubious’ and a further ~10% are classified as ‘uncharacterised’
~26% of ORFs have not yet been linked with any biological processes
advancements in research and medicine
-foundation for precision medicine – cancer treatment, pharmacogenomics
-gene therapy
-genetic screening
-several open databases being accessible
-comparative genomics – widely popular
HGS of rare childhood conditions
~ 350 babies/children in intensive care had their genome sequence, together with both parents
- 1 in 4 of the patients had a genetic disorder
- ~66% cases the mutation occurred spontaneously
- Childs symptoms was only rarely a good predictor of a genetic condition
- Diagnosis in 2-3 weeks avoiding further invasive tests and sometimes led to treatment change
- in 2019 any baby/child admitted to intensive care with an unexplained condition became eligible for whole genome sequencing
prediction of function - roles for model organisms
functional characteristics of mutant proteins
prediction of protein localisation
prediction of protein domains/modifications
identification of regulatory sequences
characterisation of protein families
functional characterisation of mutant proteins
-analysis of predicted catalytic mutant Msh2 proteins from human colon cancer was confirmed by expressing the proteins in the yeast
Analysis of many other mutant proteins revealed:
- defects in critical protein-protein interactions
- reduced steady state levels of Msh2
- mutations affected the activity of the mismatch repair complex
-insights were totally unpredicted from the human studies
understanding human genome variation
- genetic variation
-any 2 unrelated humans have around 3 million differences in DNA sequence
-~10,000 of these differences cause changes in proteins
- bottleneck in personalised medicine
-interpreting genetic variation and its role in disease is challenging
-crucial for developing personalised drug treatments
- mutation effects
-many single amino acid changes in proteins causing human diseases are believed to be due to protein instability
- potential treatments
-simple diet supplements might restore protein function for specific mutations
e.g. vit B6-dependent enzyme issues linked to neuronal disorders
-studies using yeast to express human genes reveal unpredicted defects, aiding in understanding genetic variation
prediction of protein localisation
-can analyse the protein using the PSORTII programme to see where the protein is located in the cell
-can test this information in the lab
prediction of protein domains/modifications
-use various programmes e.g. BLAST to identify conserved domains
-in Yox1 there is a homeodomain
-gives a structure of the domain in the protein
-then identify potential genes regulated by Yox1 and identify its potential binding site on DNA
-can look at potential phosphorylation sites (of serine/threonine/tyrosine) using programmes such as NetPhos programme
1. can investigate protein phosphorylation in vivo and if so whether the identified threonine is important
2. test genetically and biochemically the potential role of the programme predicted kinase
3. does the phosphorylation change in a cell cycle dependent manner
4. mutate the threonine residue to a glutamic acid, an aspartic acid or an alanine residue to investigate the role of phosphorylation
identification of regulatory sequences
-identify all promoters containing a transcription factor binding site
-looking for a specific sequence that can bind to the transcription factor
-say the computer comes up with a match e.g. a sequence that is present between gene 2 and gene 3
-this would suggest that the probability gene 3 is regulated by the txn factor that you are looking at
-to confirm need to take gene 3 along with the promoter sequence and express them in cells and see if the transcription factor you are interested in is binding to that sequence
characterisation of protein families
-kinases have well characterized homology with catalytic domains
-genome analysis allows interference of the function of uncharacterised kinases by family studies
-genome analysis allows identification of conserved and organism-specific families of protein kinases
functional genomics
-functional genomics experiments describe gene functions and interactions
-includes: protein/DNA interactions, DNA methylation, gene expression, protein-protein interactions, loss-of-function
microarrays
-many functional genomics experiments depended on microarrays
-give you measurement of hybridisation
-sample to probes (of a specific sequence that is complementary to the sequence in the genome that you are interested in) on array
-range of samples and probes for different experiment types
-gives a quantitative data of how much binding DNA between probe and sample
-measure the fluorescence from the hybridisation
expression experiment
-sample = cDNA from mRNA
-probe is complementary to coding sequence of known genes
ChIP
-sample = protein-bound DNA (immunoprecipitation)
-probe is the whole genome
-cross-link to proteins in DNA, usually in cell
-isolate DNA and shear; sonication for ‘random’ shearing
-immunoprecipitate protein of interest
-reverse cross-linking
-purify DNA
-sequence
SNP
-sample = whole genome
-probe = known SNPs
methylation
-sample = whole genome
-probe = known CpG islands
CGH
-sample = whole genome
-probe = whole genome
why has it changed from microarrays → sequencing
-direct sequencing can substitute for hybridisation and giver greater resolution and generally accuracy
-issues were cost and throughput
-developments in sequencing have addressed both
High-throughput sequencing
-sometimes called next-generation sequencing
-refers to a range of technologies
-competition has driven down cost and increased throughput
-illumina sequencing dominates
- Fragments of DNA (library) bound to solid surface (flow cell), binding enabled by special sequences ligated to fragments (adaptors)
- Solid-phase bridge PCR forms clonal clusters, ~ 1000 copies per cluster
- Sequencing proceeds in cycles
- Modified nucleotides with fluorescent group which blocks extension; means only 1 base can be added per cycle, different fluorophore per base
- Reversible termination allows sequencing to proceed to the next cycle
RNA sequencing
-use of HTS technologies to get information about a samples RNA content
-mRNA and other RNAs are converted to DNA
-cDNA used for sequencing library generation
-allows quantification, profiling and discovery of RNA
-always use RNA sequencing over a microarray
• Extract RNA from a sample, mainly rRNA which isn’t very helpful
• Want to enrich for the signal that you are interested in
• Polyadenylation can help pull the mRNA out of the mix using a bead attached to a polyT sequence (complementary to polyA); they bind and the bead is magnetic and so can literally use a magnet to pull it for highly enriched mRNA
• If there is no polyA tail e.g. in bacteria then you have to selectively degrade the RNA you are not interested in
• Then fragment the RNAs so they are small enough to sequence
• Random priming to enable cDNA
• Then PCR to make dscDNA
• Then used to make sequencing fragments
how does RNA sequencing work
-sequences in final library are derived from RNA population in sample
-presence is proportional to original sample, more abundant RNA species will be present more frequently in the library
-random priming is an attempt to remove bias or to not introduce bias
-actual randomness is debatable
considerations for RNA-sequencing
- Big data sets require expert processing
- Expression can be noisy, careful experimental design is important
- Easy for confounding factors to dominate
- Good practise same as any statistical approach
ATAC-seq
-assay for transposase-accessible chromatin
-similar to older DNAase-seq
-relies on transposase Tn5: high activity transposase, highly efficient cutting of exposed DNA, ligation of adaptors to ends
-adapter ligated fragments isolated, amplifies and sequenced
bisulfite sequencing
-used to determine the methylation state of DNA
-methylated cytosine is protected from deamination
-unmethylated cytosine converted to thymine via uracil
-high depth sequencing can provide quantitative estimates of methylation
-hypo and hypermethylated regions can then be identifies
-hypermethylation is associated with transcriptional silencing
reduced representation of BS-seq
-human genome has 28m CpG sites
-assaying all of these is a single experiment is problematic; required depth and multiple testing correction
-RRBS utilises Mspl restriction enzyme to enrich for CpGs; recognition site = C/CGG, results in fragments which begin/end with CpG
-only need to sequence 1% of the genome