1/51
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What is gene annotation
The process of identifying and describing regions of biological interest within a genome – BOTH functionally and structurally
What are three main steps in gene annotation
Identifying noncoding regions
Identifying coding regions (=gene prediction)
Attaching biological information of these elements
What are the main approaches for identifying genes within the genome
Intrinsic methods: based on DNA sequence alone
Open reading frame (ORF)
Gene codon bias
Splicing sites
Extrinsic methods: comparing to known data
Gene homology and related genomes
Comparison to RNA expression
Describe the Open reading frame (ORF)
ORF is a segment of DNA starting with a start codon (usually ATG) and ending with a stop codon (TAA, TAG, TGA)
6 reading frames because DNA is double stranded and each strand can be read in 3 different ways depending on which base position of the codon you start at.
search in all 6 reading frames 5’→3’
Identifying ORFs help finding parts of DNA coding for a protein → reveals the functional part of DNA
To find genes, we look for long stretches (=segments) without stop codons:
Random DNA has stop codons every ~64 bases. A longer stretch than 64 bases likely means a gene.
More difficult for eukaryotes: because genes are often split between introns and exons (can be short and scattered → more difficult)
Describe how exons and introns are identified using gene codon bias
Codon bias is the preferred codon for an aa in an organism (different between organisms)
Codon bias only in the exons → Helps identifying exons
ORF is effective for prokaryotes but not for eukaryotes → codon bias helps identifying protein coding segments in DNA
Describe how exons and introns are identified using the splicing sites
Exon-intron boundaries are not anywhere – they are marked by specific
sequences.
By comparing many exon-intron boundaries consensus sequences for the boundaries have been identified
Consensus sequences = typisk sekvens som ofta finns på samma plats i många gener. Useful markers to find the edges of genes, to locate where exons are in the genome.
Example, consensus sequence for vertebrae
Py = T or C
N = any nucleotide
What are the specific elements that help to identify genes
Start/stop codon
Poly-A signals and terminators (important for ending transcription)
Promotor:
CpG islands
Binding sites for regulatory proteins
What are CpG islands
About 1 kbp region high in CG (C in CG can be methylated)
Found upstream of many genes
In humans, 70% of proximal promoters contain a CpG island
Not all gene promoters contain a CpG island but if a promoter has it, most often a gene starts after
How do you identify binding sites through ChIP-seq
ChIP-seq = Chromatin Immunoprecipitation sequencing
This method helps discover regulatory proteins that bind to DNA and control gene expression especially when there's no clear sequence pattern.
Steps:
Chromatin = Crosslink proteins to DNA (so they stay stick together)
Break DNA into fragments (sonication)
Use antibodies to pull out only the DNA fragments bound to the protein of interest.
Break the crosslinks, discard proteins.
Sequence the DNA → You now know where that protein was bound!
How do you locate genes for noncoding RNA
Not all genes encode proteins. Some genes produce RNA molecules that function directly as RNAs — they are not translated into proteins, but still play essential roles in the cell.
Noncoding RNA includes:
tRNA
rRNA
Other short and long RNA molecules participating in e.g:
Alternate splicing
Posttranscriptional gene regulation
Chromatin remodelling
Protein interactions
Noncoding RNA not typically conserved by sequence but by structure
There are programs to test length of stem, size of loop, stability etc.
Describe homology search (extrinsic method)
Homology = shared ancestry. If two sequences are homologous, they likely came from the same gene in an ancestor and might have similar functions
BLAST: compare sequence to known sequences in a database. It tells you:
is this gene similar to anything known
does i exist in other organisms
Why are protein comparisons better than DNA in homology search
The genetic code is redundant: multiple DNA codons can code for the same amino acid. So, two DNA sequences can look different but still produce the same protein. That’s why comparing protein sequences often gives more meaningful results for function and annotation.
Describe why using related genomes can help with gene annotation (extrinsic method)
If you're not sure whether a DNA sequence is a real gene:
Check if the same or similar gene exists in a closely related species.
If yes, it adds confidence that this gene is real and functional, not just a random ORF (open reading frame).
Why it's important:
Prevents false positives in gene annotation: sometimes short DNA sequences look like genes but aren't real (called "spurious ORFs").
Seeing conservation across species (especially functionally important genes) helps validate annotations.
How can transcriptome comparison help with gene annotation (extrinsic method)
Jämför allt RNA (alla transkript) från olika celler för att se vilka gener som är aktiva i olika celltyper.
If a region of the genome is being transcribed into RNA, it's likely a gene (or regulatory RNA).
Mapping the RNA back onto the genome helps confirm gene locations.
How it's used:
In genome annotation pipelines, these data help validate predictions made from ORF finding and homology.
You get a functional confirmation that the gene is actually expressed.
Why should annotation exclude pseudogenes
Pseudogenes = broken versions of real genes
They may look like genes but can’t make functional proteins.
How can annotation tools detect and filter out pseudogenes (not labelling them as real, functional genes)
Common problems in pseudogenes:
Missing promoter: no way to start transcription.
Missing start codon: can’t begin translation.
Frameshifts: small insertion/deletion that throws off the reading frame.
Early stop codons: translation ends too early = nonfunctional protein.
Missing introns
Partial deletion
How do we know if we have done a good annotation
Use BUSCO (= Benchmarking Universal Single-Copy Orthologs)
BUSCO på assembly → kollar om DNA-sekvensen innehåller alla viktiga gener = mäter genomets kvalitet.
BUSCO på annotation → kollar om de annoterade generna är rätt och kompletta = mäter annoteringens kvalitet.
For example: all mammals should have the same core set of metabolic genes
What is functional annotation
= determine what the protein the gene codes for does (once after annotation)
Homology can give important clues to protein function
Often includes:
Domain/motif searches
Orthology searches
Homology searches
What is the domain/motif search in functional annotation
Search in specific parts of a protein → indicates what the protein can do
DNA-binding motif: litet återkommande funktionellt igenkänningsmönster ex. TATA-box
Catalytic domains: större självständig funktionell enhet ex. DNA-binding domain
Databases like Pfam, InterPro, and SignalP help identify these features.
What is the orthology search in functional annotation
These look for equivalent genes (orthologs) in other organisms. Since orthologs often maintain the same function through evolution, this can offer strong functional clues.
What is the homology search in functional annotation
Tools like BLAST compare sequences against large databases to find similar proteins. Functional predictions can then be made based on known roles of matching proteins.
What can the information from a gene sequence help determining about protein functions
Functions based on homology:
Secondary structure determination: Determines whether regions of a protein are likely to form alpha helices, beta sheets, or coils/loops. These structures influence how the protein folds and interacts.
Transmembrane domain prediction: Identifies hydrophobic regions that may embed into cell membranes. These regions are typical in membrane-bound receptors or transporters. Helps determine if the protein is cytosolic or membrane bound.
Signal peptides: Short sequences at the start of proteins that direct them to specific locations in the cell (e.g. mitochondria, endoplasmic reticulum).
3D structure modeling: Gives a complete shape of the protein, which is critical for understanding how it interacts with other molecules.
What is forward genetics
This approach starts with an observable trait (phenotype), such as a visible defect or disease. Researchers then try to find which gene is responsible. This method is useful when the phenotype is known, but the genetic basis is not.
What gene causes the phenotype?
What is reverse genetics
In this approach, scientists start with a specific gene of interest and then alter it = ändrar det (e.g. by knocking it out, mutating it, or overexpressing it) to see what effect that has on the organism. The goal is to deduce = härleda the gene’s function by observing what happens when it is changed.
Reverse genetics often studied using:
mutagenesis of the protein
gene knock-out
down regulation
overexpression
in vivo
What is the gene’s function?
How do you detect a transcript through Northern blot
A method to detect RNA in a sample
Same technique as Southern blot but for RNA
How it works:
RNA is separated by gel electrophoresis.
Transferred from the gel to a membrane.
A labeled probe on the membrane binds to a specific RNA sequence.
The probe shows where (and if) the RNA is present.
Determine where/when a gene is expressed.
Measure transcript length.
Detect splice variants.
What is Northen blot used for
Determine where/when a gene is expressed.
Measure transcript length.
Detect splice variants.
When do you analyze a transcript
Don’t have a complete gene sequence (or don’t trust it)
Need to know up/downstream UTR (untranslated) regions
May need to verify the sequence of your favorite gene before trying to
investigate it.
Why do you need to detect start/end of a transcript
Helps identify regulatory regions and transcription boundaries
What methods are used to detect start/end of a transcript
S1 nuclease mapping
Primer extension
RACE-PCR
What is S1 nuclease mapping
What it does:
Detects the 5' and 3' ends of RNA.
Also identifies exons and potential splice variants (= olika kombinationer av exoner i en gen).
How it works:
RNA hybridizes with DNA → forms a DNA-mRNA heteroduplex (hybrid).
Introns create loops that don’t hybridize = basparar med.
S1 nuclease cuts unpaired single-stranded DNA (loops).
RNA is degraded by alkali.
The ssDNA-fragments that where RNA-protected remain → can be analyzed by sequencing or electrophoresis.
What is RACE-PCR
What it does:
Amplifies the start or end of an RNA sequence.
Helps find exact 5' or 3' ends of mRNAs.
Steps:
Reverse transcriptase + primer → reversed transcription of RNA to cDNA.
RNA is denatured.
Poly-A tail is added to 3´end with terminal transferase.
Second primer binds to A-tail.
Second strand synthesis with Taq-polymerase
PCR amplifies the cDNA sequence.
The amplified cDNA fragments are sequenced to determine start or end of the RNA.
What does identification of regulation of gene expression show
Finding where regulatory proteins (like transcription factors) bind to DNA.
What methods are used to identify regulation of gene expression
(Chip-Seq)
Gel retardation (Electrophoretic mobility shift assay, EMSA)
Footprinting with DNase I
Modification interference assay
What is the principle of gel retardation (EMSA)
DNA bound to a protein moves slower in a gel than DNA alone.
Shifted band = DNA-protein complex.
Shifted = bandet hamnar högre upp i gelen
Used to confirm if a protein binds a specific DNA fragment.
To see binding strength or presence från proteinet till DNA.
What is DNase footprinting
Principle:
Identifies exact binding sites of DNA-binding proteins.
Detects protected regions where proteins block DNase I from cutting DNA.
Steps:
DNA is labeled with an End–label and is mixed with a regulatory protein that binds.
DNase I is added → it cuts DNA except where the protein is bound.
Fragments are separated by gel → the “gap” or “footprint” on the gel shows where the protein was bound.
End-label: marks one end of DNA so fragment sizes can be seen clearly.
Protein binds DNA → protects a region from DNase cutting.
DNase I cuts only unprotected (protein-free) DNA.
Footprint = gap in band pattern on gel → shows where protein was bound.
Only DNA fragments run in the gel, not the proteins.
What is modification interference assay
Determines which DNA bases are essential for protein binding
Steps:
DNA is chemically modified (e.g., G bases are methylated).
Protein is added. If the modification prevents binding, that base is important.
Compare modified fragments that did or did not bind the protein via gel electrophoresis.
How do you find out what the found sequence does (identifying the regulation of gene expression)
Deletion analysis
What is deletion analysis
Tests what happens when control elements that we have found (enhancers/silencers of gene expression) are deleted.
The result should be:
Delete enhancer → gene expression reduces
Delete silencer → gene expression increases
What is a reporter gene
A gene used to mimic the expression pattern of the original gene
How can reporter genes facilitate deletion analysis
En reporter används för att enkelt spåra genuttryck, eftersom den ger en tydlig och mätbar signal, till skillnad från många vanliga gener.
How do you analyze proteins and their functions
Change your protein of interest - study the effect:
Introduce a point mutation - In vitro or site-directed mutagenesis
Change larger parts of you protein – swaps and truncations
Name two ways to introduce mutations in proteins
Artificial gene synthesis
Mutations by PCR
How does artificial gene synthesis work
Create overlapping short DNA fragments = oligonucleotides (Designar dom själv).
Assemble them into a full gene with DNA polymerase and ligase.
How do you mutate by PCR
Use primers that contain mutations (forward and reverse).
Run PCR to amplify DNA with the mutation.
Used to precisely alter DNA and test effects.
How do you analyze a protein through swaps and truncations
Strategies:
Promoter swap: test how a different promoter affects expression.
Motif swap: exchange a domain or motif between genes.
Truncations: cut out parts of the gene/protein to study the function of specific regions (if they are important for the function of the gene/protein).
Name ways to change gene expression in vivo (in organism)
Gene editing using well-developed Gene Editing Tools:
TALENs, Zinc Finger Nucleases, CRISPR/Cas9
→ Allow precise changes in DNA.
Gene knockout/downregulation: turn off/ decrease expression of a gene
Homologous recombination (deletion cassettes)
CRISPR interference
RNA interference, or recombination
Gene overexpression: make more of a gene’s product, from the same or different species.
Deletion Cassette is a way to knockout a gene by homologous recombination. How does it work
Example with mice:
Insert modified DNA with selectable markers.
Replace gene via homologous recombination. (Den ursprungliga genen ersätts via homolog rekombination med det modifierade DNA:t.)
Select modified cells → insert into embryos → grow chimeric mice. Välj celler där bytet lyckats → sätt in i embryon → få chimeriska möss = har både normala och knockout-celler
Crossbreed to get homozygous knockout mice = möss där båda kopiorna av genen är utslagna.
RNA interference is way to knockdown a gene.
How does it work
A post-transcriptional method to silence genes using double-stranded RNA (dsRNA).
The dsRNA matches a gene’s mRNA and triggers its degradation (uppfattas som ett hot, ex. virus-RNA → det matchande mRNAt bryts ned i försvar), reducing protein production.
The genes are temporarily silenced not permanently removed (like gene-knockout)
What is the mechanism of RNA interference
Dicer enzyme chops long dsRNA into small interfering RNAs (siRNA).
siRNA joins a complex called RISC (RNA-induced silencing complex).
RISC uses one siRNA strand to find and bind mRNA with a matching sequence.
Bound mRNA is cleaved and degraded, stopping translation.
→ This is a natural defense against viruses, which often have dsRNA.
How do you introduce dsRNA for RNAi
Antisense RNA: Make RNA that is reverse-complementary to your target → binds to target RNA and creates dsRNA.
Hairpin RNA: RNA is engineered to fold back on itself, creating internal dsRNA.
CRISPR interference is a way to knockdown a gene. How does it work
Interference = the gene is still present, but its expression (protein production) is blocked or significantly reduced.
It uses dCas9, a modified form of Cas9 that cannot cut DNA due to inactivating mutations.
A guide RNA (sgRNA) directs dCas9 to a specific DNA sequence.
What is the mechanism of CRISPR interference
Blocking gene expression:
1. Initiation block:
dCas9 binds near the promoter region.
This prevents RNA polymerase from binding, so transcription cannot start.
2. Elongation block:
dCas9 binds within the gene body (= den kodande delen av genen).
RNA polymerase may bind, but is blocked during transcription, so transcription stops partway.
Key points:
CRISPRi blocks transcription, while RNA interference (RNAi) blocks translation.
The gene itself is not cut or removed, only silenced.