Exam 1 review Genomics

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/95

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

96 Terms

New cards

De Novo assembly

Building the genome from scratch only using the reads you sequenced
No reference genome
Needs high coverage (sequence more)

New cards

For a de novo assembly, using Illumina, how much coverage would you need?

~80x

New cards

For a de novo assembly, using pacbio how much coverage would you need?

~20-30x

New cards

Reference- guided assmebly

You already have a closely related genome to help guide the assembly.
The existing genome acts like a map to help put the puzzle pieces together faster and more accurately.
Needs less coverage because you have help:
- ~30-50x for Illumina

New cards

When would you use reference-guided assembly instead of de novo?

When you have a closely related genome to use as a guide — it’s faster and requires less data.

New cards

Why is high coverage needed for de novo assembly?

To make sure you have enough overlapping reads to piece the genome together accurately.

New cards

Which sequencing technology is better for assembling complex genomes from scratch?

PacBio or Oxford Nanopore (long reads)

New cards

If you wanted to analyze gene expression across samples, which sequencing method would you use?

Illumina (short-read) is most commonly used

New cards

Which platform is best for identifying small mutations like SNPs?

Illumina for SNPs

New cards

Why are long reads useful for de novo assembly?

Because they can span large repetitive regions, making assembly easier and more accurate.

New cards

Can you sequence a full human genome at 30X on one Illumina flow cell?

No — one run gives 0.625 of a full genome

New cards

What would happen if you only had 10X coverage?

Your data might be less accurate — harder to detect mutations or assemble the genome.

New cards

Global Alignment

What it does: Aligns the entire length of two sequences from beginning to end.
Best when:
- Sequences are similar in length
- Sequences are closely related

New cards

Local Alignment

What it does: Finds the most similar region between sequences. Doesn’t try to align everything — just the best matching parts.
Best when:
- Sequences are not the same length
- Only parts of them are similar
- Good for finding conserved motifs/domains

New cards

Which alignment type is better for comparing short reads to a reference genome?

Global allighment

New cards

If you were comparing a new protein to a huge database of unrelated sequences, which method should you use?

Local alignment — to find the most similar regions

New cards

Why local alignment is usually more useful:

Most real-world biological sequences are not 100% the same
Only parts might be conserved (similar across species)
Mutations, insertions, deletions make full alignments messy
Local alignment focuses on the good parts only
Sequences are distantly related
You're looking for a motif or conserved domain inside a larger sequence
You’re aligning a short read or fragment to a long reference genome

New cards

Can global alignment still produce full-length results if the sequences are highly similar?

Yes! If the sequences are similar and of equal length, global alignment can work well

New cards

What is something unique to Global Alignment when trying to match sequences?

Global alignment tries to match every position, even if it means adding gaps to make things fit.

New cards

Why is global alignment better for short reads?

Because short reads (like from Illumina) are already:

Very short (usually 100–300 base pairs)
Designed to match a reference exactly or almost exactly
From a known location in the genome you're sequencing

New cards

What tool uses local alignment to compare sequences in a database?

BLAST

New cards

What does a progressive Multiple Sequence Alignment do?

It aligns sequences one at a time, adding each new sequence to the existing alignment

New cards

Why use multiple sequence alignment?

To identify conserved regions and study evolutionary or functional relationships
- Like building phylogenetic trees

New cards

What’s one challenge with MSA?

Errors made early on can affect the final alignment, especially if sequences are very different

New cards

What is a consensus sequence?

A sequence that shows the most common base/amino acid at each position from a multiple alignment

New cards

What does a consensus sequence help identify?

Conserved and functionally important regions shared across multiple sequences
It helps identify regions that are evolutionarily conserved or functionally critical

New cards

What is the molecular clock used for?

What kind of mutation does the molecular clock rely on?

To estimate how long ago two species or genes diverged from a common ancestor

Neutral mutations that accumulate at a constant rate

New cards

How is the rate of evolution calculated?

By dividing the number of sequence changes by the time since divergence

New cards

BLOSUM (BLOcks SUbstitution Matrix)

Based on observed substitutions in blocks of real protein sequences
Used for local alignment (e.g., in BLAST)
BLOSUM number = % similarity cutoff
- BLOSUM80 = for closely related proteins (strict)
- BLOSUM62 = standard default
- BLOSUM45 = for distantly related proteins (more tolerant)

Lower number = more tolerant of change

New cards

PAM (Point Accepted Mutation)

Based on evolutionary models — how proteins evolve over time
Used more for global alignments
PAM number = evolutionary time
- PAM30 = short time, very similar sequences
- PAM250 = long time, lots of changes allowed

Higher number = more time has passed = more differences allowed

New cards

What does BLOSUM62 mean?

It’s a scoring matrix based on proteins with up to 62% similarity — good for moderately related sequences

New cards

Which matrix would you use to compare distantly related proteins?

BLOSUM45 or PAM250 — they tolerate more differences

New cards

Substitution rate

how often one amino acid replaces another

New cards

Ungapped alignment

sequence alignment without gaps (used in BLOSUM building)

New cards

Which matrix would you use to align very similar protein sequences?

BLOSUM80 or PAM30

New cards

FASTA Format

Used for: storing sequences without quality scores
Mostly used for:
- Assembled genomes
- Contigs
- Protein sequences

New cards

FASTQ Format

Used for: raw sequencing reads with quality scores
Used during preprocessing, assembly, and alignment because quality matters at that stage

New cards

What is the E-value?

It tells you how likely it is to get that match just by chance.
A low E-value = very unlikely the match happened randomly → good hit!
A high E-value = could just be a fluke → probably not meaningful

New cards

What is the Bit Score?

t’s a number based on the raw alignment score (matches, mismatches, gaps)
But it also accounts for the scoring system and statistical background
Because it’s normalized, you can compare bit scores across different searches, even with different databases or matrices

New cards

Why use both E-value and Bit Score?

E-value tells you how likely the match is real
Bit score tells you how strong the match is

New cards

Was the genome assembled into one piece? for yersinia pestis?

No.
The reference-guided assembly made 130,556 contigs (pieces of genome)
The N50 was only 288 bp → this means most contigs were very small

This tells us ancient DNA was highly degraded, so they couldn’t stitch it together cleanly

New cards

Was increased virulence due to mutations?

No unique mutations were found in the medieval DNA that aren’t also found in modern strains

So, the extreme death toll was probably not due to the bacteria being more deadly genetically

New cards

What might have caused the high death toll if not bacterial mutations?

Environmental conditions, vector behavior, and host vulnerability

New cards

What does a low N50 tell us about a genome assembly?

It means the assembly is fragmented — mostly short contigs

New cards

Why don’t scaffolds always cover the whole genome perfectly?

Gaps can result from low coverage, repetitive regions, or sequencing errors

New cards

Why use OLC?

Overlap – finding matching regions between reads
Layout – ordering and connecting those reads
Consensus – choosing the final base sequence from the overlaps

Good for long reads like pacbio or nanopore, slow for short reads tho

New cards

Why are de Bruijn graphs efficient for short reads?

Because they avoid all-vs-all read comparison and just work with k-mer overlaps

New cards

What’s a drawback of using de Bruijn graphs?

They are sensitive to sequencing errors and lose long-range information

New cards

Which method better resolves long repetitive regions? olc/dbg

OLC — because long reads can span the repeats

New cards

What type of assembly method do most modern genome projects use? (olc/dbg)

Hybrid assembly — combining short and long read strategies

It combines the accuracy of short reads with the long-range context of long reads for better genome coverage

New cards

N50 calculation and logic

Total genome = 100 million bp
You add contigs from longest to shortest:
- 30 Mb, 15 Mb, 10 Mb, 7 Mb, 5 Mb…
- You stop once you’ve added 50 million bp
The shortest contig used to reach that halfway mark is your N50

New cards

What is the difference between N50 and NG50?

N50 is based on the assembly size; NG50 is based on the known/reference genome size

New cards

If your N50 is very low, what does that mean about your assembly?

It’s fragmented — made up of mostly short contigs

New cards

What is an inversion in genome assembly?

A region that is flipped in direction compared to the reference

New cards

What does a high mismatch rate indicate?

Many single-base differences from the reference — possibly sequencing or assembly errors

New cards

What is a translocation error in genome assembly?

A sequence is assembled into the wrong location — possibly the wrong chromosome

New cards

what do you want to do before processing reads in an assemble like SPAdes (DbG assembler)?

Pre-process reads by trimming low-quality sequences
remove adapters and contaminants
remove human contaminants
normalize or subsample to even out read coverage (if you have too many reads)
error correction (fix base mistakes)

New cards

Why might long k-mers fail in some assemblies?

If coverage is too low, longer k-mers can’t find overlaps and assemblies become fragmented

New cards

What’s an ab initio method?

A computational method that predicts genes based on sequence patterns and signals

New cards

What do homology-based methods rely on?

Sequence similarity to known genes in other species

New cards

Why is RNA-seq useful for gene finding?

It provides experimental evidence of which DNA regions are being actively transcribed

New cards

Which method is best when no related genome is available when it comes to gene finding?

Ab initio methods

New cards

Gene annotation

Once you’ve assembled a genome and maybe identified genes, you still need to label what those regions do. That’s annotation — and it can be done in two ways:

manual or computational

New cards

Manual Annotation (by experts):

one by trained scientists, often using genome browsers (like UCSC or Ensembl)
They look at:
- Sequence context
- Homologs
- Expression data
- Functional evidence

New cards

Pros and cons of manual gene annotation

✅ Pros:

Accurate (humans can catch weird or subtle things computers miss)

❌ Cons:

Very slow
Labor-intensive
Results can be inconsistent between people

New cards

Computational Annotation (fully automated):

Uses algorithms to assign gene functions, locations, structures, etc.
May use ab initio models, homology, or RNA-seq data

New cards

Computational Annotation pros and cons

✅ Pros:

Fast and consistent
Great for annotating many genomes at once

❌ Cons:

Can be wrong or imprecise (e.g., missing start codons, splitting/merging genes incorrectly)

New cards

What’s the recommended strategy for genome annotation?

Combine computational methods with manual curation

New cards

Why is ab initio gene prediction easier in prokaryotes?

Because their genes don’t have introns and are often long, continuous ORFs

New cards

How do ab initio methods work in eukaryotes despite introns?

They use models based on sequence patterns like nucleotide composition and codon usage

New cards

Which gene elements are hard to predict computationally?

Promoters and UTRs

New cards

Which approach can provide a full TU?

Evidence-based gene finding (e.g., RNA-seq), but not promoters

New cards

Why is promoter prediction unreliable?

t requires experimental data to detect transcription start sites accurately

New cards

BLASTP = Protein BLAST

It compares your unknown protein to known proteins in a database
If your sequence looks a lot like a protein with a known function → you can guess its function

New cards

Reference Genome Approach: Annotation

Use when you have a closely related organism with a well-annotated genome
You compare your protein to the proteins in that organism’s database
If it's very similar, you can even use DNA-level comparisons

New cards

Universal Protein Databases

Use if no good reference genome exists
These databases have highly annotated proteins from many species

Popular databases:

✅ SwissProt – curated, very high quality
✅ UniProt – big database, includes SwissProt + unreviewed entries
✅ NCBI nr (non-redundant) – massive, less curated, but very inclusive

New cards

What is functional gene annotation?

Assigning a likely biological function to a predicted gene or protein

New cards

Why are domains important?

They tell you what a part of the protein does, even if the whole protein is new
They’re more evolutionarily stable than full sequences
Think of them like Lego blocks used in different proteins

New cards

What conclusion can you make if a protein has a receptor domain and a kinase domain

It may function as a receptor kinase, signaling from the outside to the inside of the cell

New cards

What is a motif in protein annotation?

A short, highly conserved sequence important for function, often part of the catalytic center

New cards

Ontology (in bioinformatics)

a fancy way to describe what parts exist in a biological system and how they’re related, in a way computers can understand.

Think of it like a biological family tree or mind map, but for:

Genes
Proteins
Cell parts
Functions

New cards

Who maintains the most widely used ontology in biology?

The Gene Ontology Consortium

New cards

What is a DAG?

AG = Directed Acyclic Graph
↳ It’s like a hierarchy, but better:
- Arrows show direction (like from general to specific).
- It doesn’t loop back (acyclic).
- A term can have multiple parents (not just one!).

New cards

What does “part_of” mean?

It means one term is a component of another, but the bigger part doesn't always require the smaller one to exist.

New cards

Robust (Strong Experimental Evidence):

These evidence codes come from direct biological experiments — trusted!

New cards

Less Robust (Computational or Indirect):

These evidence codes may be useful, but they rely more on inference or author notes — less reliable.

New cards

Gene-to-GO mappings aren't unique why?

→ One gene can be linked to many GO terms (within different or even the same ontology categories: molecular function, cellular component, or biological process).

New cards

What’s a downside of in silico GO annotations?

They are often lower quality and less reliable than manually curated annotations.

New cards

What does RNA-Seq measure?

It measures the level of gene expression across the genome.

New cards

What Must Be Normalized? to compare RNA-seq expression?

To compare expression fairly, we must adjust for:

Sequencing Depth – Did one sample just get sequenced more?
Transcript Length – Longer genes get more hits just by size.
RNA Composition – Highly expressed genes can dominate the total reads.

New cards

Why can’t you directly compare raw read counts between genes? (RNA-seq)

Because differences might be due to gene length, sequencing depth, or RNA composition, not actual gene expression.

New cards

What does TPM stand for and why is it preferred?

TPM = Transcripts Per Million. It's preferred because it allows for fair comparisons across samples and the total expression is normalized to a consistent scale.

New cards

When would you use RPKM instead of TPM?

Historically, RPKM was used for single-end sequencing, but TPM is now favored even in those cases due to better comparability.

New cards

he Problem with RPKM:

normalizes for:

gene length ✅
sequencing depth ✅
BUT…

It does NOT ensure that the total expression across all genes in a sample is the same.

So, if two samples have different total transcript amounts (like in a disease vs healthy state), RPKM values can be distorted and not comparable across samples.

New cards

methods of ways to normalize

Goal	Best Method
Comparing expression levels within a single sample	TPM
Comparing expression levels between samples or conditions	DESeq2 (or EdgeR/TMM)
Visualizing general abundance, not for stats	TPM
Accurate DE analysis (statistical tests)	DESeq2 (requires raw counts + proper transformation)

New cards

Use CPM when:

You're comparing gene expression between replicates of the same group (e.g., how consistent is Gene X in 3 untreated samples?).
You want a quick visualization or filtering lowly expressed genes before DE analysis.