Bioinfomatics Study Guide Exam 2

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/121

There's no tags or description

Looks like no tags are added yet.

Last updated 2:21 PM on 3/26/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

122 Terms

New cards

5 Stages of Phylogenetic Analysis:

selection of sequences: BLAST search results, protein families from Pfam, NCBI HomoloGene etc
multiple sequence alignment of homologous protein or nucleic acid sequences
specifying models of nucleotide or amino acid substitution
tree building: distance-based methods, max parsimony, max likelihood & Bayesian inference
tree evaluation

New cards

Distance-based methods:

analyze pairwise sequence alignments & use the distances to infer the relationships between all the taxa

UPGMA (unweighted-pair group method with arithmetic mean)
Neighbor-joining

New cards

Maximum parsimony:

a character-based method in which columns of residues are analyzed to identify the tree with the shortest overall branch length that can account for the observed character differences

New cards

Maximum likelihood & Bayesian inference:

model-based statistical methods to infer the best tree that can account for the observed data

New cards

Popular Software Tools for Phylogeny:

MEGA (molecular evolutionary genetics analysis)
PHYLIP (PHYLogeny inference package)
PAUP (phylogenetic analysis using parsimony)
TREE-PUZZLE (max likelihood method)
MrBayes (bayesian estimation of phylogeny)

New cards

__ comprises all the RNA transcripts synthesized by an organism

transcriptome

New cards

Proteome:

the entire set of proteins translated

New cards

Metabolome:

refers to the sum total of all the low-molecular-weight metabolites

New cards

Experimental approaches of gene expression

DNA microarrays
RNA-seq

New cards

“Large p, small n” problem:

gene expression studies typically measure the expression levels of 10s of 1000s of genes in only a few samples

New cards

DNA microarrays & RNA-seq have been widely used to __

identify which genes are significantly up/down-regulated (differently expressed)

New cards

Hypothesis testing:

inferential statistics; assign confidence to the discovery of regulated genes

New cards

Exploratory statistics:

define distances between genes; perform unsupervised analyses (clustering, PCA)

New cards

Classification:

perform supervised analyses (linear discriminants, support vector machines)

New cards

Affymetrix platforms for human microarrays

HG-U133 Plus 2.0:
- 54, 120 probe sets
- multiple probe sets for some genes
HG-U133A:
- 27, 722 probe sets
- well-characterized genes (RefSeq)
HG-U133B:
- 22, 577 probe sets
- representing EST clusters

New cards

In RNA sequencing, lower # of reads = __

lower expression

New cards

In RNA sequencing, higher # of reads = _

higher expression

New cards

Sample-level/global normalization:

to remove the systemic bias in the data so that meaningful biological comparisons can be made

unequal quantities of starting RNA
experimental/technical variations

New cards

Normalization is based on the assumption that __

the total intensity distribution is comparable between 2 samples & the expression of a subset of genes is assumed to be constant

New cards

Why do we use sample-level/global normalization?

to allow valid cross-sample comparison & to minimize non-biological variation

New cards

Sample-level/global normalization method:

normalize all values for a sample so that median = 1
normalize to positive control genes
normalize to a constant value

New cards

Gene-level normalization:

rescales all genes to the same normalized value range & thus enables comparison of relative expression levels

New cards

Z transformation:

for each gene, calculate the z scores of the expression values

z_xi = (x_i - ^-x^-) / σ_x

New cards

__ normalizes distribution

log transformation

New cards

In RNA-Seq Data Preprocessing and Normalization, experimental design needs to include __

sufficient replicates to measure biological variablity

New cards

RNA-Seq Data Preprocessing and Normalization steps:

experimental design
RNA acquisition
data acquisition
mapping
summarization
normalization

New cards

RPKM/FPKM (Reads/Fragments per Kilobase of transcript per Million mapped reads):

counts are first normalized for sequencing depth
counts are then normalized for gene length
FPKM for paired-end RNA-seq data

New cards

TPM (transcript per million):

proposed as an alternative to RPKM/FPKM
technology-independent measure of expression

New cards

Mapping software used to align unmapped reads to a reference genome is __ in RNA-Seq Data Preprocessing and Normalization

Tophat

New cards

Mapping software used to align millions of short reads to a reference genome is __ in RNA-Seq Data Preprocessing and Normalization

Bowtie

New cards

Advantages of RNA-seq

not limited to detection of known gene transcripts
little to no background signal
can detect large dynamic range of expression levels

New cards

RNA-seq can reveal information about __ resulting from alternative splicing methods

different transcript isoforms

New cards

RNA-seq can be used to discover __ such as lncRNAs

novel transcripts

New cards

Bowtie:

extremely fast, general purpose short read aligner

New cards

Tophat:

fast splice junction mapper for RNA-seq reads; aligns reads to the genome using Bowtie & discovers splice sites

New cards

Cufflinks:

assembles transcripts

New cards

Cuffcompare:

compares transcript assemblies to annotation

New cards

Cuffmerge:

merges 2 or more transcript assemblies

New cards

Cuffdiff:

finds differentially expressed genes and transcripts & detects differential splicing and promoter use

New cards

Bioconductor

an open source software project, based on R, to provide tools for the analysis of high-throughput genomic data

New cards

GEO2R

a web-based tool for comparing two or more groups of Samples in a GEO Series to identify differentially expressed genes. performs comparisons on original submitter-supplied processed data using R packages from Bioconductor

New cards

MeV (MultiExperiement Viewer)

a versatile tool (web-based or standalone) for expression data analysis, with sophisticated algorithms for statistical analysis, clustering, visualization, and classification

New cards

Experimental design

Compare normal vs diseased tissue, cells ± drug, early vs late development

New cards

RNA preparation

Isolate total RNA or mRNA

New cards

Microarrays

Fluorescently label cRNA samples and preprocessing (normalization, scatter plots)

New cards

RNA-seq

Make cDNA library for each sample and and align reads to genome or gene models; assemble transcripts

New cards

Inferential statistics

Identify significant regulated transcripts e.g. using ANOVA

New cards

Exploratory analyses

Scatter plots, principal components analysis

New cards

Other analyses

Classification, co-regulated genes

New cards

Biological confirmation

Independently confirm that genes are regulated e.g. by RT-PCR

New cards

Deposit data in a database

GEO, ArrayExpress, ENA, SRA

New cards

Fold change

Uses a fold change threshold (e.g., 2-fold) to select genes; does not take into account the biological and experimental variability

New cards

Statistical tests

Such as t test and ANOVA; require a number of replicates for each condition

New cards

Bonferroni correction

Bonferroni correction:

Set the significance cutoff, p' = α / N, where α is the false positive rate, and N is the number of genes
If you have 10,000 genes in your dataset, with 5% of false positives, p' = 0.05 / 10000 = 0.000005 (5 E - 6)
Calculate the adjusted p value: Padjusted = p * N

New cards

False Discovery Rate (FDR)

Rank all the genes by significance (p value) so that the top gene has the most significant p value
Start from the top of the list, and accept the genes if: p less than or equal to (i/N)q
i = the rank of the gene in the list, N = the number of genes in the dataset, q = the desired FDR

New cards

ANOVA (ANalysis Of VAriance)

Used to find significant genes in more than two conditions

New cards

Clustering Analysis

Divide a dataset into a few groups (clusters)

New cards

Homogeneity

Objects in the same cluster are similar to each other

New cards

Seperation

Dissimilar objects are placed in different clusters

New cards

Expression Vector

Each gene can be represented as a vector in the N-dimensional hyperspace, where N is the number of samples

New cards

Euclidean distance

New cards

Vector angle

New cards

Pearson’s correlation coefficient

New cards

Initialization (Hierarchical Clustering Algorithm)

Each object is a cluster

New cards

Iteration (Hierarchical Clustering Algorithm)

Merge two clusters which are most similar to each other until all objects are merged into a single cluster

New cards

Hierarchical Clustering

Results are often visualized using a tree (called dendrogram) with color-coded gene expression levels. Can be applied to genes, samples, or both

New cards

Initialization (k-Means Clustering)

User-defined k (# clusters) randomly place k vectors (called centroids) in the data space

New cards

Iteration (k-Means Clustering)

Each object is assigned to its closest centroid re-compute each centroid by taking the mean of data vectors currently assigned to the cluster until the cluster centroids no longer change

New cards

Self-Organizing Map (SOM)

The user defines an initial geometry of nodes (reference vectors) for the partitions such as a 3 x 2 rectangular grid
During the iterative “training” process, the nodes migrate to fit the gene expression data
The genes are mapped to the most similar reference vector

New cards

Gene Co-expression Network Analysis

A gene co-expression network is an undirected graph, in which each node is a gene, and each edge represents a significant co-expression relationship between two nodes
It can be constructed by looking for pairs of genes which show a similar expression pattern across various samples
Does not attempt to infer the causal relationships between genes, and thus is different from a gene regulatory network

New cards

How to Assess Whether a Gene Ontology (GO) Term Is Enriched in a Gene List?

Compare gene with one already saved in the database

New cards

Database for Annotation, Visualization, and Integrated Discovery (DAVID)

Can be used to extract biological meaning from large lists of genes
DAVID’s functional enrichment analysis of a gene list is based on a modified version of the Fisher’s exact test

New cards

How to Represent a Promoter Motif?

Multiple sequence alignment
Consensus: e.g., TATAAAA (the TATA box)
Position Weight Matrix (PWM): relative frequencies of nucleotides at different positions
Sequence logo: information content of each site (a measure of intolerance for substitution)

New cards

PWM Representation of a Motif

A motif is assumed to have a fixed width, W
In the PWM, pnk is the probability (relative frequency) of nucleotide n in column k
Background probability: pn0 is the probability of n in the background (i.e., outside the motif)
Equal distribution: pA0 = pC0 = pG0 = pT0 = ¼

New cards

Pattern matching

Scanning a nucleotide or protein sequence for matches to a known pattern
How to get better sensitivity and specificity is the major consideration

New cards

Pattern discovery

Given a set of sequences, discovering a pattern that is shared by the sequences.
It is unknown in advance about what is the pattern
Using search or learning approaches
A much harder problem than pattern matching

New cards

Multiple EM for Motif Elicitation (MEME)

Widely used for discovery of DNA and protein sequence motifs
It is based on the Expectation Maximization (EM) algorithm with several extensions
MEME is now complemented by the GLAM2 algorithm which allows discovery of motifs containing gaps

New cards

Protein Structure

Proteins are very complex molecules with diverse functions
Levels of protein structure:

Primary structure
Secondary structure
Tertiary structure
Quaternary structure

Simple images highlighting specific features are useful:

Space-filling models
Ribbon cartoon models

New cards

Protein Primary Structure

Amino acid sequence of a polypeptide chain
20 amino acids, each with a different side chain (R)
Peptide units are building blocks of protein structures
The angle of rotation around the N−Cα bond is called phi, and the angle around the Cα−C′ bond from the same Cα atom is called psi

New cards

Protein Secondary Structures

Local structures as a result of hydrogen bond formation between the carbonyl and N-H groups in the polypeptide backbone (backbone interactions)
Types of secondary structures:

Alpha helix
Beta sheet
Loop or random coil

Secondary structure formation is influenced by several properties (e.g., size and charge) of amino acid side chains

New cards

Alpha Helix

Most abundant secondary structure
3.6 amino acid residues per turn, and hydrogen bond formed between every fourth residue
Proline (with no N-H group) and glycine (too small) do not foster alpha helix formation

New cards

Beta Sheet

Two or more polypeptide chains line up side by side
Hydrogen bonds formed between adjacent strands
The chain directions can be same (parallel sheet), opposite (antiparallel), or mixed
Antiparallel beta sheets are more stable than parallel beta sheets

New cards

Loop or Coli

Regions between alpha helices and beta sheets
Various lengths and 3D configurations
Often functionally significant (e.g., part of an active site)

New cards

Supersecondary Structures (Motifs)

Many proteins contain supersecondary structures (motifs) with combinations of alpha-helices and beta-sheets
In the beta-alpha-beta motif, two beta-strands and one alpha-helix are connected by loops

New cards

Protein Tertiary Structure

The unique three-dimensional structure formed by a globular protein
Stabilized by hydrophobic interactions, hydrogen bonds, and other interactions

New cards

Important Features of Tertiary Structures

Many polypeptides fold in a way to bring distant amino acid residues in the primary structure into close proximity
Globular proteins are compact because of efficient packing as the polypeptide folds
Enough hydrophobic surface must be buried, and the interior must be sufficiently packed
Buried polar atoms must be hydrogen-bonded to other buried polar atoms
Large globular proteins often contain several compact units called domains

New cards

Protein Domains

Structurally independent segments that have specific functions
The core 3D structure of a domain is called a fold
A certain type of 3D arrangement of secondary structures
Domains are classified on the basis of their core structure:
Alpha: composed exclusively of alpha - helices
Beta: consists of antiparallel beta -strands
Alpha/beta : contains various combinations of alpha -helices and beta -strands

New cards

Protein Quaternary Structure

Two or more polypeptide chains (subunits) form a larger protein complex
Protein subunits are often held together by non-covalent interactions, including hydrophobic interactions (most important), electrostatic interactions, and hydrogen bonds
Important for understanding protein-protein interactions

New cards

Unstructured Proteins (Regions)

Some proteins are partially or completely unstructured
Unstructured proteins (regions) are referred to as intrinsically disordered proteins (regions)
Over 30% of eukaryotic proteins are partially or completely disordered and have a variety of functions
The disordered segments (e.g., KID domain of CREB) may be involved in searching out binding partners

New cards

X-Ray Crystallography

Basic steps: Expression/purification, Crystallization, X-ray diffraction, Structure solution
Advantages: High-resolution structures, large protein complexes or membrane proteins
Disadvantages: Requirement for crystals, molecules in a solid-state (crystal) environment

New cards

Nuclear Magnetic Resonance (NMR)

Reveals information on the distances between atoms in a molecule, and these distances can be used to derive a 3D model of the molecule
Advantages: No requirement for crystals, proteins in a liquid state (near physiological state)
Disadvantages: Limited by molecule size (up to 30 kD), inherently less precise than X-ray crystallography, membrane proteins may not be studied

New cards

Cryogenic Electron Microscopy (Cryo-EM)

A beam of electrons is fired at a frozen protein solution. The emerging scattered electrons pass through a lens to create a magnified image on the detector, from which their structure can be worked out
Advantages: No crystal requirement for large complexes, structure remains in native state (no dehydration)
Disadvantages: Relatively low resolution (but improving), 3D structure reconstruction from 2D images

New cards

Protein Data Bank (PDB)

Established at Brookhaven National Laboratory in 1971, initially with seven structures
Was moved to the Research Collaboratory for Structural Bioinformatics (RCSB) in 1998
The RCSB PDB (https://www.rcsb.org/) has been the primary repository for 3D structural data of proteins, nucleic acids, and complexes
The Worldwide PDB (wwPDB, http://www.wwpdb.org/) was formed in 2003 to maintain a single PDB archive of macromolecular structural data

New cards

RCSB PDB

PDB supports services for structure submission, search, retrieval, and visualization
By 3/6/2026, PDB contains 250,441 experimental structures and 1,068,577 computed structure models

New cards

Search RCSB PDB

Basic search using a PDB identifier or keywords: A PDB ID consists of one number and three letters (or numbers) (e.g., 4HHB for a human hemoglobin structure; pdb_00004hhb)
Advanced search with specific attributes or data types, sequence search (BLAST/PSI-BLAST, or FASTA)

New cards

Protein Structure File Formats

PDB supports the download of protein structural data in the following text file formats:

PDB file format: outdated but human-readable
PDBx / mmCIF: simple and consistent data representation for exchanging and archiving structural data, used by PDB to store its files
PDBML / XML: a modern and robust file format

New cards

Access to Structures through NCBI

MMDB (Molecular Modeling Database)

Structures obtained from PDB
Data in NCBI’s ASN.1 format
Integrated into NCBI’s Entrez system

Cn3D (“see in 3D”): NCBI’s protein structure viewer
VAST (Vector Alignment Search Tool): for direct comparison of 3D protein structures to identify structural neighbors

New cards

RasMol and RasTop

RasMol: An open-source software package, which was a breakthrough in 3D

structure visualization. It is widely used to view 3D protein structures.

Structure file formats supported by RasMol:

PDB file format
mmCIF file format

RasTop: Provides a graphical user interface to RasMol

New cards

Other 3D Visualization Tools

Jmol: An interactive web-browser Java applet to view chemical structures in 3D, and JSmol is a JavaScript-based extension
Cn3D: Can be used for interactive exploration of 3D structures, sequences, and alignments
Swiss-Pdb Viewer (DeepView): Probably the most powerful freely available molecular
modeling and visualization package. Supports homology modeling, site-directed mutagenesis, structure superposition, etc.

100

New cards

Root-mean-square deviation (RMSD)