Bioinfomatics Study Guide Exam 2

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/121

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 2:21 PM on 3/26/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

122 Terms

1
New cards

5 Stages of Phylogenetic Analysis:

  • selection of sequences: BLAST search results, protein families from Pfam, NCBI HomoloGene etc

  • multiple sequence alignment of homologous protein or nucleic acid sequences

  • specifying models of nucleotide or amino acid substitution

  • tree building: distance-based methods, max parsimony, max likelihood & Bayesian inference

  • tree evaluation

2
New cards

Distance-based methods:

analyze pairwise sequence alignments & use the distances to infer the relationships between all the taxa

  • UPGMA (unweighted-pair group method with arithmetic mean)

  • Neighbor-joining

3
New cards

Maximum parsimony:

a character-based method in which columns of residues are analyzed to identify the tree with the shortest overall branch length that can account for the observed character differences

4
New cards

Maximum likelihood & Bayesian inference:

model-based statistical methods to infer the best tree that can account for the observed data

5
New cards

Popular Software Tools for Phylogeny:

  • MEGA (molecular evolutionary genetics analysis)

  • PHYLIP (PHYLogeny inference package)

  • PAUP (phylogenetic analysis using parsimony)

  • TREE-PUZZLE (max likelihood method)

  • MrBayes (bayesian estimation of phylogeny)

6
New cards

__ comprises all the RNA transcripts synthesized by an organism

transcriptome

7
New cards

Proteome:

the entire set of proteins translated

8
New cards

Metabolome:

refers to the sum total of all the low-molecular-weight metabolites

9
New cards

Experimental approaches of gene expression

  • DNA microarrays

  • RNA-seq

10
New cards

“Large p, small n” problem:

gene expression studies typically measure the expression levels of 10s of 1000s of genes in only a few samples

11
New cards

DNA microarrays & RNA-seq have been widely used to __

identify which genes are significantly up/down-regulated (differently expressed)

12
New cards

Hypothesis testing:

inferential statistics; assign confidence to the discovery of regulated genes

13
New cards

Exploratory statistics:

define distances between genes; perform unsupervised analyses (clustering, PCA)

14
New cards

Classification:

perform supervised analyses (linear discriminants, support vector machines)

15
New cards

Affymetrix platforms for human microarrays

  • HG-U133 Plus 2.0:

    • 54, 120 probe sets

    • multiple probe sets for some genes

  • HG-U133A:

    • 27, 722 probe sets

    • well-characterized genes (RefSeq)

  • HG-U133B:

    • 22, 577 probe sets

    • representing EST clusters

16
New cards

In RNA sequencing, lower # of reads = __

lower expression

17
New cards

In RNA sequencing, higher # of reads = _

higher expression

18
New cards

Sample-level/global normalization:

to remove the systemic bias in the data so that meaningful biological comparisons can be made

  • unequal quantities of starting RNA

  • experimental/technical variations

19
New cards

Normalization is based on the assumption that __

the total intensity distribution is comparable between 2 samples & the expression of a subset of genes is assumed to be constant

20
New cards

Why do we use sample-level/global normalization?

to allow valid cross-sample comparison & to minimize non-biological variation

21
New cards

Sample-level/global normalization method:

  • normalize all values for a sample so that median = 1

  • normalize to positive control genes

  • normalize to a constant value

22
New cards

Gene-level normalization:

rescales all genes to the same normalized value range & thus enables comparison of relative expression levels

23
New cards

Z transformation:

for each gene, calculate the z scores of the expression values

zxi = (xi - -x-) / σx

24
New cards

__ normalizes distribution

log transformation

25
New cards

In RNA-Seq Data Preprocessing and Normalization, experimental design needs to include __

sufficient replicates to measure biological variablity

26
New cards

RNA-Seq Data Preprocessing and Normalization steps:

  • experimental design

  • RNA acquisition

  • data acquisition

  • mapping

  • summarization

  • normalization

27
New cards

RPKM/FPKM (Reads/Fragments per Kilobase of transcript per Million mapped reads):

  • counts are first normalized for sequencing depth

  • counts are then normalized for gene length

  • FPKM for paired-end RNA-seq data

28
New cards

TPM (transcript per million):

  • proposed as an alternative to RPKM/FPKM

  • technology-independent measure of expression

29
New cards

Mapping software used to align unmapped reads to a reference genome is __ in RNA-Seq Data Preprocessing and Normalization

Tophat

30
New cards

Mapping software used to align millions of short reads to a reference genome is __ in RNA-Seq Data Preprocessing and Normalization

Bowtie

31
New cards

Advantages of RNA-seq

  • not limited to detection of known gene transcripts

  • little to no background signal

  • can detect large dynamic range of expression levels

32
New cards

RNA-seq can reveal information about __ resulting from alternative splicing methods

different transcript isoforms

33
New cards

RNA-seq can be used to discover __ such as lncRNAs

novel transcripts

34
New cards

Bowtie:

extremely fast, general purpose short read aligner

35
New cards

Tophat:

fast splice junction mapper for RNA-seq reads; aligns reads to the genome using Bowtie & discovers splice sites

36
New cards

Cufflinks:

assembles transcripts

37
New cards

Cuffcompare:

compares transcript assemblies to annotation

38
New cards

Cuffmerge:

merges 2 or more transcript assemblies

39
New cards

Cuffdiff:

finds differentially expressed genes and transcripts & detects differential splicing and promoter use

40
New cards

Bioconductor

an open source software project, based on R, to provide tools for the analysis of high-throughput genomic data

41
New cards

GEO2R

a web-based tool for comparing two or more groups of Samples in a GEO Series to identify differentially expressed genes. performs comparisons on original submitter-supplied processed data using R packages from Bioconductor

42
New cards

MeV (MultiExperiement Viewer)

a versatile tool (web-based or standalone) for expression data analysis, with sophisticated algorithms for statistical analysis, clustering, visualization, and classification

43
New cards

Experimental design

Compare normal vs diseased tissue, cells ± drug, early vs late development

44
New cards

RNA preparation

Isolate total RNA or mRNA

45
New cards

Microarrays

Fluorescently label cRNA samples and preprocessing (normalization, scatter plots)

46
New cards

RNA-seq

Make cDNA library for each sample and and align reads to genome or gene models; assemble transcripts

47
New cards

Inferential statistics

Identify significant regulated transcripts e.g. using ANOVA

48
New cards

Exploratory analyses

Scatter plots, principal components analysis

49
New cards

Other analyses

Classification, co-regulated genes

50
New cards

Biological confirmation

Independently confirm that genes are regulated e.g. by RT-PCR

51
New cards

Deposit data in a database

GEO, ArrayExpress, ENA, SRA

52
New cards

Fold change

Uses a fold change threshold (e.g., 2-fold) to select genes; does not take into account the biological and experimental variability

53
New cards

Statistical tests

Such as t test and ANOVA; require a number of replicates for each condition

54
New cards

Bonferroni correction

Bonferroni correction:

  • Set the significance cutoff, p' = α / N, where α is the false positive rate, and N is the number of genes

  • If you have 10,000 genes in your dataset, with 5% of false positives, p' = 0.05 / 10000 = 0.000005 (5 E - 6)

  • Calculate the adjusted p value: Padjusted = p * N

55
New cards

False Discovery Rate (FDR)

  • Rank all the genes by significance (p value) so that the top gene has the most significant p value

  • Start from the top of the list, and accept the genes if: p less than or equal to (i/N)q

  • i = the rank of the gene in the list, N = the number of genes in the dataset, q = the desired FDR

56
New cards

ANOVA (ANalysis Of VAriance)

Used to find significant genes in more than two conditions

57
New cards

Clustering Analysis

Divide a dataset into a few groups (clusters)

58
New cards

Homogeneity

Objects in the same cluster are similar to each other

59
New cards

Seperation

Dissimilar objects are placed in different clusters

60
New cards

Expression Vector

Each gene can be represented as a vector in the N-dimensional hyperspace, where N is the number of samples

61
New cards

Euclidean distance

knowt flashcard image
62
New cards

Vector angle

knowt flashcard image
63
New cards

Pearson’s correlation coefficient

knowt flashcard image
64
New cards

Initialization (Hierarchical Clustering Algorithm)

Each object is a cluster

65
New cards

Iteration (Hierarchical Clustering Algorithm)

Merge two clusters which are most similar to each other until all objects are merged into a single cluster

66
New cards

Hierarchical Clustering

Results are often visualized using a tree (called dendrogram) with color-coded gene expression levels. Can be applied to genes, samples, or both

67
New cards

Initialization (k-Means Clustering)

User-defined k (# clusters) randomly place k vectors (called centroids) in the data space

68
New cards

Iteration (k-Means Clustering)

Each object is assigned to its closest centroid re-compute each centroid by taking the mean of data vectors currently assigned to the cluster until the cluster centroids no longer change

69
New cards

Self-Organizing Map (SOM)

  • The user defines an initial geometry of nodes (reference vectors) for the partitions such as a 3 x 2 rectangular grid

  • During the iterative “training” process, the nodes migrate to fit the gene expression data

  • The genes are mapped to the most similar reference vector

70
New cards

Gene Co-expression Network Analysis

  • A gene co-expression network is an undirected graph, in which each node is a gene, and each edge represents a significant co-expression relationship between two nodes

  • It can be constructed by looking for pairs of genes which show a similar expression pattern across various samples

  • Does not attempt to infer the causal relationships between genes, and thus is different from a gene regulatory network

71
New cards

How to Assess Whether a Gene Ontology (GO) Term Is Enriched in a Gene List?

Compare gene with one already saved in the database

72
New cards

Database for Annotation, Visualization, and Integrated Discovery (DAVID)

  • Can be used to extract biological meaning from large lists of genes

  • DAVID’s functional enrichment analysis of a gene list is based on a modified version of the Fisher’s exact test

73
New cards

How to Represent a Promoter Motif?

  • Multiple sequence alignment

  • Consensus: e.g., TATAAAA (the TATA box)

  • Position Weight Matrix (PWM): relative frequencies of nucleotides at different positions

  • Sequence logo: information content of each site (a measure of intolerance for substitution)

74
New cards

PWM Representation of a Motif

  • A motif is assumed to have a fixed width, W

  • In the PWM, pnk is the probability (relative frequency) of nucleotide n in column k

  • Background probability: pn0 is the probability of n in the background (i.e., outside the motif)

  • Equal distribution: pA0 = pC0 = pG0 = pT0 = ¼

75
New cards

Pattern matching

  • Scanning a nucleotide or protein sequence for matches to a known pattern

  • How to get better sensitivity and specificity is the major consideration

76
New cards

Pattern discovery

  • Given a set of sequences, discovering a pattern that is shared by the sequences.

  • It is unknown in advance about what is the pattern

  • Using search or learning approaches

  • A much harder problem than pattern matching

77
New cards

Multiple EM for Motif Elicitation (MEME)

  • Widely used for discovery of DNA and protein sequence motifs

  • It is based on the Expectation Maximization (EM) algorithm with several extensions

  • MEME is now complemented by the GLAM2 algorithm which allows discovery of motifs containing gaps

78
New cards

Protein Structure

  • Proteins are very complex molecules with diverse functions

  • Levels of protein structure:

  1. Primary structure

  2. Secondary structure

  3. Tertiary structure

  4. Quaternary structure

  • Simple images highlighting specific features are useful:

  1. Space-filling models

  2. Ribbon cartoon models

79
New cards

Protein Primary Structure

  • Amino acid sequence of a polypeptide chain

  • 20 amino acids, each with a different side chain (R)

  • Peptide units are building blocks of protein structures

  • The angle of rotation around the N−Cα bond is called phi, and the angle around the Cα−C′ bond from the same Cα atom is called psi

80
New cards

Protein Secondary Structures

  • Local structures as a result of hydrogen bond formation between the carbonyl and N-H groups in the polypeptide backbone (backbone interactions)

  • Types of secondary structures:

  1. Alpha helix

  2. Beta sheet

  3. Loop or random coil

  • Secondary structure formation is influenced by several properties (e.g., size and charge) of amino acid side chains

81
New cards

Alpha Helix

  • Most abundant secondary structure

  • 3.6 amino acid residues per turn, and hydrogen bond formed between every fourth residue

  • Proline (with no N-H group) and glycine (too small) do not foster alpha helix formation

82
New cards

Beta Sheet

  • Two or more polypeptide chains line up side by side

  • Hydrogen bonds formed between adjacent strands

  • The chain directions can be same (parallel sheet), opposite (antiparallel), or mixed

  • Antiparallel beta sheets are more stable than parallel beta sheets

83
New cards

Loop or Coli

  • Regions between alpha helices and beta sheets

  • Various lengths and 3D configurations

  • Often functionally significant (e.g., part of an active site)

84
New cards

Supersecondary Structures (Motifs)

  • Many proteins contain supersecondary structures (motifs) with combinations of alpha-helices and beta-sheets

  • In the beta-alpha-beta motif, two beta-strands and one alpha-helix are connected by loops

85
New cards

Protein Tertiary Structure

  • The unique three-dimensional structure formed by a globular protein

  • Stabilized by hydrophobic interactions, hydrogen bonds, and other interactions

86
New cards

Important Features of Tertiary Structures

  • Many polypeptides fold in a way to bring distant amino acid residues in the primary structure into close proximity

  • Globular proteins are compact because of efficient packing as the polypeptide folds

  • Enough hydrophobic surface must be buried, and the interior must be sufficiently packed

  • Buried polar atoms must be hydrogen-bonded to other buried polar atoms

  • Large globular proteins often contain several compact units called domains

87
New cards

Protein Domains

  • Structurally independent segments that have specific functions

  • The core 3D structure of a domain is called a fold

  • A certain type of 3D arrangement of secondary structures

  • Domains are classified on the basis of their core structure:

  • Alpha: composed exclusively of alpha - helices

  • Beta: consists of antiparallel beta -strands

  • Alpha/beta : contains various combinations of alpha -helices and beta -strands

88
New cards

Protein Quaternary Structure

  • Two or more polypeptide chains (subunits) form a larger protein complex

  • Protein subunits are often held together by non-covalent interactions, including hydrophobic interactions (most important), electrostatic interactions, and hydrogen bonds

  • Important for understanding protein-protein interactions

89
New cards

Unstructured Proteins (Regions)

  • Some proteins are partially or completely unstructured

  • Unstructured proteins (regions) are referred to as intrinsically disordered proteins (regions)

  • Over 30% of eukaryotic proteins are partially or completely disordered and have a variety of functions

  • The disordered segments (e.g., KID domain of CREB) may be involved in searching out binding partners

90
New cards

X-Ray Crystallography

  • Basic steps: Expression/purification, Crystallization, X-ray diffraction, Structure solution

  • Advantages: High-resolution structures, large protein complexes or membrane proteins

  • Disadvantages: Requirement for crystals, molecules in a solid-state (crystal) environment

91
New cards

Nuclear Magnetic Resonance (NMR)

  • Reveals information on the distances between atoms in a molecule, and these distances can be used to derive a 3D model of the molecule

  • Advantages: No requirement for crystals, proteins in a liquid state (near physiological state)

  • Disadvantages: Limited by molecule size (up to 30 kD), inherently less precise than X-ray crystallography, membrane proteins may not be studied

92
New cards

Cryogenic Electron Microscopy (Cryo-EM)

  • A beam of electrons is fired at a frozen protein solution. The emerging scattered electrons pass through a lens to create a magnified image on the detector, from which their structure can be worked out

  • Advantages: No crystal requirement for large complexes, structure remains in native state (no dehydration)

  • Disadvantages: Relatively low resolution (but improving), 3D structure reconstruction from 2D images

93
New cards

Protein Data Bank (PDB)

  • Established at Brookhaven National Laboratory in 1971, initially with seven structures

  • Was moved to the Research Collaboratory for Structural Bioinformatics (RCSB) in 1998

  • The RCSB PDB (https://www.rcsb.org/) has been the primary repository for 3D structural data of proteins, nucleic acids, and complexes

  • The Worldwide PDB (wwPDB, http://www.wwpdb.org/) was formed in 2003 to maintain a single PDB archive of macromolecular structural data

94
New cards

RCSB PDB

  • PDB supports services for structure submission, search, retrieval, and visualization

  • By 3/6/2026, PDB contains 250,441 experimental structures and 1,068,577 computed structure models

95
New cards

Search RCSB PDB

  • Basic search using a PDB identifier or keywords: A PDB ID consists of one number and three letters (or numbers) (e.g., 4HHB for a human hemoglobin structure; pdb_00004hhb)

  • Advanced search with specific attributes or data types, sequence search (BLAST/PSI-BLAST, or FASTA)

96
New cards

Protein Structure File Formats

PDB supports the download of protein structural data in the following text file formats:

  • PDB file format: outdated but human-readable

  • PDBx / mmCIF: simple and consistent data representation for exchanging and archiving structural data, used by PDB to store its files

  • PDBML / XML: a modern and robust file format

97
New cards

Access to Structures through NCBI

  • MMDB (Molecular Modeling Database)

  1. Structures obtained from PDB

  2. Data in NCBI’s ASN.1 format

  3. Integrated into NCBI’s Entrez system

  • Cn3D (“see in 3D”): NCBI’s protein structure viewer

  • VAST (Vector Alignment Search Tool): for direct comparison of 3D protein structures to identify structural neighbors

98
New cards

RasMol and RasTop

  • RasMol: An open-source software package, which was a breakthrough in 3D

structure visualization. It is widely used to view 3D protein structures.

  • Structure file formats supported by RasMol:

  1. PDB file format

  2. mmCIF file format

  • RasTop: Provides a graphical user interface to RasMol

99
New cards

Other 3D Visualization Tools

  • Jmol: An interactive web-browser Java applet to view chemical structures in 3D, and JSmol is a JavaScript-based extension

  • Cn3D: Can be used for interactive exploration of 3D structures, sequences, and alignments

  • Swiss-Pdb Viewer (DeepView): Probably the most powerful freely available molecular

  • modeling and visualization package. Supports homology modeling, site-directed mutagenesis, structure superposition, etc.

100
New cards

Root-mean-square deviation (RMSD)

knowt flashcard image

Explore top notes

note
Verbs and Verb Tenses
Updated 1160d ago
0.0(0)
note
3.1 Intro to Culture
Updated 122d ago
0.0(0)
note
AP Statistics Unit 3 Notes
Updated 330d ago
0.0(0)
note
Plant Kingdom
Updated 898d ago
0.0(0)
note
Public Key / Llave Pública
Updated 1241d ago
0.0(0)
note
Eukaryotic Cells (Animal & Plant)
Updated 1194d ago
0.0(0)
note
Chapter 3: Proteins
Updated 991d ago
0.0(0)
note
Verbs and Verb Tenses
Updated 1160d ago
0.0(0)
note
3.1 Intro to Culture
Updated 122d ago
0.0(0)
note
AP Statistics Unit 3 Notes
Updated 330d ago
0.0(0)
note
Plant Kingdom
Updated 898d ago
0.0(0)
note
Public Key / Llave Pública
Updated 1241d ago
0.0(0)
note
Eukaryotic Cells (Animal & Plant)
Updated 1194d ago
0.0(0)
note
Chapter 3: Proteins
Updated 991d ago
0.0(0)

Explore top flashcards

flashcards
ĐỀ 7
20
Updated 74d ago
0.0(0)
flashcards
Intro to Healthcare quiz 1
22
Updated 945d ago
0.0(0)
flashcards
The New Government Begins
55
Updated 317d ago
0.0(0)
flashcards
GLW #2
20
Updated 188d ago
0.0(0)
flashcards
APUSH Unit 8 Test
46
Updated 1093d ago
0.0(0)
flashcards
ĐỀ 7
20
Updated 74d ago
0.0(0)
flashcards
Intro to Healthcare quiz 1
22
Updated 945d ago
0.0(0)
flashcards
The New Government Begins
55
Updated 317d ago
0.0(0)
flashcards
GLW #2
20
Updated 188d ago
0.0(0)
flashcards
APUSH Unit 8 Test
46
Updated 1093d ago
0.0(0)