genomics - dna binding proteins and motif analysis

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/33

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

34 Terms

1
New cards

jacob & monod

repressors encoded in lac operon regulate rate of protein synthesis

2
New cards

protein-dna interaction sequence specificity

-there is no universal code

-electrostatics, hydrogen bonds, water-mediated contacts, and hydrophobic packing

-sequence-specific DNA deformations (indirect readout)

-requires the determinaiton of binding preferences for many members of a family of TFs

3
New cards

Sequence motif

-subsequence with some specific function

-may be in DNA, RNA, protein

-function many be context dependent: ribosome binding site has to be transcribed

-may be gapped or ungapped

4
New cards

Consensus sequence pattern

-may include degenerate bases and allow for mismatches

-search space is over possible patterns

-difficult to obtain an optimal consensus for identifying novel sites

-relative frequency of bases at each positions lost

5
New cards

weight matrix

might go to higher order models

search space is over possible alignments

-more information than a consensus sequence

-many ways to determine the weights

-assumes positional independence

-requires significant data

6
New cards

pattern based algorithms

-can use motif length, num of mismatches, num of seqs,

-4^l patterns, search for most common or significant

7
New cards

N(b,j)

raw scores for a weight matrix model, ie number of times each base showed up in each position

b = base (A, C, T, G), row index

j = jth position in a sequence, column index

8
New cards

F(b,j)

weighted scores for weight matrix model

take raw scores in each position, and get decimal number for likelihood of each base in each position (total should equal 1 in each column)

9
New cards

S(b,j)

probability-normalized log score in weight matrix model

log2[F(b,j)/P(b)]

P(b) = background base distribution

10
New cards

Information content

Sum over columns j and rows b to distinguish divergence of the empirical distribution (f(b,j)) from the background base distribution (p(b))

aka relative entropy, kullback-leibler distance

11
New cards

pseudocounts

entries of 0 in the count matrix cause problems because log(0) is undefined

There’s not enough observations to observe all possibilities

Can add pseudocounts to the matrix to ensure there’s no 0s

12
New cards

protein binding microarrays

can be used for defining PWM

-custom arrays of 60-mer DNA sequences (~44,000 probes)

-contain all possible 10 bp sequences

-each probe contains 27 10-mers

-8-mers guaranteed to occur 16 times

13
New cards

ChIP-seq

-can be used for defining PWM

-cross link protein to DNA

-affinity purify protein-DNA complexes

-reverse cross-links

-identify sequence by hybridization to microarray or by high throughput sequencing

14
New cards

bacterial one-hybrid

method for defining PWM

genetic selection: survival is dependent on DNA-binding

TF of interest is fused to the alpha subunit of RNA pol

randomized library of binding sites created and screened for autoactivation

co transform w TF and selected

library complexity is limited by transformation efficiency

<p>method for defining PWM</p><p>genetic selection: survival is dependent on DNA-binding</p><p>TF of interest is fused to the alpha subunit of RNA pol</p><p>randomized library of binding sites created and screened for autoactivation</p><p>co transform w TF and selected</p><p>library complexity is limited by transformation efficiency</p><p></p>
15
New cards

high-throughput SELEX

method for defining PWM

-incubate pure protein with high complexity DNA library

-pull down DNA-protein complexes

-amplify and sequence

16
New cards

JASPAR

open source repository for dna protein binding data

ChIP, PBM, SELEX

>50 different species spanning most clades

extract PWMs for downstream analysis

17
New cards

HOCOMOCO

homo sapiens comprehensive model

public repository of human-specific dna protein data

coverage of nearly all human dna binding domain classes

18
New cards

statistical definition of motif finding

given some sequences, find over-represented substrings (motif discovery)

19
New cards

biological definition of motif finding

given some co-regulated promoters, find transcription factor binding model

20
New cards

class I motif finding algorithms

planted motif problem: single species, multiple genes

-random background sequences

-proper description of a consensus motif gives better models

-randomly plant copies of the motif into sequences

-define an objective function, and use a search algorithm to find the copies that give a good score

21
New cards

exhaustive algorithm

not very tractable

construct every possible combination of alignments and keep the one with the highest information content

given a motif of width w, and k sequences of length l, there are L = (l-w+1) possible locations in each sequence, and L^k alignments to check

22
New cards

greedy algorithm (consensus)

-assume every sequence contains at least one true binding site

-using each l-mer find best match to generate 2-seq alignments

-using top K PWMs to search remaining sequences to include a new sequence

-repeat until all seqs contribute

23
New cards

MEME

Multiple Expectation Maximizations for motif elicitation

-intial “seed” PWM

use the current PWM to determine probability of all positions being sites

reestimate pwm based on the full set of those probabilities

continue until convergence - always convergences to a local maximum

EM is deterministic, meaning it is sensitive to initial seed and may not converge to the global maximum

for this reason, EM should be run multiple times with different seeds

24
New cards

Gibbs sampling

Similar to EM, but some differences:

-initial “seed” pwm

-use the current PWM to determine probability of all positions

-at each iteration, pick one site on each sequence, chosen by its probability to update the PWM, rather than updating using the full set of probabilities

-not guaranteed to converge, but tends to increase objective (IC) and plateau

-can escape local maxima, and therefore not sensitive to seed

25
New cards

gibbs sampling approach to motif discovery

-given “sites”, estimate pattern matrix

-given “matrix”, pick likely sites according to their probability

-iterate between those steps until “convergence”

important: using pseufocounts, and sample sites from estimated prob distribution

26
New cards

how gibbs sampling works

initialization: random assignment of motif locations a1-ak

pick “held-out” sequence

construct initial matrix S from alignment of matrix sequence

score all possible motif locations of held out sequence in A(i,j)

then select a new motif placement randomly based on probability distribution A(i,j)

hold out a new sequence, repeat until matrix converges on better motif window placements

however, sequence may have a placement with no real site, and sequences with more than one site might only have 1 placement

27
New cards

dna binding key take aways

-genome encodes much of its own regulation in protein binding sites

-a full description of the regulatory networks will require identifying these sites

-compact descriptions of the DNA-binding preferences of TFs is afforded by weight matrices

-the information content of an alignment is a measure of specificity

-weight matrix information for a TF is not enough to rule out false positives

-multiple experimental techniques exist for identifying sequences harboring binding sites

-a variety of algorithms can be used to identify motifs in unaligned data

-most predicted binding sites are false positives

28
New cards

class II motif finding algorithms

phylogenetic footprinting

single gene, multiple species

-orthologous background sequences

-sequences linked by a phylogenetic tree

-identify the “best conserved” motif that is under selective pressure

29
New cards

class III motif finding algorithms

multiple genes, multiple species

-combination of phylogenetic data and gene regulation

-use phylogenetic data to reduce search space

-use correlation of motif occurrences among orthologous genes to increase signal strength

30
New cards

CNNs

convolutional neural networks

-sequences are filtered through multiple convolutional layers (based on training sequences) and scored

-filtered sequence scores are pooled and max score retained

-many rounds of convolution → pooling can occur

-fully connected hidden layer used to score sequence

inputs:

-PBM/SELEX

-chromatin accessibility

-ChIP-seq or CUT&RUN/Tag

-principle is to represent sequences that are biologically meaningful in training set

31
New cards

large first filters

represent full motifs

32
New cards

small first filters

learn partial motifs

33
New cards

back propagation

determine the features being learned in early filter layers

-learn discrete sequence contributions to signal

34
New cards

saliency maps

can be used to identify real features that CNN deems important for prediction