1/264
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
structural bioinformatics
field that predicts and models function from 3D structure
genomics
field that predicts phenotype from genotype
molecular novelty
field that novel genes and gene functions from newly sequenced genomes and metagenomes
evolutionary adaptation
field that analyzes how changes in genotype lead to changes in phenotype and evolutionary adaptation
computational biology
the application of computational tools to solve biological problems (many disciples, ie/ genomics, biophysics, ecology, molecular bio, etc.)
bioinformatics
more emphases on analysis of high-throughput data, notably genome-sequence data. A subset of computational biology.
pattern discovery
subtask of comp bio -learn patterns from biological data
prediction
subtask of comp bio -use patterns to predict biological function
integration
subtask of comp bio -develop models that connect levels of info
simulation
subtask of comp bio - model behavior of biological systems on a computer
engineering
subtask of comp bio - design novel biological systems for specific purposes
therapy
subtask of comp bio - design molecular therapeutics to combat disease
wet lab
data generation testing
dry lab
interpretation, prediction, model-building, hypothesis-generation
algorithm
a set of rules or instructions specifying how to solve a problem
tool
implementation of an algorithm (software)
sequence analysis
determining the optimal alignment between sequences, searching databases, for homologs, organization and interpretation of data
phylogenetic analysis
organization of sequences according to their evolutionary relationships
genome analysis
finding and analyzing genes within the context of an entire genome
transcriptomics and proteomics
examines levels of gene or protein expression
network and systems biology
analyzing a biological system as a network of interacting components
synthetic biology
the use of computers to design new biological systems
dna polymers
specific sequences of nucleotides
nitrogenous base
nucleotides differ by which _________ they contain
genome
organism's DNA-based genetic instructions; composed of genes
genes
dna instructions for making proteins
central dogma
DNA --> RNA --> protein
transcription
assisted by RNA polymerase
translation
assisted by ribosomes
mRNA, tRNA, rRNA
three main types of RBA; messenger, transfer, ribosomal
RNA
-primary structure similar to DNA
-can be single or double stranded
-can exhibit different conformations (unlike DNA)
-contains U instead of T
gene expression
process of using DNA info to make mRNA and proteins
promoter sequence
what RNA polymerases look for to recognize beginning of genes
enhancers
. Modulate gene expression and can be far from the gene. Needed on top of promoters for eukaryotes.
prokaryotes
use +ve and -ve regulation for transcription
genetic code
RNA to amino acids
ORF
long stretches of DNA that are uninterrupted by stop codons and therefore encode protein
genes
ORFs + additional regulatory info
start codon
met, AUG, translation of DNA to RNA
stop codons
UAA, UAG, UGA, expected once every 20th codon
hydrophobic amino acids
amino acids with long alkyl side chains. More likely to be found in the interior of proteins. A I L M P V F W
hydrophilic (polar) amino acids
C N Q S T Y G
Charged amino acids
(-) D E, (+) K R H
sequencing
determining the exact nucelotide sequence of DNA
sequencing methods
Maxam- Gilbert: chemical degradation
Dideoxy (Sanger): chain termination
next gen (high throughput): many types
next gen methods
illumina (solexa)
-up to 1 Tb/rn (>5 human genomes)
-125 base pairs (paired end)
454 pyrosequencing
-up to 600 Mb/run
400-500 base pairs (single direction)
evolution
changes in inherited characteristics of biological populations over successive generations
genotype
organism's genetic information
phenotype
observable features of an organism, encoded by genotype
mutations
-point, duplication, insertion, deletion
-drive differences between species (variation)
homology
similarity due to common ancestry
homologous
evolutionarily related
NCBI GenBank
part of the national centre for biotechnology information, part of US national institute of health
protein sequencing timeline
1955: first complete sequence (insulin, Ryle et al.)
1965: ~20
1980:~1500
today: ~200 million
nucleotide sequencing timeline
1953: structure, watson and crick
60s-70s: small RNAs, cloning, then PCRs
1982: creation of genbank, simpler, democratization of data
sequence revolution
1980s/90s
-development of more efficient computer hardware and software
-birth of bioinformatics (term was coined in the 70s as the study of info processes in biotic systems)
DNA databases
NCBI GenBank: WGS, CoreNucleotide, dbGSS
RNA databases
NCBI: GEO, dbEST, UniGene
protein databases
NCBI and others: NCBI protein, UniProt, Protein Data Bank
flow of info
curation --> annotation --> release
core data
-key info in the db entry and minilan info req'd to identify it
-included data derived from experimental results (ex. sequence, structural data)
annotations
-all additional info, 2ndary info, may change over time
-ex. known or predicted functional info
purines
guanine and adenine
pyrimidines
thymine and cytosine
flatfile db
-data (ex. sequences) are stored as a text file or a collection of text files
-flat, as in sheet of paper
-easy to input, distribute, search, and retrieve data
relational databases
-data stored within a number of tables linked together by a shared field, the key (which must be unique to each record)
-handles huge mounts of data, reducing data in memory, faster search and retrieval
fasta file type
-.fa, .faa, .fna, .fasta
-header followed by raw data
ncbi genbank file type
-header, features, dequence
-each sequence filed with an accession number
-any revisions made, version number changes
accession number
~4-10 numbers/letters to identify specific DNA and protein sequence records
feature key
keyword indicating functional group
location
instructions for finding the feature
qualifiers
auxiliary info about a feature
protein dbs
NCBI protein, UniProtKB, Protein Information Resource (PIR), SWISS-PROT, TrEMBL
entries
databases are composed of
common queries
gene/protein name/function, db identifiers, species names, raw sequences
logic
these are operators that indicate relationships among searches. Ex. AND, OR, NOT, NOR, NAND, XOR, XNOR
homology searches
best way to find genes/proteins related to yours of interest
data quality and info content
redundancy, efficiency, automatic and manual quality control
computer error
incorrect annotations, missed relationships (insufficient info extraction)
human error
multiple contributions, vector sequence left in, PCR chimeras, taxonomic misidentification, trivial data entries
quality control
manual approach to deal with errors, ex. 20% of fungi sequences were misidentified
sequence alignment
identification of character matches preserving character order
true alignment
reflects evolutionary relationship between 2+ sequences that share a common ancestor (homology)
global alignment
attempt to align entire sequence, ex. NW
local alignment
stretches of sequences with highest density of matches are aligned, ex. SW
function, structure, evolutionary information
aligning sequences useful to discover
similarity, patterns, relationships
alignments reveal
score and compute
to understand alignments we need to:
a good alignment has
many matches, few mismatches, few gaps
dynamic programming
Used by both NW and SW, solves the problem by breaking it down into subproblems
db search
needed to find closest homolog of a given sequence
-important for predicting sequence function, genome annotation, phylogenetics, determining taxonomic identity of a sequence
SSEARCH
-extension of pairwise alignment
-instead 1v1, 1vmany
-problem: speed
-use for for comparing local dbs
BLAST
-basic local alignment search tool
-faster than SW
-word-based (k-tuples), ungapped, locally optimal
0larger word length permits inexact matches between words
-heuristic procedure
-minimum word length: 3 for proteins, 16 for nt
E value
expectation; the number of matches with scores equivalent to or better than S that are expected to occur in a db search by chance.
-borderline significant <0.01
-highly significant <1e-10
BLAST programs
blastp: protein query v protein db
blastn: nt query v nt db
blastx: translated nt v protein db
tblastn: protein v translated nt db
tblastx: translated nt v translated nt db
PSI-BLAST: detection of emote protein homology using profiles
BLAST process
1. break query into words
2.search for matches
3. extend matches in both directions until score beloe threshold
4.merge HSPs into a longer alignment (further extend and allow gaps)
5. report statistical significance
BLAST artifacts
-longer the sequence higher the score (this is natural)
-query sequence w/ repeats artificially inflates score
-low complexity regions
-conservative with short query sequences
multiple sequence alignment
>2 sequences, basis of phylogenetic reconstruction, indicates patterns of conservation and variation (for finding functional residues, motifs), greater accuracy of overall alignment
order
can affect end result of MSA
MSA challenges
-finding best alignment that takes mutations/gaps for ALL sequences into account
-scoring entire alignment
placement and scoring of gaps
-cannot easily extend dynamic programming algorithms like SW or NW