1/17
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
origins of bioinformatics
earliest foundations (1950-1970) focused primarily on protein sequence analysis
comprotein
first known bioinformatics software (early 1960s)
developed by Margaret Dayhoff
designed to assemble whole protein sequences (de novo) from small Edman peptide fragments
paradigm shift (1970-1980)
when bioinformatics began shifting its focus from protein analysis to DNA analysis after sanger sequencing was invented
needleman-wunsch (1970)
developed the first dynamic programming algorithm for performing pairwise protein sequence alignments
homology: orthology
homology resulting from a speciation event
defined by walter m. fitch (1970)
dayhoff/pam matrix
developed the first probabilistic model of amino acid substitutions (point accepted mutations) in 1978, using probability to measure evolutionary change
de novo sequencing
the determination of a full-genome sequence without using a known template or reference sequence
massively parallel
multiple processors working simultaneously
multiplexing
combining multiple inputs/samples into a single sequence run
overfitting
when a model built on training data shows high accuracy but significantly decreased accuracy when applied to separate validation data, indicating the model is too specific to the initial dataset features
sanger (dideoxy)
long reads (~600-1000 bp)
low throughput, typically single samples
quality loss at the beginning and end
based on chain-terminating dideoxynucleotides (ddNTPs)
illumina (MiSeq)
short reads (100-300 bp)
high/massively parallel throughput
high accuracy
bridge amplification/sequencing by synthesis where fragments attached to a flow cell are amplified into clusters
oxford nanopore (minION)
ultra long reads
high throughput, portable
moderate error rate
DNA passes through a nanopore, changes in electrical current are decoded into the DNA sequence (basecalling)
fastq file
file format that incorporates both the nucleotide sequence and associated quality scores
phred score (q)
measure of sequence quality determination
Q20 = probability of less than 1% error per base, meaning 99% accuracy
Q30 = 99.9% accuracy
coverage
the average number of reads that align to, or “cover,” known reference bases
50x genome coverage is recommendeds
single-end reads
sequence in one direction of the fragment
paired-end reads
report sequences from both directions of a DNA fragment