1/33
Comprehensive vocabulary flashcards covering bioinformatics topics including sequencing technologies, assembly, alignment algorithms, statistics, phylogeny, and transcriptomics.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai | Chat |
|---|
No analytics yet
Send a link to your students to track their progress
Illumina Sequencing Errors
A characteristic where read quality typically decreases towards the end of each sequence read.
Metatranscriptomics Strategy
Extracting RNA, fragmenting, sequencing, and using reads to assemble all transcripts in a sample to identify expressed genes, especially from unknown bacteria.
Nanopore Sequencing
A sequencing method recommended for marker genes like 16SrRNA because it provides long reads helpful for determining gene copy numbers in prokaryotic genomes.
Read Pair Calculation
To sequence an E.coli genome of 5imes106bp with paired-end 150bp reads for a depth of 30, the required number of read pairs is 500,000.
Read Coverage Probability
In a circular genome of 3,000,000bp, the probability that a random read of length 150bp covers a specific position is 3,000,000150.
Assembly Mapping Disparity
If a region has double the expected mapping depth (e.g., 200 reads vs. an average of 100), it indicates the region occurs in twice as many copies in the sequenced genome compared to the reference.
Contig
A segment of the genome that has been assembled from overlapping sequence reads.
N50 Value
A statistical measure of genome assembly quality; for contig lengths of 100, 200, 300, 400, 500, 600, and 700, the N50 value is 500.
Hash Table
A data structure optimized for the fastest possible retrieval of a stored element.
Computational Complexity O(N3)
An algorithm property where doubling the problem size N results in an eightfold (23) increase in processing time.
Sensitivity (Homology Search)
The ratio of correctly identified homologs to the total number of true homologs in the database (e.g., 35/50).
Specificity (Homology Search)
The ratio of correctly identified non-homologs to the total number of non-homologs in the database (e.g., 945/950).
BLAST Word Length
A parameter where increasing the length results in fewer total hits within the database.
Affine Gap Penalty
A scoring system that applies a higher penalty for initiating a gap than for extending an existing one.
Extreme Value Distribution
A statistical distribution used to model the score values of the best sequence alignment.
Protein Sequence Identity
A measure of similarity that is discouraged for protein sequences because different amino acids have varying substitution score values.
Sum-of-Pairs Score
The total score of a multiple sequence alignment calculated by summing the scores of all possible pairwise alignments.
Progressive Alignment Method
A multiple sequence alignment approach, such as that using a guide-tree, characterized by the inability to correct errors made in early steps.
Newick Format
A standard data format used to describe the topology and branch lengths of a phylogenetic tree.
PSSM Probabilities
For a protein pattern covering 6 positions, a Position-Specific Scoring Matrix requires 120 individual probabilities (20extaminoacidsimes6extpositions).
PROSITE Model
A syntax for protein motifs; for example, and the pattern G−[LI]−[CHK]−H−L−X−C(2)−F−[YR]−W describes specific conserved and variable residues.
PHI-BLAST
A variant of BLAST that utilizes a PROSITE-pattern during the database search.
PSI-BLAST
A variant of BLAST that creates a PSSM from hit sequences to perform iterative searches.
Gene Enrichment (Over-representation)
A statistical result indicating that a set of upregulated genes contains more genes related to a specific function (e.g., cold stress) than would be expected by chance.
Volcano Plot Outlier
A data point representing a gene with high fold-change but no statistical significance, often caused by high variance (spread) between samples within the same treatment group.
False Discovery Rate (q-value)
A method for correcting p-values; a q-value threshold of 0.05 implies that 5% of the significant genes are expected to be false positives.
Principal Component Analysis (PCA)
A technique used to identify outliers, groups, or gradients within transcriptomics data tables.
Principal Coordinate Analysis (PCoA)
A dimensionality reduction technique typically applied to distance tables rather than raw data tables.
Fisher's Exact Test
A test used to determine if the overlap between two groupings of the same genes is significantly larger than expected.
Metabarcoding
The process of mapping the biological composition of an environment by sequencing specific marker genes.
Alpha Diversity vs. Beta Diversity
Alpha diversity refers to the diversity within a single sample, while Beta diversity measures the diversity difference between samples.
Maximum Likelihood (ML) Advantage
A phylogenetic reconstruction method that utilizes sequence data more effectively and incorporates evolutionary models compared to distance-based methods.
Taxonomic Classification
The process of recognizing a sequence variant in metagenomics and assigning it a scientific name.
BLAST Bit-score
A normalized score that depends on the scoring table and is used to calculate the E-value via a simple formula.