Biological Databases - Lecture Notes
Introduction to Biological Databases
Class Overview
- The class focuses on the semantics of biological data rather than database technology (relational databases, object-oriented, columnar databases, etc.).
- The main goal is to understand how data is represented and queried in biological databases.
- Requires a basic understanding of biology, especially for those in the bioinformatics program.
- The course is designed to synthesize knowledge from other courses like BIM 181 (algorithmic methods), BIM 186 (distributions and statistics), BIM 100/183 (genomic technologies).
- The emphasis is on the interpretation of data, not on quizzing algorithmic special cases.
Prerequisites
- BIM 180 is a prerequisite, and comfort with Python is required since it will be used as a scripting language.
- Some experience handling large datasets is expected.
Biological Data
- Bioinformatics can start at different levels of biological organization (organism, organs, tissues, cells).
- Cells are convenient for experiments because they are autonomous.
- Focus on three key molecules: proteins, DNA, and RNA.
Proteins
- Proteins are considered the cellular machinery, responsible for enzymatic work, signaling, and gene regulation.
DNA
- DNA carries all the information on how cells work and includes instructions for making proteins.
- Cell transformation: Dead cells can transform other cells via DNA.
- Transformation can also refer to a cell becoming cancerous due to DNA mutations.
RNA
- RNA was initially thought of as a messenger molecule for translating DNA information into proteins.
- Now known to have other functions, including acting as enzymes and signaling molecules.
- The course will focus on RNA's role in quantifying gene activity.
- Biological molecules (DNA, RNA, proteins) can be represented as strings of a small number of molecules/letters.
- DNA: chain of 4 molecules.
- RNA: chain of 4 molecules.
- Proteins: strings over 20 molecules.
- Walter Goad recognized that patterns in these molecules could be studied as strings, marking the beginning of bioinformatics.
- GenBank started as a database maintained by Goad, who manually entered sequences.
- GenBank is now maintained by NCBI (National Center for Biotechnology Information).
Biological Databases: NCBI and GenBank
- NCBI hosts GenBank, which contains biological sequence data.
- Older database entries consisted of metadata surrounding a sequence of interest (e.g., protein sequence).
- The availability of genome sequencing has caused the sequence data itself to dominate the databases.
- Sequence data has grown to petabytes, requiring efficient querying methods.
- Sequencing costs have dropped significantly (a few cents per base pair), enabling widespread genome sequencing.
- A human genome can be sequenced 30 times over for about $200.
Querying Biological Databases
- A common task is to query a database to find sequences similar to a given query sequence.
- Example: Determining if an orphan sequence is human by searching for exact matches in a database of human sequences.
- Naive approach: search sequence of length m in a database of length n:
- O(m+n) or O(n−m)
- These answers are based on if you can skip the final m-1 base pairs in the string of length n.
- O(m∗n)
- This answer takes into consideration that you have to compare each base pair
- The differences in run time is why the answers are all correct, you can have trade offs and clever combinations.
- These differences in orders highlight the need for efficient algorithms.
Sequence Similarity and Homology
- Identifying a human homolog of a mouse gene (e.g., orexin receptor) can be important for studying human diseases like obesity.
- Sequence alignment methods (allowing for mismatches, insertions, and deletions) are used to find functional equivalents.
- Tools like BLAST, BWA, and MiniMap employ database filtering to enable fast similarity searches.
- If no similar sequences are found (e.g., <60% identity), it becomes a statistical rather than an algorithmic question.
Statistical Significance
- The question of statistical significance comes up when low sequence identity is observed.
- It must be determined whether a match with, for example, 60% identity, is significant or just by chance.
- There needs to be a way to systematically probe statistical significance during querying.
Tricky Question of Similarity
- What if your query is 90% similar to one human sequence (h2), 70% similar to a mouse sequence (m1), and 50% similar to a fruit fly sequence (f)? Experiments find functions are not the same, which one do you trust more?
- Naive answer is that you trust more similar sequences, but humans should have higher similarity than 90%. So we cannot trust the human sequence.
- Gene duplication and paralogs: After gene duplication events and sequence divergence, paralogs can have very different functions.
- Orthologs: The fruit fly sequence is the true ortholog. Conserved function across different time scales.
- Value and impact of sequence alignment diminishes as we question the value of sequence alignment.
More Questions Regarding Sequence Similarity
- If sequences A and B are 40% identical and are functionally similar, can we infer that sequence C is also functionally similar?
- A is also 40% identical to C.
- Cannot infer function without more information.
- Functional regions may be dissimilar with conserved structural regions.
- Matches between B and A and A and C may occur in different locations of the sequences.
- The question becomes systematically studying what very low levels of identity mean.
Protein Structure Databases
- Protein structures have three-dimensional representations but also two-dimensional analogs (alpha helices, beta sheets).
- Motifs and sequence patterns can be identified using regular expressions based on amino acid properties.
- Databases like ProSight store information as regular expressions.
Central Dogma of Biology
- The central dogma describes the flow of information from DNA to RNA to protein.
- Genes are regions of DNA that contain instructions for making proteins.
- Genes consists of exons (coding regions) and introns (non-coding regions).
- Upon transcription, RNA is spliced to remove introns.
- Bioinformaticians may be given an orphan genomic sequence and asked to identify the gene (coordinates of exons, etc.).
- Machine learning and AI methods might be required to identify genes.
- Algorithms are needed to identify gene features in orphan genomic sequences.
What percentage of the human genome is genes?
- The amount of exonic codes (parts that actually code for a gene) only 1.7%.
- The genetic region is larger (the entire region of the genome, all introns and exons), between 30 to 40%.
- Debate continues about protein-coding genes.
Dynamic Processes
- This class focuses on the static and dynamic aspects of the dynamic parts of bioinformatics.
- Static portion: genome sequenced, genes identified.
- Dynamic portion includes everything functional going on in the cell.
- Gene gets made into RNA with the exons being spliced together to make the RNA.
- RNA are then used to make proteins.
- What you can do is simply take a cell extract all the RNA sequences and map back to where they came from. For every gene, how many copies of RNA
- At any time you sample a cell and see different corpus of proteins because they are constantly degrading and being assembled.
- A simple experiment involves isolating all RNA sequences and counting copies of each gene.
- Samples include a collection of all those RNA sequences.
- Counts will indicate which genes are active in that area.
- The counts can change depending on the state of the cell.
- Allows for some observations
- Example: What if there were samples from individuals one with cancer and one without? Could the collected data determine which individuals had cancer.
- Length of the sequence also matters because they can span the entire gene.
- Metrics need normalization and biologists just do simple counting to find copies.
- Real life example: a matrix of genes correlated with those who has leukemia versus those who do not have leukemia.
Proteins
- RNA has a shorter lifespan. Proteins are longer and modified.
- Protein sequencing is different from RNA, need mass spectrometry to identify protein modifications with quantization methods.
Population Aspects
- Analysis on large numbers of sample data with various nucleotides
- Responses to drugs and susceptibility to diseases has a genetic basis, studied through the differences we nucleotide differences.
- Polymorphic nucleotides, with databases to collect this data. SNPs, with nucleotide variations, allowing analysis to identify ethnicity or demographics.
- Differences occur because of how our genome is different.