Biological Databases - Lecture Notes

Introduction to Biological Databases

Class Overview

The class focuses on the semantics of biological data rather than database technology (relational databases, object-oriented, columnar databases, etc.).
The main goal is to understand how data is represented and queried in biological databases.
Requires a basic understanding of biology, especially for those in the bioinformatics program.
The course is designed to synthesize knowledge from other courses like BIM 181 (algorithmic methods), BIM 186 (distributions and statistics), BIM 100/183 (genomic technologies).
The emphasis is on the interpretation of data, not on quizzing algorithmic special cases.

Prerequisites

BIM 180 is a prerequisite, and comfort with Python is required since it will be used as a scripting language.
Some experience handling large datasets is expected.

Biological Data

Bioinformatics can start at different levels of biological organization (organism, organs, tissues, cells).
Cells are convenient for experiments because they are autonomous.
Focus on three key molecules: proteins, DNA, and RNA.

Proteins

Proteins are considered the cellular machinery, responsible for enzymatic work, signaling, and gene regulation.

DNA

DNA carries all the information on how cells work and includes instructions for making proteins.
Cell transformation: Dead cells can transform other cells via DNA.
Transformation can also refer to a cell becoming cancerous due to DNA mutations.

RNA

RNA was initially thought of as a messenger molecule for translating DNA information into proteins.
Now known to have other functions, including acting as enzymes and signaling molecules.
The course will focus on RNA's role in quantifying gene activity.

Bioinformatics and Strings

Biological molecules (DNA, RNA, proteins) can be represented as strings of a small number of molecules/letters.
- DNA: chain of 4 molecules.
- RNA: chain of 4 molecules.
- Proteins: strings over 20 molecules.
Walter Goad recognized that patterns in these molecules could be studied as strings, marking the beginning of bioinformatics.
GenBank started as a database maintained by Goad, who manually entered sequences.
GenBank is now maintained by NCBI (National Center for Biotechnology Information).

Biological Databases: NCBI and GenBank

NCBI hosts GenBank, which contains biological sequence data.
Older database entries consisted of metadata surrounding a sequence of interest (e.g., protein sequence).
The availability of genome sequencing has caused the sequence data itself to dominate the databases.
Sequence data has grown to petabytes, requiring efficient querying methods.
Sequencing costs have dropped significantly (a few cents per base pair), enabling widespread genome sequencing.
- A human genome can be sequenced 30 times over for about $200.

Querying Biological Databases

A common task is to query a database to find sequences similar to a given query sequence.
Example: Determining if an orphan sequence is human by searching for exact matches in a database of human sequences.
Naive approach: search sequence of length m in a database of length n:
- $O(m+n)$ or $O(n-m)$
- These answers are based on if you can skip the final m-1 base pairs in the string of length n.
- $O(m*n)$
- This answer takes into consideration that you have to compare each base pair
The differences in run time is why the answers are all correct, you can have trade offs and clever combinations.
These differences in orders highlight the need for efficient algorithms.

Sequence Similarity and Homology

Identifying a human homolog of a mouse gene (e.g., orexin receptor) can be important for studying human diseases like obesity.
Sequence alignment methods (allowing for mismatches, insertions, and deletions) are used to find functional equivalents.
Tools like BLAST, BWA, and MiniMap employ database filtering to enable fast similarity searches.
If no similar sequences are found (e.g., <60% identity), it becomes a statistical rather than an algorithmic question.

Statistical Significance

The question of statistical significance comes up when low sequence identity is observed.
It must be determined whether a match with, for example, 60% identity, is significant or just by chance.
There needs to be a way to systematically probe statistical significance during querying.

Tricky Question of Similarity

What if your query is 90% similar to one human sequence (h2), 70% similar to a mouse sequence (m1), and 50% similar to a fruit fly sequence (f)? Experiments find functions are not the same, which one do you trust more?
Naive answer is that you trust more similar sequences, but humans should have higher similarity than 90%. So we cannot trust the human sequence.
Gene duplication and paralogs: After gene duplication events and sequence divergence, paralogs can have very different functions.
Orthologs: The fruit fly sequence is the true ortholog. Conserved function across different time scales.
Value and impact of sequence alignment diminishes as we question the value of sequence alignment.

Protein Structure Databases

Protein structures have three-dimensional representations but also two-dimensional analogs (alpha helices, beta sheets).
Motifs and sequence patterns can be identified using regular expressions based on amino acid properties.
Databases like ProSight store information as regular expressions.

Central Dogma of Biology

The central dogma describes the flow of information from DNA to RNA to protein.
Genes are regions of DNA that contain instructions for making proteins.
Genes consists of exons (coding regions) and introns (non-coding regions).
Upon transcription, RNA is spliced to remove introns.
Bioinformaticians may be given an orphan genomic sequence and asked to identify the gene (coordinates of exons, etc.).
Machine learning and AI methods might be required to identify genes.
Algorithms are needed to identify gene features in orphan genomic sequences.

What percentage of the human genome is genes?

The amount of exonic codes (parts that actually code for a gene) only 1.7%.
The genetic region is larger (the entire region of the genome, all introns and exons), between 30 to 40%.
Debate continues about protein-coding genes.

Dynamic Processes

This class focuses on the static and dynamic aspects of the dynamic parts of bioinformatics.
Static portion: genome sequenced, genes identified.
Dynamic portion includes everything functional going on in the cell.
Gene gets made into RNA with the exons being spliced together to make the RNA.
RNA are then used to make proteins.
What you can do is simply take a cell extract all the RNA sequences and map back to where they came from. For every gene, how many copies of RNA
At any time you sample a cell and see different corpus of proteins because they are constantly degrading and being assembled.
A simple experiment involves isolating all RNA sequences and counting copies of each gene.
- Samples include a collection of all those RNA sequences.
- Counts will indicate which genes are active in that area.
- The counts can change depending on the state of the cell.
- Allows for some observations
Example: What if there were samples from individuals one with cancer and one without? Could the collected data determine which individuals had cancer.
Length of the sequence also matters because they can span the entire gene.
Metrics need normalization and biologists just do simple counting to find copies.
Real life example: a matrix of genes correlated with those who has leukemia versus those who do not have leukemia.

Proteins

RNA has a shorter lifespan. Proteins are longer and modified.
Protein sequencing is different from RNA, need mass spectrometry to identify protein modifications with quantization methods.

Population Aspects

Analysis on large numbers of sample data with various nucleotides
Responses to drugs and susceptibility to diseases has a genetic basis, studied through the differences we nucleotide differences.
Polymorphic nucleotides, with databases to collect this data. SNPs, with nucleotide variations, allowing analysis to identify ethnicity or demographics.
Differences occur because of how our genome is different.

Biological Databases - Lecture Notes

Introduction to Biological Databases

Class Overview

Prerequisites

Biological Data

Proteins

DNA

RNA

Bioinformatics and Strings

Biological Databases: NCBI and GenBank

Querying Biological Databases

Sequence Similarity and Homology

Statistical Significance

Tricky Question of Similarity

More Questions Regarding Sequence Similarity

Protein Structure Databases

Central Dogma of Biology

What percentage of the human genome is genes?

Dynamic Processes

Proteins

Population Aspects