Biological Databases - Lecture Notes

Introduction to Biological Databases

Class Overview

  • The class focuses on the semantics of biological data rather than database technology (relational databases, object-oriented, columnar databases, etc.).
  • The main goal is to understand how data is represented and queried in biological databases.
  • Requires a basic understanding of biology, especially for those in the bioinformatics program.
  • The course is designed to synthesize knowledge from other courses like BIM 181 (algorithmic methods), BIM 186 (distributions and statistics), BIM 100/183 (genomic technologies).
  • The emphasis is on the interpretation of data, not on quizzing algorithmic special cases.

Prerequisites

  • BIM 180 is a prerequisite, and comfort with Python is required since it will be used as a scripting language.
  • Some experience handling large datasets is expected.

Biological Data

  • Bioinformatics can start at different levels of biological organization (organism, organs, tissues, cells).
  • Cells are convenient for experiments because they are autonomous.
  • Focus on three key molecules: proteins, DNA, and RNA.
Proteins
  • Proteins are considered the cellular machinery, responsible for enzymatic work, signaling, and gene regulation.
DNA
  • DNA carries all the information on how cells work and includes instructions for making proteins.
  • Cell transformation: Dead cells can transform other cells via DNA.
  • Transformation can also refer to a cell becoming cancerous due to DNA mutations.
RNA
  • RNA was initially thought of as a messenger molecule for translating DNA information into proteins.
  • Now known to have other functions, including acting as enzymes and signaling molecules.
  • The course will focus on RNA's role in quantifying gene activity.

Bioinformatics and Strings

  • Biological molecules (DNA, RNA, proteins) can be represented as strings of a small number of molecules/letters.
    • DNA: chain of 4 molecules.
    • RNA: chain of 4 molecules.
    • Proteins: strings over 20 molecules.
  • Walter Goad recognized that patterns in these molecules could be studied as strings, marking the beginning of bioinformatics.
  • GenBank started as a database maintained by Goad, who manually entered sequences.
  • GenBank is now maintained by NCBI (National Center for Biotechnology Information).

Biological Databases: NCBI and GenBank

  • NCBI hosts GenBank, which contains biological sequence data.
  • Older database entries consisted of metadata surrounding a sequence of interest (e.g., protein sequence).
  • The availability of genome sequencing has caused the sequence data itself to dominate the databases.
  • Sequence data has grown to petabytes, requiring efficient querying methods.
  • Sequencing costs have dropped significantly (a few cents per base pair), enabling widespread genome sequencing.
    • A human genome can be sequenced 30 times over for about $200.

Querying Biological Databases

  • A common task is to query a database to find sequences similar to a given query sequence.
  • Example: Determining if an orphan sequence is human by searching for exact matches in a database of human sequences.
  • Naive approach: search sequence of length m in a database of length n:
    • O(m+n)O(m+n) or O(nm)O(n-m)
    • These answers are based on if you can skip the final m-1 base pairs in the string of length n.
    • O(mn)O(m*n)
    • This answer takes into consideration that you have to compare each base pair
  • The differences in run time is why the answers are all correct, you can have trade offs and clever combinations.
  • These differences in orders highlight the need for efficient algorithms.

Sequence Similarity and Homology

  • Identifying a human homolog of a mouse gene (e.g., orexin receptor) can be important for studying human diseases like obesity.
  • Sequence alignment methods (allowing for mismatches, insertions, and deletions) are used to find functional equivalents.
  • Tools like BLAST, BWA, and MiniMap employ database filtering to enable fast similarity searches.
  • If no similar sequences are found (e.g., <60% identity), it becomes a statistical rather than an algorithmic question.

Statistical Significance

  • The question of statistical significance comes up when low sequence identity is observed.
  • It must be determined whether a match with, for example, 60% identity, is significant or just by chance.
  • There needs to be a way to systematically probe statistical significance during querying.

Tricky Question of Similarity

  • What if your query is 90% similar to one human sequence (h2), 70% similar to a mouse sequence (m1), and 50% similar to a fruit fly sequence (f)? Experiments find functions are not the same, which one do you trust more?
  • Naive answer is that you trust more similar sequences, but humans should have higher similarity than 90%. So we cannot trust the human sequence.
  • Gene duplication and paralogs: After gene duplication events and sequence divergence, paralogs can have very different functions.
  • Orthologs: The fruit fly sequence is the true ortholog. Conserved function across different time scales.
  • Value and impact of sequence alignment diminishes as we question the value of sequence alignment.

More Questions Regarding Sequence Similarity

  • If sequences A and B are 40% identical and are functionally similar, can we infer that sequence C is also functionally similar?
  • A is also 40% identical to C.
  • Cannot infer function without more information.
  • Functional regions may be dissimilar with conserved structural regions.
  • Matches between B and A and A and C may occur in different locations of the sequences.
  • The question becomes systematically studying what very low levels of identity mean.

Protein Structure Databases

  • Protein structures have three-dimensional representations but also two-dimensional analogs (alpha helices, beta sheets).
  • Motifs and sequence patterns can be identified using regular expressions based on amino acid properties.
  • Databases like ProSight store information as regular expressions.

Central Dogma of Biology

  • The central dogma describes the flow of information from DNA to RNA to protein.
  • Genes are regions of DNA that contain instructions for making proteins.
  • Genes consists of exons (coding regions) and introns (non-coding regions).
  • Upon transcription, RNA is spliced to remove introns.
  • Bioinformaticians may be given an orphan genomic sequence and asked to identify the gene (coordinates of exons, etc.).
  • Machine learning and AI methods might be required to identify genes.
  • Algorithms are needed to identify gene features in orphan genomic sequences.

What percentage of the human genome is genes?

  • The amount of exonic codes (parts that actually code for a gene) only 1.7%.
  • The genetic region is larger (the entire region of the genome, all introns and exons), between 30 to 40%.
  • Debate continues about protein-coding genes.

Dynamic Processes

  • This class focuses on the static and dynamic aspects of the dynamic parts of bioinformatics.
  • Static portion: genome sequenced, genes identified.
  • Dynamic portion includes everything functional going on in the cell.
  • Gene gets made into RNA with the exons being spliced together to make the RNA.
  • RNA are then used to make proteins.
  • What you can do is simply take a cell extract all the RNA sequences and map back to where they came from. For every gene, how many copies of RNA
  • At any time you sample a cell and see different corpus of proteins because they are constantly degrading and being assembled.
  • A simple experiment involves isolating all RNA sequences and counting copies of each gene.
    • Samples include a collection of all those RNA sequences.
    • Counts will indicate which genes are active in that area.
    • The counts can change depending on the state of the cell.
    • Allows for some observations
  • Example: What if there were samples from individuals one with cancer and one without? Could the collected data determine which individuals had cancer.
  • Length of the sequence also matters because they can span the entire gene.
  • Metrics need normalization and biologists just do simple counting to find copies.
  • Real life example: a matrix of genes correlated with those who has leukemia versus those who do not have leukemia.

Proteins

  • RNA has a shorter lifespan. Proteins are longer and modified.
  • Protein sequencing is different from RNA, need mass spectrometry to identify protein modifications with quantization methods.

Population Aspects

  • Analysis on large numbers of sample data with various nucleotides
  • Responses to drugs and susceptibility to diseases has a genetic basis, studied through the differences we nucleotide differences.
  • Polymorphic nucleotides, with databases to collect this data. SNPs, with nucleotide variations, allowing analysis to identify ethnicity or demographics.
  • Differences occur because of how our genome is different.