Lecture 3 - Sequences and Database

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/44

Earn XP

Description and Tags

BIOL 266

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

45 Terms

New cards

What are biological databases

A computerized archive of biological information (often specialized*) in which
• Data is stored and organized
• Sometimes further analyzed, annotated, and visualized
The goal : Must provide methods for researchers to find what they are looking for
• Queries with search criteria
• Downloads

New cards

What are primary databases

contain raw, original data, usually generated from experimental results. These databases are repositories for unprocessed or minimally processed data such as nucleotide sequences, protein sequences, and gene expression data

Relatively little to no curation for some

Another name: core data

New cards

What are some examples of primary databases

GenBank: Stores nucleotide sequences from various organisms (genes/genomes) (Apart of NCBI under national institute of health)

European Nucleotide Archive (ENA): Provides nucleotide sequences submitted by the scientific
community. (Similar to GenBack, european version)

Protein Data Bank (PDB): Contains 3D structural data of proteins and nucleic acids. (4 unique letter code within database to identify them)

UniProtKB (Swiss-Prot/TrEMBL): Stores protein sequence

All have highly variable quality, does not guarantee all data is high quality

New cards

What is the purpose of primary databases and when would you use it

To serve as repositories of original experimental data, allowing researchers to access and share biological data. These databases are often the starting point for further analysis.

You’d use a primary database if you wanted to search for the raw nucleotide sequence of a gene or obtain the protein structure from experimental data

New cards

What is the data type/content of primary databases

The data is usually directly submitted by researchers, often without any extensive analysis or interpretation. It’s raw or annotated only with basic information like species, sequence, or structure

Content: Nucleotide sequences, protein sequences, structural data, expression data, mass-spec data, and more…

New cards

What is secondary databases

Derived from primary databases but contain curated, processed, and interpreted information. They include data that has been analyzed, refined, and sometimes merged with other data sources to provide more insights.

Highly curated and processed

Another name: annotations

New cards

What are examples of secondary databases

Pfam: Stores protein families and domains. (analyze protein sequences and group them into families (evolution, common ancestor)

InterPro: Provides functional analysis of proteins by classifying them into families and predicting domains and important sites. (similar to Pfam ) (Can search by sequence, text, or domain architecture)

PROSITE: Contains information about protein domains, families, and functional sites.(identify key amino acids and functional sites)

KEGG: Provides curated information on biological pathways and molecular interactions. (group enzymes into metabolic pathways)

GTDB: Genome Taxonomy Database; stores a curated set of representative genomes for all available bacterial/archaeal species, taxonomically labeled and phylogenetically analyzed (one genome = each speices, looks at bacteria and archaea. Idea: create giant tree of life)

AnnoTree: Stores functionally annotated genomes from the GTDB

New cards

What is the data type/content of secondary databases

The data in secondary databases is curated, analyzed, and interpreted by experts. It often includes functional annotations, structural predictions, evolutionary relationships, and interactions.

Content: Functional annotations, classifications, pathways, and other bioinformatic predictions

New cards

What is the purpose and example of usage of secondary databases

To provide added value by analyzing and interpreting the data from primary
databases. This helps users understand the biological significance of raw data.

Example Usage: You’d use a secondary database when you need functional annotation of a
protein, domain predictions, or pathway information for a set of genes or proteins.

New cards

Are all databases divided into primary or secondary databases

No, many biological Databases contain both primary and secondary databases (NCBI for example)

New cards

What are the types of data annotation in the UniProtKB database

SwissProtDB is a subset of the data base with human experts validating data and making sure it is high quality and correct (star)

TrEMBL is an automatic annotation (no star)

New cards

What are flat file databases

They store data in a plain text format with no relationships between data entries
• Data is organized in a simple, sequential manner without complex indexing

New cards

What are some characteristics of flat file databases

Human-readable: Files can be opened and viewed in basic text editors.

No relational structure: Each entry is independent; no relationships or links between entries.

Simple storage: Suitable for small to moderately sized datasets

New cards

What are the advantages and limitations of flat file databases

Advantages:
• Easy to create and share across different systems.
• Flexible and simple for basic data storage and retrieval tasks.

Limitations:
• Scalability: Inefficient for very large datasets or complex queries.
• Data retrieval: Slow when extracting specific information compared to relational databases.
• No built-in data validation: Errors in data entry or formatting can easily occur

New cards

What are example formats in computational biology of flat file databases

GFF/GTF and GenBank (GBK) formats: For genomic feature annotations (e.g., gene positions)

FASTA: For storing DNA, RNA, or protein sequences

CSV/TSV: For tabular data (e.g., gene expression, variant data

Work in “plain text” mode; other text formats (e.g., docx) will not work

New cards

What are some characteristics of GenBank

Header: non computational information

Features: gives predictions/annotations

Origin: Primary raw data

Difficult for very large genomes

New cards

What are some characteristics of GenBank

Header: title/label most of times either, mySequence or sequence 1, 2, etc…

Sequence: under header

New cards

What are relational databases and examples

Data stored within a number of tables linked together by a shared identifier or key
- key must be unique to each record

Handles huge amounts of data
- reducing data in memory
- faster search and retrieval

Examples
- Most large bioinformatic databases use relational DB tools such as MySQL or PostgreSQL
• e.g., AnnoTree DB developed here at UW uses a MySQL database

New cards

How are biological sequences represented

Can be represented as “strings”

In the form of DNA, RNA or Protein sequences

New cards

What are the six standard nucleotide/nucleotide ambiguity codes

Cytosine - C

Guanine - G

Adenine - A

Thymine - T

Uracil - U

Any (A,G,C,T) - N

New cards

What website can you use to translate DNA or RNA sequences into a protein sequence

Expasy operated by the SIB Swiss Institute of Bioinformatics

New cards

What do reading frames imply for DNA/RNA sequencing

reading frame ‘5 to ‘3 - 1: start from first letter going forward

reading frame ‘5 to ‘3 - 2: start from second letter going forward

reading frame ‘5 to ‘3 - 3: start from third letter going forward

reading frame ‘3 to ‘5 - 1: start from first letter going reverse

reading frame ‘3 to ‘5 - 2: start from second letter going reverse

reading frame ‘3 to ‘5 - 3: start from third letter going reverse

New cards

What are strings in biological sequencing

A sequence of characters or symbols, used in computer science to represent data

Biological sequences are represented as strings of letters, where each letter corresponds to a nucleotide or amino acid

New cards

What are DNA/RNA strings in biological sequencing

Composed of the characters A,T (or U), C, G representing nucleotides

New cards

What are protein strings in biological sequencing

composed of one-letter codes representing amino acids

New cards

Why do we use strings for biological sequences

Efficient Representation: Strings allow biological sequences to be efficiently stored, processed, and
analyzed in computational tools.

Manipulatable: Using algorithms, strings can be searched, aligned, compared, and transformed to
study biological patterns and relationships.

New cards

What are some operations we can do on strings

Alignment: Compare two or more sequences (strings) to find similarities.

Search: Locate specific patterns or motifs within a biological sequence.

Mutation: Simulate biological mutations by modifying characters in a string

New cards

What are common biological string formats

FASTA (.faa, .fna, .fa., .fasta): A common format for storing biological sequences as strings.

Plain Text: Biological sequences are often stored in plain text as strings for easy manipulation and sharing

New cards

What are some advanced search options

narrowing down results based on factors like organism, function, or location

New cards

What are controlled vocabularies/Ontologies

standardize terms, helping researchers define and refine their search criteria.

These standardized terms ensure consistent interpretation and improve the
accuracy of search results

New cards

What does ‘AND’ mean in terms of logic operators in searches

Retrieves records that match all criteria, representing the intersection of two
sets.

Example: Genes involved in cancer AND immune response (only genes in both categories)

<p><span style="font-family: sans-serif; color: #000000">Retrieves records that match all criteria, representing the <strong>intersection</strong> of two</span><span style="color: #000000"><br></span><span style="font-family: sans-serif; color: #000000">sets.</span></p><p><span style="font-family: sans-serif; color: #000000">Example: Genes involved in cancer AND immune response (only genes in both categories)</span></p>

New cards

What does ‘OR’ mean in terms of logic operators in searches

Retrieves records that match any of the criteria, representing the union of two sets.

Example: Genes involved in cancer OR immune response (genes in either categories)

<p><span style="font-family: sans-serif; color: #000000">Retrieves records that match any of the criteria, representing the <strong>union</strong> of two sets.</span></p><p><span style="font-family: sans-serif; color: #000000">Example: Genes involved in cancer OR immune response (genes in either categories)</span></p>

New cards

What does ‘NOT’ mean in terms of logic operators in searches

Excludes records that match certain criteria, representing the difference between sets.

Example: Genes involved in cancer NOT in the immune response (only cancer-related genes, excluding immune-related ones)

<p><span style="font-family: sans-serif; color: #000000">Excludes records that match certain criteria, representing the <strong>difference</strong> between sets.</span></p><p><span style="font-family: sans-serif; color: #000000">Example: Genes involved in cancer NOT in the immune response (only cancer-related genes, excluding immune-related ones)</span></p>

New cards

What does ‘NOR’ mean in terms of logic operators in searches

Excludes records matching any of the criteria, representing data outside of the union of two sets

<p>Excludes records matching any of the criteria, representing data <strong>outside</strong> of the union of two sets</p>

New cards

What is the best way to find genes/proteins related to gene/protein of interest

Through homology search (e.g., BLAST), not from text searches

New cards

What is sequence redundancy in biological databases

Biological databases (sequence databases in particular) have to deal with
many duplicate (“redundant”) sequences

Can occur due to researcher submissions involving identical sequences

Databases are computationally filtered to reduce or remove redundancy

New cards

How can databases be computationally filtered to reduce or remove redundancy

Uniref100 combines identical sequences into a single “cluster”

• identifies 100% identical sequences

Uniref90 combines sequences with 90+% identity

• clusters highly (90% or greater) similar sequences

Uniref50 combines sequences with 50+% identity

• clusters 50% or greater similar sequences

Uniref50 redundancy < Uniref90 redundancy < Uniref100 redundancy

New cards

What are some examples of computer error in database entries

Incorrect annotations or predictions

Missed relationships (insufficient information extraction)

New cards

What are some examples of human error in database entries

Highly variable quality of deposited sequence data (Vector sequences left in)(PCR chimeras)

Taxonomic misidentifications and mislabeling

Inconsistency in researcher labeling of gene/protein names and functional descriptions

Simple typos

Database propagation of initial human misannotations

New cards

What are the three broad classes of errors in database entries

Sequence - errors within the sequences

Metadata - incorrect identification of information/names towards the data entry

Propagation - encouraging ideas from other records/databases. not updating the records when it changes

New cards

Why is there no BLAST search capable of searching the SRA

Some bioinformatic databases are growing so large, they are becoming
difficult to analyze

NCBI Sequence Read Archive (SRA) has all raw sequencing data for virtually all types of sequencing projects. IT is the world’s largest repository of sequencing information

New cards