Bionformatics
Bioinformatics
is the science of storing, retrieving and analysing large amounts of biological information
Where can I find nucleotide sequence datasets?
International Sequence Database Collaboration
Where can I find protein sequences datasets?
UniProt Consortium
Where can I find macromolecular structure datasets?
Worldwide Protein Data Bank
Where can I find molecular interaction datasets?
The International Molecular Exchange Consortium
Where can I find Protein identifications
The ProteomeXchange Consortium
Where can I find genomic and clinical datasets?
Global alliance for Genomics and Health
What are primary datasets?
are populated with experimentally derived data
What are secondary datasets?
comprise data derived from the results of analysing primary data
What are some examples of secondary datasets?
InterPro (protein families, motifs and domains)
UniProt Knowledgebase (sequence and functional information on proteins)
Ensembl (variation, function, regulation and more layered onto whole genome sequences)
What are some examples of primary datasets?
ENA, GenBank and DDBJ (nucleotide sequence) ArrayExpress and GEO (functional genomics data)
Protein Data Bank (PDB; coordinates of three-dimensional macromolecular structures)
What is metadata and what is an example of it?
ssentially data about the data. If you’re involved in sequencing samples from the environment, perhaps to understand biodiversity in different conditions, or to investigate associations between crop yield and differences in soil flora, it would be useful to know when and where your samples were collected for instance.
What is an example of a data library that has metadata in bioinformatics?
BioSamples database
Minimum information standards
Their purpose is to ensure the data generated by these methods can be easily verified, analysed and interpreted by the wider scientific community. Ultimately, they facilitate the transfer of data from journal articles (unstructured data) into databases (structured data) in a form that enables data to be mined across multiple data sets.
What’s the most simple part of a controlled vocabulary?
The simplest type of controlled vocabularies are non-hierarchical lists of terms, such as a list of countries. Annotating data with these lists makes it easier to filter or search for related records in a database. For example, if you use Europe PMC’s advanced literature search, and filter the results by language, you are choosing from items in a list determined by a controlled vocabulary
thesaurus (in IT)
is defined as a controlled and structured vocabulary in which concepts are represented by terms.
Where can I learn more about PubMed databases?
The NLM provides a series of webinars and tutorials about MeSH; one that may be of particular interest is Searching Drugs or Chemicals in PubMed
What is an ontology in IT?
is a representation of the shared background knowledge for a community (7). An ontology describes the categories of objects described in a body of data, the relationships between those objects, and the relationships between those categories
What are some tips for managing and collecting accurate data?
Start early – begin collecting data and metadata at the beginning of your experiment
Consider creating a data management plan, using tools such as DMPonline and the Data stewardship wizard
Identify the correct database (see ‘Where do I submit my data?‘ on the next page)
Speak to the curators who work with that database – check what you need to submit!
Learn about the metadata requirements and data standards used in your field. You can look these up on FAIRsharing.org.
Use an ontology to annotate the data, for example the Experimental Factor ontology.
How do I submit data to EMBL-EBI?
Through the EMBL-EBI submission portal
What does InterPro do?
Find protein families
What website can, based on a provided sequence, build a protein model?
Swiss Model
What does a QMEAN z-score mean in Swiss Model?
e represents an estimate of how comparable the model is to experimentally derived structures of similar size. QMEAN z-scores around zero indicate good agreement between the model structure and experimental structures of similar size. Models of low quality typically have scores of -4.0 or lower. The “thumbs-up” and “thumbs-down” symbols next to the score are used to indicate whether or not the model is of good quality (9). Another approach is to factor in observations of the quality of the alignment and template search method – this is represented in the GMQE (Global Model Quality Estimation) score. The GMQE score reflects the expected accuracy of that alignment and is expressed as a number between 0 and 1 where higher numbers indicate higher reliability (9). For more information see the SWISS-MODEL documentation pages.
What are some tools to help intergrate data?
, UniProt ID Mapping and Ensembl Biomart allow you to convert a set of identifiers from one format to another. There are also mappings of different controlled vocabularies, but care needs to be taken that you don’t lose data. For example, a term in one ontology might be mapped to a term that is less granular, so you might lose specificity. At EMBL-EBI we use application ontologies, the archetypal example of which is the Experimental Factor Ontology, to solve this problem.
What does EMBL-EBI’s Embassy Cloud do?
EMBL-EBI’s Embassy Cloud provides EMBL-EBI’s collaborators with direct access to their datasets hosted at EMBL-EBI, and to the institute’s powerful computing resources. This shared, high-performance workspace allows project partners in many locations to analyse their data alongside public offerings, using their own approaches.
What is a good guideline for data standirdardization?
Toni Kazic’s guide for data provenanc
Where can I see drug targets and disease data?
Open Targets
What are the four steps of a bioinformatics experiment?
– Search
– Compare
– Model
– Integrate