Examples of raw data: "yes, no, no, yes, no", "8.5, 7.5", "123, 11.1"
Without context, data is useless
Adding meaning to data makes it useful and allows for further actions and analysis
Processing rules and inference rules can be applied to turn information into knowledge
Knowledge allows for decision-making and taking actions based on the information
Contextualization increases the usefulness and interpretability of information
The more context given to data, the higher it goes in the knowledge pyramid
Interpretable information is more reusable
Knowledge involves more subjective and psychological aspects
Knowledge can be explicit (written down) or implicit (in someone's head)
Experts possess valuable knowledge that is difficult to formalize
Explicit knowledge is written down in rules, databases, schemas, etc.
Formal languages are used to represent explicit knowledge
Knowledge graphs are used to combine data, information, and formalized knowledge
The amount of written knowledge is limited compared to the knowledge in people's minds
Intuition and experience are challenging to capture in written form
Contextualizing data and information makes it more useful and valuable
Knowledge can be used to contextualize information
There is a two-way circle between knowledge and information
Data scientists spend a significant amount of time cleaning and organizing data
Using knowledge to automatically contextualize and interpret data is valuable from a business perspective
Formal knowledge allows for interpretation of data, making data science easier
Formal knowledge is written down and explicit, not the knowledge in the data scientist's head
Formal knowledge is not an alternative to machine learning, but they can work together
Knowledge graphs are a common way of writing down information, data, and knowledge
Knowledge graphs are a way to represent data, information, and knowledge.
They are useful when dealing with heterogeneous data from different sources.
Knowledge graphs make the semantics and meaning of information explicit.
They are represented in a network-like structure.
Knowledge graphs are an alternative to databases.
Databases are represented in tables with columns and rows, while knowledge graphs have nodes and edges.
The two models have fundamental differences.
There is an increasing amount of data available online, such as medical, government, and museum data.
Data is spread out and shaped differently, creating silos.
Silos are geographically and semantically distributed.
Connecting these silos would allow for more comprehensive analysis and understanding.
Connecting silos is particularly important in domains like culture heritage.
Tim Berners-Lee invented the World Wide Web and wrote a document outlining his idea while working at CERN.
He proposed the idea of sharing information and documents between researchers.
His boss responded with the famous quote, "It's vague but exciting."
The first page of the document shows familiar elements of the World Wide Web, such as documents and links.
It allowed for the connection of remotely hosted documents.
It also included hierarchies, concepts, and relations between people and documents.
In 2001, Tim Berners-Lee recognized the need for a web of data (instead of web of documents) and proposed the concept of the Semantic Web.
The current web consists of applications hosted on different locations, but there is no web of data.
The vision is to build a web of data that allows for the integration of different applications.
The analogy between a web of documents and a web of data is powerful.
The web of documents allows for easy linking to external sources of information.
It relieves the burden of having to know everything about a topic and allows for information reuse.
The web of data would provide users with connections and access to information from multiple contributors.
The web of data is similar to the World Wide Web but with data instead of websites.
It involves using databases and data items instead of web pages.
The goal is to increase the usefulness of data and enable the reuse of existing data.
The web of data is a network of data points or datasets.
Two challenges need to be resolved: integrating heterogeneous information and dealing with physical distribution.
The physical integration is solved by the web, while the semantic integration is solved by understanding how to write down knowledge.
Linked data is the idea of linking data sets from multiple sources.
Adding meaning to linked data results in the semantic web.
Knowledge graphs are the formulation for creating semantic web.
The web of data is intended to be interpreted by machines.
Data is linked and comes from different sources.
The power of linked data and knowledge graphs is evident when dealing with heterogeneous data.
Data integration becomes a challenge when dealing with different types of data and structured data from different sources.
Knowledge graphs and linked data help solve the challenges of connecting distributed information and writing down diverse information.
Tim Berners-Lee and others proposed four principles for building knowledge graphs.
Data provider gives names to things they want to talk about
Names depend on the domain and task at hand
Not all details need to be named, only relevant ones
Names can be addresses on the web (URIs → Uniform Resource Identifiers)
URIs provide globally unique identifiers for objects
Browsers or applications can follow URIs to find the object
Relations can be established between different things
Relations create a network or graph data model
Adding relations allows for the creation of graphs
Emphasizes the importance of explicitly defining the meaning of things in knowledge graphs.
Adding semantics makes connections and relationships between entities more meaningful and understandable.
Semantics clarify the context and purpose of information stored in the knowledge graph.
Explicitly defining the meaning of things enables extraction of valuable insights and various data science tasks.
Allows for effective communication and utilization of knowledge within a web setting.
Graphs can be distributed across different sources
Different sources can have their own identifiers for the same concept
Relations between different things can be established across sources
Different sources can focus on different aspects of the data
Schema.org defines the concept of a person
RDF determines the meaning of a type of something
Information is distributed and outsourced across different sources
Creates a globally distributed graph of linked data
Allows for the outsourcing of information to different sources
Enables the creation of smaller graphs for specific purposes
In the past, the speaker would have to manually input information about Harlem into their own database.
Now, they can make a link to another dataset, specifically one governed by geo names.
By linking their data to the geo names dataset, they gain access to a wealth of information for free.
This allows for network queries and various data science tasks such as visualization, data analysis, and data querying.
The speaker introduces the first three principles of linked data:
Use web URIs for names.
Put relations between things to create a web of linked data.
Use knowledge graphs to enable data science tasks.
They mention that there is a fourth principle, which will be discussed after a break.
Computers struggle to understand textual information and context
Humans are good at reading and interpreting English text
Computers need a formal representation of information to understand it
Formal representation of information allows for predictable inferencing
Example: Using a formalism to understand the meaning of a statement
Computers can derive new facts based on the formal representation
Semantic web combines naming, graph relationships, and explicit semantics
Giving things names and addressing them on the web
Representing relationships between things using graphs of data
Adding explicit semantics for predictable inferencing
Semantic web enables the web of data
Machines can understand and derive information from machine-readable formats
Knowledge graphs play a role in the semantic web
Knowledge graphs may not be widely known or discussed in the news
There are many knowledge graphs available for searching and use in various domains.
Life sciences, government data, media data, publication, social networking, etc.
These are public and open knowledge graphs.
Combining different databases in life sciences is important for drug research and discovery.
Understanding how enzymes work, existing drugs, and genomic pathways.
There has been an opening up and connecting of data in life sciences and other domains.
Some generic knowledge graphs mentioned:
Iago: a general-purpose knowledge graph with various facts.
DBpedia: a project that converts Wikipedia info boxes into a knowledge graph format.
Freebase: a combination of several knowledge graphs used in commercial applications.
Google Knowledge Graph: one of the largest knowledge graphs used by Google in search results.
Google Knowledge Graph provides information boxes about entities searched for.
Derived from their own knowledge graph.
Many companies and organizations are using knowledge graphs in their infrastructure.
Netflix for recommendations.
German National Library.
Elsevier for better services for scientists.
IKEA for product catalog and user information.
Knowledge graphs are used for various tasks in different industries.
Amazon for a product graph in their web tool.
Uber for a food ontology for food recommendations.
AI is not limited to machine learning, it encompasses various tasks and approaches
AI includes machine learning, knowledge representation, and other theories
Machine learning relies on statistics and signal processing
Knowledge representation is based on logics
AI can be divided into symbolic AI and sub-symbolic AI
Symbolic AI focuses on knowledge representation, while sub-symbolic AI focuses on deep learning
There are interesting connections between machine learning and knowledge representation
Symbolic methods, such as knowledge graphs, can be combined with machine learning
Knowledge graphs can provide formal knowledge to enhance machine learning results
Symbolic methods can be used for explainability and data input/output
The combination of symbolic and sub-symbolic AI is an interesting approach
The course will not cover this combination, but students should be equipped to explore it after completing their AI bachelor's degree