What are Knowledge Graphs? An Overview
In 2001 the father of W3C consortium, Tim Berners-Lee, stated: “The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation”. Since the publication of the paper the evolution of Semantic Web has moved its first steps with the representation of knowledge in forms of web standards. The following phase, more recently, is enabling a vast repertoire of applications with the emergence of machine learning methods, while maintaining the notion of knowledge with the aspects of graph: the Knowledge Graph (KG).
The common denominator of these applications is the leverage and momentum they get out of the expressiveness that KGs can provide. In industrial scenarios, the so-called Industry 4.0, semantic technologies generate business value for process automation, risk management and even the monitoring the entire factory. The Building DigitalTwin, for mentioning an use-case, uses KGs to integrate and represent knowledge in order to create a full virtual copy of real buildings and ultimately running applications for facility management. The integration of external knowledge is used to augment internal data for discovering conflict of interests, keeping KPIs on track and generally, it is used to detect risks never seen before. The reliability and the safety requirements of industrial manufacturing rises the demand of a shared terminology, standards and best practices convoluted in an enormous dataset (I40KG) of accepted terminology from international institutions, such as ISO, IEC, ETSI together with national organizations.
At first glance the Industry Graph Ontology, as well as the correspondent one in healthcare or finance, appears to be comprehensive of all possible structures of knowledge that shape possibly all aspects of the business domain. The expertise (in terms of nomenclature) collected by veterans in operations, development, sales and business is made available in a computable form, and it is ready for being used for machine learning, analytics and more. For being useful and functional, one of the qualities of a given KG is its size, the bigger the better. ETLs, BI systems are employed to extract value from diverse sources of data at large scale, but the complexity siloed in legacy databases is an open challenge for such integration. Domain specific databases might contain thousands of tables, with complicated relationships and naming which is not rare for being impossible to understand. Domain experts unavailable and missing documentation, are some other factors that open the way for new figures into organizations, the knowledge expert, a person who builds bridges between business requirements, questions, and data.
Concepts and their related semantic relations are not only the result of long assemblies from experts, but they are rather extracted automatically from raw unstructured data. The latter is actually the norm and provides most of the free and proprietary quantity of data. Wikidata for example, which counts 55 million entities and it is used by Amazon Alexa and Apple Siri, is entirely built-up by mining Wikipedia through a series of information retrieval techniques.
As the name of Knowledge Graph might recall, the data is stored in graph structures. Graph DBs emphasize the relation between entities and their arrangement is extremely simple. At their foundations lays just one single data structure, the triple:
Subject → Relation → Object
It might be seen as a generalization of tabular data on one hand, and a more expressive language than a key-value store, on the other. While relational DBs refer to a single field to their unique row identifiers, such relations does not lose any semantic in graph DBs and at the same time the data could be arranged purposely without any predefined schema. While in the Industry 4.0 ontology a shared schema is very welcomed, in more general purposes KGs such as DBPedia, the underlying schema is less than a prerequisite. The simple triple based structure has some commonalities with software principles and in particular with the composability, or in other words, the adequacy of the data structure to include other KGs and expanding the scope of applications running over them. More inclusive the KG is, the better it is.
It’s not all gold that glitters. Maintaining a workable KG is not just about injecting information but also executing specific routines that preserve the dataset consistency. Take for example the presidency of a country, which is assigned to a person for a limited time. There have been several people who were presidents for a while, but now, there is only one in charge, and s/he will be the president until the end of the mandate. The KG maintainer should employ some restrictions, or rules, that ensure such constraints. Without introducing the topic related to KGs reasoning, I’d advance the idea that the maintainer, in addition to the set of time-based constraints mentioned above, could infer the new president by picking up the candidate who scored highest in public elections or, depending on the statute, closed selection in the chamber of deputies. As maintaining time series data might seem unnatural in graphs, dynamic KGs are anything but unusual.
A general, concise, informal, non-doctrinal definition of KGs I would pose is:
KG = Data at scale + Integration + Concepts AND Relationships
Concepts are entities of interest of the real world and relationships are the edges between these entities.
What do KGs really know?
In this post I convey a brief overview of what KG is, its scope and why it could be useful for information services organizations. But representing knowledge is just one part of the picture. We should care much more on how much knowledge a system actually has rather than how to represent that knowledge, and it is not just about what you put into a graph. When knowledge turns to understanding it creates observable behaviour, and that behaviour is the proficiency on problem solving attributable to that system.
An interesting analogy is the music score, this is what actual knowledge representation looks like.
People might say this is music, but it is not music at all. This is a bunch of symbols, notes, pauses, concepts and relations. We know there is a process that can be applied to this representation that generates something we can hear and appreciate as music. In the following posts, I will point out how an agent might use KGs by selecting actions in a reasonable way by using logical deductive reasoning, answer set programming and graph neural networks.