The internet inventor Tim Berners-Lee had the following vision of the semantic web:
"I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web - the content, links, and transactions between people and computers. A ‘Semantic Web', which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The ‘intelligent agents' people have touted for ages will finally materialize."
I can't predict the future and I won't try it to foresay when this vision will be reality but in the last years I learned the following:
And this is exactly what my blog series is about. I will introduce common semantic web technologies and discuss applications:
The web is about resources that are identified using URIs - and RDF was invented to express metadata about resources. Why do we need a special format? RSS is a good example of a vocabulary that started as RDF based format and specialized during its evolution.
This raises the question why we need a general format. RDF has some advantages: We can express facts about resources (identified using URIs) in a generic way. Perhaps there are facts that are expressed using a vocabulary we don't know but this won't stop us from querying. Let me give an example: a database contains information about different countries and their culture. We are asking for geo coordinates and have no interest in historical or cultural things and in fact don't even the used vocabulary - so we won't get those information. But if we want that kind of information then we have determine which vocabulary the database supports and learn it to ask those questions. This is nothing extraordinary: a database knows only facts and a user needs wisdom to ask the right questions - and of course some intelligence to use a query language correctly.
RDF is a data model that consists of triples: subject predicate object. There are different serializations for the RDF and first I will express some facts about Afghanistan in the so called Turtle notation:
@prefix dbpprop: <http://dbpedia.org/property/>
<http://dbpedia.org/resource/Afghanistan> a http://dbpedia.org/class/yago/LandlockedCountries .
<http://dbpedia.org/resource/Afghanistan> dbpprop:populationEstimate 31889923 .
Let me explain this notation: at first the namespace dbpprop of properties of entities in DBPedia database is defined. The first expression says that Afghanistan belongs to landlocked countries and the second contains an estimations of the population. Of course there is an XML serialization which is used by internet applications, and here is a part of it:
<rdf:Description rdf:about="http://dbpedia.org/resource/Afghanistan">
<rdf:type rdf:resource="http://dbpedia.org/class/yago/LandlockedCountries"/>
</rdf:Description>
<rdf:Description rdf:about="http://dbpedia.org/resource/Afghanistan">
<dbpprop:populationEstimate rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">31889923</dbpprop:populationEstimate>
</rdf:Description>
What are the differences between RDF data sets and XML data defined using XML Schema?
Up to now we know how to express facts in RDF. These RDF data can be extracted from existing data bases (using transformation or data mining techniques) or created in special use cases (think of meta information of certain business objects like customers or contacts). You can imbed RDF in XML/XHTML documents or expose using REST web services.
Now we learn how to query RDF data. Therefore a new standard was invented: SPARQL that is a kind of SQL for RDF data.
SPARQL can operate on local as well as remote data called SPARQL endpoints. You'll find a lot of resources on the web, see f.i. http://www.cambridgesemantics.com/2008/09/sparql-by-example/#(1) which I take as example. Please open http://dbpedia.org/sparql and enter following query:
PREFIX type: <http://dbpedia.org/class/yago/>
PREFIX prop: <http://dbpedia.org/property/>
SELECT ?country_name ?population
WHERE {
?country a type:LandlockedCountries ;
rdfs:label ?country_name ;
prop:populationEstimate ?population .
FILTER (?population > 15000000 && langMatches(lang(?country_name), "EN")) .
} ORDER BY DESC(?population)
The meaning is quite obvious: find all landlocked countries with a population greater than 15 million people with the highest population country first. The SPARQL expression is quite similar to SQL: We have a SELECT with some select options (country name and population) that are variables that begin with a question mark. The WHERE condition contains three conditions about the country variable: it must be a landlocked country, the variable for the country name has got to be its name andthere must be an estimation about its population. We have an additional FILTER condition that specifies a filter about the size of the population and the language of the output and an ORDER BY clause. As a result you will see the following list of the biggest landlocked countries:
country_name | population |
Ethiopia | 7825490 |
Afghanistan | 31889923 |
Uganda | 30900000 |
Nepal | 29519114 |
Uzbekistan | 27372000 |
Kazakhstan | 15217711 |
Ethiopia | 78254090 |
Now change the SELECT clause and add the country variable to get the URIs of the resources found within this query:
SELECT ?country_name ?population ?country
Now the output contains the URIs of the resources found. Select the resource http://dbpedia.org/resource/Afghanistan from the result set and you will see all facts about Afghanistan stored in http://dbpedia.org and then here are the RDF facts in XML syntax: http://dbpedia.org/data/Afghanistan.rdf .
In my opinion querying RDF dataset is quite cool but it gets even better. In a Twitter conversation a SPARQL expert told me that SPARQL is much more powerful and can tackle difficult problems like Sudoku and I decided to take the challenge and solve it using SPARQL. "Difficulty" doesn't mean that some 9 times 9 puzzles are hard to solve (of course lots of them are!) - it means that with increasing size of the squares there is probably no efficient solution algorithm. So a general solver has to try most of the combinations and those "explode" in exponential order.
So I got curious and installed a SPARQL enginehttp://joseki.sourceforge.net/ and started with a simple 2x2 Sudoku. My input RDF graph are four fields (the square) and for every field are two values allowed:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:ex="http://example.org/">
<rdf:Description rdf:about="http://example.org/field/11">
<ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">1</ex:allowed>
<ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">2</ex:allowed>
</rdf:Description>
<rdf:Description rdf:about="http://example.org/field/12">
<ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">1</ex:allowed>
<ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">2</ex:allowed>
</rdf:Description>
<rdf:Description rdf:about="http://example.org/field/21">
<ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">1</ex:allowed>
<ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">2</ex:allowed>
</rdf:Description>
<rdf:Description rdf:about="http://example.org/field/22">
<ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">1</ex:allowed>
<ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">2</ex:allowed>
</rdf:Description>
</rdf:RDF>
My SPARQL query is very simple. With statement below the SPARQL engine constructs a partial graph which corresponds to a solutionof the Sudoku puzzle: an assignment of a number to each field. The WHERE condition describes the allowed values and in my FILTER I define the connection between field and value: the game rules of a 2x2 Sudoku. The last FILTER condition is about the leftmost upper field: it is a "2" (i.e. an integer in the meaning of XML Schema):
PREFIX ex:<http://example.org/>
CONSTRUCT {
?11 ex:allowed ?w11 .
?12 ex:allowed ?w12 .
?21 ex:allowed ?w21 .
?22 ex:allowed ?w22 .
}
WHERE {
?11 ex:allowed ?w11 .
?12 ex:allowed ?w12 .
?21 ex:allowed ?w21 .
?22 ex:allowed ?w22 .
FILTER(
(?11 = <http://example.org/field/11> ) &&
(?12 = <http://example.org/field/12> ) &&
(?21 = <http://example.org/field/21> ) &&
(?22 = <http://example.org/field/22> ) &&
(?w11 != ?w12) && (?w21 != ?w22) &&
(?w11 != ?w21) && (?w12 != ?w22) &&
(?w11 = "2"^^<http://www.w3.org/2001/XMLSchema#int> )
)
}
It's easy to extend this solution to 4x4 and even higher Sudoku puzzles but the running time will be terrible. So here's another approach: my RDF input graph is simpler but the complexity of the SPARQL query increases. The following graph corresponds to a single field of a Sudoku square that can have 4 possible values:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:ex="http://example.org/">
<rdf:Description rdf:about="http://example.org/field">
<ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">1</ex:allowed>
<ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">2</ex:allowed>
<ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">3</ex:allowed>
<ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">4</ex:allowed>
</rdf:Description>
</rdf:RDF>
The RDF query asks only for the 16 values allowed for the above defined field: The FILTER condition defines the game rules of the 4x4 Sudoku: every column and row must contain exactly the values 1 to 4 and the 4 2x2 squares, too:
PREFIX ex:<http://example.org/>
SELECT
?w11 ?w12 ?w13 ?w14
?w21 ?w22 ?w23 ?w24
?w31 ?w32 ?w33 ?w34
?w41 ?w42 ?w43 ?w44
WHERE {
ex:field ex:allowed ?w11 .
ex:field ex:allowed ?w12 .
ex:field ex:allowed ?w13 .
ex:field ex:allowed ?w14 .
ex:field ex:allowed ?w21 .
ex:field ex:allowed ?w22 .
ex:field ex:allowed ?w23 .
ex:field ex:allowed ?w24 .
ex:field ex:allowed ?w31 .
ex:field ex:allowed ?w32 .
ex:field ex:allowed ?w33 .
ex:field ex:allowed ?w34 .
ex:field ex:allowed ?w41 .
ex:field ex:allowed ?w42 .
ex:field ex:allowed ?w43 .
ex:field ex:allowed ?w44
FILTER(
(?w11 != ?w12) && (?w11 != ?w13) && (?w11 != ?w14) && (?w12 != ?w13) && (?w12 != ?w14) && (?w13 != ?w14) &&
(?w21 != ?w22) && (?w21 != ?w23) && (?w21 != ?w24) && (?w22 != ?w23) && (?w22 != ?w24) && (?w23 != ?w24) &&
(?w31 != ?w32) && (?w31 != ?w33) && (?w31 != ?w34) && (?w32 != ?w33) && (?w32 != ?w34) && (?w33 != ?w34) &&
(?w41 != ?w42) && (?w41 != ?w43) && (?w41 != ?w44) && (?w42 != ?w43) && (?w42 != ?w44) && (?w43 != ?w44) &&
(?w11 != ?w21) && (?w11 != ?w31) && (?w11 != ?w41) && (?w21 != ?w31) && (?w21 != ?w41) && (?w31 != ?w41) &&
(?w12 != ?w22) && (?w12 != ?w32) && (?w12 != ?w42) && (?w22 != ?w32) && (?w22 != ?w42) && (?w32 != ?w42) &&
(?w13 != ?w23) && (?w13 != ?w33) && (?w13 != ?w43) && (?w32 != ?w33) && (?w23 != ?w43) && (?w33 != ?w43) &&
(?w14 != ?w24) && (?w14 != ?w43) && (?w14 != ?w44) && (?w42 != ?w34) && (?w24 != ?w44) && (?w43 != ?w44) &&
(?w11 != ?w22) && (?w12 != ?w21 ) &&
(?w13 != ?w24) && (?w23 != ?w14 ) &&
(?w31 != ?w42) && (?w32 != ?w41 ) &&
(?w33 != ?w44) && (?w34 != ?w43 ) &&
(?w14 = "3"^^<http://www.w3.org/2001/XMLSchema#int> ) &&
(?w24 = "4"^^<http://www.w3.org/2001/XMLSchema#int> ) &&
(?w13 = "2"^^<http://www.w3.org/2001/XMLSchema#int> ) &&
(?w14 = "3"^^<http://www.w3.org/2001/XMLSchema#int> )
)
}
The last 4 FILTER conditions contain values for 4 already filled fields. The solver gives out all results and one of this is:
4 | 1 | 2 | 3 |
3 | 2 | 1 | 4 |
2 | 4 | 3 | 1 |
1 | 3 | 4 | 2 |
Yes we can. There are more sophisticated Sudoku queries in SPARQL. I give you some hints: Try to simplify the conditions. And if you are really smart you can get rid of FILTER conditions.
We can use RDF to store metadata about resources in a generic way. We can query those data using Semantic Web technologies like SPARQL. SPARQL is able to solve difficult problems but this means that SPARQL has a horrible worst case running time. SPARQL is a kind of general problem solver but we shouldn't use is that way. If we want to solve optimization or decision problems we should use specialized tools for defining problems in an appropriate way and that use more specialized solvers.
For "simple" "SQL like" queries RDF combined with SPARQL is a quite powerful technology: It is indeed an SQL to retrieve knowledge for web and non-web applications.
In the next blog entry we will learn about much more powerful tools of the Semantic Web world. With OWL we can define ontologies and perform more advanced reasoning: classification of resources and properties and defined classes using properties.