Semantic Web Technologies Part 1 – SPARQL

ttrapp · ‎04-15-2009

The internet inventor Tim Berners-Lee had the following vision of the semantic web:

"I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web - the content, links, and transactions between people and computers. A ‘Semantic Web', which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The ‘intelligent agents' people have touted for ages will finally materialize."

I can't predict the future and I won't try it to foresay when this vision will be reality but in the last years I learned the following:

If we add metadata to (web) resources we can create applications that build bridges. A good example are RSS feeds that can be evaluated by newsreaders and can be spread using twitter and so on. So there is a truth in the common saying "a little bit semantics can take you far".
People started to extract data from web resources (think of Wikipedia) and stored them in a way so that we can do queries. In fact you will learn about the DBpedia project that extracts data from Wikipedia and exposits it so we can query questions like the following: "Which rivers are longer than the Nile?"
People are starting to use semantic web technologies in a creative way and apply them to other branches - think of software engineering.

And this is exactly what my blog series is about. I will introduce common semantic web technologies and discuss applications:

How can we express knowledge in a generic way using open standards?
How can we query those data using open standards?
How can we ask questions and do reasoning about ressources?

RDF - Resource Description Framework

The web is about resources that are identified using URIs - and RDF was invented to express metadata about resources. Why do we need a special format? RSS is a good example of a vocabulary that started as RDF based format and specialized during its evolution.

This raises the question why we need a general format. RDF has some advantages: We can express facts about resources (identified using URIs) in a generic way. Perhaps there are facts that are expressed using a vocabulary we don't know but this won't stop us from querying. Let me give an example: a database contains information about different countries and their culture. We are asking for geo coordinates and have no interest in historical or cultural things and in fact don't even the used vocabulary - so we won't get those information. But if we want that kind of information then we have determine which vocabulary the database supports and learn it to ask those questions. This is nothing extraordinary: a database knows only facts and a user needs wisdom to ask the right questions - and of course some intelligence to use a query language correctly.

Some Basic Facts About Afghanistan

RDF is a data model that consists of triples: subject predicate object. There are different serializations for the RDF and first I will express some facts about Afghanistan in the so called Turtle notation:

@prefix dbpprop: <http://dbpedia.org/property/>   
<http://dbpedia.org/resource/Afghanistan>     a   http://dbpedia.org/class/yago/LandlockedCountries .
<http://dbpedia.org/resource/Afghanistan> dbpprop:populationEstimate 31889923 .

Let me explain this notation: at first the namespace dbpprop of properties of entities in DBPedia database is defined. The first expression says that Afghanistan belongs to landlocked countries and the second contains an estimations of the population. Of course there is an XML serialization which is used by internet applications, and here is a part of it:

<rdf:Description rdf:about="http://dbpedia.org/resource/Afghanistan"> 
  <rdf:type rdf:resource="http://dbpedia.org/class/yago/LandlockedCountries"/> 
</rdf:Description>

<rdf:Description rdf:about="http://dbpedia.org/resource/Afghanistan"> 
  <dbpprop:populationEstimate   rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">31889923</dbpprop:populationEstimate> 
</rdf:Description>

What are the differences between RDF data sets and XML data defined using XML Schema?

XML documents correspond to trees and XML Schemas define sets of trees. RDF has the model of a graph: resources are linked to other resources and literals using expressions (=properties).
XML Schema was invented to define having complex data types - RDF was invented as a syntax for knowledge.
In RDF there is no need for a schema agreement: RDF datasets can evolve, extend and adapted. So the RDF approach is very flexible.
Of course you can define schemas for RDF data that define semantics like classes and properties, but they can grow incrementally without impacting existing databases.

SPARQL - an RDF Query Language

Up to now we know how to express facts in RDF. These RDF data can be extracted from existing data bases (using transformation or data mining techniques) or created in special use cases (think of meta information of certain business objects like customers or contacts). You can imbed RDF in XML/XHTML documents or expose using REST web services.

Now we learn how to query RDF data. Therefore a new standard was invented: SPARQL that is a kind of SQL for RDF data.

SPARQL can operate on local as well as remote data called SPARQL endpoints. You'll find a lot of resources on the web, see f.i. http://www.cambridgesemantics.com/2008/09/sparql-by-example/#(1) which I take as example. Please open http://dbpedia.org/sparql and enter following query:

PREFIX type: <http://dbpedia.org/class/yago/>
PREFIX prop: <http://dbpedia.org/property/>

SELECT ?country_name ?population
WHERE {
            ?country a type:LandlockedCountries ;
             rdfs:label ?country_name ;
             prop:populationEstimate ?population .
FILTER (?population > 15000000 && langMatches(lang(?country_name), "EN")) .
} ORDER BY DESC(?population)

The meaning is quite obvious: find all landlocked countries with a population greater than 15 million people with the highest population country first. The SPARQL expression is quite similar to SQL: We have a SELECT with some select options (country name and population) that are variables that begin with a question mark. The WHERE condition contains three conditions about the country variable: it must be a landlocked country, the variable for the country name has got to be its name andthere must be an estimation about its population. We have an additional FILTER condition that specifies a filter about the size of the population and the language of the output and an ORDER BY clause. As a result you will see the following list of the biggest landlocked countries:

country_name	population
Ethiopia	7825490
Afghanistan	31889923
Uganda	30900000
Nepal	29519114
Uzbekistan	27372000
Kazakhstan	15217711
Ethiopia	78254090

Now change the SELECT clause and add the country variable to get the URIs of the resources found within this query:

SELECT ?country_name ?population ?country

Now the output contains the URIs of the resources found. Select the resource http://dbpedia.org/resource/Afghanistan from the result set and you will see all facts about Afghanistan stored in http://dbpedia.org and then here are the RDF facts in XML syntax: http://dbpedia.org/data/Afghanistan.rdf .

Solving Sudoku using SPARQL

In my opinion querying RDF dataset is quite cool but it gets even better. In a Twitter conversation a SPARQL expert told me that SPARQL is much more powerful and can tackle difficult problems like Sudoku and I decided to take the challenge and solve it using SPARQL. "Difficulty" doesn't mean that some 9 times 9 puzzles are hard to solve (of course lots of them are!) - it means that with increasing size of the squares there is probably no efficient solution algorithm. So a general solver has to try most of the combinations and those "explode" in exponential order.

So I got curious and installed a SPARQL engine http://joseki.sourceforge.net/ and started with a simple 2x2 Sudoku. My input RDF graph are four fields (the square) and for every field are two values allowed:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:ex="http://example.org/">   
  <rdf:Description rdf:about="http://example.org/field/11"> 
     <ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">1</ex:allowed>
     <ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">2</ex:allowed>
   </rdf:Description>

   <rdf:Description rdf:about="http://example.org/field/12"> 
     <ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">1</ex:allowed>
     <ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">2</ex:allowed>
   </rdf:Description>

   <rdf:Description rdf:about="http://example.org/field/21">
     <ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">1</ex:allowed>
     <ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">2</ex:allowed>
   </rdf:Description>

   <rdf:Description rdf:about="http://example.org/field/22">
     <ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">1</ex:allowed>
     <ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">2</ex:allowed> 
   </rdf:Description>
</rdf:RDF>

My SPARQL query is very simple. With statement below the SPARQL engine constructs a partial graph which corresponds to a solutionof the Sudoku puzzle: an assignment of a number to each field. The WHERE condition describes the allowed values and in my FILTER I define the connection between field and value: the game rules of a 2x2 Sudoku. The last FILTER condition is about the leftmost upper field: it is a "2" (i.e. an integer in the meaning of XML Schema):

PREFIX ex:<http://example.org/>

CONSTRUCT {
  ?11 ex:allowed ?w11 .
  ?12 ex:allowed ?w12 .
  ?21 ex:allowed ?w21 .
  ?22 ex:allowed ?w22 .
          }
WHERE {  
  ?11 ex:allowed ?w11 .
  ?12 ex:allowed ?w12 .
  ?21 ex:allowed ?w21 .
  ?22 ex:allowed ?w22 .
FILTER(
        (?11 = <http://example.org/field/11> ) &&
        (?12 = <http://example.org/field/12> ) &&
        (?21 = <http://example.org/field/21> ) &&
        (?22 = <http://example.org/field/22> ) &&
        (?w11 != ?w12) && (?w21 != ?w22) && 
        (?w11 != ?w21) && (?w12 != ?w22) && 
        (?w11 = "2"^^<http://www.w3.org/2001/XMLSchema#int> ) 
      )
}

It's easy to extend this solution to 4x4 and even higher Sudoku puzzles but the running time will be terrible. So here's another approach: my RDF input graph is simpler but the complexity of the SPARQL query increases. The following graph corresponds to a single field of a Sudoku square that can have 4 possible values:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"  xmlns:ex="http://example.org/">
    <rdf:Description rdf:about="http://example.org/field">
      <ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">1</ex:allowed>
      <ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">2</ex:allowed>
      <ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">3</ex:allowed>
      <ex:allowed rdf:datatype="http://www.w3.org/2001/XMLSchema#int">4</ex:allowed>
   </rdf:Description>
</rdf:RDF>

The RDF query asks only for the 16 values allowed for the above defined field: The FILTER condition defines the game rules of the 4x4 Sudoku: every column and row must contain exactly the values 1 to 4 and the 4 2x2 squares, too:

PREFIX ex:<http://example.org/>

SELECT
  ?w11 ?w12 ?w13 ?w14 
  ?w21 ?w22 ?w23 ?w24 
  ?w31 ?w32 ?w33 ?w34 
  ?w41 ?w42 ?w43 ?w44

WHERE {  
  ex:field ex:allowed ?w11 .
  ex:field ex:allowed ?w12 .
  ex:field ex:allowed ?w13 .
  ex:field ex:allowed ?w14 .
  ex:field ex:allowed ?w21 .
  ex:field ex:allowed ?w22 .
  ex:field ex:allowed ?w23 .
  ex:field ex:allowed ?w24 .
  ex:field ex:allowed ?w31 .
  ex:field ex:allowed ?w32 .
  ex:field ex:allowed ?w33 .
  ex:field ex:allowed ?w34 .
  ex:field ex:allowed ?w41 .
  ex:field ex:allowed ?w42 .
  ex:field ex:allowed ?w43 .
  ex:field ex:allowed ?w44 
FILTER( 
  (?w11 != ?w12) && (?w11 != ?w13) && (?w11 != ?w14) && (?w12 != ?w13) && (?w12 != ?w14) && (?w13 != ?w14) &&
  (?w21 != ?w22) && (?w21 != ?w23) && (?w21 != ?w24) && (?w22 != ?w23) && (?w22 != ?w24) && (?w23 != ?w24) &&
  (?w31 != ?w32) && (?w31 != ?w33) && (?w31 != ?w34) && (?w32 != ?w33) && (?w32 != ?w34) && (?w33 != ?w34) &&
  (?w41 != ?w42) && (?w41 != ?w43) && (?w41 != ?w44) && (?w42 != ?w43) && (?w42 != ?w44) && (?w43 != ?w44) &&

  (?w11 != ?w21) && (?w11 != ?w31) && (?w11 != ?w41) && (?w21 != ?w31) && (?w21 != ?w41) && (?w31 != ?w41) &&
  (?w12 != ?w22) && (?w12 != ?w32) && (?w12 != ?w42) && (?w22 != ?w32) && (?w22 != ?w42) && (?w32 != ?w42) &&
  (?w13 != ?w23) && (?w13 != ?w33) && (?w13 != ?w43) && (?w32 != ?w33) && (?w23 != ?w43) && (?w33 != ?w43) &&
  (?w14 != ?w24) && (?w14 != ?w43) && (?w14 != ?w44) && (?w42 != ?w34) && (?w24 != ?w44) && (?w43 != ?w44) &&

  (?w11 != ?w22) && (?w12 != ?w21 ) &&
  (?w13 != ?w24) && (?w23 != ?w14 ) &&
  (?w31 != ?w42) && (?w32 != ?w41 ) &&
  (?w33 != ?w44) && (?w34 != ?w43 ) &&

  (?w14 = "3"^^<http://www.w3.org/2001/XMLSchema#int> ) &&
  (?w24 = "4"^^<http://www.w3.org/2001/XMLSchema#int> ) &&
  (?w13 = "2"^^<http://www.w3.org/2001/XMLSchema#int> ) &&
  (?w14 = "3"^^<http://www.w3.org/2001/XMLSchema#int> ) 
     )
}

The last 4 FILTER conditions contain values for 4 already filled fields. The solver gives out all results and one of this is:

4	1	2	3
3	2	1	4
2	4	3	1
1	3	4	2

Can We Do Better?

Yes we can. There are more sophisticated Sudoku queries in SPARQL. I give you some hints: Try to simplify the conditions. And if you are really smart you can get rid of FILTER conditions.

Summary

We can use RDF to store metadata about resources in a generic way. We can query those data using Semantic Web technologies like SPARQL. SPARQL is able to solve difficult problems but this means that SPARQL has a horrible worst case running time. SPARQL is a kind of general problem solver but we shouldn't use is that way. If we want to solve optimization or decision problems we should use specialized tools for defining problems in an appropriate way and that use more specialized solvers.

For "simple" "SQL like" queries RDF combined with SPARQL is a quite powerful technology: It is indeed an SQL to retrieve knowledge for web and non-web applications.

In the next blog entry we will learn about much more powerful tools of the Semantic Web world. With OWL we can define ontologies and perform more advanced reasoning: classification of resources and properties and defined classes using properties.