In the recent couple of years I became convinced in what researchers had been telling for quite a long time – in the rise of unstructured data, or, simply put, plain text in natural language. Some analysts go as far as predicting that volume of unstructured data available to companies will exceed one of the ‘traditional’ structured data, which we read from database tables and are accustomed to build intelligence on. Analytical potential of unstructured data is well discussed (to mention a couple of uses, that would be consumer sentiment analysis and entities finding), so why don’t we talk about some practicalities of text data processing, and how SAP products could be of use.
I would like to start this series of blogs with discussion on how unstructured data from social media can be loaded to a data warehouse.
Have a glance at JSON code:
As you might guess, it describes a green plant of 50 cm height.
Just so you might compare – in XML, a well-known standard for data interchange, one possible way to express the same would be:
On large data volumes the JSON representation is more lightweight than XML, mainly because of missing closing tags. JSON became widely used in web development (“Ajax programming” is your keywords for further reading), which brings us to the topic of those famous social web-based applications, including Twitter, Facebook, Friendfeed, or RSS feeds collected by Google Reader.
So this is why JSON looks like a good candidate to be the medium for unstructured data extraction into a data warehouse: it is supported by source web-applications, and it can be converted relatively easy into XML, understood by major ETL tools – in our case, SAP BusinessObjects Data Services.
Even out of the box, Data Services can integrate with wide range of databases and data formats from various vendors, and that range can be extended by using plugins called Adapters. SAP provides Java Software Development Kit to create such Adapters. A very helpful introduction into Data Services Adapter SDK can be found at SDN website; sample code is also provided as a part of the product installation.
With that SDK I have developed JSONAdapter that obtains data via HTTP in JSON-format, converts it to XML and passes to Data Services – so I will describe the steps to configure such interface with Twitter, assuming the JSONAdapter has been installed, and its Datastore in the Local Repository has been created.
Twitter provides a search interface, where search parameters are included into a URL. For example, a URL “http://search.twitter.com/search.json?q=intelligence&rpp=50” would render Twitter to return a JSON-formatted result with 50 most recent tweets containing the word “intelligence”.
The first step in ETL setup is to describe the data structure (metadata) in XML DTD form, one that Data Services understands. Unfortunately, it is not always possible to obtain JSON Schema, used to describe the structure of a JSON document – nor it is actually standardized yet. That is why the Adapter makes an attempt to derive metadata from a sample data. While the sample may be incomprehensive, and there may be several ways to describe the same XML data by DTD, it is still a good start.
So, let’s use “Import By Name” JSONAdapter’s functionality for metadata import.
..which leads to —
While actually creating a Function Call that may be used straight away, the import process also (optionally) generates a DTD file. That file may be reviewed and adjusted, for example, to denote some XML elements as optional rather than required. “Import By Name” functionality should be then used again, but this time to import metadata from the adjusted DTD file, not a sample URL.
With that second import or without, the final result would be a Function returning data in Nested Relational Data Model (NRDM), the Data Services’ internal representation of XML data (screenshot below is partial):
In the new Function, the Input parameters are always URL and EXTRA_PARAM, which both form the Url submitted to the web-application, but at development time such split provides more clarity regarding constant and variable parts of the request.
The Output is a nested schema that may be processed using standard Data Services tools. Let’s have a closer look at Data Flow design for that.
Function cannot be placed into a Data Flow by itself; it rather should be inserted into a Query transform, which, in turn, should be preceded by a Row_Generation transform, to provide an input. Hence, the Row_Generation transform generates exactly 1 row, and Query transform Twitter_Search calls the JSONAdapter-based Function:
Greater flexibility may be added to the Function call’s input parameters using variables:
These variables may be globals or the data flow’s input parameters, so that is the way to parameterise Function calls from outside of Data Services.
The NRDM data received by Data Services may then be unnested and stored in table format, voi la!
From here, the social media data can be used and reused for analytical processing as part of the data warehouse.
In the next blogs I will discuss data acquisition by JSONAdapter from Facebook and RSS Feeds, and, using collected tweets as a model, discuss text data processing options in Data Services.