Skip to Content

In the recent couple of years I became convinced in what researchers had been telling for quite a long time – in the rise of unstructured data, or, simply put, plain text in natural language. Some analysts go as far as predicting that volume of unstructured data available to companies will exceed one of the ‘traditional’ structured data, which we read from database tables and are accustomed to build intelligence on. Analytical potential of unstructured data is well discussed (to mention a couple of uses, that would be consumer sentiment analysis and entities finding), so why don’t we talk about some practicalities of text data processing, and how SAP products could be of use.

I would like to start this series of blogs with discussion on how unstructured data from social media can be loaded to a data warehouse.

In this context appears the name JSON — an open source text-based data interchange format. The acronym stands for JavaScript Object Notation – the name points to the roots of the format, but, actually, the standard is used outside of Javascript, with implementations of JSON for various platforms referenced at its homepage, http://json.org.

Have a glance at JSON code:

{

   “Plant”:{

     “Colour”:”green”,

     “Height”:{

         “Measure”:”50”,

         “Units”:”cm”

      }

   }

}

As you might guess, it describes a green plant of 50 cm height.

Just so you might compare – in XML, a well-known standard for data interchange, one possible way to express the same would be:

<Plant>

     <Colour>green</Colour>

     <Height>

         <Measure>50</Measure>

         <Units>cm</Units>

     </Height>

</Plant>

On large data volumes the JSON representation is more lightweight than XML, mainly because of missing closing tags. JSON became widely used in web development (“Ajax programming” is your keywords for further reading), which brings us to the topic of those famous social web-based applications, including Twitter, Facebook, Friendfeed, or RSS feeds collected by Google Reader.

So this is why JSON looks like a good candidate to be the medium for unstructured data extraction into a data warehouse: it is supported by source web-applications, and it can be converted relatively easy into XML, understood by major ETL tools – in our case, SAP BusinessObjects Data Services.

Even out of the box, Data Services can integrate with wide range of databases and data formats from various vendors, and that range can be extended by using plugins called Adapters. SAP provides Java Software Development Kit to create such Adapters. A very helpful introduction into Data Services Adapter SDK can be found at SDN website; sample code is also provided as a part of the product installation.

With that SDK I have developed JSONAdapter that obtains data via HTTP in JSON-format, converts it to XML and passes to Data Services – so I will describe the steps to configure such interface with Twitter, assuming the JSONAdapter has been installed, and its Datastore in the Local Repository has been created.

Twitter provides a search interface, where search parameters are included into a URL.  For example, a URL “http://search.twitter.com/search.json?q=intelligence&rpp=50” would render Twitter to return a JSON-formatted result with 50 most recent tweets containing the word “intelligence”.

The first step in ETL setup is to describe the data structure (metadata) in XML DTD form, one that Data Services understands. Unfortunately, it is not always possible to obtain JSON Schema, used to describe the structure of a JSON document – nor it is actually standardized yet. That is why the Adapter makes an attempt to derive metadata from a sample data. While the sample may be incomprehensive, and there may be several ways to describe the same XML data by DTD, it is still a good start.

So, let’s use “Import By Name” JSONAdapter’s functionality for metadata import.

1-1.PNG

..which leads to —

1-2.PNG

While actually creating a Function Call that may be used straight away, the import process also (optionally) generates a DTD file.  That file may be reviewed and adjusted, for example, to denote some XML elements as optional rather than required. “Import By Name” functionality should be then used again, but this time to import metadata from the adjusted DTD file, not a sample URL.

With that second import or without, the final result would be a Function returning data in Nested Relational Data Model (NRDM), the Data Services’ internal representation of XML data (screenshot below is partial):

1-3.PNG

In the new Function, the Input parameters are always URL and EXTRA_PARAM, which both form the Url submitted to the web-application, but at development time such split provides more clarity regarding constant and variable parts of the request.

The Output is a nested schema that may be processed using standard Data Services tools. Let’s have a closer look at Data Flow design for that.

1-4.PNG

Function cannot be placed into a Data Flow by itself; it rather should be inserted into a Query transform, which, in turn, should be preceded by a Row_Generation transform, to provide an input. Hence, the Row_Generation transform generates exactly 1 row, and Query transform Twitter_Search calls the JSONAdapter-based Function:

1-5.PNG

Greater flexibility may be added to the Function call’s input parameters using variables:

1-6.PNG

These variables may be globals or the data flow’s input parameters, so that is the way to parameterise Function calls from outside of Data Services.

The NRDM data received by Data Services may then be unnested and stored in table format, voi la!

1-7.PNG

From here, the social media data can be used and reused for analytical processing as part of the data warehouse.

In the next blogs I will discuss data acquisition by JSONAdapter from Facebook and RSS Feeds, and, using collected tweets as a model, discuss text data processing options in Data Services.


To report this post you need to login first.

15 Comments

You must be Logged on to comment or reply to a post.

  1. varada santosh

    Hi

      Thanks for the blog , I was looking for the same ,I saw your blog on innogence but the site got changed and could not found your blog there.Also I am starting with developing Adapter can you guide me on the same , can you provide me the links for the same.As of now I am going through the below link but I am not able to find much help from here.If you have any guide lines for the same , I would request your help.

    http://wiki.sdn.sap.com/wiki/display/EIM/Adapter+SDK+Tutorial

    Thanks

    Santosh Varada

    (0) 
  2. Durairaju A

    Hi Roman,

    Good one.

    Also I want to know how to create JSON adapter, since i m working on sentiment analysis project.

    Please share if you have any documents for JSON adaper creation @ ardurairaj@gmail.com

    Highly anticipating your response.

    Thanks

    Durai

    (0) 
      1. Durairaju A

        Hi Thomas,

        Thanks a lot for your valuable response.

        The blueprint which you are talking about will help us to extract the data from twitter without custom adapter?

        Looking forward your response….

        Regards,

        Durai

        (0) 
        1. Justin Martinson

          Durai,

               Yes, there is a new Twitter job available in the DS 4.1 SP1 Text Data Processing English blueprint package that is available from http://scn.sap.com/docs/DOC-8820. You can find information about installing and configuring this job in the

          TDP_Lang.pdf available on the same page.

               This job will use a User-Defined Transform in Python to execute a search against Twitter for terms or hashtags you define and bring back tweets in one or more languages. A companion job then processes each tweet using the TDP Entity Extraction transform configured for sentiment analysis – English by default although you can configure it for French, German, or Spanish. Once the sentiments are extracted from the tweets, they are stored in a database table that you can perform further analytics on using BI tools, such as Visual Intelligence.

          Thanks.

          Justin-

          (0) 
          1. ashutosh dixit

            HI justin ,

            Can you please help me in this as i am trying to to the same but out of luck.

            If you have some step by step guide regarding this please suggest.

            Thanks

            Ashutosh

            (0) 
  3. Raja Mahesh

    Hi Roman,

    Thank you very much. That’s a great information to start up. Could you please provide some more information on how to create a JSON Adapter with the help of SDK.

    We are stuck up here in creation of such an adapter which can obtain the data via HTTP in JSON format and can convert into XML format. Please provide a mode of contact or please send a mail to maheshm@intellicompute.com

    Awaiting for your response.

    Thanks & Regards

    Raj Mahesh

    (0) 
  4. Amala Varkey

    I am in great need of help to find out how I can use SAP BODS to process JSON webservices.

    Can any one kindly let me know the process of configuring the JSON Adpater and reading the file??

    What are the alternatives that we have to process JSON webservices besides having to write an adapter our-self since SAP currently does not have any available JSON adapter? I’m using DataServices 4.1, unfortunately can’t use 4.2 since the client systems are 4.1.

    TIA

    (0) 
  5. Anupam Choudhary

    I have postgre source system in which data is stored in JSON data type , after  importing that table, in BODS its showing varchar .

    How jsonadapter can be used to read that datatype .

    Please help .

    Thank you

    Anupam

    (0) 

Leave a Reply