Introduction

ThorstenHa · ‎03-07-2022

Introduction

There are more options to connect to SAP Data Intelligence than the connection types of the Connection Management. The connection management has 2 objectives

Storing the access details of the data source

Being the datasources supported by the metadata explorer for analysing the technical metadata, providing previews etc.

Admittedly the Metadata Explorer does not provide the same features for all but these are compromises to the overarching strategy, e.g. there is no pre-view for the HTTP connection type but it is added due to its importance and versatility. I will come to this later in the blog.

Although we try to provide for the most important data sources a connection type in the Connection Management there are always kinds that will not be supported. Despite this fact SAP Data Intelligence provide all the necessary functionalities to gain the essential target of

Storing the access details

Connecting to the data source with Modeler

Manage the metadata of this data source

This blog is all about proving this statement with an example I recently came across: Connecting with the semi-structured database CouchDB.

Spoiler for the resulting pipeline and intended for being automatically used when referred to in social media:

I plan a separate blog for managing the metadata of datasources that are not connected via the connection management.

Use Case

A company has hundreds of rather independent outlets with an access control. The outlets have no uniform device for getting the access data therefore it was decided to use a CouchDB for storing the data. In addition it has a small implementation footprint, it is robust and the data can easily be replicated into a central repository. For analysing the data with a data warehouse and displaying it in a central cockpit an sql database like HANA is indispensable. Therefore they like to extract the essential data into a HANA table.

For this we need to create a generic custom operator that reads from a couchDB and passes the content to another custom operator that prepares the data specifically for the target table.

Semi-structed Database CouchDB

CouchDB is like Mongo DB a document oriented, no-sql database for storing semi-structured data. You just create a database based on the context without defining up-front any structure. You then upload data with some information that later let you find the data again. For the CouchDB you store data in a JSON format with an optional attachment. When you store a JSON-document you provide either a key or use a generated one. The document itself is stored as the value without any further restrictions on the format. Later you can create views with different keys and value pairs based on the stored documents. This is done with a Javascript code snippet. For retrieving documents you can either use the keys you created with a view or explicitly search for values in data fields. For interacting with the CouchDB you use simple RestAPIs.

This is basically all you need to know and the learning curve is quite steep and short once you get rid from your sql-bias.

Connection Type: HTTP

For storing the connection details I am using the HTTP connection type. Here you find all the fields that are basically needed:

URL to the CouchDB

Name of the Database

User Credentials

Connection Management

Custom Operators

I recommend to separate the retrieving of data from a CouchDB into 2 processing steps:

Generic CouchDB query that sends the un-processed data to its outport

Specific CouchDB processing operator with a view-specific outport

With this approach you have to build for each "source view" to "target table" pipeline a specific operator. When using a kind of template this takes no effort.

This is only true when using the Generation 2 operators with its strict type requirement. When using Generation 1 operators you could leave the final validation to the database. I have not checked this option for 3 reasons because of the following advantages of the Generation 2 approach:

In case of type errors the information you get from SAP Data Intelligence is more elaborate

The fixing of an error might only impact the pipeline of a database view pipeline

When replicating large amounts of data into an SQL database the option of resiliency of generation 2 operators is a great asset.

Currently I have not implemented the resiliency into the following operators for the sake of focus but with the help of my blog about the development of Generation 2 operators it should not be that difficult. If you need additional hints let me know by an comment.

CouchDB Reader

My proposal for a basic CouchDB reader is to get the access information from the Connection Management, Connection Type "HTTP", and add the CouchDB view, startkey and endkey as configuration parameters.

This operator is designed as "generator" operator without an inport. For that I use the function

api.set_prestart(gen).

If you like to use this operator in a continuously running pipeline that queries the CouchDB periodically that you might rather opt for the function

api.add_timer(gen),

and adding an additional configuration parameter for specifying the frequency of the call.

The outport type is just of the vtype:

scalar - com.sap.core.string,

the unprocessed response from the restAPI call.

The "couchDB" configuration parameter refers to the standard configuration type that you cannot define in the "Operator" creation wizard. For this you need to go to the repository tab and open all the folders until the one that defines your operator and open the "configSchema.json". The path in my case:

subengines > com > sap > python3 (for gen2 operators with python 3.9) > operators > couchdb (my package name) > get_data (custom operator CouchDB Reader)

After having opened the configurationSchema.json file you need to replace the value of the ["properties"]["couchdb"] value with the following code:

        "couchdb": {

            "title": "HTTP Connection",

            "description": "HTTP Connection",

            "type": "object",

            "properties": {

                "connectionID": {

                    "title": "ConnectionID",

                    "type": "string",

                    "format": "com.sap.dh.connection.id",

                    "sap_vflow_valuehelp": {

                        "url": "/app/datahub-app-connection/connections?connectionTypes=HTTP",

                        "valuepath": "id",

                        "displayStyle": "autocomplete"

                    }

                }

            }

        },

This was the most challenging part of the whole project.

Because I always like to pass header information along with data I also define a vtype header:

Finally all the preparations are being used in the script:

from urllib.parse import urljoin

import requests



def gen():

    

    # Get Access Data From HTTP Connection

    host = api.config.couchdb['connectionProperties']['host']

    user = api.config.couchdb['connectionProperties']['user']

    pwd = api.config.couchdb['connectionProperties']['password']

    db = api.config.couchdb['connectionProperties']['path']

    view = api.config.view

    

    # Create the final URL for the host, db and view

    url = urljoin(host, db) +'/'

    url = urljoin(url, view)

    

    # Prepare the startkey and endkey parameter

    selector = {'startkey':api.config.startkey,'endkey':api.config.endkey}

    selector = {k:"\""+str(v)+"\"" for k,v in selector.items()}

    

    # Call RestAPI

    r = requests.get(url, auth=(user, pwd),params = selector)



    # Send result to outport

    header = api.Record([host,db,view,api.config.startkey,api.config.endkey,0,0])

    header = {"diadmin.couchdb.header":header}

    api.outputs.json.publish(r.content,header=header)





api.set_prestart(gen)

As you see the script is pretty much straightforward. The only tricky part was that the values of the selector needs to be explicitly in double-quotes.

You can download the development objects from my personal GitHub diadmin

CouchDB Reader

vtype structures.diadmin.couchdb

JSON To Table Transformer

The 2nd custom operator is basically converting the JSON-string into a flat DataFrame structure before it is providing the data to the table structure of the outport.

This operator is semi-genericly designed, that means you do not have to change the script but only the vtype of the outport. Everything else runs independently from the processed data. The structure validation and mapping is done by reading the outport vtype definition.

The assumption is that a CouchDB view has been created that provides a flat structure of the data with the key values corresponding to the output table. If this is not the case the highly specific logic has to be build into this custom operator.

This operator has no configuration parameters.

The ports:

inport: JSON of type string

outport: table of specific vtype

import json

import pandas as pd



def on_input(msg_id, header, body):



    query_result = json.loads(body.get())



    # Extract the values from the couchdb records

    records = [ v['value'] for v in query_result['rows']]



    # Create a DataFrame

    df = pd.DataFrame(records)



    # Fill empty values with '' or 0

    for c in df.select_dtypes(include=['float64','int64']):

        df[c].fillna(0,inplace=True)

    for c in df.select_dtypes(include=['object']):

        df[c].fillna('',inplace=True)



    # Ensure data sequence is the same as required by outport

    outport_vtype_ref = api.DataTypeReference(api.outputs.table.type_ref.type_,   api.outputs.table.type_ref.type_id)

    outport_info = api.type_context.get_vtype(outport_vtype_ref)

    outport_col_names = list(outport_info.columns.keys())

    table_list = df[outport_col_names].values.tolist()



    # Send data to outport

    api.outputs.table.publish(api.Table(table_list),header=header)



api.set_port_callback('JSON',on_input)

The operator can be downloaded:

JSON 2 Table

Pipeline

Finally you can create a pipeline that reads the data from a CouchDB and stores the data into a HANA DB using the standard operator "Table Producer" or using the operator "Write HANA Table".

Conclusion

There is no real restriction of connecting data sources, in particular it is not the list of connection types of the Connection Management. As long as there is an API then you can connect and process the data. As you have seen with little effort and code you can write a pipeline that is as generic as it can be for a semi-structured database. This small project was done for a presentation and not a PoC.

Connecting Semi-structured Data Sources (Example CouchDB) to SAP Data Intelligence

Introduction

Use Case

Semi-structed Database CouchDB

Connection Type: HTTP

Custom Operators

CouchDB Reader

JSON To Table Transformer

Pipeline

Conclusion

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win