Technical Articles
Import External Datasets to SAP Data Intelligence Catalog
Introduction
It is one of the hidden secrets of SAP Data Intelligence that you can manage datasets of external data sources (=not connected by Connection Management) with SAP Data Intelligence Catalog.
I have already discussed the openness of the Catalog in a previous blog: Open Catalog of SAP Data Intelligence, while having not covered this aspect. In my most recent blog about connecting to external data sources: Connecting Semi-structured Data Sources to SAP Data Intelligence, I announced this blog because it compliments processing-focused connection. Before you access the data you need to know that this dataset exists with some basic metadata information. This blog is about how to achieve this.
There are 2 steps we need to do:
- Create a connection with a specific Connection Type in Connection Management. This is essential, because in SAP Data Intelligence datasets are organised based on Connection IDs.
- Upload the metadata information to the catalogue.
Then you can manage the external datasets within limits as usual by adding tags or glossaries.
As a use case I took up the example of my previous blog of integrating a couchDB datasource.
Create “METADATA_IMPORT” Connection
Before you been able to upload dataset information you need to create a connection of connection type METADATA_IMPORT. Because there is only the metadata API option to upload dataset information to SAP Data Intelligence, you cannot add the connection using the Connection Management.
For adding the connection you use the RestAPI /catalog/connections described on SAP API Business Hub passing a json-formatted data. For this you can either use e.g. postman or use my command-line script. For the latter you need a Python installation on your computer and install the package diadmin
pip install diadmin>=0.0.72
and run the script:
dicatalog connections <json-file> --upload
You need to have a config-file config.yaml with the credentials and your DI-system:
PWD: pwd123
TENANT: default
URL: https://vsystem.ingress.xxx.shoot.live.k8s-hana.ondemand.com
USER: user
and a folder “connections” at your working directory with the json-file of your connection details, e.g
{
"id": "COUCHDB_IMPORT",
"description": "CouchDB Metadata Import",
"type": "METADATA_IMPORT",
"contentData": {
"type": "METADATA_IMPORT"
},
"tags": [
"import"
],
"gatewayId": "",
"cloudConnectorLocationId": "",
"licenseRelevant": false,
"readOnly": true
}
This file contains the very basic data you need for creating a new connection. It is important having the element “contentData”/”type”:”METADATA_IMPORT although it seems redundant.
In case you like to delete a hidden connection call:
dicatalog connections <connection_id> --remove
To ensure that you have added the connection successfully you can download all connections:
dicatalog connections all --download
If you like to see the details of how I have implemented the script you can have a look to by personal public GitHub-repository: thhapke/diamin
After this preparation we can start to upload the actual dataset information.
Upload Datasets
Again we call a RestAPI to upload data to the SAP Data Intelligence catalog: /catalog/datasets. As response you get a status code and a task id. If the status code is 202 then you might have to wait a couple of seconds until the job is completed. You can either send another request to /catalog/importTasks/{taskid} to check if the job was successful or see in SAP Data Intelligence “Browse Catalog” if there are datasets in the connection created in the first step.
Again you can use my script:
dicatalog datasets <json-file> --upload
In this case you need to create a folder at your working directory “metadata_datasets” with a json-file that contains the metadata information of the dataset. In my example it is:
[
{
"remoteObjectReference": {
"qualifiedName": "/visits",
"remoteObjectType": "VIEW",
"connection": {
"id": "COUCHDB_IMPORT",
"type": "METADATA_IMPORT"
},
"name": "visits",
"size": 2345789
},
"schema": {
"genericType": "TABLE",
"tableBasedRepresentation": {
"attributes": [
{
"name": "KEY",
"datatype": "STRING",
"templateType": "string",
"length": 50,
"labelAttribute": "String"
},
{
"name": "LOCATIONID",
"datatype": "STRING",
"templateType": "string",
"length": 30,
"labelAttribute": "String"
},
{
"name": "VISITID",
"datatype": "STRING",
"templateType": "string",
"length": 30,
"labelAttribute": "string"
},
{
"name": "TIMESTAMP",
"datatype": "DATETIME",
"templateType": "timestamp",
"labelAttribute": "DateTime"
},
{
"name": "YEAROFBIRTH",
"datatype": "INTEGER",
"templateType": "int16",
"labelAttribute": "Integer"
}
],
"uniqueKeys": [
{
"attributeReferences": [
"KEY"
]
}
]
}
},
"descriptions": [
{
"value": "CouchDB Visits",
"type": "SHORT",
"origin": "REMOTE"
}
]
},
{
"remoteObjectReference": {
"qualifiedName": "/booking",
"remoteObjectType": "VIEW",
"size": 73923,
"connection": {
"id": "COUCHDB_IMPORT",
"type": "METADATA_IMPORT"
}
},
"schema": {
"genericType": "TABLE",
"tableBasedRepresentation": {
"attributes": [
{
"qualifiedName": "timestamp",
"name": "timestamp",
"datatype": "DATETIME",
"templateType": "timestamp"
},
{
"qualifiedName": "location_id",
"name": "location_id",
"datatype": "STRING",
"templateType": "string"
},
{
"qualifiedName": "name",
"name": "name",
"datatype": "STRING",
"templateType": "string"
},
{
"qualifiedName": "booking_time",
"name": "booking_time",
"datatype": "DATETIME",
"templateType": "timestamp"
}
]
}
},
"descriptions": [
{
"value": "CouchDB Booking",
"type": "SHORT",
"origin": "REMOTE"
}
]
}
]
You find the details of the openAPI definitions in SAP API Business Hub tab “Models View” for each of the elements:
Once you have uploaded the datasets you see them when you browse the catalog:
In my example I have uploaded 2 datasets of connection ID COUCHDB_IMPORT: booking and visits.
Conclusion
With the Metadata API you can add datasets to the catalog of unsupported connection types in the Connection Management. Once you have a dataset in the catalog you can add tags and link glossaries.
This allows you to manage external datasets and that you might further process with pipelines as described in my previous blog: Connecting Semi-structured Data Sources to SAP Data Intelligenc
Many thanks Thorsten for your post! Does it mean that we can bind rulebooks with these datasets?
No, because this only is managing the metadata but you cannot access the data: For doing this you need to connect the data source using the connection management.
But when I create an OPENAPI connection in Connection Management, I will not see it in Metadata Explorer -> Browse Connections. That means, I cannot use rules for data that's coming via API in Metadata Explorer. In this case, I suppose we need to use Modeler.
Yes. The main reason although is, that rules only applies to certain connection/data source types (Seed help for details). OpenAPI is not supported. Due to the fact that OpenAPI is mostly used for external data sources and not maintained by the DI-running company it might make anyway more sense to choose a stream-like quality check (pipeline) rather than the static "rule"-approach.