Today developers and data-experts are able to choose from an impressive number of public datasets for various purposes:
Of course these datasets are a perfect starting point to build demos and prototypes as well! ⌨
In this post I will describe an analytics demo, based on a public dataset, which is build with the following components:
- SAP Data-Hub
- Metadata explorer for data exploration
- Pipelines for data ingestion
- Vora for persistency and SQL access
- SAP HANA
- Virtualization Layer for integration w. business data
- SAP Analytics Cloud
The public dataset used for this demo is the Deutsche Börse Public Dataset provided by AWS S3(Link).
From an architecture pattern perspective this public dataset could represent a generic data lake based on a cloud object storage.
? Demo video
The storyline of the demo consists of these main building blocks
The SAP Data Hub Metadata Explorer will be used to graphically browse the content of the S3-bucket.
Manage & Persist
For reading the data from the S3 bucket and in order to persist it in SAP Data-Hub Vora, a basic Data-Hub Pipeline will be used.
Combine w. business Data
In the demo a HANA system will contain business data, or more precise, additional master data for analysis purposes.
Next to combining the trade-data from S3 with business data, in this scenario the calculation view is used to virtually access the data stored in Vora.
From an architecture perspective HANA could be used a virtualization layer to combine multiple distributed big data engines with In-Memory processing capabilities.
One aspect for considering SAP HANA as virtualization layer is the potential reuse of existing SAP data models and authorizations by generating them as HANA views. (Link)
In this scenario SAP BW master-data(InfoObject) and corresponding analysis authorizations could be exported from a BW/4HANA and virtually combined with data from S3 using HANA views.
Last but not least it is always a pleasure to build some beautiful visualizations with the SAP Analytics Cloud
As well the SAC HANA live data connection(Link) is an important building block of this demo.
“Some benefits of live data connection are:
- No data replication and prevents transfer of large datasets from source systems
- Automatically updated with current data – “live” data
- Create complex models and calculation in source systems and leverage them within SAC
- Sensitive data can stay in local network, behind your firewall”
1. SAP Data Hub Metadata Explorer
Maintain the S3-Bucket connection in “Connection Management”
First the S3-bucket needs to be maintained in the DH Connection Management.
Relevant parameters are:
- Custom endpoint = “s3.eu-central-1.amazonaws.com”
- Region = “eu-central-1”
- Root Path = “/deutsche-boerse-xetra-pds”
The AWS Access and Secret Keys have to be maintained.
Leaving the access and secret key empty does not work for this demo.
Picture: Connection to S3-Bucket
Browse content of S3 bucket
After maintaining the S3-connection, the data is available for exploration and the first findings:
One file folder per day
Picture: Folder structure S3 bucket
Each folder contains CSV files:
Picture: CSV files in folder of S3 bucket
The data types and content of CSV
Picture: Columns and data type description of CSV file with trading data
Picture: Content and data distribution in the CSV file.
After exploring the content of the S3 bucket, building the DH pipeline is the next step.
2. SAP Data Hub Pipeline
The pipeline consists of these main elements:
This operator is used to access an S3 instance to read a file or periodically poll a directory for its contents.
In this example the java-scripts iterates over the content of the folder, e.g. the passes the file-names to the next S3-Consumer,
Vora Avro Ingestor
This operator allows you to dynamically ingest data into SAP Vora based on the incoming Avro or other text messages in CSV and Json
Picture. DH Pipeline S3 to Vora Avro Ingestor
2. SAP Data-Hub Vora
After running the pipeline the trading data for the selected day is saved in a Vora disk-based streaming table.
Streaming tables support SQL statements like INSERT, UPDATE, or DELETE.
As well streaming tables persist their content in the distributed log (DLog).
This enables the cluster to recover the data after restart or failure.
Picture: Trading data persisted in Vora streaming table
In addition to persist the data in a updateable disk-based table, a Vora table could directly be created on a S3-bucket:
3. SAP HANA
As described, in this demo the HANA In-Memory Engine provides two relevant features:
- Virtual access to the Vora tables without data replication
Picture: Vora table connect to SAP HANA as remote source
- Graphical modeling to combine the trading data w. business data or master data.
In this demo additional information about the listed companies will be added for later SAC–visualizations.
Picture: Calculation View based on remote data combined w. company information.
3. SAP Analytics Cloud (SAC)
For this demo the folder and trading data files of November 15th were loaded and visualized.
The first visualization is a basic time series or line chart per minute:
Picture: Tradevolume Line-Chart
The next chart compares start- and end-price per minute of the SAP share:
And a basic analysis to identify outliers based on the trade volume and price change:
In the morning of the November 15th international retail holding company had a major share of the overall trade volume:
The intention of this blog and demo is to demonstrate how plug&play like integration between several SAP-analytics architecture components works in practise based on a public dataset.
What is important to emphasize from my perspective:
- No downloads or installations required, this demo was built using only the web-browser as IDE
- Except of a few lines Java-Script, the demo is implemented graphically,.
- The DH pipelines are a great tool to visually model dataflows which combine technologies like Kafka, SAP or cloud-based object storages
- Vora, in combination with SAP HANA, makes the data then accessible for users that have a background in relational databases or SQL
- And finally the relevant information is made available for a larger audience using SAP Analytics Cloud (SAC)
Many thanks for reading this blog till here! ?