Technical Articles
Automating Web Data Acquisition With SAP Data Intelligence
Here I want to build a pipeline that retrieves economic indicators from the internet, processes them and then loads into the SAP Vora engine, within SAP Data Intelligence.
The Pipeline components I will use for this pipeline are
- Message Generator – Pass the URL to the HTTP Client
- HTTP Client – Download CSV file with Exchange Rates
- JavaScript – Remove 5 Header Rows
- Multiplexer – Split the pipeline to provide multiple outputs
- Wiretap – View the pipeline data on the screen
- Vora Avro Ingestor – Load the data into Vora
- Write File – Persist the data into HDFS, S3 or other storage
- Graph Terminator – Stop the graph from continuously running
The European Central Bank (ECB) provides the Statistical Data Warehouse that we will use in our pipeline.
Figure 1: European Central Bank Website
We can then take the desired feed, in my case I’ve chosen the daily USD-EUR Exchange Rate
http://sdw.ecb.europa.eu/quickviewexport.do?SERIES_KEY=120.EXR.D.USD.EUR.SP00.A&type=csv
Figure 2: European Central Bank, CSV Data, with 5 header rows
We can pass the url and method using the code below in a simple javascript message generator,
generateMessage = function() {
var msg = {};
msg.Attributes = {};
msg.Attributes["http.url"] = "http://sdw.ecb.europa.eu/quickviewexport.do?SERIES_KEY=120.EXR.D.USD.EUR.SP00.A&type=csv";
msg.Attributes["http.method"] = "GET";
msg.Body = {};
return msg;
};
$.addGenerator(doTick);
function doTick(ctx) {
$.output(generateMessage());
}
Using the Data Hub Pipeline HTTP Client operator.
We do not need to specify any fields as they are populated by the javascript above.
I have increased the timeout to 25 seconds, Request timeout (ms) to 25,000 ms
Figure 3: HTTP Client
We can now test this with the WireTap which accepts any input type.
Figure 4: Skeleton Pipeline
After Saving our pipeline, we can now run this.
Figure 5: Save Pipeline
Creating the file name as above prefixed with “.” will automatically create the desired repository folder structure.
Figure 6: Repository Structure
We can press Run and we should see it running at the bottom
Figure 7: Running Pipeline
With the pipeline running, we open the UI for the WireTap
Figure 8: Open Wiretap UI
We can see our data being returned to the screen
Figure 9: CSV in Wiretap output
The pipeline will continue to to run forever, so you should stop it.
We extend the pipeline with a Multiplexer and and Write File
Figure 10: HDFS Configuration
The directory structure is automatically created if it does not already exist. <counter>, <date> and <time> are built in variables that can be used to create filenames. We can re-use connections already defined.
Re-running the pipeline will save the CSV file into HDFS.
We can browse the output in HDFS with the Data Intelligence Metadata Explorer.
We can see that the are a number of header rows that should be handled.
Using a simple piece of JavaScript will allow us to do this,
Much of the code below is the framework for the JavaScript Operator 2, we just need the inbody lines to actually strip out the header
$.setPortCallback("input",onInput);
function isByteArray(data) {
return (typeof data === 'object' && Array.isArray(data)
&& data.length > 0 && typeof data[0] === 'number')
}
function onInput(ctx,s) {
var msg = {};
var inbody = s.Body;
if (isByteArray(inbody)) {
inbody = String.fromCharCode.apply(null, inbody);
}
// remove first 5 lines
// break the textblock into an array of lines
inbody = inbody.split('\n');
// remove 5 lines, starting at the first position
inbody.splice(0,5);
// join the array back into a single string
inbody = inbody.join('\n');
msg.Body = inbody;
$.output(msg);
}
We can save and re-run our job, and check the output with either the WireTap or Discovery
We can now extend the pipeline further, by loading into SAP Vora using the Vora Avro Ingestor.
Despite what the name suggests, this operator actually works with JSON, CSV and Avro file formats.
The Vora Avro Ingestor expects an Avro Schema, the schema tells Vora the table name, columns, data types and specification, we need to supply this, which is shown below.
DefaultAvroSchema:
{
"name": "ECB_EXCHANGE_RATE",
"type": "record",
"fields": [
{
"name": "ER_DATE",
"type": [
"null",
"date"
]
},
{
"name": "EXCHANGE_RATE",
"type": [
"null",
"double"
]
}
]
}
The other parameters for the Vora Avro Ingestor with the TA image would like like the following.
dsn: v2://dqp:50000/?binary=true
engineType: Disk
tableType: empty (not Streaming)
databaseSchema: WEBDATA
csvHeaderIncluded: false
user: admin
password: I can't tell you that
Rerunning this pipeline now produces an error, as the we attempt to insert a “-” into an numeric field.
We can add the following additional line of JavaScript to correct this.
// Replace "-" characters with Nulls in Exchange Rate data
inbody = inbody.replace(/,-/g,",");
If we rerun our pipeline now we can check Vora Tools to see that the Schema and Table and have been created and the data has also been loaded.
However the pipeline will continue to run and re-download the same data and insert into our Vora table continuously. To change this behavior we need to supply a commit token within our existing JavaScript operator. The Vora Avro Ingestor will then pass this on to the Graph Terminator.
msg.Attributes = {};
// Add Commit Token to DataHub message header to stop pipeline from running continuously
msg.Attributes["message.commit.token"] = "stop-token";
The completed pipeline would then look like this.
Rerunning it now, we can see that the pipeline completes and all is good 🙂
I have written an alternative solution where I use Python instead of JavaScript, which can be found here.
Hi Ian - Thanks for sharing and I'm sure it will help many others including me.
Swapan
No problem Swapan, it's good to share our learnings 🙂
Very nice tutorial, illustration & use case! I wasn't familiar with the Wiretap operator but had used Terminal operator similarly. I now see the Wiretap operator is indeed in the SDH Dev Edition. Thanks for putting this together, Ian!
Doug
Thanks Doug,
I guess my Dev edition needs an update as I couldn't see it listed.
That’s very helpful Ian! Thanks for putting it together .
Excellent tutorial! Thanks for sharing it.
Nice tutorial Ian. Thumbs up!!
Hello, thanks for sharing this article.
I am using SAP Data Hub 2.3 Dev Edition. The Http client used in this example is now called - Old Http Client and it works fine with that. However, I could not get the new Http Client working.
Is it possible to get this tutorial using the new Http client?
As an example, I am generating a message like this and passing to the http client. In the Http client configuration I have tried supplying/not supplying the host name and path.
However, the graph just keeps running. The same function runs successfully using the old Http client.
generateMessage = function() {
var msg = {};
msg.Attributes = {
"http.url":"https://api_url.companywebsite.net/api/call_function"
,"http.method":"GET"
,"http.host":"api_url.companywebsite.net"
};
msg.Body = generatePayload();
counter++;
return msg;
}
Hi Santanu,
if you haven't found a solution until now: I am following this tutorial using also SAP Data Hub 2.3 Dev Edition. The new HTTP Client operator is working in my scenario.
Could it be, that $.output(generateMessage()) is missing in the script of your Message Generator? (You can have a look at the code in the example Graph "Message Generator", I also used the $.addTimer function.)
In the HTTP Client connector I chose "Configuration Manager" as Configuration Type and pasted the whole URL into the path field.
Maybe this might help you.
Ian, thanks for this blog post!
Hi Santanu,
I have updated the blog for the new HTTP Client, where I supply the url similar to your code above.
I used the message generator to do this.
Thanks, Ian.
Hi Ian Henry ,
Thanks for a very informative blog. I am trying to acquire the csv data from https url with your code but it runs indefinitely and does not give any output to the wiretap. Is it not correct to use the same code for https url? Please provide your inputs.
SAP DI: HTTP Client
Regards
Achin Kimtee