Technical Articles
Scraping RSS Feeds with SAP Data Hub
I had a request to retrieve RSS data using SAP Data Hub and store this in SAP Vora.
There are many ways to do achieve this, here’s how I did it.
Data Hub Pipeline
- Docker with Beautiful Soup 4 & Pandas
- Python Operator using Beautiful Soup 4
- Vora Avo Ingestor
- Vora Disk Table
Figure 1: Data Intelligence Pipeline
Python is great for scraping RSS feeds, we can wrap our code in a custom operator and then associate that with a suitable docker image that contains the required libraries.
Create a Docker Image
First we need to create a docker that contains the required python libraries, and associate this with some appropriate tags that we will link to our operator
Figure 2: Docker Image
# Use an official Python 3.6 image as a parent image
FROM python:3.6.4-slim-stretch
# Data Intelligence requires Tornado
RUN python3 -m pip --no-cache install tornado==5.0.2
RUN python3 -m pip install requests
RUN python3 -m pip install pandas
RUN python3 -m pip install beautifulsoup4
RUN python3 -m pip install lxml
# Add vflow user and vflow group to prevent error
# container has runAsNonRoot and image will run as root
RUN groupadd -g 1972 vflow && useradd -g 1972 -u 1972 -m vflow
USER 1972:1972
WORKDIR /home/vflow
ENV HOME=/home/vflow
If the docker build fails, you can get more details through the Diagnostic Information.
Figure 3: Download Diagnostic Logs
Custom SAP Data Hub Python Operator
I have tested the operator with various RSS feeds and it appears to be reliable.
Figure 4: Create Custom Python Operator
import requests
import pandas as pd
from bs4 import *
url = "http://feeds.bbci.co.uk/news/rss.xml"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, features="xml")
items = soup.findAll('item')
news_items = []
for each_item in items:
news_item = {}
news_item['RSS_TITLE'] = each_item.title.text
news_item['RSS_DESC'] = each_item.description.text
news_item['RSS_LINK'] = each_item.link.text
news_item['RSS_DATE'] = each_item.pubDate.text
news_items.append(news_item)
# Use a Pandas Dataframe to pass as CSV
df = pd.DataFrame(news_items)
df = df.to_csv(index=False, header=True, sep=";")
# Create Data Hub Message
attr = dict()
attr["message.commit.token"] = "stop-token"
messageout = api.Message(body=df, attributes=attr)
api.send("outmsg", messageout)
If we connect this to the WireTap component we can quickly see that data is being retrieved and structured as required.
Figure 5: WireTap Output
Vora Avro Ingestor
Using the Vora Avro Ingestor is a great way to receive structured information into Vora.
I needed to use fixed length fields below, this has the advantage of working with HANA Smart Data Access (SDA).
{
"name": "RSS_FEED",
"type": "record",
"fields": [
{
"name": "RSS_TITLE",
"type": "fixed",
"size": 128
},
{
"name": "RSS_DESC",
"type": "fixed",
"size": 2500
},
{
"name": "RSS_LINK",
"type": "fixed",
"size": 128
},
{
"name": "RSS_DATE",
"type": "fixed",
"size": 16
}
]
}
For completeness I have captured the properties of the Vora Avro Ingestor, and highlighted the fields that I changed.
Figure 6: Vora Avro Ingestor Configuration
Executing this pipeline will now retrieve the RSS data amd automatically create the table within SAP Vora, we can easily verify the table has been created with the SAP Vora Tools or the Metadata Explorer.
Figure 7: Metadata Explorer Fact Sheet of RSS_FEED table
The Data Preview shows us what is now stored in the SAP Vora disk engine.
Figure 8: Metadata Data Preview