Skip to Content
Technical Articles

Scraping RSS Feeds with SAP Data Hub

I had a request to retrieve RSS data using SAP Data Hub and store this in SAP Vora.
There are many ways to do achieve this, here’s how I did it.

Data Hub Pipeline

  • Docker with Beautiful Soup 4 & Pandas
  • Python Operator using Beautiful Soup 4
  • Vora Avo Ingestor
  • Vora Disk Table

Python is great for scraping RSS feeds, we can wrap our code in a custom operator and then associate that with a suitable docker image that contains the required libraries.

Create a Docker Image

First we need to create a docker that contains the required python libraries, and associate this with some appropriate tags that we will link to our operator

# Use an official Python 3.6 image as a parent image
FROM python:3.6.4-slim-stretch

# Install python libraries
RUN pip install requests
RUN pip install pandas
RUN pip install beautifulsoup4
RUN pip install lxml

Custom SAP Data Hub Python Operator

I have tested the operator with various RSS feeds and it appears to be reliable.

import requests
import pandas as pd
from bs4 import *

url = ""

resp = requests.get(url)
soup = BeautifulSoup(resp.content, features="xml")

items = soup.findAll('item')

news_items = []

for each_item in items:
    news_item = {}
    news_item['RSS_TITLE'] = each_item.title.text
    news_item['RSS_DESC'] = each_item.description.text
    news_item['RSS_LINK'] =
    news_item['RSS_DATE'] = each_item.pubDate.text

# Use a Pandas Dataframe to pass as CSV
df = pd.DataFrame(news_items)
df = df.to_csv(index=False, header=True, sep=";")

# Create Data Hub Message
attr = dict()
attr["message.commit.token"] = "stop-token"
messageout = api.Message(body=df, attributes=attr)
api.send("outmsg", messageout)

If we connect this to the WireTap component we can quickly see that data is being retrieved and structured as required.

Vora Avro Ingestor

Using the Vora Avro Ingestor is a great way to receive structured information into Vora.
I needed to use fixed length fields below, this has the advantage of working with HANA Smart Data Access (SDA).

  "name": "RSS_FEED",
  "type": "record",
  "fields": [
      "name": "RSS_TITLE",
      "type": "fixed",
      "size": 128
      "name": "RSS_DESC",
      "type": "fixed",
      "size": 2500
      "name": "RSS_LINK",
      "type": "fixed",
      "size": 128
      "name": "RSS_DATE",
      "type": "fixed",
      "size": 16

For completeness I have captured the properties of the Vora Avro Ingestor, and highlighted the fields that I changed.

Executing this pipeline will now retrieve the RSS data amd automatically create the table within SAP Vora, we can easily verify the table has been created with the SAP Vora Tools or the new Metadata Explorer.

The Data Preview shows us what is now stored in the SAP Vora disk engine.

Be the first to leave a comment
You must be Logged on to comment or reply to a post.