Skip to Content
Technical Articles

Scraping RSS Feeds with SAP Data Hub

I had a request to retrieve RSS data using SAP Data Hub and store this in SAP Vora.
There are many ways to do achieve this, here’s how I did it.

Data Hub Pipeline

  • Docker with Beautiful Soup 4 & Pandas
  • Python Operator using Beautiful Soup 4
  • Vora Avo Ingestor
  • Vora Disk Table

Figure 1: Data Intelligence Pipeline

Python is great for scraping RSS feeds, we can wrap our code in a custom operator and then associate that with a suitable docker image that contains the required libraries.

 

Create a Docker Image

First we need to create a docker that contains the required python libraries, and associate this with some appropriate tags that we will link to our operator

Figure 2: Docker Image

# Use an official Python 3.6 image as a parent image
FROM python:3.6.4-slim-stretch

# Data Intelligence requires Tornado
RUN python3 -m pip --no-cache install tornado==5.0.2
RUN python3 -m pip install requests
RUN python3 -m pip install pandas
RUN python3 -m pip install beautifulsoup4
RUN python3 -m pip install lxml

# Add vflow user and vflow group to prevent error 
# container has runAsNonRoot and image will run as root
RUN groupadd -g 1972 vflow && useradd -g 1972 -u 1972 -m vflow
USER 1972:1972
WORKDIR /home/vflow
ENV HOME=/home/vflow

If the docker build fails, you can get more details through the Diagnostic Information.

Figure%20x%3A%20Download%20Diagnostic%20Logs

Figure 3: Download Diagnostic Logs

Custom SAP Data Hub Python Operator

I have tested the operator with various RSS feeds and it appears to be reliable.

Figure 4: Create Custom Python Operator

import requests
import pandas as pd
from bs4 import *

url = "http://feeds.bbci.co.uk/news/rss.xml"

resp = requests.get(url)
soup = BeautifulSoup(resp.content, features="xml")

items = soup.findAll('item')

news_items = []

for each_item in items:
    news_item = {}
    news_item['RSS_TITLE'] = each_item.title.text
    news_item['RSS_DESC'] = each_item.description.text
    news_item['RSS_LINK'] = each_item.link.text
    news_item['RSS_DATE'] = each_item.pubDate.text
    news_items.append(news_item)

# Use a Pandas Dataframe to pass as CSV
df = pd.DataFrame(news_items)
df = df.to_csv(index=False, header=True, sep=";")

# Create Data Hub Message
attr = dict()
attr["message.commit.token"] = "stop-token"
messageout = api.Message(body=df, attributes=attr)
api.send("outmsg", messageout)

If we connect this to the WireTap component we can quickly see that data is being retrieved and structured as required.

Figure 5: WireTap Output

Vora Avro Ingestor

Using the Vora Avro Ingestor is a great way to receive structured information into Vora.
I needed to use fixed length fields below, this has the advantage of working with HANA Smart Data Access (SDA).

{
  "name": "RSS_FEED",
  "type": "record",
  "fields": [
    {
      "name": "RSS_TITLE",
      "type": "fixed",
      "size": 128
    },
    {
      "name": "RSS_DESC",
      "type": "fixed",
      "size": 2500
    },
    {
      "name": "RSS_LINK",
      "type": "fixed",
      "size": 128
    },
    {
      "name": "RSS_DATE",
      "type": "fixed",
      "size": 16
    }
  ]
}

For completeness I have captured the properties of the Vora Avro Ingestor, and highlighted the fields that I changed.

Figure 6: Vora Avro Ingestor Configuration

Executing this pipeline will now retrieve the RSS data amd automatically create the table within SAP Vora, we can easily verify the table has been created with the SAP Vora Tools or the Metadata Explorer.

Figure 7: Metadata Explorer Fact Sheet of RSS_FEED table

The Data Preview shows us what is now stored in the SAP Vora disk engine.

Figure 8: Metadata Data Preview

Be the first to leave a comment
You must be Logged on to comment or reply to a post.