Technical Articles
Web-site Scraping with SAP Data Intelligence
Introduction
A lot of data is scattered unstructured on web-sites but nonetheless could carry worthwhile information, foremost to mention are product reviews whereas not all web-sites provide APIs for an easy access.
For a proof-of- concept project we had to scrape articles from online media web-sites and do text analysis of the content. In this blog I am going to describe shortly how we implemented a web-scraper and in a follow-up blog I will write on our text analysis approach. You find all the coding (operators, pipelines and README) for both parts of the project on the public github repository di_textanalysis.
At first a big thank you to my colleague Lijin Lan who introduced me to scrapy and created all the spiders.
Scrapy
For the website scraping we used the open source tool framework scrapy. It is for good reasons the most popular tool among developers who like to suck data from websites without too much effort. It is python-based and you have to maintain only 5 configuration files:
- items.py
- middleware.py
- pipelines.py
- setting.py
- Spider.py
Adjustments to items.py, middleware.py and setting.py are only necessary for advanced usage. In pipelines.py you setup the hand-over of the extracted data to the receiving application, whereas in Spider.py you define the details of what kind of data you want to extract. The spider.py-file contains the so-called spiders (python-classes) that are the definitions of how the web-site is scraped.
In the following you see an example of how to get all the articles from the most popular German news portal Spiegel.de.
class SpiegelSpider(scrapy.Spider):
name = "Spiegel_spider"
'''this spider crawl the news on the website of Spiegel'''
def start_requests(self):
# the schema of rubrics pages are identical to that of homepage, therefore all put in the same list
homepage_urls = ["https://www.spiegel.de", "https://www.spiegel.de/politik/deutschland", "https://www.spiegel.de/politik/ausland", "https://www.spiegel.de/wirtschaft/"]
for url in homepage_urls:
yield scrapy.Request(url = url, callback = self.parse_homepage)
for url in politics_urls:
yield scrapy.Request(url = url, callback = self.parse_rubrics)
def parse_homepage(self, response):
article_urls_raw = response.xpath("//a[@class = 'text-black block']//@href").extract()
article_urls = [url for url in article_urls_raw if url[0:5] == "https"]
# article_titles = response.xpath("//a[@class = 'text-black block']//@title").extract()
for ii, article_url in enumerate(article_urls):
# In principle, only one kind of item should be generated, because they will be all equally treated by the pipeline
article = newsItem()
article['index'] = ii
article['url'] = article_url
# check the parse_rubrics
if 'politik' in article_url.split('/'):
article['rubrics'] = 'politics'
elif 'wirtschaft' in article_url.split('/'):
article['rubrics'] = 'economics'
else:
article['rubrics'] = 'homepage'
article_request = scrapy.Request(article_url, callback = self.parse_article)
# the parser seems to be only able to catch response, not items
# the item can be stored in the request/response and transfered to the next parser
article_request.meta['item'] = article
yield article_request
def parse_article(self, response):
article = response.meta['item']
article['title'] = response.xpath("//article//header//span[contains(@class, 'align-middle')]//text()").extract()
article['text'] = response.xpath("//article//p//text()").extract()
if 'www' not in response.url.split('.')[0]:
article['website'] = response.url.split('.')[0][8:]
else:
article['website'] = response.url.split('.')[1]
# check if the article is behind the paywall
# Spiegel uses paywall in div to indicate premium articles
paywall_test = response.xpath("//div[@data-component='Paywall']//text()").extract()
if len(paywall_test) == 0:
article['paywall'] = False
else:
article['paywall'] = True
# each time, when an item is generated, it will be passed on to the pipeline
yield article
For more details on how to use scrapy have a look at the numerous tutorials, e.g. the introduction on scrapy.org.
Implementation of Scrapy
For running the crawling of scrapy on SAP Data Intelligence you have to encapsulate it in a docker container. Luckily no binary installation is necessary and the python base image provided by SAP can be used. With SAP Data Intelligence 3.0 you cannot run docker container as root and therefore you have to add a new group and user. Finally you have to setup the scrapy environment by running ‘srapy startproject …’
FROM §/com.sap.datahub.linuxx86_64/sles:15.0-sap-007
# basic setup with additional user
RUN groupadd -g 1972 textanalysis && useradd -g 1972 -u 1972 -m textanalysis
USER 1972:1972
WORKDIR "/home/textanalysis"
ENV HOME=/home/textanalysis
ENV PATH="${PATH}:${HOME}/.local/bin"
###### packages needed
# for output data
RUN python3 -m pip install pandas --user
# utilities package from thorsten hapke
RUN python3 -m pip --no-cache-dir install sdi_utils --user
###### scrapy
# package
RUN python3 -m pip install scrapy --user
# create the scrapy environment
RUN scrapy startproject onlinemedia
# additional packages ...
You could provide all the configuration files with the definition of the docker image. This would mean that for any changes to the configuration in particular when adding spiders to the spider.py you have to build the image newly. In order to avoid this we developed a ‘scrapy’-operator with inports where all configuration files can be send to.
Scrapy-Operator
The operator incorporates the following tasks:
- Saving configuration files to respective container folders
- Start the web crawling as a sub-process: ‘scrapy crawl <spider>’
- Capture output of sub-process
- Transform output into required data format, e.g. dictionary, pandas DataFrame
- Send logging information to log-outport
- Send data output to outport
Of course someone could argue that splitting the operator into 3 more generic operators would make sense, like
- Setup scrapy configuration files in container (generic, but configurable)
- Start scrapy and sending output stream as batches to outport (generic, without configuration)
- Transform output stream to specific (template, adjustable to specifics of scraped data and required output
and the one would be right. But this is left to projects who like to use this kind of data retrieval productively and wants to minimise the error-proneness by splitting code into smaller pieces and gain more flexibility.
Pipeline
The final pipeline could then look like the following where the bulk of the data (text of the article) is stored to an object store and the metadata stored into an HANA database:
The logging from scrapy and the other operators are collected, shown on-the-fly in a wiretap-operator and then stored to an object store. The alternative is to channel the logging to the standard SAP Data Intelligence logging where it is mixed with all the other loggings and shares the destiny of the standard logging lifecycle.
Some final numbers:
- The scraping of 7 online-media web-sites
- takes roughly 30secs and
- produces for each website 300-500kb storage memory.
Summary
Web-scraping is still the business of a huge number of small service providers who specialise in knowledge areas and the usage of freely available web sources. As it is shown in this blog that with some basic skills and little effort you can use this resource by yourself and add valuable data to your data analysis.
Not the least this is another example of how easy it is to use open source solutions with SAP Data Intelligence.