Skip to Content

      We have a Chinese version of this document.

R is an open source programming language and software environment for statistical computing and graphics. The R language has become very popular among statisticians and data miners for developing statistical software and is widely used for advanced data analysis

The goal of the integration of the SAP HANA database with R is to enable the embedding of R code in the SAP HANA database context. That is, the SAP HANA database allows R code to be processed in-line as part of the

Overall query execution plan. This scenario is suitable when an SAP HANA-based modeling and consumption application wants to use the R environment for specific statistical functions.

1. Environment Preparation

We not only need to prepare the SAP HANA database, but we also need prepare the additional R runtime environment with the Rserve installedThen we should need to configure the SAP HANA indexserver.ini. The detail configuration please visit:

SAP_HANA_R_Integration_Guide_en.pdf

2. HANA Database Preparation

Firstly, we need to create the SchemaTables and related Table Types


CREATE SCHEMA RTEST;
SET SCHEMA RTEST;
CREATE TABLE URL(
URL VARCHAR(500)
);
CREATE TABLE URL_CONTENT(
  "URL" VARCHAR(500),
  "CONTENT" NCLOB,
  "LANGU" VARCHAR(5)
);
CREATE TYPE URL_TYPE AS TABLE ( "URL" VARCHAR(500))
CREATE TYPE URL_CONTENT_TYPE AS TABLE (
  "URL" VARCHAR(500),
  "CONTENT" NCLOB,
  "LANGU" VARCHAR(5)
);



     Table URL stores the URL set for crawling. URL_CONTENT used for storing the web contents. URL_TYPE and URL_CONTENT_TYPE are table types.

/wp-content/uploads/2014/06/1_481006.png

                                   Table 1.0 URL

3. RLANG CRAWLER

We will use the R package RCurl to crawl the web pages. So, we should install the package first use the following command:

                       install.packages(“RCurl”)

      

       The main RLANG shows as follows:      


CREATE PROCEDURE CRAWLER_R (IN url URL_TYPE,OUT result URL_CONTENT_TYPE)
language rlang as
begin
setwd("/tmp/")
  --user defined function, for crawling the webpages
    crawler<-function(url){
        tryCatch({
            library("RCurl")
--invoke the function getURL, the function will return the html souce code
            html<-getURL(url[1],.encoding="utf-8",.mapUnicode = FALSE)
     --user defined function, for extracting the url content
html<-extractContent(html)
            return (html)
        },error=function(err){
            return("")
        },finally={
           gc()
        }
      )
    }
  content<-as.character(Apply(url,1,crawler))
  --save the results to table
    result <- as.data.frame(cbind(URL=url$URL,CONTENT=content), stringsAsFactors=FALSE)
end;


Above RProcedure need one input table type for storing the URLs need to be crawled and one output table type for storing web contents. The function crawler is a user defined function used for crawling. We used the function getURL in the package RCurl to crawl html source files. Then, we use a user defined function extractContent to extract the html contents from html source files. There are a large number of ways to extract the html content form html source files. The most commonly used methods is parsing the html source files to XML files using the XPATH. Then we can remove the non-text nodes and other useless nodes, finally the values of remaining nodes are the html contents. We can also extract the html content based on the Line and Block distribution function

The above RProcedure is a single thread function, cannot make full use of multi-core advantages, it is very slow when we have a large amount of URLs to be crawledFor this case, we can use the R package parallel to rewrite our code to multi-thread version.

  We can use the following commands to invoke the RProcedure

CALL CRAWLER_R(URL,URL_CONTENT) with overview;

         The following picture shows the crawl results.

/wp-content/uploads/2014/06/2_481133.png

                                        Pic 1.1 Html contents

To report this post you need to login first.

Be the first to leave a comment

You must be Logged on to comment or reply to a post.

Leave a Reply