SAP HANA CHINESE TEXT PROCESS(一):RLANG Real-Time Web Crawling
We have a Chinese version of this document.
R is an open source programming language and software environment for statistical computing and graphics. The R language has become very popular among statisticians and data miners for developing statistical software and is widely used for advanced data analysis。
The goal of the integration of the SAP HANA database with R is to enable the embedding of R code in the SAP HANA database context. That is, the SAP HANA database allows R code to be processed in-line as part of the
Overall query execution plan. This scenario is suitable when an SAP HANA-based modeling and consumption application wants to use the R environment for specific statistical functions.
1. Environment Preparation
We not only need to prepare the SAP HANA database, but we also need prepare the additional R runtime environment with the Rserve installed。Then we should need to configure the SAP HANA indexserver.ini. The detail configuration please visit:
SAP_HANA_R_Integration_Guide_en.pdf
2. HANA Database Preparation
Firstly, we need to create the Schema、Tables and related Table Types。
CREATE SCHEMA RTEST;
SET SCHEMA RTEST;
CREATE TABLE URL(
URL VARCHAR(500)
);
CREATE TABLE URL_CONTENT(
"URL" VARCHAR(500),
"CONTENT" NCLOB,
"LANGU" VARCHAR(5)
);
CREATE TYPE URL_TYPE AS TABLE ( "URL" VARCHAR(500))
CREATE TYPE URL_CONTENT_TYPE AS TABLE (
"URL" VARCHAR(500),
"CONTENT" NCLOB,
"LANGU" VARCHAR(5)
);
Table URL stores the URL set for crawling. URL_CONTENT used for storing the web contents. URL_TYPE and URL_CONTENT_TYPE are table types.
Table 1.0 URL
3. RLANG CRAWLER
We will use the R package RCurl to crawl the web pages. So, we should install the package first use the following command:
install.packages(“RCurl”)
The main RLANG shows as follows:
CREATE PROCEDURE CRAWLER_R (IN url URL_TYPE,OUT result URL_CONTENT_TYPE)
language rlang as
begin
setwd("/tmp/")
--user defined function, for crawling the webpages
crawler<-function(url){
tryCatch({
library("RCurl")
--invoke the function getURL, the function will return the html souce code
html<-getURL(url[1],.encoding="utf-8",.mapUnicode = FALSE)
--user defined function, for extracting the url content
html<-extractContent(html)
return (html)
},error=function(err){
return("")
},finally={
gc()
}
)
}
content<-as.character(Apply(url,1,crawler))
--save the results to table
result <- as.data.frame(cbind(URL=url$URL,CONTENT=content), stringsAsFactors=FALSE)
end;
Above RProcedure need one input table type for storing the URLs need to be crawled and one output table type for storing web contents. The function crawler is a user defined function used for crawling. We used the function getURL in the package RCurl to crawl html source files. Then, we use a user defined function extractContent to extract the html contents from html source files. There are a large number of ways to extract the html content form html source files. The most commonly used methods is parsing the html source files to XML files using the XPATH. Then we can remove the non-text nodes and other useless nodes, finally the values of remaining nodes are the html contents. We can also extract the html content based on the Line and Block distribution function。
The above RProcedure is a single thread function, cannot make full use of multi-core advantages, it is very slow when we have a large amount of URLs to be crawled。For this case, we can use the R package parallel to rewrite our code to multi-thread version.
We can use the following commands to invoke the RProcedure:
CALL CRAWLER_R(URL,URL_CONTENT) with overview;
The following picture shows the crawl results.
Pic 1.1 Html contents