Note: This blog is not up to date as new functionalities have been introduced in SP10.

Hadoop offers large-scale, low-cost distributed storage called Hadoop Distributed File System (HDFS). When a large file is stored in Hadoop, it is broken into multiple fixed-size blocks and replicated across data nodes. Since the redundancy is on the small size block level, the recovery is faster and distribution is optimal. In order to process these blocks coherently, Hadoop introduced a programming paradigm called MapReduce (MR). By integrating with Hadoop, you can combine the in-memory processing power of SAP HANA with Hadoop’s ability to store and process huge amounts of data, regardless of structure.

Before SAP HANA SPS09, you can use the Smart Data Access to create virtual tables through hive ODBC driver for data federation scenarios. Start from SPS09, SAP HANA supports the integrations with Hadoop MapReduce (MR) jobs written in Java. You can create new type of User Defined Function with the direct access to HDFS and the vUDFs(Virtual User Defined Function) can be invoked from SQL directly that help customer to reuse their investments in MapReduce and solve the problems don’t fit the typical Hive usage patterns via SDA.

Here is the architecture diagram of the integration:

/wp-content/uploads/2015/01/1_626945.png

1. Install Hadoop adapter controller

You should have your Hadoop system installed already. SAP has created an adapter that can be installed as a delivery unit in HANA XS Engine and will be pushed to Hadoop later. This installation has been done in the system but it is good for you to understand how it works.

The controller can be downloaded from SAP Marketplace. After that, you need to assign the sap.hana.xs.lm.roles::Administrator role to your HANA user then start the HANA Application Lifecycle Manager to import it as a delivery unit.

HANA Application Lifecycle Manager URL is like below, replace the host and instance no. After login, click on Delivery Unit->Import->Browse and locate the tgz file and import it.

http://<yourhost>:80<instanceNumber>/sap/hana/xs/lm/

/wp-content/uploads/2015/01/2_626946.png

In the Hadoop side, the controller.jar should have been deployed at /sap/hana/mapred/<Revision>/controller/package in HDFS.

/wp-content/uploads/2015/01/3_626947.png

You will also need to put these two jar files at the path /sap/hana/mapred/lib in HDFS:

/wp-content/uploads/2015/01/4_626948.png

2. Create the MapReduce java project

You can ask to download the project from this blog. If you want to start from scratch, below are the steps.

In HANA Studio/Eclipse, create a new Java project, give it a name song_count, add the following jar files into a subdirectory lib and add them into class path of the project. You should see something like below after that.

/wp-content/uploads/2015/01/5_626949.png

Create the Mapper class with a name SongCountMapper.java in the package com.sfp and then copy the source code below. In the mapper, we find out the songs for each artist.

package com.sfp;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class SongCountMapper extends Mapper<Object, Text, Text, IntWritable> {

    private final static IntWritable one = new IntWritable(1);

    @Override
   
public void map(Object key, Text value, Context output) throws IOException,
             InterruptedException {
         String[]
song = value.toString().split(“,”);
       
output.write(new Text(song[3]), one);
     }

}


Create the Reducer class with a name SongCountReducer.java in the package com.sfp and then copy the source code below. In the reducer, we aggregate the number of songs for each artist.

package com.sfp;

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class SongCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    @Override
   
public void reduce(Text key, Iterable<IntWritable> values, Context output)
           
throws IOException, InterruptedException {
       
int count = 0;
       
for(IntWritable value: values){
       
count+= value.get();
         }
       
output.write(key, new IntWritable(count));
     }

}

Create the Application class with a name SongCountApplication.java and then copy the source code below. In the application, it incorporates the mapper and reducer, launch a MapReduce job the process data from the input path /sfp/song/data and put the result into path /sfp/job_result in HDFS.

package com.sfp;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class SongCountApplication extends Configured implements Tool{

    public static void main(String[] args) throws Exception {
       
int res = ToolRunner.run(new Configuration(), new SongCountApplication(), args);
         System.exit(
res);     
     }

    @Override
   
public int run(String[] args) throws Exception {

         Job job = Job.getInstance(new Configuration());
       
job.setOutputKeyClass(Text.class);
       
job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(SongCountMapper.class);
       
job.setReducerClass(SongCountReducer.class);

        job.setInputFormatClass(TextInputFormat.class);
       
job.setOutputFormatClass(TextOutputFormat.class);

         FileInputFormat.setInputPaths(job, new Path(“/sfp/song/data”));
         FileOutputFormat.setOutputPath(
job, new Path(“/sfp/job_result”));

        job.setJarByClass(SongCountApplication.class);

        job.submit();
       
return 0;
     }

}

Compile the java code and make sure there is no syntax error. The project should look like below now.

/wp-content/uploads/2015/01/6_626950.png

3. Upload the MapReduce Jobs Archive into HANA repository

In order to let HANA aware of the MapReduce code, we need to upload the jar file into HANA repository. HANA Studio can help to generate the jar file on the fly from your source code.

We firstly need to create a new package in HANA repository that we can store the jar file. Go to HANA repository(Create a new one if not created), create a new package sfp.song. Create a new general project in HANA Studio/Eclipse named song and share the project with the HANA repository. You should see a project like this after that. The project is really a placeholder for the MapReduce job archive, it is not necessary to create the .xsapp or .xsaccess file like other XS-based project deployed in HANA repository.

/wp-content/uploads/2015/01/7_626951.png

Select the project, click New—>Other.. and click SAP HANA—>Database Development—>Hadoop MR Jobs Archive in the pop-up window.

/wp-content/uploads/2015/01/8_626952.png

Click next and enter the file name as song_count then click next.

/wp-content/uploads/2015/01/9_626953.png

Select the Java project song_count, put the target Schema as your Schema, here is SFP and click Finish.

/wp-content/uploads/2015/01/10_626954.png

That will create a song_count.hdbmrjobs file, activate it and you will see it has been uploaded as a jar file in HANA repository by executing this query.

/wp-content/uploads/2015/01/11_626955.png

/wp-content/uploads/2015/01/12_626956.png

4. Create the Remote Data Source

Now it’s the time to create the remote data source by running the SQL statement below in SQL console of HANA Studio. You will need to replace the <FQDN> with your own Full Qualified Domain Name of your host that you can find it by running command hostname -f.


CREATE REMOTE SOURCE HADOOP_SOURCE

ADAPTER “hadoop”

CONFIGURATION ‘webhdfs_url=http://<FQDN>:50070;webhcat_url=http://<FQDN>:50111′

WITH CREDENTIAL TYPE ‘PASSWORD’

USING ‘user=hdfs;password=hdfs’;


Now you should see the Remote Source.

/wp-content/uploads/2015/01/13_626957.png


5. Create the Virtual Function

To create the virtual function, you can run the following SQL statement in SQL console of HANA Studio. The returns table structure need to be the same data types as the output structure of the Reducer, the Package System is the one you can find at view “SYS”.”VIRTUAL_FUNCTION_PACKAGES”, the configuration need to specify the input path of the MapReduce job where you put your data files in HDFS and the class name of the Mapper and Reducer.

CREATE virtual FUNCTION HADOOP_SONG_COUNT()
RETURNS TABLE (“song_artist” NVARCHAR(400), “count” INTEGER)
PACKAGE SYSTEM.“sfp.song::song_count_job”
CONFIGURATION
‘enable_caching=true;mapred_jobchain=[{“mapred_input”:”/sfp/song/data”,”mapred_mapper”:”com.sfp.SongCountMapper”,”mapred_reducer”:”com.sfp.SongCountReducer”}]’

AT HADOOP_SOURCE;


6. Run the Virtual Function

Now you can simply run the virtual function like below.

SELECT * FROM HADOOP_SONG_COUNT();


It will trigger the MapReduce job and execute it in the Hadoop Cluster and populate the result into the output path of your Reducer, in this Reducer, it will be in /sfp/job_results in HDFS.

/wp-content/uploads/2015/01/13_626957.png

And you can find the job records in the job history UI at http://<your_hadoop_host>:19888/jobhistory

/wp-content/uploads/2015/01/14_626959.png

Clink the link to Download Source Code

To report this post you need to login first.

29 Comments

You must be Logged on to comment or reply to a post.

  1. vignesh vel

    Hi this is raj i am having 3 years of experience as a php developer and i am certified. i have knowledge on OOPS concepts in php but dont know indepth. After learning sap will be enough to get a good career in IT with good package? and i crossed <a href=”http://www.besanttechnologies.com/training-courses/data-warehousing-training/hadoop-training-institute-in-chennai“>hadoop training in chennai</a> website where someone please help me to identity the syllabus covers everything or not??

    Thanks,

    raj

    http://www.besanttechnologies.com/training-courses/data-warehousing-training/hadoop-training-institute-in-chennai

    (0) 
      1. Eric Du Post author

        Thanks siva for the helps. I was kind of busy yesterday 🙂

        @Chris, just let me know if you need anything else.

        (0) 
    1. Eric Du Post author

      Hi Sravan,

      I have sent it to your email address. I also uploaded the file to SCN and they replied it will take them sometime to review and publish it. I will then update the blog to include a link to download the file.

      Regards,

      Eric Du

      (0) 
      1. Eric Du Post author

        Hi Chris,

        There is a link in the end of the blog where you can download source code now.

        Regards,

        Eric Du

        (0) 
  2. Aran Liu

    Hi Eric,

    Could you please tell me the version of Hana Studio,

    And what does “webhcat_url”  mean in Create the Remote Data Source part.

    Regards,

    Aran

    (0) 
  3. Ram Lokeswara

    Hi Eric,

    I have created Remote source with out credentials

    CREATE REMOTE SOURCE HADOOP_SOURCE

    ADAPTER “hadoop”

    CONFIGURATION ‘webhdfs_url=http://XX.XX.XXX.XX:9101;webhcat_url=http://XX.XX.XXX.XX:50111‘;

    and I have created virtual user-defined function

    CREATE VIRTUAL FUNCTION HADOOP_PRODUCT_UDF()

    RETURNS TABLE

    (“code” VARCHAR(255), “description” VARCHAR(255), “total_emp” INTEGER, “salary” INTEGER )

    CONFIGURATION

    ‘datetime_format=yyyy-MM-dd HH:mm:ss;date_format=yyyy-mm-dd HH:mm:ss;time_format=HH:mm:ss;enable_remote_caching=true;cache_validity=3600;

    hdfs_location=/user/hive/warehouse/sample_07/sample_07′

    AT HADOOP_SOURCE;

    where the sample_07 files contains columns as below

    {“columns”:[{“name”:”code”,”type”:”string”},{“name”:”description”,”type”:”string”},{“name”:”total_emp”,”type”:”int”},{“name”:”salary”,”type”:”int”}],”database”:”default”,”table”:”sample_07″}

    when I tried to run the user defined function. I am getting the below error

    SAP DBTech JDBC: [2048]: column store error: search table error:  [2620] executor: plan operation failed;{“RemoteException”:{“exception”:”AccessControlException”,”javaClassName”:”org.apache.hadoop.security.AccessControlException”,

    “message”:”Permission denied: user=hduser, access=WRITE, inode=\”/\”:hadoop:supergroup:drwxr-xr-x”}}

    Can you please let me know what is wrong in my process and what is this error about.

    Best regards,

    Ram

    (0) 
    1. Ram Lokeswara

      Now we tried using the credentials as well but still we are facing same issue

      CREATE REMOTE SOURCE HADOOP_SOURCE

      ADAPTER “hadoop”

      CONFIGURATION ‘webhdfs_url=http://XX.XX.XX:9101;webhcat_url=http://XX.XX.XX:50111

      WITH CREDENTIAL TYPE ‘PASSWORD’

      USING ‘user=hduser;password=XXXXXXX’

      SAP DBTech JDBC: [2048]: column store error: search table error:  [2620] executor: plan operation failed;{“RemoteException”:{“exception”:”AccessControlException”,”javaClassName”:”org.apache.hadoop.security.AccessControlException”,”message”:”Permission denied: user=hduser, access=WRITE, inode=\”/\”:hadoop:supergroup:drwxr-xr-x”}}19 , 6 ) )

      Can you please let me know what is that I am missing..

      Regards,

      Ram

      (0) 
      1. Nenggong Du

        hi Ram,

        before you run the sql. you need create /sap/hana/mapred dirs in HDFS command like this:

        $ hdfs dfs -mkdir -p /sap/hana/mapred/lib

        also you need put joda-time-2.3.jar  solr-commons-csv-3.5.0.jar to HDFS

        $ hdfs dfs -put joda-time-2.3.jar /sap/hana/mapred/lib

        $ hdfs dfs -put solr-commons-csv-3.5.0.jar /sap/hana/mapred/lib

        after that change the folder owner.

        $ hdfs dfs -chown -R hduser /sap

        BTW, HANA seems not working with all hadoop version. hadoop 2.6.0 ok, but hadoop 2.7.x not work.

        have a good luck.

        (0) 
  4. Matthias Schmidt

    Hi all, I am facing the following error message in HANA Studio: SAP DBTech JDBC: [2048]: column store error: search table error:  [2620] executor: plan operation failed;NA I can see in Hadoop that the job has started but for some reason it only starts 1 map task and there is no error message in the logs of Hadoop. When I execute the code using a jar file directly in Hadoop everything is working fine. WebHCat is configured properly and access rights are also not an issue anymore. I am using Hadoop 2.6. Does anybody has an idea/experience with this problem? Thanks, Matthias

    (0) 
    1. Lilian Liu

      hi Matthias Schmidt,

           I have the same problem as you. But my job is not running , I see no logs of my job on Hadoop job history. Could you tell me if you have find the reasons.


      Thanks.

      (0) 
      1. Matthias Schmidt

        Hi Lilian,

        unfortunately not, but I am currently in contact with SAP support, they will try to reproduce the issue. I will keep you up to date.

        Best regards,

        Matthias

        (0) 

Leave a Reply