Hadoop MapReduce as Virtual Functions in HANA
Note: This blog is not up to date as new functionalities have been introduced in SP10.
Hadoop offers large-scale, low-cost distributed storage called Hadoop Distributed File System (HDFS). When a large file is stored in Hadoop, it is broken into multiple fixed-size blocks and replicated across data nodes. Since the redundancy is on the small size block level, the recovery is faster and distribution is optimal. In order to process these blocks coherently, Hadoop introduced a programming paradigm called MapReduce (MR). By integrating with Hadoop, you can combine the in-memory processing power of SAP HANA with Hadoop’s ability to store and process huge amounts of data, regardless of structure.
Before SAP HANA SPS09, you can use the Smart Data Access to create virtual tables through hive ODBC driver for data federation scenarios. Start from SPS09, SAP HANA supports the integrations with Hadoop MapReduce (MR) jobs written in Java. You can create new type of User Defined Function with the direct access to HDFS and the vUDFs(Virtual User Defined Function) can be invoked from SQL directly that help customer to reuse their investments in MapReduce and solve the problems don’t fit the typical Hive usage patterns via SDA.
Here is the architecture diagram of the integration:
1. Install Hadoop adapter controller
You should have your Hadoop system installed already. SAP has created an adapter that can be installed as a delivery unit in HANA XS Engine and will be pushed to Hadoop later. This installation has been done in the system but it is good for you to understand how it works.
The controller can be downloaded from SAP Marketplace. After that, you need to assign the sap.hana.xs.lm.roles::Administrator role to your HANA user then start the HANA Application Lifecycle Manager to import it as a delivery unit.
HANA Application Lifecycle Manager URL is like below, replace the host and instance no. After login, click on Delivery Unit->Import->Browse and locate the tgz file and import it.
http://<yourhost>:80<instanceNumber>/sap/hana/xs/lm/
In the Hadoop side, the controller.jar should have been deployed at /sap/hana/mapred/<Revision>/controller/package in HDFS.
You will also need to put these two jar files at the path /sap/hana/mapred/lib in HDFS:
2. Create the MapReduce java project
You can ask to download the project from this blog. If you want to start from scratch, below are the steps.
In HANA Studio/Eclipse, create a new Java project, give it a name song_count, add the following jar files into a subdirectory lib and add them into class path of the project. You should see something like below after that.
Create the Mapper class with a name SongCountMapper.java in the package com.sfp and then copy the source code below. In the mapper, we find out the songs for each artist.
package com.sfp;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class SongCountMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value, Context output) throws IOException,
InterruptedException {
String[] song = value.toString().split(“,”);
output.write(new Text(song[3]), one);
}
}
Create the Reducer class with a name SongCountReducer.java in the package com.sfp and then copy the source code below. In the reducer, we aggregate the number of songs for each artist.
package com.sfp;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class SongCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context output)
throws IOException, InterruptedException {
int count = 0;
for(IntWritable value: values){
count+= value.get();
}
output.write(key, new IntWritable(count));
}
}
Create the Application class with a name SongCountApplication.java and then copy the source code below. In the application, it incorporates the mapper and reducer, launch a MapReduce job the process data from the input path /sfp/song/data and put the result into path /sfp/job_result in HDFS.
package com.sfp;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class SongCountApplication extends Configured implements Tool{
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new SongCountApplication(), args);
System.exit(res);
}
@Override
public int run(String[] args) throws Exception {
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(SongCountMapper.class);
job.setReducerClass(SongCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(“/sfp/song/data”));
FileOutputFormat.setOutputPath(job, new Path(“/sfp/job_result”));
job.setJarByClass(SongCountApplication.class);
job.submit();
return 0;
}
}
Compile the java code and make sure there is no syntax error. The project should look like below now.
3. Upload the MapReduce Jobs Archive into HANA repository
In order to let HANA aware of the MapReduce code, we need to upload the jar file into HANA repository. HANA Studio can help to generate the jar file on the fly from your source code.
We firstly need to create a new package in HANA repository that we can store the jar file. Go to HANA repository(Create a new one if not created), create a new package sfp.song. Create a new general project in HANA Studio/Eclipse named song and share the project with the HANA repository. You should see a project like this after that. The project is really a placeholder for the MapReduce job archive, it is not necessary to create the .xsapp or .xsaccess file like other XS-based project deployed in HANA repository.
Select the project, click New—>Other.. and click SAP HANA—>Database Development—>Hadoop MR Jobs Archive in the pop-up window.
Click next and enter the file name as song_count then click next.
Select the Java project song_count, put the target Schema as your Schema, here is SFP and click Finish.
That will create a song_count.hdbmrjobs file, activate it and you will see it has been uploaded as a jar file in HANA repository by executing this query.
4. Create the Remote Data Source
Now it’s the time to create the remote data source by running the SQL statement below in SQL console of HANA Studio. You will need to replace the <FQDN> with your own Full Qualified Domain Name of your host that you can find it by running command hostname -f.
CREATE REMOTE SOURCE HADOOP_SOURCE
ADAPTER “hadoop”
CONFIGURATION ‘webhdfs_url=http://<FQDN>:50070;webhcat_url=http://<FQDN>:50111′
WITH CREDENTIAL TYPE ‘PASSWORD’
USING ‘user=hdfs;password=hdfs’;
Now you should see the Remote Source.
5. Create the Virtual Function
To create the virtual function, you can run the following SQL statement in SQL console of HANA Studio. The returns table structure need to be the same data types as the output structure of the Reducer, the Package System is the one you can find at view “SYS”.”VIRTUAL_FUNCTION_PACKAGES”, the configuration need to specify the input path of the MapReduce job where you put your data files in HDFS and the class name of the Mapper and Reducer.
CREATE virtual FUNCTION HADOOP_SONG_COUNT()
RETURNS TABLE (“song_artist” NVARCHAR(400), “count” INTEGER)
PACKAGE SYSTEM.“sfp.song::song_count_job”
CONFIGURATION ‘enable_caching=true;mapred_jobchain=[{“mapred_input”:”/sfp/song/data”,”mapred_mapper”:”com.sfp.SongCountMapper”,”mapred_reducer”:”com.sfp.SongCountReducer”}]’
AT HADOOP_SOURCE;
6. Run the Virtual Function
Now you can simply run the virtual function like below.
SELECT * FROM HADOOP_SONG_COUNT();
It will trigger the MapReduce job and execute it in the Hadoop Cluster and populate the result into the output path of your Reducer, in this Reducer, it will be in /sfp/job_results in HDFS.
And you can find the job records in the job history UI at http://<your_hadoop_host>:19888/jobhistory
Clink the link to Download Source Code
Hi this is raj i am having 3 years of experience as a php developer and i am certified. i have knowledge on OOPS concepts in php but dont know indepth. After learning sap will be enough to get a good career in IT with good package? and i crossed <a href="http://www.besanttechnologies.com/training-courses/data-warehousing-training/hadoop-training-institute-in-chennai">hadoop training in chennai</a> website where someone please help me to identity the syllabus covers everything or not??
Thanks,
raj
http://www.besanttechnologies.com/training-courses/data-warehousing-training/hadoop-training-institute-in-chennai
Awesome Eric!
Can you please share the project with me? I would like to test it out.
Please let me know your email address that I can forward the source code.
Hi Eric,
Can you please forward the code to sivasaphana@gmail.com
Regards,
Siva
I have sent it to your email address. Please check.
Thank you so much.
Regards,
Siva.
Hi
great article! Could you share the code with me as well?
Chris
Hi Chris,
Please let me know your email address. I can forward you the same code which Eric sent me.
Regards,
Siva.
Thanks siva for the helps. I was kind of busy yesterday 🙂
@Chris, just let me know if you need anything else.
Hi Eric,
Can you please forward the code to tvsravan@gmail.com
I would like to test it out.
Regards,
Sravan
Hi Sravan,
I have sent it to your email address. I also uploaded the file to SCN and they replied it will take them sometime to review and publish it. I will then update the blog to include a link to download the file.
Regards,
Eric Du
Hi Eric,
I am unable to configure SPS9 in AWS . Can you please guide me the procedure.
Thanks in advance.
Regards,
Siva.
Hi Siva,
Which configuration you were refer to?
Regards,
Eric Du
Hi Eric,
I practice through Amazon EC2 Cloud instance. I cannot configure the SPS09 server.
Hence I cannot explore this scenario.
Regards,
Siva.
Hi Siva,
How about following the steps described in the below blog to launch a SP09 instance?
SAP HANA Developer Edition
Regards,
Eric Du
Hi S
My address is bezuidenhout.chris@gmail.com
Thanks!
Chris
Hi Chris,
There is a link in the end of the blog where you can download source code now.
Regards,
Eric Du
Hi Eric:
Where can I get the file of controller.jar?
Thanks.
You need to download it from SAP Service Marketplace.
Hi Eric,
Could you please tell me the version of Hana Studio,
And what does "webhcat_url" mean in Create the Remote Data Source part.
Regards,
Aran
Hi Aran,
webhcat/hcatalog normally comes with hive (0.11 or later) and provides REST-api for Hadoop accessing.
You could refer to this link (https://cwiki.apache.org/confluence/display/Hive/WebHCat)
Regards,
Jonathan
Are you using Hortonworks or AWS EMR. and which version?
Hi Eric,
I have created Remote source with out credentials
and I have created virtual user-defined function
where the sample_07 files contains columns as below
when I tried to run the user defined function. I am getting the below error
SAP DBTech JDBC: [2048]: column store error: search table error: [2620] executor: plan operation failed;{"RemoteException":{"exception":"AccessControlException","javaClassName":"org.apache.hadoop.security.AccessControlException",
"message":"Permission denied: user=hduser, access=WRITE, inode=\"/\":hadoop:supergroup:drwxr-xr-x"}}
Can you please let me know what is wrong in my process and what is this error about.
Best regards,
Ram
Now we tried using the credentials as well but still we are facing same issue
Can you please let me know what is that I am missing..
Regards,
Ram
hi Ram,
before you run the sql. you need create /sap/hana/mapred dirs in HDFS command like this:
$ hdfs dfs -mkdir -p /sap/hana/mapred/lib
also you need put joda-time-2.3.jar solr-commons-csv-3.5.0.jar to HDFS
$ hdfs dfs -put joda-time-2.3.jar /sap/hana/mapred/lib
$ hdfs dfs -put solr-commons-csv-3.5.0.jar /sap/hana/mapred/lib
after that change the folder owner.
$ hdfs dfs -chown -R hduser /sap
BTW, HANA seems not working with all hadoop version. hadoop 2.6.0 ok, but hadoop 2.7.x not work.
have a good luck.
I met the same problems, anyone know where is the password coming from? how to set it in hadoop?
Hi all, I am facing the following error message in HANA Studio: SAP DBTech JDBC: [2048]: column store error: search table error: [2620] executor: plan operation failed;NA I can see in Hadoop that the job has started but for some reason it only starts 1 map task and there is no error message in the logs of Hadoop. When I execute the code using a jar file directly in Hadoop everything is working fine. WebHCat is configured properly and access rights are also not an issue anymore. I am using Hadoop 2.6. Does anybody has an idea/experience with this problem? Thanks, Matthias
hi Matthias Schmidt,
I have the same problem as you. But my job is not running , I see no logs of my job on Hadoop job history. Could you tell me if you have find the reasons.
Thanks.
Hi Lilian,
unfortunately not, but I am currently in contact with SAP support, they will try to reproduce the issue. I will keep you up to date.
Best regards,
Matthias