Skip to Content

I’ve read somewhere recently someone suggesting HADOOP is the Swiss Army Knife for solving Big Data problems.

It certainly has a large plethora of tools.

It’s amazing the speed at which these opensource tools are developing and evolving.

If I needed to prepare external data files for HANA my first thought would be Excel.

As the size of the data and frequency of loading increased I might start thinking SAP DataServices (BODS).

There is usually more than one way to crack an egg though so my next thought is to consider using HADOOP.

The following diagram illustrates just a few of HADOOPs tools:


In this blog I will primarily explore the use of  PIG, SQOOP and OOZIE to insert delta records into HANA. [ b)  & c) ]

For more details on using SQOOP & OOZIE with HANA see:

Exporting and Importing DATA to HANA with HADOOP SQOOP

Creating a HANA Workflow using HADOOP Oozie



For a great intro to Hadoop (including PIG) then try out the Hortonworks Sandbox and follow some of their useful tutorials (Hadoop Tutorial: How to Process Data with Pig)


I don’t want to reinvent the wheel completely so please do check out the Hortonworks tutorials.  They also have videos  if you don’t want to get your hands dirty.


Below I will briefly cover 3 scenarios:

A) Manually using PIG to reformat a file

B) Using PIG to compare files and generate a DELTA file

C) Use OOZIE, PIG & SQOOP to transfer a delta to HANA



A) Manually using PIG to reformat a file


1) Load your raw file using the  HADOOP User interface (HUE)

NOTE: PIG can also be used with some compressed file formats as well.


2) Run a Pig Script to FILTER rows with repeated headers, and remove a column not required in the final file.



End result



B) Using PIG to compare 2 files and generate a basic DELTA file


In this example I will load a new file and compare with the above file.  Where the new file has a new key (ID)  I want to generate a DELTA file with only the new key records.


The new file is:

Note from above we have previous received record with ID 3, so the new delta record should only be (4,dddd)


So lets use a PIG script to determine the simple DELTA

If you look closely at the logic it  conceptual resembles a SQL Right Outer Join Statement such as a:

Select PRIOR.* from NEW Right Outer Join PRIOR where PRIOR.ID is NULL


The end result is a new file with just the Delta record:



Finally lets combine this PIG Script with HADOOP OOZIE & SQOOP to schedule and load the DELTA to HANA.



C) Use Ooozie, Pig & Sqoop to transfer Delta to HANA


Prior to running a new OOZIE workflow, lets first check the target table in HANA which I’ve previously loaded with results of the first simple PIG script.

Note: it only contains 3 rows


Now lets create & run an Oozie workflow as follows:


Step1 – Use a Pig Script to create Delta File

NOTE: This will execute the same script used earlier.



Step 2 – Use Sqoop to export Delta File to HANA


Step3 – Move the New Delta and Overwrite the previous Delta



Now lets execute the workflow and see the results



Now finally lets check if the DELTA made it to HANA. 


SUCCESS 

ID 4 was correctly inserted.


If you give it a try then please do let me know how you get on.


To report this post you need to login first.

Be the first to leave a comment

You must be Logged on to comment or reply to a post.

Leave a Reply