Access Hadoop via the SAP HANA Smart Data Integration Hive Adapter from SAP Agile Data Preparation
Earlier, in my SAP Agile Data Preparation Tutorial, I described how to use the File Adapter to read a file with UEFA Champions League players to use the SAP HANA Rules Framework to do some analysis on them.
Later, I described how to Leverage your SAP Data Hub Connections with SAP Agile Data Preparation. This allows me to access big data stores like HDFS, SAP Vora, Microsoft Azure Data Lake or Google Cloud Storage from SAP Agile Data Preparation:
However, the logic for accessing big data via the SAP Data Hub differs from direct data access in that only a representative sample of the actual data is loaded into SAP Agile Data Preparation and then the applied transformations are only subsequently being applied to the whole data set. While this makes perfect sense for big data, it does e.g. not allow for Functional Dependencies or SAP HANA Rules Management rules to be applied. Since sometimes regular data is stored in data lakes as well, in this blog I will describe how to configure the Hive Adapter to access data in HDFS directly.
My data file is the same as when describing how to Profile your big data with the SAP Data Hub stored in HDFS. Accessing its Fact Sheet as described in Leverage your SAP Data Hub Connections with SAP Agile Data Preparation, shows e.g. the title distribution by country:
To access this data directly to apply the rules from my SAP Agile Data Preparation Tutorial, I configure the Hive Adapter. Please check the SAP HANA Smart Data Integration and SAP HANA Smart Data Quality Installation and Configuration Guide for details, but most importantly:
|Before registering the adapter with the SAP HANA system, ensure you have downloaded and installed the correct JDBC libraries. See the SAP HANA smart data integration Product Availability Matrix (PAM) for details. Place the files in the <DPAgent_root>/lib/hive folder.|
Then, I get the necessary connection information from my Ambari dashboard:
And configure my Remote Connection accordingly:
I had described a similar configuration in Leverage your SAP Data Hub Adapter with SAP Agile Data Preparation, but there the focus had been on leveraging the SAP Data Hub Adapter, whereas now I am using a standard SAP Data Provisioning Agent.
To access my file, I create a respective external table considering its header row and ANSI encoding:
CREATE EXTERNAL TABLE IF NOT EXISTS players( firstname STRING , lastname STRING , country STRING , club STRING , titles INT , countrycode STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u003b' STORED AS TEXTFILE LOCATION '/tmp/players' TBLPROPERTIES ("skip.header.line.count"="1", "serialization.encoding"="windows-1252")
Interestingly, the Suggestions for the full table:
Vary from the suggestions for the subset:
But most importantly, I see my Functional Dependencies again:
And can apply the same rule as in my SAP Agile Data Preparation Tutorial (611, i.e. 99 % failed means that out of 617 players there are only 6 Scottish ones with more than one title in my data set):
And here they are:
While the SAP Data Hub ADP Connector is great for big data HDFS access, the necessary data sampling has limitations that can be avoided by accessing the data fully via the Hive Adapter, if the size of the data allows for it.