One big data concept we have in Lumira is that we work on sample data during the design of the analytics document. This is desirable because it is not viable to load an extremely large data set at the Terabyte/Petabyte level on a desktop machine and we want our user workflows to be responsive. Once the user has designed their report we then replay the data wrangling operations from their sample on the full data set on the Hadoop platform. The resulting Lumira document will contain the necessary real data to do your analytics on. Think of this strategy as retrieving a manageable subset of data from your enterprise data lake and only focusing on the data that is relevant to answer your business question. As a joke on our team, we refer to this as “Bonsai Tree Big Data”.
In my example, I have already sampled 0.1% of my Hive table and have pruned down the data set to focus only on last quarters data. I have created some visualizations that will help provide answers to the questions I would like to ask my data. I am now ready to schedule my Lumira document as an Oozie Workflow on the Hadoop platform.
The first step is to click on the “Generate Full Dataset” button under the sample label.
You will be given the choice to create a Hive table or a Lumira Document. Creating a Hive table will generate a new Hive table with the full data set that represents all the data wrangling operations executed in the Prepare room of Lumira. This option is useful if your resulting table is extremely large (>100M records) or you want to consume the resulting Hive table with another application. Generating a Lumira document is useful if you want to do some ad-hoc analysis on your full data set and then share your insights with your co-workers via Lumira Cloud or Lumira Team Server.
For the output options, you will need to specify an Output Directory on HDFS where your scheduled Lumira Document will be created and where your hive data will reside. The “Full Dataset Lumira Document Name” will be the name of your scheduled Lumira document and also the Hive table of resulting wrangled data.
If you click on the Preference button, you can enter in the details for your WebHDFS (or HttpFs) Server. You can find this information out from Ambari (Hortonworks) or Cloudera Manager. If you do not have access to these sites, you will need to speak with your IT department. A reminder that the user must have the appropriate authorization to read and write to the HDFS Output directory.
Now comes the fun of Hadoop configuration under the Oozie Setting pane. We need to enter in the necessary information to schedule a Hive action in an Oozie Workflow. We tried to make this configuration as painless as possible by only having the user enter in the values once upon scheduling successfully. We also tried to use the configuration keys to be the same name as you would find them in Ambari or Cloudera Manager. There are 2 exceptions. The “Oozie Hive ShareLibs Directory” can be found in HDFS. This would be the directory where all your hive jars would be located when running a Hive action in Oozie. I always like to run the Apache Oozie Hive action sample to verify that the Hive action has been properly installed in the Oozie workflow scheduler. The “Oozie hive-site.xml” value is the hive configuration file stored on HDFS. It isn’t ideal to have another copy of the hive site configuration stored in HDFS but we currently need this so it is possible that the created hive table will be discovered on the remote Hive metastore.
Now that the Lumira Document is scheduled, you can choose to monitor the job in the “Hadoop HDFS” tab under the opening page in Lumira Desktop. The “Submitted Workflows” section show the scheduled Oozie Workflows from this application. The “On Server” section is the scheduled Lumira Document that can be opened from the HDFS directory by double clicking.
If there is an Oozie Scheduling Error, you are able to double click on the job to investigate the problem. Of course, you are free to use other tools already in Hadoop to montior your Oozie Workflow like Hue or the Oozie Web Console.
As you know configuring Hadoop is challenging because all of the different options available. Hopefully this blog can make it less painful for you to successfully schedule a Hadoop job using SAP Lumira.
If you are trying to schedule large data sets in Hadoop, check out these tips