DataServices Text Analysis and Hadoop – the Details
I have already used the text analysis feature within SAP DataServices in various projects (the transform in DataServices is called Text Data Processing or TDP in short). Usually, the TDP transform runs in the DataServices engine, means that DataServices first loads the source text in its own memory and then runs the text analysis on its own server / engines. The text sources are usually unstructured text or binary files such as Word, Excel, PDF files etc. If these files reside on a Hadoop cluster as HDFS files, DataServices can also push down the TDP transform as a MapReduce job to the Hadoop cluster.
Why running DataServices Text Analysis
within Hadoop ?
Running the text analysis within Hadoop (means as MapReduce jobs) can be an appealing approach if the total volume of source files is big and at the same time the Hadoop cluster has enough resources. Then the text analysis might run much quicker inside Hadoop than within DataServices. The text analysis extracts entities from unstructured text and as such it will transform unstructured data into structured data. This is a fundamental pre-requisite in order to be able to run any kind of analysis with the data in the text sources. You only need the text sources as input for the text analysis. Afterwards you could theoretically delete the text sources, because they are no longer needed for the analysis process. But in reality you still want to keep the text sources for various reasons:
- Quality assurance of the text analysis process: doing manual spot checks by reviewing the text source. Is the generated sentiment entity correct? Is the generated organisation entity Apple correct or does the text rather refer to the fruit apple? …
- As an outcome of the QA you might want to improve the text analysis process and therefore rerun the text analysis with documents that already had been analyzed previously.
- You might want to run other kinds of text analysis with the same documents. Let’s say you did a sentiment analysis on text documents until now. But in future you might have a different use case and you want to analyze customer requests with the same documents.
Anyway, in all these scenarios the text documents are only used as a source. Any BI solution will rely on the structured results of the text analysis. But it will not need to lookup any more information from these documents. Keeping large volumes of these documents in a Hadoop cluster can therefore be a cost-effective solution. If at the same time the text analysis process can run in parallel on multiple nodes on that cluster the overall performance might be improved – again on comparative cheap hardware.
I was therefore curious how the push-down of the DataServices TDP transform to Hadoop works technically. Unfortunately my test cluster is too small to test the performance. Instead my test results focus on technical functionality only. Below are my experiences and test results:
For my tests I am using a Hortonworks Data Platform HDP 2.1 installation on just one cluster. All HDP services are running on this node. The cluster is installed on a virtual machine with CentOS 6.5, 8 GB memory and 4 cores.
DataServices 4.2.3 is installed on another virtual machine based on CentOS 6.5, 8 GB memory and 4 cores.
My Test Cases
- Text analysis on 1000 Twitter tweets in one CSV file:
The tweets are a small sample from a bigger file extracted from Twitter on September 9 using a DataSift stream. The tweets in the stream were filtered, they all mention Apple products like iPhone, iPad,AppleWatch and so on.
- 3126 binary MSGfiles. These files contain emails that I had sent out in 2013:
- 3120 raw text files. These files contain emails that I had sent out in 2014
I have been running each of these test cases in two variants:
- Running the TDP transform within the DS engine. In order to achieve this, the source files were saved on a local filesystem
- Running the TDP transform within Hadoop. In order to achieve this, the source files were saved on the Hadoop cluster.
The TDP transform of both variants of the same test case is basically the same. This means they are copies of each other with exactly the same configuration options. For all the test cases I have configured standard entity extraction in addition to a customer provided dictionary which defines the names of Apple products. Furthermore, I specified sentiment and request analysis for English and German languages:
TDP configuration options
Whether a TDP transform runs in the DS engine or is pushed down to Hadoop depends solely on the type of source documents: on a local file system or in HDFS. Finally, I wanted to compare the results between both variants. Ideally, there should be no differences.
DataServices vs. Hadoop: comparing the results
Surprisingly, there are significant differences between the two variants. When the TDP transform runs in the DS engine the transform produces much more output than when running the same transform inside Hadoop as a MapReduce job. More precise, for some source files the MapReduce job did not generate any output at all, while the DS engine generates output for the same documents:
For example, the DS engine produced 4647 entities from 991 documents. This makes sense, because 9 tweets do not contain any meaningful content at all, so that the text analysis does not generate any entities for these documents. But the MapReduce job generated 2288 entities from only 511 documents. This does no longer make sense. The question is what happened with the other 502 documents?
The picture is similar for the other two test cases analyzing email files. So, the problem is not related to the format of the source files. I tried to narrow down this problem within Hadoop – please see the section Problem Tracking –> Missing entities at the end of this blog for more details.
On the other hand, for those documents when both, the DS engine and the MapReduce job generated entities, the results between the two variants are nearly identical. For example, these are the results of both variants for one example tweet.
The minor differences are highlighted in the Hadoop version. They can actually be ignored, because they won’t have any impact on a BI solution processing this data. Please note that the entity type APPLE_PRODUCT has been derived from the custom dictionary that I provided, whereas the entity type PRODUCT has been derived from the system provided dictionary.
We can also see that the distribution of entity types is nearly identical between the two variants, just the number of entities is different. The chart below shows the distribution of entity types across all three test cases:
Preparing Hadoop for TDP
It is necessary to read the relevant section in the SAP DataServices reference manual. Section 12.1.2 Configuring Hadoop for text data processing describes all the necessary prerequisites. The most important topics here are:
Transferring TDP libraries to the Hadoop cluster
The $LINK_DIR/hadoop/bin/hadoop_env.sh -c script will simply transfer some required libraries and the complete $LINK_DIR/TextAnalysis/languages directory from the DS server to the Hadoop cluster. The target location in the Hadoop cluster is set in the environment variable $HDFS_LIB_DIR (this variable will be set in the same script when called with the -e option).You need to provide this directory in the Hadoop cluster with appropriate permissions. I recommend that the directory in the Hadoop cluster is owned by the same login under which that the Data Services engines are running (in my environment this is the user ds). After $LINK_DIR/hadoop/bin/hadoop_env.sh -c has been executed successfully the directory in HDFS will look something like this:
Important: If you are using the Hortonworks Data Platform (HDP) distribution version 2.x you have to use another script instead: hadoop_HDP2_env.sh. hadoop_env.sh does not transfer all the required libraries for HDP 2.x installations.
Optimizing text data processing for use in the Hadoop framework
The description in the SAP DataServices reference manual, section 12.1.2 Configuring Hadoop only works for older Hadoop distributions as the referred Hadoop parameters are deprecated. For HDP 2.x I had to use different parameters. See the section Memory Configuration in YARN and Hadoop in this blog below for more details.
HDFS source file formats
Reading the SAP DataServices reference manual word-for-word, it says that only unstructured text files work for the push-down to Hadoop and furthermore that the source file must be connected (directly?) to the TDP transform:
My tests – fortunately – showed that the TDP MapReduce job can handle more formats:
- HDFS Unstructured Text
- HDFS Unstructured Binary
I just tested MSG and PDF files, but I assume that the same binary formats as documented in the SAP DataServices 4.2 Designer Guide – see section 6.10 Unstructured file formats will work.The TDP transform is able to generate some metadata for unstructured binary files (the TDP option Document Properties) : this feature does not work when the TDP transform is pushed down to Hadoop. The job still analyzes the text in the binary file. But the MapReduce job will temporary stage the entity records for the document metadata. When reading this staging file and passing the records back to DataServices file format, warnings will be thrown. Thus you will still receive all generated text entities but not those for the document metadata.
- HDFS CSV files:
I was also able to place a query transform between the HDFS source format and the TDP transform and specify some filters, mappings and so in the query transform. All these settings still got pushed down to Hadoop via a Pig script. This is a useful feature, the push-down seems to work very similar to the push-down to relational databases: as long as DataServices knows about corresponding functions in the source system (here HDFS and Pig Latin language) it pushes as much as possible to the underlying system instead of processing the functions in its own engine.
TDP pushed down to Hadoop
It is very useful to understand roughly how the TDP push down to Hadoop works, because it will ease performance and problem tracking. Once all perquisites for the push-down are met (see previous section Preparing Hadoop for TDP –> HDFS source file formats) DataServices will generate a Pig script which in turn starts the TDP MapReduce job on the Hadoop cluster. The generated Pig script will be logged in the DataServices trace log file:
The Pig script runs on the DataServices server. But the commands in the Pig script will be executed against the Hadoop cluster. This means – obviously – that the Hadoop client libraries and tools must be installed on the DataServices server. The Hadoop client does not necessarily need to be part of the Hadoop cluster, but it must have access to the Hadoop cluster. More details on the various installation options are provided in the SAP DataServices reference manual or on SCN – Configuring Data Services and Hadoop.
In my tests I have noticed two different sorts of Pig scripts that can be generated, depending on the source file formats.
- CSV files
The Pig scripts for analyzing text fields in CSF files usually looks similar to this:Obviously,DataServices provides its own Pig functions for loading and storing data. They are also used for staging temporary results within the Pig script. During problem tracking it might be useful to check these files. All required files and information are located in one local directory on the DataServices server:
Given all these information, you might also run performance tests or problem analysis by executing or modifying the Pig script directly on the DataServices server.
- Unstructured binary or text files
The Pig script for unstructured binary or text files looks a bit different. This is obviously because theMapReduce jar file provided byDataServices can read these files directly. In contrast, in case of CSV files, some additional pre-processing need to happen in order to extract the text fields from the CSV file.In case of unstructured binary or text files the Pig script calls another Pig script which in turn runs a Unix shell command (I don’t know what the purpose of these two wrapper Pig scripts is, though?). The Unix shell command simply runs the MapReduce job within the Hadoop framework:
What if DS does not push down the TDP transform to Hadoop?
Unstructured binary or text HDFS files:
If unstructured binary or text HDFS files are directly connected to the TDP transform, the transform should get pushed down to Hadoop. In most cases it does not make sense to place other transforms between the HDFS source file format and the TDP transform. If you do have one or more query transforms after the HDFS source file format, you should only use functions in the query transforms that DataServices is able to push down to Pig. As long as you are using common standard functions such as substring() and so on, the chances are hight that DataServices will push them down to Pig. In most cases you will anyway have to find out on your own which functions are candidates for a push-down because they are not documented in the SAP manuals.
Basically the same rules apply as for unstructured binary or text files. In addition there are potential pitfalls with the settings in the CSV file format:
keep in mind that DataServices provides its own functions for reading HDFS CSV files (see previous section TDP pushed down to Hadoop). We don’t know about the implemented features of the DSLoad Pig function and they are not documented in the SAP manuals. It may be that some of the settings in the CSV file format are not supported by the DSLoad function. In this case DataServices cannot push down the TDP transform to Hadoop, because it first need to read the complete CSV file from HDFS into its own engine. It will then process the specific CSV settings locally and also process the TDP transform locally in its own engine.
A typical example for this behaviour are the row skipping options in the CSV file format. They are apparently not supported by the DSLoad function. If you set these options, DataServices will not push down the TDP transform to Hadoop:
Memory Configuration in YARN and Hadoop
I managed to run a MapReduce job using standard entity extraction and english sentiment analysis using the default memory settings for YARN and MapReduce (means the settings initially provided when I setup the cluster). But when using the German voice-of-customer libraries (for sentiment or request analysis) I had to increase some memory settings. According to SAP the german language is more complex so that these libraries require much more memory. Please note that the memory required for the MapReduce jobs depends on these libraries and not on the size of the source text files that will be analyzed.
If you do not configure enough memory the TDP MapReduce job will fail. Such kind of problems cannot be tracked using the DataServices errorlogs. See the section Problem Tracking in this blog below.
I followed the descriptions in the Hortonworks manuals about YARN and MapReduce memory configurations. This way I had overridden the following default settings in my cluster (just as an example!):
Using these configurations I managed to run the TDP transform with German VOC libraries as MapReduce jobs.
Problems while running a TDP transform as a MapReduce job are usually more difficult to track. The errorlog of the DataServices job will simply contain the errorlog provided by the generated Pig script. It might already point you to the source of the problem. On the other hand, other kind of problems might not be listed here. Instead you need to review the relevant Hadoop or MapReduce log files within Hadoop.
For example, in my case I identified a memory issue within MapReduce like this:
If a TDP MapReduce jobs fails due to insufficient memory the DataServices errorlog might look similar to this:
In this DS errorlog excerpt you just see that one MapReduce job from the generated Pig script failed, but you don’t see any more details about the failure. Also, the other referred log files pig_*.err and hdfsRead.err do not provide any more details. In such cases you ned to review the errorlog of the MapReduce job in Hadoop to get more details.
In general it is helpful to roughly understand how logging works inside Hadoop. In this specific case of YARN and/or MapReduce I found the chapter Log files in Hortonworks’ YARN and MR2 documentation useful (especially section MapReduce V2 Container Log Files).
It is also good to know about YARN log aggregation, see HDP configuration files for more information. Depending on the setting of the
yarn.log-aggregation-enable parameter the most recent log files are either still on the local file system of the NameNode or they are compressed within the HDFS cluster.
In anyway, you can start the error log review with tools such as Hue or the JobHsitory server provided with Hadoop. In my example I used the Job browser view within Hue. Clicking on the TDP MapReduce job lists various failed attempts of the job to obtain a so-called memory container from YARN:
The log file of one of these attempts will show the reason for failure, for instance low memory:
In this case the MapReduce job couldn’t start at all and an error will be returned to the DataServices engine which started the Pig script. So in this case the DataServices job will also get a failed status.
Other type of errors to watch out
Once a task attempt succeeds it does not necessarily mean that the task itself will succeed. It means that a task has successfully started. Any errors during task execution will then be logged in the task log file. In case of TDP, a task on one node will sequentially analyze all the text source files. If the task fails when analyzing one of the source documents the error will be logged but the task will go ahead with the next source document. This is a typical fault tolerant behaviour of MapReduce jobs in Hadoop.
Important: although that errors with individual documents or records get logged in the log file of the MapReduce TDP task, the overall result of the MapReduce job will be SUCCEDED and no error or even warnings get returned to DataServices. The DataServices job will get a successful status in such situations.
It is therefore important to monitor and analyze the MapReduce log files in addition to the DataServices error logs!
During my tests I found some other typical errors in the MapReduce log files that hadn’t been visible in the DataServices errorlog. For example:
- Missing language libraries:
I configured automatic language detection in the TDP transform, but forgot to install all the available TDP languages during DS installation. After installing all the languages and re-executing $LINK_DIR/hadoop/bin/hadoop_env.sh -c these errors disappeared.
(BTW: when running the same job with the TPD transform in the DS engine instead of having it pushed-down to Hadoop DataServices did not print any warnings in the errorlog, so I probably would have never caught this issue)
- Non-ASCII characters in text sources:
In the case of the Twitter test case some tweets may contain special characters such as the © or ™ signs. The TDP transform (or MapReduce job) cannot interpret such characters as raw text and then aborts the analysis of this document.
If the TDP runs within the DS engine these errors will be printed as warnings on the DS errorlog.
If the TDP runs as MapReduce job there will be no warnings in the DS errorlog, but the errorlog of the MapReduce will contain these errors.
- Missing entities:
As described above I found that the TDP transform running as a MapReduce job generates much less entities than when running the same TDP transform (with the same input files) within the DS engine. To narrow down the problem I checked the stdout log file of the MapReduce job. For each text source file (or for each record in a CSV file) it prints a message like this if the analysis succeeds:
In order to understand how many text sources had been successfully analyzed I grepped the stdout file:
Well, in this particular case it actually didn’t help me to solve the problem: there had been 3126 text files as source for the TDP transform. According to stdout all of them had been successfully analyzed by the TDP job. Nevertheless, text entities had been generated for only 2159 documents. This would actually mean that 967 documents are empty or have no meaningful content so that the text analysis does not generate text entities. But because the same TDP transform when running in the DS engine generates entities for 2662 documents, there seem to be something wrong in the MapReduce job of the TDP transform!
Having a closer look at how the generated entities distribute across various metadata I recognized that entities were missing only for documents for which the auto language detection mechanism has determined English language:
With these findings I manage to narrow down the problem to the language settings in the TDP transform. If I change settings from Auto detection (and English as default language) to English as fix language both jobs (DS engine and Hadoop MapReduce – will produce the exact same number of entities.
I have passed this issue to the SAP – hopefully there will be fix available soon.