Unstructured Data Analytics using SAP Netweaver Cl...

rahul_aware · ‎12-18-2012

It’s been 4 months since I published my last post on this topic. As promised, here’s the next post with some incremental information on this endeavor. Lets see how a simple UIMA annotator can be deployed to cloud as a REST service.

Information Extraction

First and foremost challenge in implementing any unstructured data analytics is information extraction. That’s where APACHE UIMA framework comes handy. It’s a Java based open source framework which can be used to develop complex components that can extract information from variety of unstructured data (text, voice, etc).

UIMA can analyze large volumes of text and extract information based on custom rules written as Annotators. A fairly large community is contributing towards creating and enhancing annotators that can identify words, nouns, phrases, sentiments, etc. You can download these annotators and process its output as per the requirements. UIMA framework is also flexible in terms of feeding output of one annotator to input of other. So, you can use whitespace annotator to tag each word in the text and then feed these words to regular expressions annotators that detect email addresses, URLS, phone numbers, ZIP codes, etc. These are called aggregate analysis engines.

So where does SAP Netweaver Cloud come in picture?. Being open standards based, it’s fairly simple to run an annotator on cloud as a rest service. In this post I will show how to deploy White Space annotator on cloud as rest service.

Broad outline on how this works-

Annotator is packaged as a UIMA PEAR file. We will be using White Space tokenizer; source code can be downloaded from SVN repository. Use Eclipse UIMA plugin to create component PEAR file.
Create a mapping file that tells the server what analysis results to show in which format.
Create a WAR file for deployment into the servlet container with appropriate WEB.xml.
Deploy the WAR file on cloud.

Initial Setup:

Download latest UIMA SDK and Annotator addons. I suggest version 2.3.1 as it’s been there for some time and used by many without any issues.

Create UIMA_HOME environment variable that points to target directory where you uncompressed the SDK. You can run some of the examples packaged with the SDK. Refer- http://uima.apache.org/doc-uima-examples.html
Install UIMA Eclipse plugins as per http://uima.apache.org/d/uimaj-2.3.1/overview_and_setup.html#ugr.ovv.eclipse_setup.installation
Get White Space tokenizer source code from SVN link

Creating a PEAR file:

PEAR file is a standard package for UIMA components that can be distributed and reused. UIMA eclipse plugin provides you the option to add UIMA nature to your project and to create a PEAR file for your annotator using PEAR generation wizard. For complete documentation on PEAR generator please refer http://uima.apache.org/d/uimaj-2.3.1/tools.html#ugr.tools.pear.packager

Add UIMA nature to the project from the context menu. This is a prerequisite for generating pear file. After adding UIMA nature, you will find ‘Generate PEAR File’ option in the context menu (right click the project). Select this option.

Wizard will pop up with default Component ID name

Search for the component descriptor ‘desc’ folder in the project structure. It’s a XML file that hold key values for the component.

On next screen provide the Operating system and JRE version of the target system that will use this PEAR file
Provide the directory and file name for the pear file

On completion of the wizard you will get PEAR file generated for the UIMA project in the target directory provided.

Adding UIMA PEAR to SAP Netweaver Cloud Application:

Create a Dynamic web project

Give a name to the project and select SAP Netweaver Cloud as Target Runtime

Next we need to add some jar files to WEB-INF/lib folder. They are required to run UIMA functionalities. You get them from SimpleServer project that comes with Sandbox addon which you downloaded along with UIMA SDK.

Create a folder names ‘resources’ under WEB-INF and add to it the PEAR file that we generated. Project structure will look something like this-

Modify web.xml file to point to the pear file location and provide servlet class name

If everything goes well- you will be able to deploy this web project to local server and Cloud. Application will look something like this-

This page also has following information about the service:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

No description provided

Usage

In order to use this service, a POST- or GET-request should be sent to the server with the following URL:

http://localhost:8080/WSTRest/

The following request parameters are expected:

POST-parameters

POST request should be sent to use the service

text -- the value of this parameter is the text to analyze. Expected encoding is UTF-8. This parameter must always be set.
lang -- This parameter sets the language of the text. If this parameter is not set, the value"en" will be used
mode -- This parameter should define, what view of the analyss result the servlet should return. If this parameter is not set, XML output will be produced.

Possible values:

inline -- returns inline-xml containing the analyzed text in which all found entities are represented by tags
xml -- means to output the result as a XML-document containing a list of found entities

GET-parameters

GET request should be sent to obtain information about the service

mode -- This parameter should define, what the servlet should return. Some options are available.

Possible values:

xmldesc -- will show a specification of this service in XML format
form -- will show a form with input fields, which will allow you to try out this service
description -- will return a description of a service in HTML (human-readable) format. This description is partially automatically generated, and partially created by the author of this service.
xsd -- will return a XSD schema definition of the text analysis results

Result

If XML or inline-XML output is requested, it will contain the tags listed below. The XSD-definition of the output in XML-format can be downloaded here.

XML elemets of result

Example of usage

String text = "Hello Mr. John Smith !";

String parameters = "text=" + URLEncoder.encode(text, "UTF-8") + "&mode=inline";

URL url = new URL("http://localhost:8080/WSTRest/");

URLConnection connection = url.openConnection();

connection.setDoOutput(true);

OutputStreamWriter writer = new OutputStreamWriter(connection.getOutputStream());

writer.write(parameters);

writer.flush();

BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"));

String line;

while ((line = reader.readLine()) != null) {

    System.out.println(line);

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

It neatly points out how to consume the service programmatically from other component or application.

For now, we will test the service with a simple html form that comes with this service:

Output of the query:

You can try out this service on cloud at:

https://wstrests0007950666trial.nwtrial.ondemand.com/WSTRest/?mode=form

Next Steps:

White space tokenizer forms the basics of any complex UIMA annotation engine. There are many out of the box annotators that are part of UIMA sandbox like –Dictionary Annotator, concept Mapper annotator, snowball annotator and Hidden Markov Model Tagger annotator. It will be interesting to see them working on cloud with output saved to HANA for analytics.