SAP HANA Text Analysis

Mining social media data for customer feedback is one of the greatest untapped opportunities for customer analysis in many organizations today.

As many are aware, twenty-first century corporations are facing a crisis. Many corporations have been accurately and comprehensively storing data for years. The data is in variety of forms like social media posts, email, blogs, news, feedback, tweets, business documents etc.

It is very important to extract meaningful information without having to read every single sentence. Now, what is meaningful information? The extraction process should identify the “who”, “what”, “where”, “when” and “how much” (among other things) from these data.
For example, use social media data to find out –

  • What people are saying about my brand or products?
  • How many people recommend my brand vs. advocate against it?

Text Analysis is the solution of all this problem.

In this article we will explain:

  • What is Text Analysis?
  • Why Text Analysis is so important for business?
  • How does SAP HANA support text analysis?

Before understanding Text Analysis, you will have to first understand Structured Data and Unstructured Data.

Structured and Unstructured Data:

Structured Data:

Data that resides in a fixed field within a record or file is called structured data. This includes data contained in relational databases and spreadsheets .
For example data stored in database tables are structured data.

Structured data has the advantage of being easily entered, stored, queried and analyzed.

Unstructured Data:

The phrase “unstructured data” usually refers to information that doesn’t reside in a traditional row-column database.

Unstructured data files often include text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents.
Digging through unstructured data can be cumbersome and costly. Email is a good example of unstructured data. It’s indexed by date, time, sender, recipient, and subject, but the body of an email remains unstructured. Other examples of unstructured data include books, documents, medical records, and social media posts.

Why unstructured data is so important for business?

Experts estimate that 80 to 90 percent of the data in any organization is unstructured. And the amount of unstructured data in enterprises is growing significantly — often many times faster than structured databases are growing.

What is Text Analysis?

Text Analysis is the process of analyzing unstructured text, extracting relevant information and then transforming that information into structured information that can be leveraged in different ways.

Text Analysis refers to the ability to do Natural Language Processing, linguistically understand the text and apply statistical techniques to refine the results.
With the help of text analysis we can model and structure the information content of unstructured data for the purpose of business analysis, research and investigation.

TA.jpg

There are few important techniques being used in Text Analysis.

    • • Full Text Indexing
    • Full Text Search
    • • Fuzzy Search

Let’s have a look into them one by one.

Full Text Search:

The primary function of full-text search is to optimize linguistic searches.

Full text search is designed to perform linguistic (language-based) searches against text and documents stored in your database.
In a full-text search, the search engine examines all of the words in every stored document as it tries to match search criteria (text specified by a user).

Full Text Indexing:

When dealing with a small number of documents, it is possible for the full-text-search engine to directly scan the contents of the documents with each query, a strategy called “serial scanning.” This is what some rudimentary tools, such as grep, do when searching.

However, when the number of documents to search is potentially large, the problem of full-text search is often divided into two tasks: indexing and searching.

The indexing stage will scan the text of all the documents and build a list of search terms (often called an index). In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents.
The indexer will make an entry in the index for each term or word found in a document, and possibly note its relative position within the document.

Conceptually, full-text indexes support searching on columns in the same way that indexes support searching through books.

Fuzzy Search:

Also known as approximate string matching.
Fuzzy search is the technique of finding strings that match a pattern approximately (rather than exactly).
It is a type of search that will find matches even when users misspell words or enter in only partial words for the search.

A Real World Example:

If a user types “SAP HANA Tutorl” into Yahoo or Google (both of which use fuzzy matching), a list of hits is returned along with the question, “Did you mean “SAP HANA Tutorial”?”

Now let’s create a table to undersatnd this process.

Sentiment Text Analysis:-

————————————————————————

create column table “DIBYENDU”.”TEXT_ANALYSIS”

(

ID INTEGER PRIMARY KEY,

STRING nvarchar (200)

);

insert into “DIBYENDU”.”TEXT_ANALYSIS” values(1,’Dibu really likes playing cricket’);

/wp-content/uploads/2014/04/1_429607.png

/wp-content/uploads/2014/04/2_429641.jpg

/wp-content/uploads/2014/04/3_429642.jpg

We can find out this index has few columns with TA_Token, TA_Type etc.

2 columns of the table $TA_* is very important for us.
TA_TOKEN

This column contains the extracted entity or element (for example, an identifiable person, place, topic, organization, or sentiment).

TA_TYPE

This is the category the entity falls under. For example PERSON, PLACE, PRODUCT etc.

In the TA_Token it takes all the possible breaking of the table’s data and in TA_Type it gives the possible SENTIMENT of the data which stored in TA_Token.

Without creating separate index like above we can create a text analysis index at the time of creation of the table itself as shown below by shorttext datatype.

Create table for text analysis with shorttext:-

—————————————————

create column table “DIBYENDU”.”TEXT_ANALYSIS_SHORT_TEXT”

(

ID INTEGER PRIMARY KEY,

STRING shorttext (200)

text analysis on

async

);

/wp-content/uploads/2014/04/6_429656.png

/wp-content/uploads/2014/04/5_429643.png

/wp-content/uploads/2014/04/7_429657.png

Here as we have not used sentiment analysis so TA_TYPE will not give our sentiment rather it will give noun / preposition etc.

How to use the Fuzzy Search in SAP HANA

In this article we will talk about

    • • What is Fuzzy Search?
    • • Why Fuzzy Search is important?
    • • Real Time Example of Fuzzy Search Based Applications.
    • • How to Implement Fuzzy Search in SAP HANA?

What is Fuzzy Search?

Also known as approximate string matching.
Fuzzy search is the technique of finding strings that match a pattern approximately (rather than exactly).
It is a type of search that will find matches even when users misspell words or enter in only partial words for the search.
purpose:

With the help of Fuzzy Search Misspellings and typos still provide relevant results.

A Real World Example:

If a user types “SAP HANA Tutorl” into Yahoo or Google (both of which use fuzzy matching), a list of hits is returned along with the question, “Did you mean “SAP HANA Tutorial”?”

Fuzzy Search in SAP HANA:

In SAP HANA, you can call the fuzzy search by using the CONTAINS predicate with the FUZZY option in the WHERE clause of a SELECT statement.
Syntax:

SELECT * FROM <tablename>
WHERE CONTAINS (<column_name>, <search_string>, FUZZY (0.8))

A search with FUZZY(x) returns all values that have a fuzzy score greater than or equal to x.

The SCORE() Function

The fuzzy search algorithm calculates a fuzzy score for each string comparison. The higher the score, the more similar the strings are. A score of 1.0 means the strings are identical. A score of 0.0 means the strings have nothing in common.

You can request the score in the SELECT statement by using the SCORE() function.

You can sort the results of a query by score in descending order to get the best records first (the best record is the record that is most similar to the user input). When a fuzzy search of multiple columns is used in a SELECT statement, the score is returned as an average of the scores of all columns used.

So not only does it find a “fault tolerant” match, it also puts a score behind it.

Example:
When searching with ‘SAP’, a record like ‘SAP AG’ gets a high score, because the term ‘SAP’ exists in the texts. A record like ‘BSAP Corp’ gets a lower score, because ‘SAP’ is only a part of the longer term ‘BSAP Corp’.

Create the table and data:

CREATE COLUMN TABLE “DIBYENDU”.“COMPANIES” (
               ID INTEGER PRIMARY KEY,
               COMPANY_NAME SHORTTEXT(200)

FUZZY SEARCH INDEX ON);

INSERT INTO “DIBYENDU”.“COMPANIES” VALUES (1, ‘SAP’);

INSERT INTO “DIBYENDU”.“COMPANIES” VALUES (2, ‘SAP in Walldorf’);

INSERT INTO “DIBYENDU”.“COMPANIES” VALUES (3, ‘SAP AG’);

INSERT INTO “DIBYENDU”.“COMPANIES” VALUES (4, ‘ASAP Corp’);

INSERT INTO “DIBYENDU”.“COMPANIES” VALUES (5, ‘BSAP orp’);

INSERT INTO “DIBYENDU”.“COMPANIES” VALUES (6, ‘IBM Corp’);

/wp-content/uploads/2014/04/8_429677.png

Perform the search on one column:

SELECT SCORE() AS score, * FROM “DIBYENDU”.“COMPANIES”

        WHERE CONTAINS(COMPANY_NAME,‘SAP’,

                FUZZY(0.7,‘textSearch=compare,bestMatchingTokenWeight=0.7’))

        ORDER BY score DESC
The output of fuzzy search contains 5 entries. Based on the fuzzy search factor (which is 0.7 in this case), it will also consider the similar words. In this case “SAP AG”, “BSAP orp” etc.

/wp-content/uploads/2014/04/9_429678.png

Another Example:-

/wp-content/uploads/2014/04/10_429679.png

Without Fuzzy Search:

Suppose you want to search a customer with name “Jimi”.
SQL Query:

SELECT * FROM <Schema_Name>.”CUSTOMERS”
                WHERE CONTAINS(FIRST_NAME, ‘Jimi’)
                ORDER BY “CUSTOMER_ID” DESC;


The output will contain only one entry which contains exact match of “Jimi”.

/wp-content/uploads/2014/04/10_1_429683.png

Now let us try the fuzzy search function.
SQL Query:

SELECT SCORE() AS score, * FROM “DIBYENDU”.“CUSTOMERS”

                WHERE

               CONTAINS(FIRST_NAME, ‘Jimi’, FUZZY(0.7))

               ORDER BY score DESC


The output of fuzzy search contains 4 entries. Based on the fuzzy search factor (which is  0.7 in this case), it will also consider the similar words. In this case “Jimy”.

We can also do fuzzy search on 2 columns. For example First Name and Last Name.
SQL Query:

SELECT SCORE() AS score, * FROM “DIBYENDU”.“CUSTOMERS”

                WHERE

                          CONTAINS(FIRST_NAME, ‘Jimi’, FUZZY(0.7))

                          and CONTAINS(LAST_NAME, ‘Hendricks’, FUZZY(0.7))

                ORDER BY score DESC;

The output contains 3 entries. Based on the fuzzy search factor (which is 0.7 in this case).

/wp-content/uploads/2014/04/11_429684.png

To report this post you need to login first.

4 Comments

You must be Logged on to comment or reply to a post.

Leave a Reply