Skip to Content

One of the guys I respect the most in the HANA jungle is Juergen Schmerder. Not only is he a great help in the forums when it comes to developing in HANA, he also writes awesome blogs. And so I find myself on my comfortable couch, weaponed with my iPad, looking at the latest HANA Development Center Blog by Mr. Schmerder: The Road to HANA for Software Developers. Amazing piece of work, referencing a ton of documents, hitting that sweet spot of “all you need to know” when developing in SAP HANA.  Bookmark it now, it is the HANA development bible.

The following section in Juergens blog gets me interested:

Did you know that HANA supports full-text search and hence can bridge the gap between what is commonly referred to as “structured” (tables with fields as you typically find it in relational databases) and “unstructured” (just text) data? The Fuzzy Search Developer Guide provides information on how to – well, the title says it all – how to do fuzzy searches on text data


Now, that sounds cool. Time to find out what’s so fuzzy about fuzzy searches!

So what’s a fuzzy search?

According to the Fuzzy Search Developer Guide:


Fuzzy Search is a fast and fault-tolerant search feature for SAP HANA.


The term ”fault-tolerant search” means that a database query returns records even if the

search term (the user input) contains additional or missing characters or other types of

spelling error. Fuzzy search can be used in various applications, for example:


Fault-tolerant search in text columns (for example, html or pdf): Search for

documents on ‘Driethanolamyn’ and find all documents that contain the term

‘Triethanolamine’.


Fault-tolerant search in structured database content: Search for a product called

‘coffe krisp biscuit’ and find ‘Toffee Crisp Biscuits’.


Fault-tolerant check for duplicate records: Before creating a new customer record in

a CRM system, search for similar customer records and verify that there are no

duplicates already stored in the system. When, for example, creating a new record

‘SAB Aktiengesellschaft & Co KG Deutschl.’ in ‘Wahldorf’, the system shall bring up

‘SAP Deutschland AG & Co. KG’ in ‘Walldorf’ as a possible duplicate.


So in short it can find values in big data sets without actually hitting the search term exactly right. Think Google (“did you mean”?)

It gets better:

The fuzzy search algorithm calculates a fuzzy score for each string comparison. The higher

the score, the more similar the strings are. A score of 1.0 means the strings are identical. A

score of 0.0 means the strings have nothing in common.

You can request the score in the SELECT statement by using the SCORE() function. You

can sort the results of a query by score in descending order to get the best records first (the

best record is the record that is most similar to the user input). When a fuzzy search of

multiple columns is used in a SELECT statement, the score is returned as an average of the

scores of all columns used.

So not only does it find a “fault tolerant” match, it also puts a score behind it:

For example, when searching with ‘SAP’, a record like ‘SAP Deutschland AG & Co. KG’ gets

a high score, because the term ‘SAP’ exists in both texts. A record like ‘SAPPHIRE NOW

Orlando’ gets a lower score, because ‘SAP’ is only a part of the longer term ‘SAPPHIRE’ (3 of

8 characters match)


Time to hit the system.

It always starts with a download

So where do I find a nice big unstructured dataset. Well that’s easy enough. The guys over at infoChimps surely have something great available. And lo and behold, there it is, a Twitter data set:

Twitter Census – Conversation Metrics: One year of URLs, Hashtags, Smileys usage (monthly)

Twitter data from millions of tweets! This is a download of Twitter data from March 2006 to November 2009. The data set consists of “tokens,” which are hashtags (#data), URLs, or emoticons (Twitter smileys or other “faces” created using keyboard characters). The data comes from analysis on the full set of tweets during that time period, which is 35 million users, over 500 million tweets, and more than 1 billion relationships between users.

Infochimps uses a tool to “scrape” user profiles, extracting Twitter data. This data set exclusively consists of token usage (Twitter smileys, hashtags or URLs).


Download file is almost 200 MB and after extraction I have a file of 1.8 gigabyte. First thing to do is take a peek inside the datafile. I recommend Large text file viewer for that. Completely free and loads a file in seconds. Examining the file, two things get me worried. The file is tab delimited and the first 100.000 records are garbage. I could load them into HANA, using the “Import” function without cleansing, but this will give me some errors for sure. I choose to clean in advance.  Getting rid of the garbage in a file of 2 gigs is a little complicated but an easy solution is available. A file editor for Windows is available which can load up to 10 GB files and let you edit them easily by doing it hex based (don’t worry, you don’t feel a thing). Her name is 010 editor and you can use her for free for 30 days.

Now to convert my file to CSV (the format HANA can handle with import/file load). Converting is easy when your on the Mac (and wouldn’t you be ;-). You can easily convert huge files from tab to CSV (comma separated) by using a command in a terminal session:

tr ‘\t’ ‘,’ < tokens_by_month-20091111.tsv > output.csv

All tab values (‘\t’) are replaced by a ‘,’. Nice and simple!

Lock and Load

So it’s time to create a model in HANA and load my file. When doing a fuzzy search, my table needs to be structured in a certain way. This can easily be done by using SQL to create your table. I end up with the following command:

CREATE COLUMN TABLE “S0007457730”.“TWITTERINFOCHIMPS3” (“TOKEN_TYPE” VARCHAR(10) NOT NULL ,

       “YEAR_MONTH” VARCHAR(6),

       “COUNT” INTEGER CS_INT,

       “TOKEN_TEXT” SHORTTEXT(200) LANGUAGE DETECTION (‘EN’) PHRASE INDEX RATIO 0.200000 FUZZY SEARCH INDEX ON SEARCH ONLY ON FAST PREPROCESS ON)

In general fuzzy searches can be performed on:

  • TEXT
  • SHORTTEXT
  • VARCHAR, NVARCHAR
  • DATE
  • All data types with a full-text index

I choose SHORTTEXT as Text type fuzzy searches can be the most sophisticated. Note that the ‘FUZZY SEARCH INDEX’ structure will improve performance as it increases the memory of the loaded table. Set it in advance as you cannot modify it later.

Structure is created:

Naamloos.png

And after loading (see manual here), we have the following result):

Naamloos.png

40 million records, summarized by month, hashtag/tweet_url and a count of occurrence!

Bring on the not so fuzzy fuzziness

You can find a ton of examples in the development guide. I really urge you to take a look and come up with some use cases. For this blog I will stick to a fuzzy search on a single column:

Give me all records containing ‘SAP’ as part of the hashtag:

SELECT SCORE() AS score, * FROM TWITTERINFOCHIMPS3

WHERE CONTAINS(TOKEN_TEXT,’SAP’,

FUZZY(0.7,’textSearch=compare,bestMatchingTokenWeight=0.7′))

AND TOKEN_TYPE = ‘hashtag’

ORDER BY score DESC;

My dataset comes back with over 800 records and my highest ranked scored values are:

SCORE             

TOKEN_TYPE

TOKEN_TEXT

                 1,00

hashtag

sap

                 0,94

hashtag

sap_wt

                 0,94

hashtag

sap_inovation                                            

                 0,94

hashtag

sap_innovation                                           

                 0,94

hashtag

sap_bw

                 0,94

hashtag

sap_beratung

                 0,94

hashtag

sap_08susrprt                                            

                 0,94

hashtag

sap::rfc

                 0,94

hashtag

sap-tip5

                 0,94

hashtag

sap-tip

                 0,94

hashtag

sap-test

                 0,94

hashtag

sap-software                                             

                 0,94

hashtag

sap-rfc

                 0,94

hashtag

sap-beratungshaus                                        

It worked beautifully. High ranked scores give me values which make sense (containing “SAP”), lower level values show me no values which add value to my analysis:

                 0,76

hashtag

slap-a-hater

                 0,76

hashtag

blip-on-asap                                             

                 0,76

hashtag

swap-o-rama

                 0,71

hashtag

sapat&atilde

That list bye the way goes on and on 😉

Plotting  the results in a nice ViSi document give me:

Naamloos.png

Seems the majority of the Hashtags used was a simple #SAP.  Guess social media back in 2009 wasn’t up to par to what we have today. Interesting  to see how things have progressed the last couple of years. A lot of discussion on hashtags is going on at the moment. I highly recommend you to follow some of the influencers in the SAP social media space. A prime example would be @Jonerp.

Back to ViSi,  some interesting analysis when we look at the counted hashtags per month:

Naamloos.png

So why the sudden increase in Hashtag “SAP” in October? I guess it must have been SAPPHIRE time!

See you there. I’ll be on stage during the TechEd Demo Jam in Madrid. Cheer me on (I need it to get my nerves down 😉

Thanks for reading this and take care,

Ronald.

To report this post you need to login first.

13 Comments

You must be Logged on to comment or reply to a post.

  1. Juergen Schmerder

    Ronald!

    Let me be the first one to comment 🙂

    First of all, thanks for the compliments – I usually don’t like being complimented, but yours is so nice, I would actually put it in my resume…

    Still feeling touched, I have to give the compliments back – I’m just a hunter an gatherer, without people who actually contribute genuine content, I’m nothing. And you are really a role model when it comes to taking from and giving to the community – thanks for your engagement!

    Hey, and I will be there to cheer for you in Madrid!

    @all HANA developers, read this blog and try it yourself – it’s those use cases beyond boring SELECT … FROM … WHERE … GROUP BY … ORDER BY … that make HANA so compelling!

    cheers

    –juergen

    (0) 
  2. Niels Weigel

    Hi Ronald,

    thanks for blogging about the fuzzy search and your nice use case.

    Indeed I can not wait until SAP Business Suite runs on HANA and when a new business partner is created, the user is immediately getting the weighted list of SIMILAR existing business partners within the system to avoid creating a new business partner similar to an existing one.

    Or imagine a Call Center agent in CRM who does not have to ask the customer “Maria Dsipanovski” at the phone to search for her customer number or having her exactly spelling the correct name. The call center agent enters just the way he understands the name and get the right record back.

    –> Fast, reliable error-tolerant search and lookup functionality thanks to FUZZY! technology.

    Niels

    (0) 
  3. User Union

    Hi Ronald,

    thank you very much for your interesting blog about “HANA and Fuzzy Search”, I have been searching for such a detailled document a very long time:)

    I still have got a question:

    Is it possible to use the fuzzy-search to find duplicates e.g. in an CRM-System?

    Example: I have got a customer in my CRM-system, correctyl written “Meier”. But wrongly there is also an entry “Maier” and “Meyer” in the CRM-System.

    Now: Is it possible to replicate the “customer”-table into HANA, to correct / to purge these duplicates and then transfering the adjusted customer-table back into my CRM-system?

    Thank you for replying.

    Best Regards.

    (0) 
    1. Ronald Konijnenburg Post author

      Thank for the question!

      Getting that CRM data into HANA and doing the check should be fairly easy. Getting the changes back in might be challenging. Take a look at the current MDM solution and BOBJ data quality tools. You’ll get some inspiration from those I’m sure!

      (0) 
  4. Ethan Zhang

    Awesome. Our team currently are evaluating HANA fuzzy search, this excellent content help us a lot.

    Thanks again, I’ve already add it into my bookmark.

    Best Regards

    Ethan

    (0) 
  5. Rama Shankar

    Good stuff – thanks for sharing. Wish I had seen this earlier. Learnt and did fuzzy search the hard way few months ago !

    Thanks!

    Rama

    (0) 
  6. Jemin Tanna

    Thanks a lot for sharing this blog…

    Can you help me understand the parameters used….for instance the WEIGHT seems to be of now use, following two queries yield exactly the same result

    Thanks again!

    SELECT “TEXT”,  SCORE() AS RELEVANCE

                                                      FROM “SNWD_TEXTS”

                                                      WHERE client=’000′ AND

                                                                  CONTAINS (TEXT, ‘FLASH’,   FUZZY(0.7), WEIGHT(0.9));

    SELECT “TEXT”,  SCORE() AS RELEVANCE

                                                      FROM “SNWD_TEXTS”

                                                      WHERE client=’000′ AND

                                                                  CONTAINS (TEXT, ‘FLASH’,   FUZZY(0.7), WEIGHT(0.00001));

    (0) 

Leave a Reply