Technology Blogs by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 
RonaldKonijnenb
Contributor

One of the guys I respect the most in the HANA jungle is juergen.schmerder. Not only is he a great help in the forums when it comes to developing in HANA, he also writes awesome blogs. And so I find myself on my comfortable couch, weaponed with my iPad, looking at the latest HANA Development Center Blog by Mr. Schmerder: The Road to HANA for Software Developers. Amazing piece of work, referencing a ton of documents, hitting that sweet spot of “all you need to know” when developing in SAP HANA.  Bookmark it now, it is the HANA development bible.

The following section in Juergens blog gets me interested:

Did you know that HANA supports full-text search and hence can bridge the gap between what is commonly referred to as "structured" (tables with fields as you typically find it in relational databases) and "unstructured" (just text) data? The Fuzzy Search Developer Guide provides information on how to - well, the title says it all - how to do fuzzy searches on text data


Now, that sounds cool. Time to find out what’s so fuzzy about fuzzy searches!

So what’s a fuzzy search?

According to the Fuzzy Search Developer Guide:


Fuzzy Search is a fast and fault-tolerant search feature for SAP HANA.


The term ”fault-tolerant search” means that a database query returns records even if the

search term (the user input) contains additional or missing characters or other types of

spelling error. Fuzzy search can be used in various applications, for example:


Fault-tolerant search in text columns (for example, html or pdf): Search for

documents on 'Driethanolamyn' and find all documents that contain the term

'Triethanolamine'.


Fault-tolerant search in structured database content: Search for a product called

'coffe krisp biscuit' and find 'Toffee Crisp Biscuits'.


Fault-tolerant check for duplicate records: Before creating a new customer record in

a CRM system, search for similar customer records and verify that there are no

duplicates already stored in the system. When, for example, creating a new record

'SAB Aktiengesellschaft & Co KG Deutschl.' in 'Wahldorf', the system shall bring up

'SAP Deutschland AG & Co. KG' in 'Walldorf' as a possible duplicate.


So in short it can find values in big data sets without actually hitting the search term exactly right. Think Google (“did you mean”?)

It gets better:

The fuzzy search algorithm calculates a fuzzy score for each string comparison. The higher

the score, the more similar the strings are. A score of 1.0 means the strings are identical. A

score of 0.0 means the strings have nothing in common.

You can request the score in the SELECT statement by using the SCORE() function. You

can sort the results of a query by score in descending order to get the best records first (the

best record is the record that is most similar to the user input). When a fuzzy search of

multiple columns is used in a SELECT statement, the score is returned as an average of the

scores of all columns used.

So not only does it find a “fault tolerant” match, it also puts a score behind it:

For example, when searching with 'SAP', a record like 'SAP Deutschland AG & Co. KG' gets

a high score, because the term 'SAP' exists in both texts. A record like 'SAPPHIRE NOW

Orlando' gets a lower score, because 'SAP' is only a part of the longer term 'SAPPHIRE' (3 of

8 characters match)


Time to hit the system.

It always starts with a download

So where do I find a nice big unstructured dataset. Well that’s easy enough. The guys over at infoChimps surely have something great available. And lo and behold, there it is, a Twitter data set:

Twitter Census - Conversation Metrics: One year of URLs, Hashtags, Smileys usage (monthly)

Twitter data from millions of tweets! This is a download of Twitter data from March 2006 to November 2009. The data set consists of “tokens,” which are hashtags (#data), URLs, or emoticons (Twitter smileys or other “faces” created using keyboard characters). The data comes from analysis on the full set of tweets during that time period, which is 35 million users, over 500 million tweets, and more than 1 billion relationships between users.

Infochimps uses a tool to “scrape” user profiles, extracting Twitter data. This data set exclusively consists of token usage (Twitter smileys, hashtags or URLs).


Download file is almost 200 MB and after extraction I have a file of 1.8 gigabyte. First thing to do is take a peek inside the datafile. I recommend Large text file viewer for that. Completely free and loads a file in seconds. Examining the file, two things get me worried. The file is tab delimited and the first 100.000 records are garbage. I could load them into HANA, using the “Import” function without cleansing, but this will give me some errors for sure. I choose to clean in advance.  Getting rid of the garbage in a file of 2 gigs is a little complicated but an easy solution is available. A file editor for Windows is available which can load up to 10 GB files and let you edit them easily by doing it hex based (don’t worry, you don’t feel a thing). Her name is 010 editor and you can use her for free for 30 days.

Now to convert my file to CSV (the format HANA can handle with import/file load). Converting is easy when your on the Mac (and wouldn’t you be ;-). You can easily convert huge files from tab to CSV (comma separated) by using a command in a terminal session:

tr '\t' ',' < tokens_by_month-20091111.tsv > output.csv

All tab values (‘\t’) are replaced by a ‘,’. Nice and simple!

Lock and Load

So it’s time to create a model in HANA and load my file. When doing a fuzzy search, my table needs to be structured in a certain way. This can easily be done by using SQL to create your table. I end up with the following command:

CREATE COLUMN TABLE "S0007457730"."TWITTERINFOCHIMPS3" ("TOKEN_TYPE" VARCHAR(10) NOT NULL ,

       "YEAR_MONTH" VARCHAR(6),

       "COUNT" INTEGER CS_INT,

       "TOKEN_TEXT" SHORTTEXT(200) LANGUAGE DETECTION ('EN') PHRASE INDEX RATIO 0.200000 FUZZY SEARCH INDEX ON SEARCH ONLY ON FAST PREPROCESS ON)

In general fuzzy searches can be performed on:

  • TEXT
  • SHORTTEXT
  • VARCHAR, NVARCHAR
  • DATE
  • All data types with a full-text index

I choose SHORTTEXT as Text type fuzzy searches can be the most sophisticated. Note that the ‘FUZZY SEARCH INDEX’ structure will improve performance as it increases the memory of the loaded table. Set it in advance as you cannot modify it later.

Structure is created:

And after loading (see manual here), we have the following result):

40 million records, summarized by month, hashtag/tweet_url and a count of occurrence!

Bring on the not so fuzzy fuzziness

You can find a ton of examples in the development guide. I really urge you to take a look and come up with some use cases. For this blog I will stick to a fuzzy search on a single column:

Give me all records containing ‘SAP’ as part of the hashtag:

SELECT SCORE() AS score, * FROM TWITTERINFOCHIMPS3

WHERE CONTAINS(TOKEN_TEXT,'SAP',

FUZZY(0.7,'textSearch=compare,bestMatchingTokenWeight=0.7'))

AND TOKEN_TYPE = 'hashtag'

ORDER BY score DESC;

My dataset comes back with over 800 records and my highest ranked scored values are:

SCORE             

TOKEN_TYPE

TOKEN_TEXT

                 1,00

hashtag

sap

                 0,94

hashtag

sap_wt

                 0,94

hashtag

sap_inovation                                            

                 0,94

hashtag

sap_innovation                                           

                 0,94

hashtag

sap_bw

                 0,94

hashtag

sap_beratung

                 0,94

hashtag

sap_08susrprt                                            

                 0,94

hashtag

sap::rfc

                 0,94

hashtag

sap-tip5

                 0,94

hashtag

sap-tip

                 0,94

hashtag

sap-test

                 0,94

hashtag

sap-software                                             

                 0,94

hashtag

sap-rfc

                 0,94

hashtag

sap-beratungshaus                                        

It worked beautifully. High ranked scores give me values which make sense (containing “SAP”), lower level values show me no values which add value to my analysis:

                 0,76

hashtag

slap-a-hater

                 0,76

hashtag

blip-on-asap                                             

                 0,76

hashtag

swap-o-rama

                 0,71

hashtag

sapat&atilde

That list bye the way goes on and on 😉

Plotting  the results in a nice ViSi document give me:

Seems the majority of the Hashtags used was a simple #SAP.  Guess social media back in 2009 wasn’t up to par to what we have today. Interesting  to see how things have progressed the last couple of years. A lot of discussion on hashtags is going on at the moment. I highly recommend you to follow some of the influencers in the SAP social media space. A prime example would be @Jonerp.

Back to ViSi,  some interesting analysis when we look at the counted hashtags per month:

So why the sudden increase in Hashtag “SAP” in October? I guess it must have been SAPPHIRE time!

See you there. I’ll be on stage during the TechEd Demo Jam in Madrid. Cheer me on (I need it to get my nerves down 😉

Thanks for reading this and take care,

Ronald.

13 Comments
Labels in this area