Text Analysis of IPL Match using Twitter Data (Part 2)
I am back with continuation to my below blog,
In this part of document, we will be focusing on Custom Dictionaries.
If you refer to below screen shot it indicates when SQL query is executed for
TA_TYPE = ‘PERSON’, Virat Kohli & Ashwin repeated few times in separate rows.
Why this is happening:
When comments are entered in Twitter by different Users, it depends on individuals
how data is entered.
Possibility of having Cricketer names entered in different ways is a common scenario.
To make it Standard and for easy analysis, we need to create custom dictionaries and let system return a uniform name when SQL is executed.
Now let’s see how this can be achieved:
Create custom HANA Text Analysis configuration file
In HANA studio create a workspace followed by creating and sharing a project.
Under this project create a new file with extension “hdbtextconfig”.
Copy all the contents of one of the predefined configurations delivered by SAP they are located in the HANA repository
For this exercise, let’s copy contents of the configuration file “EXTRACTION_CORE_VOICEOFCUSTOMER”.
Creating a Text Analysis Configuration: Section 10.1.3.2.1 of the
HANA developer guide SPS07: http://help.sap.com/hana/SAP_HANA_Developer_Guide_en.pdf
In next document I will highlight how to create Custom Dictionary and put in Custom Configuration that we created just now to achieve analysis on Twitter Data and avoide repeated names when running SQL to perform analysis.