SAP HANA Full Text Index
In this blog, we will cover Full Text Index. Full Text Index is created to enable text search features on specific column in a table. Conceptually, for efficient operational processing, full-text indexes perform search on columns in the same way that indexes support searching through books.
Full Text index is bound to a defined column of a table. DML (Insert, Update and Delete) and DDL (Create, Drop and Alter) commands on the table can impact/update the full text index. Full Text Index is automatically dropped when you drop a table or column on which index is created.
Additionally, to achieve high search performance, it is recommended to create a Full Text Index.
Figure 1 below shows the relationship between database table and full text index.
Figure 1 Relation between database table and full text index
*******************************************************************************
Below is the list of currently supported data types, mime types and supported languages:
Supported data types
- explicit full-text index: (N)VARCHAR, (N)CLOB, BLOB, ALPHANUM
- implicit full-text index: SHORTTEXT, TEXT, BINTEXT
Supported mime types
- pdf, Microsoft Office, Open Office, plain text, html, rtf, xml, and more. Supported MIME types are listed in “SYS”.”M_TEXT_ANALYSIS_MIME_TYPES” view.
Supported languages
- English, French, German, Chinese… 34 languages. Supported languages are listed in “SYS”.”M_TEXT_ANALYSIS_LANGUAGES” view.
*******************************************************************************
Two Types of Full Text Indexing are available:
- Explicit Full Text Index
- Implicit Full Text Index
Explicit Full Text Index
Index is separately created after creation of table. In this case, content of table is stored in its original format. Manual creation of index on specific column creates a hidden column which contains pre-processed information for that specified column. Hidden column is referred for search queries.
Figure 2 Explicit Full Text index
Processing for manually created indexes can be performed either synchronously [Insert/Update calls do not return the control to application until all documents have been pre-processed and inserted into the full text index] or asynchronously [Insert/Update calls return the control to application before all documents are pre-processed and inserted into the index] which is discussed in detail later in this blog.
Explicit Full Text Index supports all the standard SQL functions.
Find below the syntax for creating Explicit Full Text Index
CREATE FULLTEXT INDEX <index-name> ON <table> (<column>) <options>
Implicit Full Text Index
When a table is created using TEXT, BINTEXT or SHORTTEXT column, SAP HANA automatically creates a corresponding full-text index on specified column. This is used in scenarios where search operation is dominant requirement and memory consumption is critical. Figure 3 shows the Implicit Full Text Index.
Figure 3 Implicit Full Text Index
In this case, data is not stored in original format while pre-processed form is like an index on end of book, each keyword located at end of book with page number in Appendix. However, if you fire a query to select data on that column, original data is reconstructed out of full text index.
In contrast to Explicit Full text index, Implicit Full text index doesn’t support all the standard SQL functions. Few limitations exist with Implicit Full text index. For example: you can’t use the column to define the join between two tables, and you can’t use string operator “like”.
Find below the syntax for creating Implicit Full Text Index
CREATE COLUMN TABLE <table-name> (<columns> <TEXT/BINTEXT/SHORTTEXT>);
**************************************************************************************
Pre-processing and Synchronization
Pre-processing is triggered when a full-text index is created on a column with existing data and when new data is written to the column – either by an insert or an update operation.
Processing steps involving below operations are explained in previous blog[https://blogs.sap.com/2018/02/01/sap-hana-text-analysis-3/].
- Document filtering
- Language detection
- Tokenization
- Stemming
Full-text indexes in a SAP HANA database must be created and updated in synchronization with the corresponding column of database table.
Two modes of doing synchronization
- Synchronous Processing
- Asynchronous Processing
Synchronous Processing
In SAP HANA, indexes of the type TEXT and SHORTTEXT(n) are synchronous in nature. Text pre-processing is automatically performed when a column is created or new text is inserted into column results in immediate update of the full-text index with latest information.
Figure 4 shows synchronous processing wherein the system cannot transport any data to the database tables until text pre-processing is complete. After the completion of pre-processing task, control is returned to application/triggering statement for further execution.
Figure 4 Synchronous Processing
When you manually create a full-text index, you can specify whether the index is synchronous or asynchronous.
Asynchronous Processing
Text pre-processing is not initially performed when the table or column is created or whenever new text is inserted. In this case, inserting the results of the text pre-processing and writing the original data do not occur at the same time.
With asynchronous pre-processing the execution of the triggering statement does not wait. When the application writes data into a column with a full-text index, SAP HANA puts the data itself into the column and control goes back to application immediately for further execution. Along with this, SAP HANA puts requests for pre-processing into the pre-processing queue which can be monitored using System View M_FULLTEXT_QUEUES as shown in Figure 5.
Figure 5 Asynchronous Processing
After the pre-processing, the results are put into a result queue and written to the full-text index in a separate transaction. You can specify that the results are collected in the result queue and written to the index in one batch, for example after a specified period, or after a given number of records have been processed. This can be configured with the FLUSH EVERY option. Note that this option only controls when the results are taken from the queue after pre-processing.
Below is the syntax for configuring result queue scheduling
CREATE FULLTEXT INDEX <index-name> ON <table> (<column>) FAST PREPROCESS OFF ASYNC FLUSH EVERY <x> MINUTES
Fast pre-processing is the default behaviour. With fast pre-processing option, you can speed up pre-processing, however with a simplified and less powerful processing. With fast pre-processing, the language detection just returns the default language, which is ‘EN’. Linguistic analysis is skipped, and only a simple tokenizer is used. This simple tokenizer does not work for languages which do not use spaces as word separators (like Japanese). It cannot handle binary documents either.
Differences between Synchronous and Asynchronous processing highlighted in table below:
*********************************************************************************
Monitoring Full Text Index
System views are available in SAP HANA to query information about the system status using SQL commands. They exist in SYS schema. To monitor the full text index, refer to below listed system views
- M_FULLTEXT_QUEUES System View: Provides full text index queue status (active or suspended). Number of pending, completed and failed text analysis request can be monitored via this system view.
- FULLTEXT_INDEXES System View: Provides information about full text indexes on table columns like scheduling mode, flush scheduling details.
- M_TEXT_ANALYSIS_LANGUAGES System View: Provides information on supported valid languages.
- M_TEXT_ANALYSIS_MIME_TYPES System View: Provides information on supported MIME types.
To summarize Full Text Index is created to enable text search features on specific column in a table.
Can we build entirely new custom configuration in text analytics?
I do not want to use any of the five standard configuration provided by SAP.
I have built my own .hdbtextconfig file but unable to remove the standard configurations/analysis.
Hi,
I have just found that the columns created with implicit text index allows wild card search ( using like operator ) and usage in an inner join. Has some thing changed recently?
We have created implcit text indices on our table using short text data type already. We are just re evaluating this design. We have around 171 such columns. Should we really neeed to create text index explicitely? Ours is a search dominant application.
Thanks for the article,
Ajith
Hi Esha,
Thank you for writing such a helpful blog!
can you or anyone here please answer some of my queries as
I have searched everything, but I am unable to find the answers, please help.
Thank you,
Regards,
Jagesh