SAP HANA Full Text Index

Esha1 · ‎02-15-2018

In this blog, we will cover Full Text Index. Full Text Index is created to enable text search features on specific column in a table. Conceptually, for efficient operational processing, full-text indexes perform search on columns in the same way that indexes support searching through books.

Full Text index is bound to a defined column of a table. DML (Insert, Update and Delete) and DDL (Create, Drop and Alter) commands on the table can impact/update the full text index. Full Text Index is automatically dropped when you drop a table or column on which index is created.

Additionally, to achieve high search performance, it is recommended to create a Full Text Index.

Figure 1 below shows the relationship between database table and full text index.

Figure 1 Relation between database table and full text index

*******************************************************************************

Below is the list of currently supported data types, mime types and supported languages:

Supported data types

explicit full-text index: (N)VARCHAR, (N)CLOB, BLOB, ALPHANUM

implicit full-text index: SHORTTEXT, TEXT, BINTEXT

Supported mime types

pdf, Microsoft Office, Open Office, plain text, html, rtf, xml, and more. Supported MIME types are listed in "SYS"."M_TEXT_ANALYSIS_MIME_TYPES" view.

Supported languages

English, French, German, Chinese… 34 languages. Supported languages are listed in "SYS"."M_TEXT_ANALYSIS_LANGUAGES" view.

*******************************************************************************

Two Types of Full Text Indexing are available:

Explicit Full Text Index

Implicit Full Text Index

Explicit Full Text Index

Index is separately created after creation of table. In this case, content of table is stored in its original format. Manual creation of index on specific column creates a hidden column which contains pre-processed information for that specified column. Hidden column is referred for search queries.

Figure 2 Explicit Full Text index

Processing for manually created indexes can be performed either synchronously [Insert/Update calls do not return the control to application until all documents have been pre-processed and inserted into the full text index] or asynchronously [Insert/Update calls return the control to application before all documents are pre-processed and inserted into the index] which is discussed in detail later in this blog.

Explicit Full Text Index supports all the standard SQL functions.

Find below the syntax for creating Explicit Full Text Index

CREATE FULLTEXT INDEX <index-name> ON <table> (<column>) <options>

Implicit Full Text Index

When a table is created using TEXT, BINTEXT or SHORTTEXT column, SAP HANA automatically creates a corresponding full-text index on specified column. This is used in scenarios where search operation is dominant requirement and memory consumption is critical. Figure 3 shows the Implicit Full Text Index.

Figure 3 Implicit Full Text Index

In this case, data is not stored in original format while pre-processed form is like an index on end of book, each keyword located at end of book with page number in Appendix. However, if you fire a query to select data on that column, original data is reconstructed out of full text index.

In contrast to Explicit Full text index, Implicit Full text index doesn’t support all the standard SQL functions. Few limitations exist with Implicit Full text index. For example: you can’t use the column to define the join between two tables, and you can’t use string operator “like”.

Find below the syntax for creating Implicit Full Text Index

CREATE COLUMN TABLE <table-name> (<columns> <TEXT/BINTEXT/SHORTTEXT>);

**************************************************************************************

Pre-processing and Synchronization

Pre-processing is triggered when a full-text index is created on a column with existing data and when new data is written to the column – either by an insert or an update operation.

Processing steps involving below operations are explained in previous blog[https://blogs.sap.com/2018/02/01/sap-hana-text-analysis-3/].

Document filtering

Language detection

Tokenization

Stemming

Full-text indexes in a SAP HANA database must be created and updated in synchronization with the corresponding column of database table.

Two modes of doing synchronization

Synchronous Processing

Asynchronous Processing

Synchronous Processing

In SAP HANA, indexes of the type TEXT and SHORTTEXT(n) are synchronous in nature. Text pre-processing is automatically performed when a column is created or new text is inserted into column results in immediate update of the full-text index with latest information.

Figure 4 shows synchronous processing wherein the system cannot transport any data to the database tables until text pre-processing is complete. After the completion of pre-processing task, control is returned to application/triggering statement for further execution.

Figure 4 Synchronous Processing

When you manually create a full-text index, you can specify whether the index is synchronous or asynchronous.

Asynchronous Processing

Text pre-processing is not initially performed when the table or column is created or whenever new text is inserted. In this case, inserting the results of the text pre-processing and writing the original data do not occur at the same time.

With asynchronous pre-processing the execution of the triggering statement does not wait. When the application writes data into a column with a full-text index, SAP HANA puts the data itself into the column and control goes back to application immediately for further execution. Along with this, SAP HANA puts requests for pre-processing into the pre-processing queue which can be monitored using System View M_FULLTEXT_QUEUES as shown in Figure 5.

Figure 5 Asynchronous Processing

After the pre-processing, the results are put into a result queue and written to the full-text index in a separate transaction. You can specify that the results are collected in the result queue and written to the index in one batch, for example after a specified period, or after a given number of records have been processed. This can be configured with the FLUSH EVERY option. Note that this option only controls when the results are taken from the queue after pre-processing.

Below is the syntax for configuring result queue scheduling

CREATE FULLTEXT INDEX <index-name> ON <table> (<column>) FAST PREPROCESS OFF ASYNC FLUSH EVERY <x> MINUTES

Fast pre-processing is the default behaviour. With fast pre-processing option, you can speed up pre-processing, however with a simplified and less powerful processing. With fast pre-processing, the language detection just returns the default language, which is 'EN'. Linguistic analysis is skipped, and only a simple tokenizer is used. This simple tokenizer does not work for languages which do not use spaces as word separators (like Japanese). It cannot handle binary documents either.

Differences between Synchronous and Asynchronous processing highlighted in table below:

*********************************************************************************

Monitoring Full Text Index

System views are available in SAP HANA to query information about the system status using SQL commands. They exist in SYS schema. To monitor the full text index, refer to below listed system views

M_FULLTEXT_QUEUES System View: Provides full text index queue status (active or suspended). Number of pending, completed and failed text analysis request can be monitored via this system view.

FULLTEXT_INDEXES System View: Provides information about full text indexes on table columns like scheduling mode, flush scheduling details.

M_TEXT_ANALYSIS_LANGUAGES System View: Provides information on supported valid languages.

M_TEXT_ANALYSIS_MIME_TYPES System View: Provides information on supported MIME types.

To summarize Full Text Index is created to enable text search features on specific column in a table.

SAP HANA Full Text Index

Useful documents on SCN

Evolution of ABAP

Analytics in S/4HANA - real shape of embedded analytics and beyond embedded analytics