Skip to Content

In this blog, we’ll discuss remaining Text Mining Functions.  Functions available to find top ranked related and relevant documents and terms.

Document Classification or Categorization

One of the category of Text Mining function is Document classification or categorization. SQL Function in HANA for performing this operation is TM_CATEGORIZE_KNN.

TM_CATEGORIZE_KNN

K-nearest neighbor algorithm is used for predicting or classifying objects based upon the similarity and closeness to available labelled data.  This function classifies an input document with respect to sets of categories.

K-Nearest Neighbor classification is Document Categorization.

  • Requires a “reference set” of previously classified documents
  • Takes an input document and returns the most likely categories for it by comparing it to the documents in reference set
  • KNN Classifier determines the K nearest neighbors or similar documents from the reference set and then sums and normalizes their similarities per category value to determine the winning category value
  • Return table contains the suggested categorizations for the target documents with weightage (score value)

When a new data point is to be classified, its distance from each of the labelled data points is computed. This is simplest of all the machine learning algorithm explain with below graphic.  Figure 1 predict the new data points based upon this algorithm.

Figure 1: k-nearest neighbor algorithm

In the figure, we need to predict the data points (triangle shown in green color). On left hand side, new data point is classified as category ‘1’ as majority of the nearest neighbors in circle belongs to category 1. Whereas on right hand side, new data point is classified as category ‘0’ as majority of the nearest neighbors in circle belongs to category 0.Classification behavior changes (category 1 to category 0) are deviated with considered neighborhood distance.

Syntax: 
TM_CATEGORIZE_KNN ( 
<tm_document>
<tm_search_categorize_knn>
{ <tm_return_category>, ….}
)

Where
<tm_document> := 
DOCUMENT { <string>  [ LANGUAGE <string> ] [ MIME TYPE <string> ]    
| ( <subquery> )   [ LANGUAGE <string> ] [ MIME TYPE <string> ]
| IN FULLTEXT INDEX WHERE <condition> }

Specify the document which you want to categorize in <tm_document>. Either provide text as string or provide a select query or specify query document as part of full text index using where clause for restriction.

SEARCH NEAREST NEIGHBORS
{ <knn_int> | DEFAULT } 
<reference_column> FROM <reference_table> [ WHERE <condition>] 
[ WITH TERM TYPE <string>, ….]
….

In <tm_search_categorize_knn> you search for nearest neighbors by providing below syntax. Specify the reference document in reference column and table. Reference document can be restricted by specifying where condition or with term type clause. 

RETURN TOP { <top_int>| DEFAULT } 
<category_column> FROM <category_table>
[ JOIN <reference_column>  ON <primary key of category table> = <primary key of reference table>  ]

In { <tm_return_category>, ….} this provides the maximum number of category results to return for the specified category column. It may be a column in the same table as the reference documents or a column in a different table that can be used with the join clause.

In this case, we are pinning it down to one document “Federal_award_id_number = 1304684”. Input is query as part of full text index with document number which is run against the term document matrix/text mining index to fetch top 5 nearest neighbors. Score depicts the weightage, higher the value better would be the classification of the document.

Figure 2 shows the result of function TM_CATEGORIZE_KNN.

Figure 2: Result Set of TM_CATEGORIZE_KNN

*******************************************************************************

Term Functions

TM_GET_RELATED_TERMS: This text mining function returns the top-ranked related terms for a query term, based on a set of reference documents.

Syntax
TM_GET_RELATED_TERMS ( 
<tm_term> 
<tm_search> 
<tm_return_term> )

Where
<tm_term> :=  
 TERM <string>  [ LANGUAGE <string> ]

Specifies the term and the language to be processed. Input term can be single term or multiple terms with optional terms types and wildcards example: sap, sap:noun (used as a noun) etc.

<tm_search> := 
SEARCH <column>   FROM <table>   [ WHERE <condition> ]
[ WITH TERM TYPE <string>, ... ]

Specifies the set of reference documents in <column> and <table>. The specified column must be of type text or must have a full-text index. Set of documents can be restricted by where conditions and with term type. 

<tm_return_term> :=    
RETURN     
[ PRINCIPAL COMPONENTS <pc int> ]                        -- output FACTORS, ROTATED_FACTORS     
[ CLUSTERING [<string>] ]                                -- output CLUSTER_LEVEL, CLUSTER_LEFT,                                       
                                                         -- CLUSTER_RIGHT    
[ CORRELATION ]                   		         -- output CORRELATIONS       
TOP { <top int> | DEFAULT }

For explanation of above options, refer to previous section. If specified, the options PRINCIPAL COMPONENTS, CLUSTERING and CORRELATION must be used in this order. TOP must always be specified as the last option.

In this case, input is a term “ocean” which is run against the term document matrix/text mining index to provide top ranked 5 related terms. This is based on co-occurrences.  Figure 3 shows the result of function TM_GET_RELATED_TERMS.

Figure 3: Result Set of TM_GET_RELATED_TERMS

TM_GET_RELEVANT_TERMS: This text mining function returns the top-ranked relevant terms that describe a document.

Syntax:
TM_GET_RELEVANT_TERMS ( 
<tm_document> 
<tm_search> 
<tm_return_term> )

<tm_document> := 
 DOCUMENT { <string>  [ LANGUAGE <string> ] [ MIME TYPE <string> ]   
 | ( <subquery> )   [ LANGUAGE <string> ] [ MIME TYPE <string> ]   
 | IN FULLTEXT INDEX WHERE <condition> }

Either provide text as string or provide a select query or specify query document part of full text index using where clause for restriction.

<tm_search> := 
SEARCH <column>   FROM <table>   [ WHERE <condition> ]
[ WITH TERM TYPE <string>, ... ]

Specifies the set of reference documents in <column> and <table>. The specified column must be of type text or must have a full-text index. Set of documents can be restricted by where conditions and with term type. 

<tm_return_term> :=   
RETURN      
[ PRINCIPAL COMPONENTS <pc int> ] 	 -- output FACTORS, ROTATED_FACTORS     
[ CLUSTERING [<string>] ]         	 -- output CLUSTER_LEVEL, CLUSTER_LEFT 
                                         -- CLUSTER_RIGHT     
[ CORRELATION ]                  	 -- output CORRELATIONS       
TOP { <top int> | DEFAULT }

For explanation of above options, refer to previous section. If specified, the options PRINCIPAL COMPONENTS, CLUSTERING and CORRELATION must be used in this order. TOP must always be specified as the last option.

In this case, we are pinning it down to one document “Federal_award_id_number = 1304684”. Input is entered as document which is run against the document matrix/text mining index to fetch top 5 relevant terms in the document. We have got relevant terms, normalized terms where we remove capitalization and diacritics and term type is giving part of speech text from text analysis. This example shows text mining and text analysis complement each other. Figure 4 shows the result of function TM_GET_RELEVANT_TERMS.

Figure 4: Result Set of TM_GET_RELEVANT_TERMS

TM_GET_SUGGESTED_TERMS: This text mining function returns the top-ranked terms that match an initial substring. This function can be used for type-ahead or auto-completion functions.

Syntax:
TM_GET_SUGGESTED_TERMS
 ( <tm_term>
 <tm_search> 
<tm_return_top> )

Where
<tm_term> := 
TERM <string>  [ LANGUAGE <string> ]

Specifies the term and the language to be processed.

<tm_search> := 
SEARCH <reference column>   FROM <reference table>   [ WHERE <condition> ]
[ WITH TERM TYPE <string>, ... ]

Specifies the set of reference documents in <reference column> and <reference table>. The specified column must be of type text or must have a full-text index. Set of documents can be restricted by where conditions and with term type. 

<tm_return_top> :=
RETURN TOP { <top int> | DEFAULT }

Specifies the number of top returned terms.

In this case, input is a term which is run against the term document matrix/text mining to get the top 5 suggestions as an output. Figure 5 shows the result of function TM_GET_SUGGESTED_TERMS.

Figure 5: Result Set of TM_GET_SUGGESTED_TERMS

For Document Functions refer to previous blog[https://blogs.sap.com/2018/02/18/sap-hana-text-mining-functions-part1/ ]. For details on SAP HANA Text Mining, refer to blog [https://blogs.sap.com/2018/02/16/sap-hana-text-mining/].

To report this post you need to login first.

1 Comment

You must be Logged on to comment or reply to a post.

  1. Hobart Liu

    Esha, good blog, very helpful.

     

    There is one thing not very clear to me, when I read KNN classifier part. Because each document has different number of words, they probably have different dimensions, how does KNN compute distance between two documents in this case?

     

    Many thanks!

    Hobart

    (0) 

Leave a Reply