This post was originally written by Glenn Paulley and posted to sybase.com in May of 2008
In a recent post, James Hamilton of Microsoft Research espoused his enthusiasm for full text search, and I couldn’t agree more: extending relational database systems with full text search capability adds considerable functionality and flexibility, and permits classes of applications that previously could not be supported efficiently.
In SQL Anywhere, text search – part of the SQL Anywhere Panorama release – is fully integrated into the server proper. For system administrators and DBAs, this means that there isn’t yet another software package to install, maintain, and coordinate with the other server components: it’s all contained within the server, which means that you have access to full text search, using Google-style search terms, on any platform, including Windows CE. This integration also means that application developers can take full advantage of both “ordinary” relational operators and full-text search in the same query, and the entire request is processed within a single, unified access plan.
A useful feature of full-text search is document (row) ranking. Rather than (simply) return all matching rows to the search in an arbitrary order, ranked search permits annotation of results with scores that are computed using an inverse-document frequency (IDF) function, an idea originally developed by Karen Sparck Jones and her colleagues at City University in London in the 1980’s and refined over time. The IDF function utilized by SQL Anywhere is a variant of the BM25 algorithm, which ranks documents (rows) using a probabilistic weighting scheme based on the inverse frequency of the query terms appearing in each document (that is, terms that appear rarely in a document are more meaningful than if they appear more often). The weighting scheme disregards any implied relationship between query terms within a single row, for example due to their relative proximity. The formula to compute the “score” of a document (row) is as follows:
In the above formula, q represents the query terms, nq is the vocabulary of the collection being searched, N is the number of documents (rows) in the collection, f(q,D) is the frequency of the qterms in the specific document (row) D, || D || is the length of the document, and ADL is the average document length. The factors k1 and b are (somewhat arbitrary) constants. In SQL Anywhere, k1 = 1.2 and b = 0.75.
Row scores are returned when the query specifies the
CONTAINS stored procedure, rather than using the
CONTAINS predicate in the query’s
WHERE clause. Scores, accompanied by other constructions such as
SELECT TOP N offer a richer means of selecting the “right” rows for a specific database request.