Functionality overview about PostgreSQL full text search
According to definition in wikipedia, full text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references).
In other words, full text search is to find all documents containing given query terms and return them in order of their similarity to the query.
PostgreSQL supports full text search with following features.
- Highlighting results
- Multiple languages
- Fuzzy search for misspelling
- Accent support
1. Parsing Documents
A document is the unit of searching in a full-text search system; for example, a magazine article or email message.
PostgreSQL provides the function to_tsvector for converting a document to the tsvector data type which stores processed documents. The function calls a parser which parses the textual document into tokens, reduces the tokens to lexemes, and returns a vector which lists the lexemes together with their positions in the document.
During the process of breaking the document text into tokens, the parser assigns a type to each token. Depending on the token type, each token is processed by a different list of dictionaries.
With help of dictionaries, some words are recognized as stop words, for example a, on, the. These words will be ignored since they are unuseful in searching.
Some words are recognized as normalized lexemes to represent the token. For example, rats becomes rat.
If no dictionary recognizes the token, the token is ignored.
The choices of parser, dictionaries and which types of tokens to index are determined by the selected text search configuration. It is possible to have many different configurations in the same database, and predefined configurations are available for various languages.
2. Parsing Queries
PostgreSQL provides the functions to_tsquery, plainto_tsquery, phraseto_tsquery and websearch_to_tsquery for converting a query to the tsquery data type.
These functions normalizes each token in the query input into a lexeme using the specified or default configuration, and discards any tokens that are stop words according to the configuration.
The query input may consist of single tokens separated by the operators AND, OR, NOT, and FOLLOWED BY, possibly grouped using parentheses.
With FOLLOWED BY operator, you can do phrase search to find words next to each other.
Also, * can be attached to a lexeme to specify prefix matching. Such a lexeme will match any word that begins with the given string.
The to_tsquery can accept single-quoted phrases. This is primarily useful when the configuration includes a thesaurus dictionary that may trigger on such phrases. Without quotes, to_tsquery will generate a syntax error for tokens that are not separated by an AND, OR, or FOLLOWED BY operator.
3. Ranking Search Results
PostgreSQL supports to get search results ordered by relevance. The ranking of documents is based on many factors, e.g. how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur.
Different applications might require additional information for ranking, e.g., document modification time. Besides the built-in ranking functions, you can write your own ranking functions and/or combine their results with additional factors to fit your specific needs.
PostgreSQL provides two predefined ranking functions.
- ts_rank, ranks vectors based on the frequency of their matching lexemes.
- ts_rank_cd, is similar to ts_rank ranking except that the proximity of matching lexemes to each other is taken into consideration.
The function setweight is to assign a weight value to a tsvector. The value can be ‘A’, ‘B’, ‘C’ or ‘D’. Typically weights are used to mark words from special areas of the document, like the title or an initial abstract, so they can be treated with more or less importance than words in the document body.
Based on the weights assigned to parts of document the ts_rank() returns a floating number which represents the relevancy of document against the query.
4. Highlighting Results
To present search results it is ideal to show a part of each document and how it is related to the query. Usually, search engines show fragments of the document with marked search terms. PostgreSQL pro- vides a function ts_headline that implements this functionality.
The function ts_headline accepts a document along with a query, and returns an excerpt from the document in which terms from the query are highlighted.
5. Triggers for Automatic Updates
One common practice is to use a separate column to store the tsvector representation of documents. Then it is necessary to create a trigger to update the tsvector column when the document content columns change. Two built-in trigger functions are available for this.
However, this method has been obsoleted by the use of stored generated columns, as described below.
A stored generated column can keep this column automatically up to date with its source data.
Then you can create an index to speed up the search.
Dictionaries are used to eliminate words that should not be considered in a search (stop words), and to normalize words so that different derived forms of the same word will match. A successfully normalized word is called a lexeme.
The simple dictionary template operates by converting the input token to lower case and checking it against a file of stop words. If it is found in the file then an empty array is returned, causing the token to be discarded. If not, the lower-cased form of the word is returned as the normalized lexeme. Alternatively, the dictionary can be configured to report non-stop-words as unrecognized, allowing them to be passed on to the next dictionary in the list.
The synonym dictionary template is used to create dictionaries that replace a word with a synonym.
A thesaurus dictionary is a collection of words that includes information about the relationships of words and phrases, i.e., broader terms (BT), narrower terms (NT), preferred terms, non-preferred terms, related terms, etc. Basically a thesaurus dictionary replaces all non-preferred terms by one preferred term.
The Ispell dictionary template supports morphological dictionaries, which can normalize many different linguistic forms of a word into the same lexeme.
Snowball now provides stemming algorithms for many languages (see the Snowball site for more information). Each algorithm understands how to reduce common variant forms of words to a base, or stem, spelling within its language. A Snowball dictionary requires a language parameter to identify which stemmer to use, and optionally can specify a stopword file name that gives a list of words to eliminate.
Postgres ships with an extension call unaccent which is useful to unaccentuate content. For example the word était will be coverted as etait by the function.
PostgreSQL comes with extenstion pg_trgm which provides support for trigram. It can find strings with similar characters. The function similarity can calculate the similarity between two strings. In this way you can detect misspelling.
It’s common not to look for misspellings on every query. Instead, you could query for misspellings only when the search returns no results. It is also possible that your data may contain misspellings, for example social posts. Then you may obtain good results by appending the similar lexeme to your tsquery.
Some missing features are as below.
- Additional built-in language support. eg: Chinese, Japanese
- More advanced features for the ranking of results
- A way to do fuzzy search without having to use trigram would be nice. Elasticsearch offers a simple way to do fuzzy search queries.