PostgreSQL Search
Every since I saw that you can search in Postgres with regular expressions, I wanted to attempt to build a search feature with it. I also need to improve the search feature for this site and learn more about search in Postgres, so I am going to try to build a search feature that can utilize regex for this site.
References
Preface
- The goal is to build a search feature that can direct the user to the proper page / section of a page.
- The layout of pages looks like this:
<html>
<head>
<!-- Links and stuff -->
</head>
<body>
<header role="banner">
<!-- Navbar -->
</header>
<section role="complementary">
<nav>
<!-- Left Sidebar Navigation -->
</nav>
</section>
<div>
<main role="page">
<!-- Main Page -->
</main>
</div>
<footer><!-- Footer --></footer>
<aside>
<!-- Right Sidebar -->
</aside>
</body>
</html>
- Given the layout of the page, only elements / text inside
main[role="page"]
should be used for search. - The description and page title should probably be used for search.
- When getting the inner text / elements for a page, you probably want to exclude content inside a
<form>
element.
I think offering Regex Search in PostgreSQL could open the doorway to ReDoS (Regular Expression Denial of Service) attacks.
Definitions
Denial of Service
The Denial of Service (DoS) attack is focused on making a resource (site, application, server) unattainable for the purpose it was designed. There are many ways to make a service unavailable for legitimate users by manipulating network packets, programming, logical, or resources handling vulnerabilities, among others. If a service receives a large number of requests, it may cease to be available to legitimate users.
Sometimes the attacker can inject and execute code while performing a DoS attack in order to access critical information or execute commands on the sever. Denial-of-service attacks significantly degrade the service quality experienced by legitimate users. These attacks introduce large response delays, excessive losses, and service interruptions, resulting in direct impact on availability.
Risk Factors
- The two principle risk factors for DoS attacks are inadequate resources and non-technical threat motivators.
- The first example of a risk factor requires attention if system architecture was not designed to meet traffic demand overflows.
- The second risk factor means that an organization should avoid taking action that can make them a target of a DoS attack unless the benefits of doing so outweigh the potential costs or mitigating controls are in place.
Example - Storing too Much Data in Session
- Care must be taken not to store too much data in a user session object. Storing too much information in the session, such as large quantities of data retrieved from the database, can cause denial of service issues. The problem is exacerbated if session data is also tracked prior to a login, as a user can launch the attack without the need of an account.
Regular Expression Denial of Service (ReDoS)
The Regular expression Denial of Service (ReDoS) is a Denial of Service attack that exploits the face that most Regular Expression implementations may reach extreme situations that cause them to work very slowly (exponentially related to input size). An attacker can then cause a program using a Regular Expression (Regex) to enter these extreme situations and then hang for a very long time.
Description
- The Regex naive algorithm builds a Nondeterministic Finite Automation (NFA), which is a finite state machine where for each pair of state and input symbol there may be several possible next states. Then the engine starts to make transition until the end of the input. Since there may be several possible next states, a deterministic algorithm is used. The algorithm tries all the possible paths (if needed) until a match is found (or all the paths are tries and fail).
- A Regis pattern is called Evil regex if it can get stuck on crafted input.
- Evil Regex contains:
- Grouping with repetition
- Inside the repeated group:
- Repetition
- Alternation with overlapping
- Examples of Evil Regex:
(a+)+
([a-zA-Z]+)*
(a|aa)+
(a|a?)+
(.*a){x} for x \> 10
Risk Factors
- The Web is Regex-Based:
- In every layer of the web there are Regular Expressions, that might contain an Evil Regex., An attack can hang a Web-browser (on a computer or potentially a mobile device), hang a web application firewall, attack a database, and even stack a vulnerable web server.
- Don't use the same Regex to validate the client and server side.
About Search in Postgres
Introduction
Full Text Searching (or just text search) provides the capability to identify natural-language documents that satisfy a query, and optionally to sort them by relevance to the query. The most common type of search is to find all documents containing given query terms and return them in order of their similarity to the query. Notions of query and similarity are very flexible and depend on the specific application. The simplest search considers query as a set of words and similarity as the frequency of query words in the document.
- Textual search operators have existed in databases for years. PostgreSQL has
~
,~*
,LIKE
andILIKE
operators for textual data types, but they lack some essential properties: - no linguistic support
- provide no ordering (rank) of search results, which make them ineffective when many matches are found
- They tend to be slow because there is no index support, so they must process all documents for every search
- Full text indexing allows documents to be preprocessed and an index saved for later rapid searching. Preprocessing includes:
- Parsing documents into tokens. PostgreSQL uses a parser to perform this step.
- Converting tokens into lexemes. A lexeme is a string, just like a token, but it has been normalized so that different forms of the same word are made alike. This allows searches to find variant forms of the same word, without tediously entering all the possible variants.
- Storing preprocessed documents optimized for searching
- Dictionaries allow fine-grained control over how tokens are normalized. With dictionaries, you can:
- Defined stop words that should not be indexed
- Map synonyms to a single word using Ispell
- Map phrases to a single word using a thesaurus
- Map different variations of a word to a canonical form using an Ispell dictionary
- Map different variations of a word to canonical form using Snowball stemmer rules
- A data type
tsvector
is provided for storing preprocessed documents, along with a typetsquery
for representing processed queries.
What is a Document?
A document is the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes with their parent document. Later, these associations are used to search for documents that contain query words.
- A document is usually a
TEXT
column in a table in the database. - For text search purposes, each document must be reduced to the preprocessed
tsvector
format. Searching and ranking are performed entirely on thetsvector
representation of a document - the original text need only be retrieved when the document has been selected for display to a user.
Basic Text Matching
- Full text searching in PostgreSQL is based on the match operator
@@
which returnstrue
if atsvector
matches atsquery
(query). It doesn't matter which data type is written first:
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat'::tsquery;
-- true
SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
-- false
SELECT to_tsvector('dat cats ate fat rats') @@ to_tsquery('fat & rat');
-- true
- A
tsquery
contains search terms, which must be already-normalized lexemes, and may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators. - There are functions
tp_tsquery
,plainto_tsquery
, andphraseto_tsquery
that are helpful in converting user-written text into a proper tsquery, primarily by normalizing words appearing in the text. Similarly,to_tsvector
is used to parse and normalize a document string. - The elements of a
tsvector
are lexemes, which are assumed already normalized. - The
@@
operator also supportstext
input, allowing explicit conversion of a text sting totsvector
ortsquery
to be skipped in simple cases. - Within a
tsquery
, the&
operator specifies that both its arguments must appear in the document to have a match. The|
OR operator specifies that at least one of its arguments must appear, while the!
NOT operator specifies that its argument must not appear in order to have a match. - The
<->
FOLLOWED BY operator matches only if its arguments have matches that are adjacent and in the given order. - The FOLLOWED BY operator has the form
<N>
, where N is an integer standing for the difference between the positions of the maxing lexemes.
- The FOLLOWED BY operator has the form
- Parentheses can be used to control nesting of the
tsquery
operators. Without parentheses,|
binds least tightly, then&
, then<->
, and!
most tightly.
Configurations
- Full text search functionality includes the ability to do many things: skip indexing certain words (stop words), process synonyms, and use sophisticated parsing. This functionality is controlled by text search configurations.
- A configuration is built up from simpler database objects:
- Text Search parsers break documents into tokens and classify each token
- Text search dictionaries convert tokens to normalized form and reject stop words
- Text search templates provide the functions underlying dictionaries
- Text search configurations select a parser and a set of dictionaries to use to normalize the tokens produced by the parser
Tables and Indexes
Searching a Table
- You can use
to_tsvector
on a column or concatenation of columns in a table:
SELECT title FROM pgweb WHERE to_tsvector('english',body) @@ to_tsquery('english','friend');
-- You can leave off the english configuration parameter
SELECT title FROM pgweb WHERE to_tsvector(body) @@ to_tsquery('friend');
Creating Indexes
- You can create a
GIN
index to speed up searches:
CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english',body)); -- You should use the two argument version of to_tsvector so that index contents ate unaffected by default_text_search_config
- You should probably create a
STORED GENERATED
column and create an index on that column to speed up searches:
ALTER TABLE pgweb
ADD COLUMN textsearchable_index_col tsvector
GENERATED ALWAYS AS (to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, ''))) STORED;
CREATE INDEX textsearch_idx ON pgweb USING GIN (textsearchable_index_col);
Controlling Text Search
To implement full text searching there must be a function to create atsvector
from a document and atsquery
from a user query. Also, we need to return results in a useful order, so we need a function that compares documents with respect to their relevance to the query. It's also important to be able to display the results nicely.
Parsing Documents
to_tsvector
- converts a document to the
tsvector
data type - parses a textual document into tokens, reduces the tokens to lexemes, and returns which lists the lexemes together with their positions in the document.
- The function internally calls a parser which breaks the document text into tokens and assigns a type to each token. For each token, a list of dictionaries is consulted, where the list can vary depending on the token type. The first dictionary that recognizes the token emits one or more normalized lexemes to represent the token - e.g.,
rats
becomesrat
. - Some words are recognized as stop words which causes them to be ignored since they occur too frequently to be useful in searching.
- I no dictionary ion the list recognizes the token then it is also ignored
to_tsvector([config reconfig, ] document text) returns tsvector
SELECT to_tsvector('english', 'a fat cat sat on a mat - it ate a fat rats');
--------------
'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
- The function
setweight
can be used to label the entries of atsvector
with a given weight, where weight is one of the lettersA
,B
,C
, orD
.
UPDATE tt SET ti =
setweight(to_tsvector(coalesce(title,'')), 'A') ||
setweight(to_tsvector(coalesce(keyword,'')), 'B') ||
setweight(to_tsvector(coalesce(abstract,'')), 'C') || -- merge the ts vectors with concatenation
setweight(to_tsvector(coalesce(body,'')), 'D');
Parsing Queries
- PostgreSQL provides the functions
to_tsquery
,plainto_tsquery
,phraseto_tsquery
, andwebsearch_to_tsquery
for converting a query to thetsquery
data type.to_tsquery
offers access to more features than eitherplainto_tsquery
orphraseto_tsquery
, but it is less forgiving about its input.websearch_to_tsquery
is a simplified version ofto_tsquery
with an alternative syntax, similar to the one used by web search engines.
to_tsquery
to_tsquery([ config regconfig, ] querytext text) returns tsquery
to_tsquery
creates atsquery
value fromquerytext
which must consist of single tokens separated by thetsquery
operators & AND, | OR, ! NOT, and <-> FOLLOWED BY, possibly grouped using parentheses. The input toto_tsquery
must already follow the general rules fortsquery
input. The difference is that while basictsquery
input takes tokens at face value,to_tsquery
normalizes each token into a lexeme using the specified or default configuration, and discards any tokens that are stop words according to the configuration.- Weights can be attacked to each lexeme to restrict it to match only
tsvector
lexemes of those weights. - Also, * can be attacked to a lexeme to specify prefix matching
- It can accept single quoted phrases - useful for when the configuration includes a thesaurus dictionary that may trigger on such phrases
plainto_tsquery
plainto_tsquery([ config regconfig, ] querytext text) returns tsquery
- Transforms the unformatted text querytext to a
tsquery
value. The text is parsed and normalized much as forto_tsquery
, then the & (AND)tsquery
operator is inserted between surviving words. - The
plainto_tsquery
will not recognizetsquery
operators, weight labels, or prefix-match labels in its input
phraseto_tsquery
phraseto_tsquery([ config regconfig, ] querytext text) returns tsquery
phraseto_tsquery
behaves much likeplainto_tsquery
except that it inserts the<->
FOLLOWED BY operator between surviving words instead of the&
AND operator. Also, stop words are not simply discarded, but are accounted for by inserting<N>
operators rather than<->
operators.- This function is useful when searching for exact lexeme sequences, since FOLLOWED BY operators check lexeme order not just the presence of all the lexemes
- The
phraseto_tsquery
will not recognizetsquery
operators, weight labels, or prefix-match labels in its input
websearch_to_tsquery
phraseto_tsquery([ config regconfig, ] querytext text) returns tsquery
- Creates a
tsquery
value from querytext using an alternative syntax in which simple unformatted text is a valid query. Unlike phrase- and plain- to_tsquery, it also recognizes certain operators. Moreover, this function will never raise syntax errors, which makes it possible to use raw user-supplied input for search. The following syntax is supported: unquoted text
: text not inside quote marks will be converted to terms separated by & operators, as if processed byplainto_tsquery
"quoted text"
: text inside quote marks will be converted to terms separated by<->
operators, as if processed byphraseto_tsquery
OR
: the wordor
will be converted to the|
operator-
: a dash will be converted to the!
operator- Other punctuation is ignored
Ranking Search Results
Ranking attempts to measure how relevant documents are to a particular query, so that wen there are many matches the most relevant ones can be shown first. PostgreSQL provides two predefined ranking functions, which take into account lexical, proximity, and structural information; that is, they consider how often the query terms appear in the document, how close together the terms are in the document, and how important the part of the document where they occur.
- The two ranking functions currently available:
ts_rank([ wights float4[], ] vector tsvector, query tsquery [, normalization integer]) returns float4
- Ranking vectors based on the frequency of their matching lexemes
ts_rank_cd([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4
ts_rank_cd
computes the cover density ranking for the given document vector.- This function requires lexeme positional information to perform its calculation.
The weights argument offers the ability to weigh word instances more or less heavily depending on how they are labeled.
{D-weight, C-weight, B-weight, A-weight}
-- Default Values:
{0.1, 0.2, 0.4, 1.0}
- weights are used to mark words from special areas of the document, like the title or an initial abstract, so they can be treated with more or less importance than words in the document body.
- Both function take an integer normalization option that specifies whether and how a document's length should impact its rank. The integer option controls several behaviors, so it is a bit mask: you can specify one or more behaviors using
|
- 0 (the default) ignores the document length
- 1 divides the rank by 1 + the logarithm of the document length
- 2 divides the rank by the document length
- 3 divides the rank by the mean harmonic distance between extents
- 8 divides the rank by the number of unique words in document
- 16 divides the rank by 1+ the logarithm of the number of unique words in document
- 32 divides the rank by itself + 1
Highlighting Results
To present search results it is ideal to show part of each document and how it is related to teh query. Usually, search engines show fragments of the document with marked search terms. PostgreSQL provides a function ts_headline
that implements this functionality.
ts_headline([ config regconfig, ] document text, query tsquery [, options text ]) returns text
ts_headline
accepts a document along with a query, and returns an excerpt from the document in which terms from the query are highlighted. Specifically the function will use the query to select relevant text fragments, and then highlight all words that appear in the query, even if those word positions do not match the query's restrictions. The configuration to be used to parse the document can be specified byconfig
. If config is omitted, thedefault_text_search_config
configuration is used.- If an options string is specified, it must consist of a comma separated list of one or more
option=value
pairs. The available options are: MaxWords
,MinWords
(integers): these numbers determine the longest and shortest headlines to output. The default values are 35 and 15.ShortWord
(integer): words of this length or less will be dropped at the start and end of a headline, unless they are query terms. The default value of three eliminates common English articles.HighlightAll
(boolean): iftrue
the whole document will be used as the headline, ignoring the preceding three parameters. The default isfalse
MaxFragments
(integer): maximum number of text fragments to display. The default value of zero selects a non-fragment based headline generation method. A value greater than zero selects fragment-based headline generationStartSel
,StopSel
(string): the string with which to delimit query words appearing in the document, to distinguish them from other excerpted words. The default values are<b>
and</b>
, which can be suitable for HTML output.- Might want to replace with
<mark>
- Might want to replace with
FragmentDelimeter
(string): when more than one fragment is displayed, the fragments will be separated by this string. The default is "...
".
Additional Features
Manipulating Documents
tsvector || tsvector
- The concatenation operator returns a vector which combines the lexemes and positional information of the two vectors given as arguments. Positions appearing in the right hand vector are offset by the largest position mentioned in the left-hand vector, so that the result is nearly equivalent to performing
to_tsvector
on the concatenation of the two original document strings.
setweight(vector tsvector, weight "char") returns tsvector
setweight
returns a copy of the input vector in which every position has been labeled with a given weight, either A, V, C, or D, weight applies to positions, not lexemes.
length(vector tsvector) returns integer
- Returns the number of lexemes stored in the vector.
strip(vector tsvector) returns tsvector
- Returns a vector that lists the same lexemes as the given vector, but lacks any position or weight information. The result is usually much smaller than an unstripped vector, but it is also less useful.
Manipulating Queries
tsquery && tsquery
- Returns the AND-combination of two given queries
tsquery || tsquery
- Returns the OR-combination of two given queries
!! tsquery
- Returns the negation (NOT) of the given query
tsquery <-> tsquery
- Returns a query that searches for a match to the first given query immediately followed by a match to the second given query, using the
<->
(FOLLOWED BY)tsquery
operator.
tsquery_phrase(query1 tsquery, query2 tsquery [, distance integer ]) returns tsquery
- returns a query that searches for a match to the first given query followed by a match to the second given query at a distance of exactly distance lexemes, using the
<N>
tsquery operator.
numnode(query tsquery) returns integer
- Returns the number of nodes (lexemes plus operators) in a tsquery. This function is useful to determine if the query is meaningful, or contains only stop words (returns 0)
querytree(query tsquery) returns text
- Returns the portion of a
tsquery
that can be used for searching an index. This function is useful for detecting unindexable queries, for example those containing only stop words or only negated terms.
Query Rewriting
- The
ts_rewrite
family of functions search a givents_query
for occurrences of a target subquery and replace each occurrence with a substitute subquery. In essence, this operation is atsquery
specific version of substring replacement. A target and substitute combination can be thought of as a query rewrite rule. A collection of such rewrite riles can be a powerful search aid.
ts_rewrite(query tsquery, target tsquery, subsitute tsquery) returns tsquery
- This form of
ts_reqrite
simply applies a single rewrite rule: target is replaced by substitute wherever is appears in query.
ts_rewrite(query tsquery, select text) returns tsquery
- This form of
ts_reqrite
accepts a starting query and an SQL select command, which is given as a text string. The select must yield two columns oftsquery
type. For each row of the select result, occurences of the first column value (the target) are replaced by the second column value (the substitute) within the current query value.
Triggers for Automatic Updates
tsvector_update_trigger(tsvector_column_name, config_name, text_column_name [, ... ])
tsvector_update_trigger_column(tsvector_column_name, config_column_name, text_column_name [, ... ])
- These methods have been obsoleted by the use of stored generated columns.
Gathering Document Statistics
ts_stat(sqlquery text, [ weights text, ]
OUT word text, OUT ndoc integer,
OUT nentry integer) returns setof record
- The function
ts_stat
is useful for checking your configuration and for finding stop-word candidates. sqlquery
is a text value containing an SQL query which must return a singletsvector
column.ts_stat
executes the query and returns statistics about each distinct lexeme (word) contained in thetsvector
data. The columns returned:word text
- the value of a lexemendoc text
- number of documents (tsvectors
) the word occurred innentry text
- total number of occurrences of the word
Parsers
Text search parsers are responsible for splitting raw document text into tokens and identifying each token's type, where the set of possible types is defined by the parser itself. Note that a parser doesn't modify the text at all - it simply identify plausible word boundaries. Because of this limited scope, there is less need for application-specific custom parameters than there is for custom dictionaries. At present PostgreSQL provides just one built-in parser, which has been found to be useful for a wide range of applications.
- The built-in parser is named
pg_catalof.default
. It recognizes 23 types - shown below:
Alias | Description | Example |
---|---|---|
asciiword | Word, all ASCII letters | elephant |
word | Word, all letters | mañana |
numword | Word, letters and digits | beta1 |
asciihword | Hyphenated word, all ASCII | up-to-date |
hword | Hyphenated word, all letters | lógico-matemática |
numhword | Hyphenated word, letters and digits | postgresql-beta1 |
hword_asciipart | Hyphenated word part, all ASCII | postgresql in the context postgresql-beta1 |
hword_part | Hyphenated word part, all letters | lógico or matemática in the context lógico-matemática |
hword_numpart | Hyphenated word part, letters and digits | beta1 in the context postgresql-beta1 |
Email address | foo@example.com | |
protocol | Protocol head | http:// |
url | URL | example.com/stuff/index.html |
host | Host | example.com |
url_path | URL path | /stuff/index.html, in the context of a URL |
file | File or path name | /usr/local/foo.txt, if not within a URL |
sfloat | Scientific notation | -1.234e56 |
float | Decimal notation | -1.234 |
int | Signed integer | -1234 |
uint | Unsigned integer | 1234 |
version | Version number | 8.3.0 |
tag | XML tag | <a href="dictionaries.html"> |
entity | XML entity | & |
blank | Space symbols | (any whitespace or punctuation not otherwise recognized) |
Dictionaries
The format and location of dictionary files can be seen by clicking the link above - I don't know if you can implement your own dictionary files in a managed database like Amazon RMBDS - which might need to be moved to an EC2 instance.
Dictionaries are used to eliminate words that should not be considered in a search (stop words) and to normalize words so that different derived forms of the same word will match. A successfully normalized word is called a lexeme. Aside from improving search quality, normalization and removal of stop words reduce the size of the tsvector
representation of a document, thereby improving performance. Normalization does not always have linguistic meaning and usually depends on the application semantics.
Examples of Normalization:
- Ispell dictionaries try to reduce input words to a normalized form; stemmer dictionaries remove word endings
- Color names can be replaced by their hexadecimal values
- A dictionary is a program that accepts a token as input and returns:
- an array of lexemes if the input token is known to the dictionary
- a single lexeme with the
TSL_FILTER
flag set, to replace the original token with a new token to be passed to subsequent dictionaries (a dictionary that does this is called a filtering dictionary) - an empty array if the dictionary knows the token, but it is a stop word
NULL
if the dictionary does not recognize the input token
- A text search configuration binds a parser together with a set of dictionaries to process the parser's output tokens. Foe each token type that the parser can return, a separate list of dictionaries is specified by the configuration. When a token of that type is found by the parser, each dictionary in the list is consulted in turn, until some dictionary recognizes it as a known word. Stop words are discarded. Further dictionaries are usually non consulted after the first dictionary returns a non-
NULL
result. - The general rule for configuring a list of dictionaries is to place first the most narrow, most specific dictionary, then the more general dictionaries, finishing with a very general dictionary, like a Snowball stemmer or
simple
, which recognizes everything.
Stop Words
- Stop words are words that are very common, appear in almost every document, and have no discrimination value. Therefore, they can be ignored in the context of full text searching. Stop words do affect the positions in
tsvector
, which in turn affect ranking.
Simple Dictionary
- The
simple
dictionary template operates by converting the input token to lower case and checking it against a file of stop words. If it is found in the file then an empty array is returned, causing the token to be discarded. If not, the lower-cased form of the word is returned as the normalized lexeme.
Synonym Dictionary
- This dictionary template is used to create dictionaries that replace a word with a synonym. Phrases are not supported (use the thesaurus template for that). A synonym dictionary can be used to overcome linguistic problems - i.e., prevent an English stemmer dictionary from reducing the word
Paris
topari
.
Thesaurus Dictionary
- A thesaurus dictionary is a collection of words that includes information about the relationships of words and phrases. Basically a thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally, preserves the original terms for indexing as well.
Ispell Dictionary
- The Ispell dictionary supports morphological dictionaries, which can normalize many different linguistic forms of a word into the same lexeme. For example, an English Ispell dictionary can match all declensions and conjugations of the search term bank, e.g.,
banking
,banked
,banks
,banks'
, andbank's
.
Snowball Dictionary
- The Snowball dictionary is based on a project by Martin Porter, inventor of the popular Porter's stemming algorithm for the English language.
Configuration Example
- Section shows an example of a text search configuration
Testing and Debugging Text Search
The behavior of a custom text search configuration can easily become confusing. The functions described in this section are useful for testing text search objects. You can test a complete configuration, or test parsers and dictionaries separately.
Configuration Testing
ts_debug
allows for easy testing of a text search configuration
ts_debug([ config regconfig, ] document text,
OUT alias text,
OUT description text,
OUT token text,
OUT dictionaries regdictionary[],
OUT dictionary regdictionary,
OUT lexemes text[])
returns setof record
- The function displays information about every token of
document
as produced by the parser and processed by the configured dictionaries.
Parser Testing
- The following functions allow direct testing of a text search parser:
ts_parse(parser_name text, document text,
OUT tokid integer, OUT token text) returns setof record
ts_parse(parser_oid oid, document text,
OUT tokid integer, OUT token text) returns setof record
ts_token_type
returns a table which describes each type of token the specified parser can recognize.
ts_token_type(parser_name text, OUT tokid integer,
OUT alias text, OUT description text) returns setof record
ts_token_type(parser_oid oid, OUT tokid integer,
OUT alias text, OUT description text) returns setof record
Dictionary Testing
- The
ts_lexize
function facilitates dictionary testing
ts_lexize(dict regdictionary, token text) returns text[]
- It returns an array of lexemes if the input token is known to the dictionary, or an empty array if the token is known to the dictionary but a stop word, or
NULL
if it is an unknown word.
Preferred Index Types for Text Search
There are two types of indexes that can be used to speed up full text searches: GIN and GiST. Note that indexes are not mandatory for full text searching, but in cases where a column is searched on a regular basis, an index is usually desirable.
- GIN (Generalized Inverted Index) indexes are the preferred search index type. As inverted indexes, they contain an index entry for each word (lexeme), with a compressed list of matching locations. Multi-word searches can find the first match, then use the index to remove rows that are lacking additional words. GIN indexes store only the words (lexemes) of
tsvector
values, and not their weight labels. - A GiST index is lossy meaning that the index might produce false matches, and it is necessary to check the actual table row to eliminate such false matches. (PostgreSQL does this automatically when needed.) GiST indexes are lossy because each document is represented in the index by a fixed-length signature. The signature length in bytes is determined by the value of the optional integer parameter
siglen
. The signature is generated by hashing each word into a single bit in an n-bit string, with all these bits OR-ed together to produce an n-bit document signature. - Partitioning of big collections and the proper use of GIN and GiST indexes allows the implementation of very fast searches with online update. Partitioning can be done at the database level using table inheritance, or by distributing documents over servers and collecting external search results.
Limitations
- The current limitations of PostgreSQL's text search features are:
- The length of each lexeme must be less than 2 kilobytes
- The length of a
tsvector
(lexemes + positions) must be less than 1MB - The number of lexemes must be less than 264.
- Position values in
tsvector
must be greater than 0 and no more than 16,383 - The match distance in a
<N>
tsquery
operator cannot be more than 16,384 - No more than 256 positions per lexeme
- The number of nodes (lexemes + operators) in a
tsquery
must be less than 32,768
Coding Search Implementations
- In this section I am going to try to go over some common problems which require search solutions and create some boiler plate tables / queries so that I do not have to think so much about search every time I want to implement search functionality.
Different Methods of Searching
- In the above section, we went over full text search in PostgreSQL
- There are a few other methods of search that you might want to employ:
- String Similarity
- Support for string similarity can be provided using the
pg_trm
extension - Functions like
similarity()
andword_similarity()
can be used to
- Support for string similarity can be provided using the
- Semantic Search
- Semantic search is a search engine technology that interprets the meaning of words and phrases. The results of a semantic search will return content matching the meaning of a query, as opposed to content that literally matches words in a query.
- You can use APIs like those provided by OpenAI and Google to generate embeddings and then you can store and query the embeddings in PostgreSQL using the
pgvector
extension.
- Regex Search
- Probably best used for searches that include relatively few possible search results that are not too long.
- I.e., it would be reasonable to employ Regex search for this site, but not for a larger site that might have millions of possibilities.
- A very long possible search result could create a situation in which the database hangs.
- In the situation that the database hangs on a query,
Different Factors Related to Search
- The number of possible search results
- If there are a large number of possible search results, then you probably want to use an index and impose a
LIMIT
on the number of results - If you are using string similarity or semantic search, then you may want to define a similarity score above which the search result will be returned, i.e.
WHERE similaritry > 0.95
- If you are using string similarity or semantic search, then you may want to define a similarity score above which the search result will be returned, i.e.
- If there are a large number of possible search results, then you probably want to use an index and impose a
- The length of each possible search result
- For search results that are likely to be short, it may be best to use string similarity
- For longer search results, it may be best to use full text search, semantic search, or some combination of the two.
- Whether or not you need to highlight the specific terms in the query
- Whether or not you want to highlight terms that you respond to the user with
- The likelihood of the user knowing exactly what they are looking for
- For example, if a user is searching for a specific user - they should probably know how to spell the username of the user they are looking for; therefore, it is probably best to use string similarity.
- How often the search occurs
- If the search occurs often, you may want to add an index,
Indexes To Use for Different Kinds of Searches
- For full text search, see the section above. If you are using full text search, it is best to create a
STORED GENERATED
column that automatically creates atsvector
column for a given document and create an index on that column to speed up searches. It is best to create a GIN or GiST index for full text searches. - For semantic search using
pgvector
, it is best to use theHNSW
orIVFFlat
index. - An HNSW index creates a multilayer graph. It has better query performance than IVFFlat, but has slower build times and uses more memory. I currently use the HNSW index for embeddings.
- Indexes for pgvector
- For string similarity search with
pg_trgm
, it is best to useGIST
orGIN
indexes. These index types supportpg_trgm
similarity operators and support trigram based index searches likeLIKE
,ILIKE
,~
,~*
, and=
queries. - The integer parameter
siglen
for the GiST index can be set to give a more precise search - Index Support
pg_trgm
- The integer parameter
Search Cases
- I am going to list out some common search scenarios here and create boiler plate solutions to them.
- Case: Searching medium-to-large size documents (> 25 characters, many words)
- For searches of this type, string similarity search is usually off the table
- Case: There are enough keywords in the documents that eliminate the need for semantic search.
CREATE TABLE table_name (
...,
document text NOT NULL,
tsquery_vector tsquery GENERATED ALWAYS AS to_tsvector('english',tsquery_vector) STORED
);
-- Search Table
-- $1 is the query from the user, $2 is the max number to return
SELECT ... FROM table_name WHERE websearch_to_tsquery('english',$1) @@ tsquery_vector LIMIT $2 ORDER BY ts_rank(tsquery_vector,websearch_to_tsquery('english',$1));
-- Search table with highlighting
WITH search_results AS (SELECT document, ... FROM table_name WHERE websearch_to_tsquery('english',$1) @@ tsquery_vector LIMIT $2 ORDER BY ts_rank(tsquery_vector,websearch_to_tsquery('english',$1))) SELECT ts_headline('english', document, websearch_to_tsquery('english',$1), 'StartSel=<mark>, StopSel=</mark>') highlighted_text FROM search_results;
If you need the search to be fast
CREATE INDEX full_text_gin_search_idx ON table_name USING GIN (tsquery_vector);
- Case: There needs to be some combination of semantic and full text search.
CREATE TABLE table_name (
...,
document text NOT NULL,
tsquery_vector tsquery GENERATED ALWAYS AS to_tsvector('english',tsquery_vector) STORED,
embedding vector NOT NULL
);
-- Search table
-- <=> is the cosine distance
-- 0.75 is an arbitrary value for the similarity score
SELECT ... FROM table_name WHERE websearch_to_tsquery('english',$1) @@ tsquery_vector AND (embedding<=>$3)>=0.75 LIMIT $2 ORDER BY ts_rank(tsquery_vector,websearch_to_tsquery('english',$1));
If you need the search to be fast
CREATE INDEX full_text_gin_search_idx ON table_name USING GIN (tsquery_vector);
CREATE INDEX embedding_search_hnsw_idx ON table_name USING hnsw (embedding vector_l2_ops);
- Case: Documents are very similar, so the meaning of the search is much more important than the keywords.
CREATE TABLE table_name (
...,
document text NOT NULL,
embedding vector(1536) NOT NULL
);
SELECT ... FROM table_name WHERE (embedding<=>$1)>=0.75 LIMIT $2 ORDER BY (embedding<=>$1) DESC;
If you need the search to be fast
CREATE INDEX embedding_search_hnsw_idx ON table_name USING hnsw (embedding vector_l2_ops);
- Case: There are many documents to search (millions of rows in a table)
- You probably just want to use full text search in this case, like above.
- Case: Searching small documents (<25 characters, few words)
- For searches of this type, full text search is usually off the table.
- Case: The users should know how to spell the potential search results.
CREATE TABLE table_name (
...,
document text NOT NULL
);
SELECT ... FROM table_name WHERE similarity(document,$1) >= 0.9 ORDER BY similarity (document,$1);
If you need the search to be fast
CREATE INDEX gist_search_idx ON table_name using GIST(document);
- Case: The users don't know what the options are or don't know what they are looking for, but they know how to describe it.
CREATE TABLE table_name (
...,
document text NOT NULL,
embedding vector(1536) NOT NULL
);
SELECT ... FROM table_name WHERE similarity(document,$1) >= 0.9 ORDER BY similarity (document,$1);
If you need the search to be fast
CREATE INDEX gist_search_idx ON table_name using GIST(document);
CREATE INDEX embedding_search_hnsw_idx ON table_name USING hnsw (embedding vector_l2_ops);
- Case: There are many documents to search (millions of rows in a table)
- In this case you should probably avoid using embedding (maybe, for now) and use
LIMIT
- In this case you should probably avoid using embedding (maybe, for now) and use
Comments
There are currently no comments to show for this article.