PostgreSQL Search

Every since I saw that you can search in Postgres with regular expressions, I wanted to attempt to build a search feature with it. I also need to improve the search feature for this site and learn more about search in Postgres, so I am going to try to build a search feature that can utilize regex for this site.

Date Created:
Last Edited:

References



Preface


  • The goal is to build a search feature that can direct the user to the proper page / section of a page.
  • The layout of pages looks like this:
<html>
<head>
<!-- Links and stuff -->
</head>
<body>
<header role="banner">
<!-- Navbar -->
</header>
<section role="complementary">
<nav>
<!-- Left Sidebar Navigation -->
</nav>
</section>
<div>
<main role="page">
<!-- Main Page -->
</main>
</div>
<footer><!-- Footer --></footer>
<aside>
<!-- Right Sidebar -->
</aside>
</body>
</html>
  • Given the layout of the page, only elements / text inside main[role="page"] should be used for search.
  • The description and page title should probably be used for search.
  • When getting the inner text / elements for a page, you probably want to exclude content inside a <form> element.

I think offering Regex Search in PostgreSQL could open the doorway to ReDoS (Regular Expression Denial of Service) attacks.


Definitions


Denial of Service


The Denial of Service (DoS) attack is focused on making a resource (site, application, server) unattainable for the purpose it was designed. There are many ways to make a service unavailable for legitimate users by manipulating network packets, programming, logical, or resources handling vulnerabilities, among others. If a service receives a large number of requests, it may cease to be available to legitimate users.

Sometimes the attacker can inject and execute code while performing a DoS attack in order to access critical information or execute commands on the sever. Denial-of-service attacks significantly degrade the service quality experienced by legitimate users. These attacks introduce large response delays, excessive losses, and service interruptions, resulting in direct impact on availability.

Risk Factors

  • The two principle risk factors for DoS attacks are inadequate resources and non-technical threat motivators.
  • The first example of a risk factor requires attention if system architecture was not designed to meet traffic demand overflows.
  • The second risk factor means that an organization should avoid taking action that can make them a target of a DoS attack unless the benefits of doing so outweigh the potential costs or mitigating controls are in place.

Example - Storing too Much Data in Session

  • Care must be taken not to store too much data in a user session object. Storing too much information in the session, such as large quantities of data retrieved from the database, can cause denial of service issues. The problem is exacerbated if session data is also tracked prior to a login, as a user can launch the attack without the need of an account.


Regular Expression Denial of Service (ReDoS)


The Regular expression Denial of Service (ReDoS) is a Denial of Service attack that exploits the face that most Regular Expression implementations may reach extreme situations that cause them to work very slowly (exponentially related to input size). An attacker can then cause a program using a Regular Expression (Regex) to enter these extreme situations and then hang for a very long time.

Description

  • The Regex naive algorithm builds a Nondeterministic Finite Automation (NFA), which is a finite state machine where for each pair of state and input symbol there may be several possible next states. Then the engine starts to make transition until the end of the input. Since there may be several possible next states, a deterministic algorithm is used. The algorithm tries all the possible paths (if needed) until a match is found (or all the paths are tries and fail).
  • A Regis pattern is called Evil regex if it can get stuck on crafted input.
  • Evil Regex contains:
    • Grouping with repetition
    • Inside the repeated group:
      • Repetition
      • Alternation with overlapping
  • Examples of Evil Regex:
    • (a+)+
    • ([a-zA-Z]+)*
    • (a|aa)+
    • (a|a?)+
    • (.*a){x} for x \> 10

Risk Factors

  • The Web is Regex-Based:

  • In every layer of the web there are Regular Expressions, that might contain an Evil Regex., An attack can hang a Web-browser (on a computer or potentially a mobile device), hang a web application firewall, attack a database, and even stack a vulnerable web server.
  • Don't use the same Regex to validate the client and server side.


About Search in Postgres


Introduction

Full Text Searching (or just text search) provides the capability to identify natural-language documents that satisfy a query, and optionally to sort them by relevance to the query. The most common type of search is to find all documents containing given query terms and return them in order of their similarity to the query. Notions of query and similarity are very flexible and depend on the specific application. The simplest search considers query as a set of words and similarity as the frequency of query words in the document.
  • Textual search operators have existed in databases for years. PostgreSQL has ~, ~*, LIKE and ILIKE operators for textual data types, but they lack some essential properties:
    • no linguistic support
    • provide no ordering (rank) of search results, which make them ineffective when many matches are found
    • They tend to be slow because there is no index support, so they must process all documents for every search
  • Full text indexing allows documents to be preprocessed and an index saved for later rapid searching. Preprocessing includes:
    • Parsing documents into tokens. PostgreSQL uses a parser to perform this step.
    • Converting tokens into lexemes. A lexeme is a string, just like a token, but it has been normalized so that different forms of the same word are made alike. This allows searches to find variant forms of the same word, without tediously entering all the possible variants.
    • Storing preprocessed documents optimized for searching
  • Dictionaries allow fine-grained control over how tokens are normalized. With dictionaries, you can:
    • Defined stop words that should not be indexed
    • Map synonyms to a single word using Ispell
    • Map phrases to a single word using a thesaurus
    • Map different variations of a word to a canonical form using an Ispell dictionary
    • Map different variations of a word to canonical form using Snowball stemmer rules
  • A data type tsvector is provided for storing preprocessed documents, along with a type tsquery for representing processed queries.
What is a Document?
A document is the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes with their parent document. Later, these associations are used to search for documents that contain query words.
  • A document is usually a TEXT column in a table in the database.
  • For text search purposes, each document must be reduced to the preprocessed tsvector format. Searching and ranking are performed entirely on the tsvector representation of a document - the original text need only be retrieved when the document has been selected for display to a user.
Basic Text Matching
  • Full text searching in PostgreSQL is based on the match operator @@ which returns true if a tsvector matches a tsquery (query). It doesn't matter which data type is written first:
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat'::tsquery;
-- true
SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
-- false
SELECT to_tsvector('dat cats ate fat rats') @@ to_tsquery('fat & rat');
-- true
  • A tsquery contains search terms, which must be already-normalized lexemes, and may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators.
  • There are functions tp_tsquery, plainto_tsquery, and phraseto_tsquery that are helpful in converting user-written text into a proper tsquery, primarily by normalizing words appearing in the text. Similarly, to_tsvector is used to parse and normalize a document string.
  • The elements of a tsvector are lexemes, which are assumed already normalized.
  • The @@ operator also supports text input, allowing explicit conversion of a text sting to tsvector or tsquery to be skipped in simple cases.
  • Within a tsquery, the & operator specifies that both its arguments must appear in the document to have a match. The | OR operator specifies that at least one of its arguments must appear, while the ! NOT operator specifies that its argument must not appear in order to have a match.
  • The <-> FOLLOWED BY operator matches only if its arguments have matches that are adjacent and in the given order.
    • The FOLLOWED BY operator has the form <N>, where N is an integer standing for the difference between the positions of the maxing lexemes.
  • Parentheses can be used to control nesting of the tsquery operators. Without parentheses, | binds least tightly, then &, then <->, and ! most tightly.
Configurations
  • Full text search functionality includes the ability to do many things: skip indexing certain words (stop words), process synonyms, and use sophisticated parsing. This functionality is controlled by text search configurations.
  • A configuration is built up from simpler database objects:
    • Text Search parsers break documents into tokens and classify each token
    • Text search dictionaries convert tokens to normalized form and reject stop words
    • Text search templates provide the functions underlying dictionaries
    • Text search configurations select a parser and a set of dictionaries to use to normalize the tokens produced by the parser


Tables and Indexes

Searching a Table
  • You can use to_tsvector on a column or concatenation of columns in a table:
SELECT title FROM pgweb WHERE to_tsvector('english',body) @@ to_tsquery('english','friend');
-- You can leave off the english configuration parameter
SELECT title FROM pgweb WHERE to_tsvector(body) @@ to_tsquery('friend');
Creating Indexes
  • You can create a GIN index to speed up searches:
CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english',body)); -- You should use the two argument version of to_tsvector so that index contents ate unaffected by default_text_search_config
  • You should probably create a STORED GENERATED column and create an index on that column to speed up searches:
ALTER TABLE pgweb
ADD COLUMN textsearchable_index_col tsvector
GENERATED ALWAYS AS (to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, ''))) STORED;
CREATE INDEX textsearch_idx ON pgweb USING GIN (textsearchable_index_col);


Controlling Text Search

To implement full text searching there must be a function to create a tsvector from a document and a tsquery from a user query. Also, we need to return results in a useful order, so we need a function that compares documents with respect to their relevance to the query. It's also important to be able to display the results nicely.
Parsing Documents

to_tsvector

  • converts a document to the tsvector data type
  • parses a textual document into tokens, reduces the tokens to lexemes, and returns which lists the lexemes together with their positions in the document.
  • The function internally calls a parser which breaks the document text into tokens and assigns a type to each token. For each token, a list of dictionaries is consulted, where the list can vary depending on the token type. The first dictionary that recognizes the token emits one or more normalized lexemes to represent the token - e.g., rats becomes rat.
    • Some words are recognized as stop words which causes them to be ignored since they occur too frequently to be useful in searching.
    • I no dictionary ion the list recognizes the token then it is also ignored
to_tsvector([config reconfig, ] document text) returns tsvector
SELECT to_tsvector('english', 'a fat cat sat on a mat - it ate a fat rats');
--------------
'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
  • The function setweight can be used to label the entries of a tsvector with a given weight, where weight is one of the letters A, B, C, or D.
UPDATE tt SET ti =
setweight(to_tsvector(coalesce(title,'')), 'A') ||
setweight(to_tsvector(coalesce(keyword,'')), 'B') ||
setweight(to_tsvector(coalesce(abstract,'')), 'C') || -- merge the ts vectors with concatenation
setweight(to_tsvector(coalesce(body,'')), 'D');
Parsing Queries
  • PostgreSQL provides the functions to_tsquery, plainto_tsquery, phraseto_tsquery, and websearch_to_tsquery for converting a query to the tsquery data type. to_tsquery offers access to more features than either plainto_tsquery or phraseto_tsquery, but it is less forgiving about its input. websearch_to_tsquery is a simplified version of to_tsquery with an alternative syntax, similar to the one used by web search engines.

to_tsquery

to_tsquery([ config regconfig, ] querytext text) returns tsquery
  • to_tsquery creates a tsquery value from querytext which must consist of single tokens separated by the tsquery operators & AND, | OR, ! NOT, and <-> FOLLOWED BY, possibly grouped using parentheses. The input to to_tsquery must already follow the general rules for tsquery input. The difference is that while basic tsquery input takes tokens at face value, to_tsquery normalizes each token into a lexeme using the specified or default configuration, and discards any tokens that are stop words according to the configuration.
  • Weights can be attacked to each lexeme to restrict it to match only tsvector lexemes of those weights.
  • Also, * can be attacked to a lexeme to specify prefix matching
  • It can accept single quoted phrases - useful for when the configuration includes a thesaurus dictionary that may trigger on such phrases

plainto_tsquery

plainto_tsquery([ config regconfig, ] querytext text) returns tsquery
  • Transforms the unformatted text querytext to a tsquery value. The text is parsed and normalized much as for to_tsquery, then the & (AND) tsquery operator is inserted between surviving words.
  • The plainto_tsquery will not recognize tsquery operators, weight labels, or prefix-match labels in its input

phraseto_tsquery

phraseto_tsquery([ config regconfig, ] querytext text) returns tsquery
  • phraseto_tsquery behaves much like plainto_tsquery except that it inserts the <-> FOLLOWED BY operator between surviving words instead of the & AND operator. Also, stop words are not simply discarded, but are accounted for by inserting <N> operators rather than <-> operators.
  • This function is useful when searching for exact lexeme sequences, since FOLLOWED BY operators check lexeme order not just the presence of all the lexemes
  • The phraseto_tsquery will not recognize tsquery operators, weight labels, or prefix-match labels in its input

websearch_to_tsquery

phraseto_tsquery([ config regconfig, ] querytext text) returns tsquery
  • Creates a tsquery value from querytext using an alternative syntax in which simple unformatted text is a valid query. Unlike phrase- and plain- to_tsquery, it also recognizes certain operators. Moreover, this function will never raise syntax errors, which makes it possible to use raw user-supplied input for search. The following syntax is supported:
    • unquoted text: text not inside quote marks will be converted to terms separated by & operators, as if processed by plainto_tsquery
    • "quoted text": text inside quote marks will be converted to terms separated by <-> operators, as if processed by phraseto_tsquery
    • OR: the word or will be converted to the | operator
    • -: a dash will be converted to the ! operator
    • Other punctuation is ignored


Ranking Search Results
Ranking attempts to measure how relevant documents are to a particular query, so that wen there are many matches the most relevant ones can be shown first. PostgreSQL provides two predefined ranking functions, which take into account lexical, proximity, and structural information; that is, they consider how often the query terms appear in the document, how close together the terms are in the document, and how important the part of the document where they occur.
  • The two ranking functions currently available:
ts_rank([ wights float4[], ] vector tsvector, query tsquery [, normalization integer]) returns float4
  • Ranking vectors based on the frequency of their matching lexemes
ts_rank_cd([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4

The weights argument offers the ability to weigh word instances more or less heavily depending on how they are labeled.

{D-weight, C-weight, B-weight, A-weight}
-- Default Values:
{0.1, 0.2, 0.4, 1.0}
  • weights are used to mark words from special areas of the document, like the title or an initial abstract, so they can be treated with more or less importance than words in the document body.
  • Both function take an integer normalization option that specifies whether and how a document's length should impact its rank. The integer option controls several behaviors, so it is a bit mask: you can specify one or more behaviors using |
    • 0 (the default) ignores the document length
    • 1 divides the rank by 1 + the logarithm of the document length
    • 2 divides the rank by the document length
    • 3 divides the rank by the mean harmonic distance between extents
    • 8 divides the rank by the number of unique words in document
    • 16 divides the rank by 1+ the logarithm of the number of unique words in document
    • 32 divides the rank by itself + 1
Highlighting Results
To present search results it is ideal to show part of each document and how it is related to teh query. Usually, search engines show fragments of the document with marked search terms. PostgreSQL provides a function ts_headline that implements this functionality.
ts_headline([ config regconfig, ] document text, query tsquery [, options text ]) returns text
  • ts_headline accepts a document along with a query, and returns an excerpt from the document in which terms from the query are highlighted. Specifically the function will use the query to select relevant text fragments, and then highlight all words that appear in the query, even if those word positions do not match the query's restrictions. The configuration to be used to parse the document can be specified by config. If config is omitted, the default_text_search_config configuration is used.
  • If an options string is specified, it must consist of a comma separated list of one or more option=value pairs. The available options are:
    • MaxWords, MinWords (integers): these numbers determine the longest and shortest headlines to output. The default values are 35 and 15.
    • ShortWord (integer): words of this length or less will be dropped at the start and end of a headline, unless they are query terms. The default value of three eliminates common English articles.
    • HighlightAll (boolean): if true the whole document will be used as the headline, ignoring the preceding three parameters. The default is false
    • MaxFragments (integer): maximum number of text fragments to display. The default value of zero selects a non-fragment based headline generation method. A value greater than zero selects fragment-based headline generation
    • StartSel, StopSel (string): the string with which to delimit query words appearing in the document, to distinguish them from other excerpted words. The default values are <b> and </b>, which can be suitable for HTML output.
      • Might want to replace with <mark>
    • FragmentDelimeter (string): when more than one fragment is displayed, the fragments will be separated by this string. The default is "...".

Additional Features

Manipulating Documents

tsvector || tsvector

  • The concatenation operator returns a vector which combines the lexemes and positional information of the two vectors given as arguments. Positions appearing in the right hand vector are offset by the largest position mentioned in the left-hand vector, so that the result is nearly equivalent to performing to_tsvector on the concatenation of the two original document strings.

setweight(vector tsvector, weight "char") returns tsvector

  • setweight returns a copy of the input vector in which every position has been labeled with a given weight, either A, V, C, or D, weight applies to positions, not lexemes.

length(vector tsvector) returns integer

  • Returns the number of lexemes stored in the vector.

strip(vector tsvector) returns tsvector

  • Returns a vector that lists the same lexemes as the given vector, but lacks any position or weight information. The result is usually much smaller than an unstripped vector, but it is also less useful.
Manipulating Queries

tsquery && tsquery

  • Returns the AND-combination of two given queries

tsquery || tsquery

  • Returns the OR-combination of two given queries

!! tsquery

  • Returns the negation (NOT) of the given query

tsquery <-> tsquery

  • Returns a query that searches for a match to the first given query immediately followed by a match to the second given query, using the <-> (FOLLOWED BY) tsquery operator.

tsquery_phrase(query1 tsquery, query2 tsquery [, distance integer ]) returns tsquery

  • returns a query that searches for a match to the first given query followed by a match to the second given query at a distance of exactly distance lexemes, using the <N> tsquery operator.

numnode(query tsquery) returns integer

  • Returns the number of nodes (lexemes plus operators) in a tsquery. This function is useful to determine if the query is meaningful, or contains only stop words (returns 0)

querytree(query tsquery) returns text

  • Returns the portion of a tsquery that can be used for searching an index. This function is useful for detecting unindexable queries, for example those containing only stop words or only negated terms.
Query Rewriting
  • The ts_rewrite family of functions search a given ts_query for occurrences of a target subquery and replace each occurrence with a substitute subquery. In essence, this operation is a tsquery specific version of substring replacement. A target and substitute combination can be thought of as a query rewrite rule. A collection of such rewrite riles can be a powerful search aid.

ts_rewrite(query tsquery, target tsquery, subsitute tsquery) returns tsquery

  • This form of ts_reqrite simply applies a single rewrite rule: target is replaced by substitute wherever is appears in query.

ts_rewrite(query tsquery, select text) returns tsquery

  • This form of ts_reqrite accepts a starting query and an SQL select command, which is given as a text string. The select must yield two columns of tsquery type. For each row of the select result, occurences of the first column value (the target) are replaced by the second column value (the substitute) within the current query value.
Triggers for Automatic Updates
tsvector_update_trigger(tsvector_column_name,​ config_name, text_column_name [, ... ])
tsvector_update_trigger_column(tsvector_column_name,​ config_column_name, text_column_name [, ... ])
  • These methods have been obsoleted by the use of stored generated columns.
Gathering Document Statistics
ts_stat(sqlquery text, [ weights text, ]
OUT word text, OUT ndoc integer,
OUT nentry integer) returns setof record
  • The function ts_stat is useful for checking your configuration and for finding stop-word candidates.
  • sqlquery is a text value containing an SQL query which must return a single tsvector column. ts_stat executes the query and returns statistics about each distinct lexeme (word) contained in the tsvector data. The columns returned:
    • word text - the value of a lexeme
    • ndoc text - number of documents (tsvectors) the word occurred in
    • nentry text - total number of occurrences of the word

Parsers

Text search parsers are responsible for splitting raw document text into tokens and identifying each token's type, where the set of possible types is defined by the parser itself. Note that a parser doesn't modify the text at all - it simply identify plausible word boundaries. Because of this limited scope, there is less need for application-specific custom parameters than there is for custom dictionaries. At present PostgreSQL provides just one built-in parser, which has been found to be useful for a wide range of applications.
  • The built-in parser is named pg_catalof.default. It recognizes 23 types - shown below:


Alias Description Example
asciiword Word, all ASCII letters elephant
word Word, all letters mañana
numword Word, letters and digits beta1
asciihword Hyphenated word, all ASCII up-to-date
hword Hyphenated word, all letters lógico-matemática
numhword Hyphenated word, letters and digits postgresql-beta1
hword_asciipart Hyphenated word part, all ASCII postgresql in the context postgresql-beta1
hword_part Hyphenated word part, all letters lógico or matemática in the context lógico-matemática
hword_numpart Hyphenated word part, letters and digits beta1 in the context postgresql-beta1
email Email address foo@example.com
protocol Protocol head http://
url URL example.com/stuff/index.html
host Host example.com
url_path URL path /stuff/index.html, in the context of a URL
file File or path name /usr/local/foo.txt, if not within a URL
sfloat Scientific notation -1.234e56
float Decimal notation -1.234
int Signed integer -1234
uint Unsigned integer 1234
version Version number 8.3.0
tag XML tag <a href="dictionaries.html">
entity XML entity &amp;
blank Space symbols (any whitespace or punctuation not otherwise recognized)


Dictionaries

The format and location of dictionary files can be seen by clicking the link above - I don't know if you can implement your own dictionary files in a managed database like Amazon RMBDS - which might need to be moved to an EC2 instance.

Dictionaries are used to eliminate words that should not be considered in a search (stop words) and to normalize words so that different derived forms of the same word will match. A successfully normalized word is called a lexeme. Aside from improving search quality, normalization and removal of stop words reduce the size of the tsvector representation of a document, thereby improving performance. Normalization does not always have linguistic meaning and usually depends on the application semantics.

Examples of Normalization:

  • Ispell dictionaries try to reduce input words to a normalized form; stemmer dictionaries remove word endings
  • Color names can be replaced by their hexadecimal values


  • A dictionary is a program that accepts a token as input and returns:
    • an array of lexemes if the input token is known to the dictionary
    • a single lexeme with the TSL_FILTER flag set, to replace the original token with a new token to be passed to subsequent dictionaries (a dictionary that does this is called a filtering dictionary)
    • an empty array if the dictionary knows the token, but it is a stop word
    • NULL if the dictionary does not recognize the input token
  • A text search configuration binds a parser together with a set of dictionaries to process the parser's output tokens. Foe each token type that the parser can return, a separate list of dictionaries is specified by the configuration. When a token of that type is found by the parser, each dictionary in the list is consulted in turn, until some dictionary recognizes it as a known word. Stop words are discarded. Further dictionaries are usually non consulted after the first dictionary returns a non-NULL result.
  • The general rule for configuring a list of dictionaries is to place first the most narrow, most specific dictionary, then the more general dictionaries, finishing with a very general dictionary, like a Snowball stemmer or simple, which recognizes everything.


Stop Words
  • Stop words are words that are very common, appear in almost every document, and have no discrimination value. Therefore, they can be ignored in the context of full text searching. Stop words do affect the positions in tsvector, which in turn affect ranking.


Simple Dictionary
  • The simple dictionary template operates by converting the input token to lower case and checking it against a file of stop words. If it is found in the file then an empty array is returned, causing the token to be discarded. If not, the lower-cased form of the word is returned as the normalized lexeme.


Synonym Dictionary
  • This dictionary template is used to create dictionaries that replace a word with a synonym. Phrases are not supported (use the thesaurus template for that). A synonym dictionary can be used to overcome linguistic problems - i.e., prevent an English stemmer dictionary from reducing the word Paris to pari.


Thesaurus Dictionary
  • A thesaurus dictionary is a collection of words that includes information about the relationships of words and phrases. Basically a thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally, preserves the original terms for indexing as well.


Ispell Dictionary
  • The Ispell dictionary supports morphological dictionaries, which can normalize many different linguistic forms of a word into the same lexeme. For example, an English Ispell dictionary can match all declensions and conjugations of the search term bank, e.g., banking, banked, banks, banks', and bank's.


Snowball Dictionary
  • The Snowball dictionary is based on a project by Martin Porter, inventor of the popular Porter's stemming algorithm for the English language.


Configuration Example

  • Section shows an example of a text search configuration


Testing and Debugging Text Search

The behavior of a custom text search configuration can easily become confusing. The functions described in this section are useful for testing text search objects. You can test a complete configuration, or test parsers and dictionaries separately.
Configuration Testing
  • ts_debug allows for easy testing of a text search configuration
ts_debug([ config regconfig, ] document text,
OUT alias text,
OUT description text,
OUT token text,
OUT dictionaries regdictionary[],
OUT dictionary regdictionary,
OUT lexemes text[])
returns setof record
  • The function displays information about every token of document as produced by the parser and processed by the configured dictionaries.
Parser Testing
  • The following functions allow direct testing of a text search parser:
ts_parse(parser_name text, document text,
OUT tokid integer, OUT token text) returns setof record
ts_parse(parser_oid oid, document text,
OUT tokid integer, OUT token text) returns setof record
  • ts_token_type returns a table which describes each type of token the specified parser can recognize.
ts_token_type(parser_name text, OUT tokid integer,
OUT alias text, OUT description text) returns setof record
ts_token_type(parser_oid oid, OUT tokid integer,
OUT alias text, OUT description text) returns setof record
Dictionary Testing
  • The ts_lexize function facilitates dictionary testing
ts_lexize(dict regdictionary, token text) returns text[]
  • It returns an array of lexemes if the input token is known to the dictionary, or an empty array if the token is known to the dictionary but a stop word, or NULL if it is an unknown word.

Preferred Index Types for Text Search

There are two types of indexes that can be used to speed up full text searches: GIN and GiST. Note that indexes are not mandatory for full text searching, but in cases where a column is searched on a regular basis, an index is usually desirable.
  • GIN (Generalized Inverted Index) indexes are the preferred search index type. As inverted indexes, they contain an index entry for each word (lexeme), with a compressed list of matching locations. Multi-word searches can find the first match, then use the index to remove rows that are lacking additional words. GIN indexes store only the words (lexemes) of tsvector values, and not their weight labels.
  • A GiST index is lossy meaning that the index might produce false matches, and it is necessary to check the actual table row to eliminate such false matches. (PostgreSQL does this automatically when needed.) GiST indexes are lossy because each document is represented in the index by a fixed-length signature. The signature length in bytes is determined by the value of the optional integer parameter siglen. The signature is generated by hashing each word into a single bit in an n-bit string, with all these bits OR-ed together to produce an n-bit document signature.
  • Partitioning of big collections and the proper use of GIN and GiST indexes allows the implementation of very fast searches with online update. Partitioning can be done at the database level using table inheritance, or by distributing documents over servers and collecting external search results.


Limitations

  • The current limitations of PostgreSQL's text search features are:
    • The length of each lexeme must be less than 2 kilobytes
    • The length of a tsvector (lexemes + positions) must be less than 1MB
    • The number of lexemes must be less than 264.
    • Position values in tsvector must be greater than 0 and no more than 16,383
    • The match distance in a <N> tsquery operator cannot be more than 16,384
    • No more than 256 positions per lexeme
    • The number of nodes (lexemes + operators) in a tsquery must be less than 32,768


Coding Search Implementations


  • In this section I am going to try to go over some common problems which require search solutions and create some boiler plate tables / queries so that I do not have to think so much about search every time I want to implement search functionality.


Different Methods of Searching

  • In the above section, we went over full text search in PostgreSQL
  • There are a few other methods of search that you might want to employ:
    • String Similarity
      • Support for string similarity can be provided using the pg_trm extension
      • Functions like similarity() and word_similarity() can be used to
    • Semantic Search
      • Semantic search is a search engine technology that interprets the meaning of words and phrases. The results of a semantic search will return content matching the meaning of a query, as opposed to content that literally matches words in a query.
      • You can use APIs like those provided by OpenAI and Google to generate embeddings and then you can store and query the embeddings in PostgreSQL using the pgvector extension.
    • Regex Search
      • Probably best used for searches that include relatively few possible search results that are not too long.
        • I.e., it would be reasonable to employ Regex search for this site, but not for a larger site that might have millions of possibilities.
        • A very long possible search result could create a situation in which the database hangs.
          • In the situation that the database hangs on a query,


Different Factors Related to Search

  • The number of possible search results
    • If there are a large number of possible search results, then you probably want to use an index and impose a LIMIT on the number of results
      • If you are using string similarity or semantic search, then you may want to define a similarity score above which the search result will be returned, i.e. WHERE similaritry > 0.95
  • The length of each possible search result
    • For search results that are likely to be short, it may be best to use string similarity
    • For longer search results, it may be best to use full text search, semantic search, or some combination of the two.
  • Whether or not you need to highlight the specific terms in the query
    • Whether or not you want to highlight terms that you respond to the user with
  • The likelihood of the user knowing exactly what they are looking for
    • For example, if a user is searching for a specific user - they should probably know how to spell the username of the user they are looking for; therefore, it is probably best to use string similarity.
  • How often the search occurs
    • If the search occurs often, you may want to add an index,


Indexes To Use for Different Kinds of Searches

  • For full text search, see the section above. If you are using full text search, it is best to create a STORED GENERATED column that automatically creates a tsvector column for a given document and create an index on that column to speed up searches. It is best to create a GIN or GiST index for full text searches.
  • For semantic search using pgvector, it is best to use the HNSW or IVFFlat index.
    • An HNSW index creates a multilayer graph. It has better query performance than IVFFlat, but has slower build times and uses more memory. I currently use the HNSW index for embeddings.
    • Indexes for pgvector
  • For string similarity search with pg_trgm, it is best to use GIST or GIN indexes. These index types support pg_trgm similarity operators and support trigram based index searches like LIKE, ILIKE, ~, ~*, and = queries.
    • The integer parameter siglen for the GiST index can be set to give a more precise search
    • Index Support pg_trgm


Search Cases


  • I am going to list out some common search scenarios here and create boiler plate solutions to them.
  1. Case: Searching medium-to-large size documents (> 25 characters, many words)
  • For searches of this type, string similarity search is usually off the table
    1. Case: There are enough keywords in the documents that eliminate the need for semantic search.
CREATE TABLE table_name (
...,
document text NOT NULL,
tsquery_vector tsquery GENERATED ALWAYS AS to_tsvector('english',tsquery_vector) STORED
);
-- Search Table
-- $1 is the query from the user, $2 is the max number to return
SELECT ... FROM table_name WHERE websearch_to_tsquery('english',$1) @@ tsquery_vector LIMIT $2 ORDER BY ts_rank(tsquery_vector,websearch_to_tsquery('english',$1));

-- Search table with highlighting
WITH search_results AS (SELECT document, ... FROM table_name WHERE websearch_to_tsquery('english',$1) @@ tsquery_vector LIMIT $2 ORDER BY ts_rank(tsquery_vector,websearch_to_tsquery('english',$1))) SELECT ts_headline('english', document, websearch_to_tsquery('english',$1), 'StartSel=<mark>, StopSel=</mark>') highlighted_text FROM search_results;
If you need the search to be fast
CREATE INDEX full_text_gin_search_idx ON table_name USING GIN (tsquery_vector);
    1. Case: There needs to be some combination of semantic and full text search.
CREATE TABLE table_name (
...,
document text NOT NULL,
tsquery_vector tsquery GENERATED ALWAYS AS to_tsvector('english',tsquery_vector) STORED,
embedding vector NOT NULL
);
-- Search table
-- <=> is the cosine distance
-- 0.75 is an arbitrary value for the similarity score
SELECT ... FROM table_name WHERE websearch_to_tsquery('english',$1) @@ tsquery_vector AND (embedding<=>$3)>=0.75 LIMIT $2 ORDER BY ts_rank(tsquery_vector,websearch_to_tsquery('english',$1));
If you need the search to be fast
CREATE INDEX full_text_gin_search_idx ON table_name USING GIN (tsquery_vector);
CREATE INDEX embedding_search_hnsw_idx ON table_name USING hnsw (embedding vector_l2_ops);
    1. Case: Documents are very similar, so the meaning of the search is much more important than the keywords.
CREATE TABLE table_name (
...,
document text NOT NULL,
embedding vector(1536) NOT NULL
);
SELECT ... FROM table_name WHERE (embedding<=>$1)>=0.75 LIMIT $2 ORDER BY (embedding<=>$1) DESC;
If you need the search to be fast
CREATE INDEX embedding_search_hnsw_idx ON table_name USING hnsw (embedding vector_l2_ops);
    1. Case: There are many documents to search (millions of rows in a table)
      1. You probably just want to use full text search in this case, like above.
  1. Case: Searching small documents (<25 characters, few words)
  • For searches of this type, full text search is usually off the table.
    1. Case: The users should know how to spell the potential search results.
CREATE TABLE table_name (
...,
document text NOT NULL
);
SELECT ... FROM table_name WHERE similarity(document,$1) >= 0.9 ORDER BY similarity (document,$1); 
If you need the search to be fast
CREATE INDEX gist_search_idx ON table_name using GIST(document);
    1. Case: The users don't know what the options are or don't know what they are looking for, but they know how to describe it.
CREATE TABLE table_name (
...,
document text NOT NULL,
embedding vector(1536) NOT NULL
);
SELECT ... FROM table_name WHERE similarity(document,$1) >= 0.9 ORDER BY similarity (document,$1); 
If you need the search to be fast
CREATE INDEX gist_search_idx ON table_name using GIST(document);
CREATE INDEX embedding_search_hnsw_idx ON table_name USING hnsw (embedding vector_l2_ops);
    1. Case: There are many documents to search (millions of rows in a table)
      1. In this case you should probably avoid using embedding (maybe, for now) and use LIMIT


Comments

You must be logged in to post a comment!

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


Insert Chart

ESC

View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Lexical State

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload 3D Object

ESC

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image Background Color

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language