Creating Full Text Search For The Website Using PostgreSQL and Maybe Word Embeddings

Why Implement Search

I got tired of trying to create the SVG editor, and it seems like it is going to be a little bit more difficult than I thought, so I decided to try to implement search for this site to get something done.
I have also tried to implement search in the past, and would like to try and make it faster.
I also need to figure out the best way to implement Lexical with search.

I am using
  1. Part 1 and
  2. Part 2
of this xata.io blog to guide me. Also, this Medium blog article has some good tips.

Key Terms

tsvector
tsvector is a PostgreSQL data type that stores a sorted list of lexemes
lexeme
A lexeme is a string, like a token, but it has been normalized so that different forms of the word are made (e.g., removing prefixes and suffixes from words and removing too common words - e.g. I, to, an - from the string ). Below is an example that shows a English text string being converted to a tsvector.
SELECT * FROM unnest(to_tsvector('english',
'I''m going to make him an offer he can''t refuse. Refusing is not an option.'));

lexeme | positions | weights
--------+-----------+---------
go     | {3}       | {D}
m      | {2}       | {D}
make   | {5}       | {D}
offer  | {8}       | {D}
option | {17}      | {D}
refus  | {12,13}   | {D,D}
(6 rows)

    
tsquery
The tsquery datatype is used to represent a normalized query.
GIN (Generlized Inverted Index)
GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items. This means that GIN can be used for more than just text search, notably for JSON querying.
ts_rank
PostgreSQL provides two predefined ranking functions, which take into account lexical, proximity, and structural information; that is, they consider how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur. However, the concept of relevancy is vague and very application-specific. Different applications might require additional information for ranking, e.g., document modification time. The built-in ranking functions are only examples. You can write your own ranking functions and/or combine their results with additional factors to fit your specific needs. The two ranking functions mentioned are ts_rank and ts_rank_cd. The difference between them is that while they both take into account the frequency of the term, ts_rank_cd also takes into account the proximity of matching lexemes to each other.

Things to remember

  • Using ts-vector in PostgreSQL, each word has its position, the position of the word in the original string, and weight recorded.
  • To find the different language configurations that you can use for to_tsvector, run the following:
  • SELECT cfgname FROM pg_ts_config;
  • There are functions like to_tsquery, plainto_tsquery,websearch_to_tsquery that are helpful in converting user-written text into a proper tsquery, primarily by normalizing words appearing in the text.
  • You don't have to only rank search results by word frequency (ts_rank), you can also takeinto account other variables by coming up with an algorithm. This is called Relevancy Tuning.
  • You can have a typo tolerance / use fuzzy search for a query that may have some misspelled words by creating a Materialized View of all the lexems in the database and then correcting the spelling of a word when the Levenshtein distance is less than some value (3 or 2 is good). Should probably try to find a node module that can do this faster.
  • Use Faceted Search to allow users to only search for a certain type of information.