PostgreSQL Notes

Creating this note to keep track of some things about PostgreSQL that I want to remember / learn about. So far, I have taken notes mainly on PostgreSQL indexes and keys.

Date Created:

Last Edited:

2 536

Listing All Columns in Table

Sometimes you want to do this to get information about the table in a quick manner
- Right now I am using it to create structs for each table to make the backend migration from NodeJS to GO easier.

-- Select most of the relevant information from the table
SELECT column_name, column_default, data_type, is_nullable 
FROM information_schema.columns 
WHERE table_name='TABLE_NAME';

A Note on Database Indexes

See the documentation for detailed notes on PostgreSQL indexes.

Indexes are a common way to enhance databse performance. An index allows the database server to find and retrieve specific rows much faster than it could do without an index. But indexes also add overhead to the database system as a whole, so they should be used sensibly.

Index Operations:

index_name: Whatever you want to name the index
table_name: the name of the table on which to create the index
column_name: the name of the column to create the index on

-- Create Index
CREATE INDEX index_name ON table_name (column_name);
-- Drop Index
DROP INDEX index_name

In general, PostgreSQL indexes can be used to optimize queries that contain one or more WHERE or JOIN clauses of the form indexed-column indexable-operator comparison-value where the indexed-column is whatever column or expression the index has been defined on. The indexable-operator is an operator that is a member of the index's operator class for the indexed column, and the comparison-value can be any expression that is not volatile and that does not reference the index's table.

Index Types:

PostgreSQL provides several index types: B-tree, Hash, GIST< SP-GiST, GIN, BRIN, and the extension bloom. Each index type uses a different algorithm that is best suited to different types of indexable clauses.

B-Tree indexes are the default. The other index types are selected using the syntax below. (Replace index_type with an index_type).

CREATE INDEX index_name ON table_name USING index_type (column_name)

B-Tree
- B-trees can handle equality and range queries on data that can be sorted into some ordering. Good for queries that use these operators: <, <=, =, >=, >
Hash
- Stores a 32-bit hash code derived from the value of the indexed columnn. Such indexes can only handle simple equality comparisons (=).
GiST
- The standard distribution of PostgreSQL includes GiST operator classes for several operator classes for several two-dimensional data types, which support indexed queries using these operators: <<, &<, &>, >>, &<|, |&>, |>>, @>, <@, ~=, &&
- See Section 9.11 of the documentation for the meaning of these operators
SP-GiST
- The standard distribution of PostgreSQL includes SP-PiST operator classes for two-dimensional points, which support indexed queries using these operators: <<, >>, ~=, <@, <<|, |>>
GIN
- GIN indexes are "inverted indexes", which are appropriate ffor data values that contain multiple component values, such as arrays. An inverted index contains a seperate entry for each component value, and can effectively handle queries that test for the presence of specific component values.
- The standard distribution of PostgreSQL includes a GIN operator class for arrays, which supportes indexed queries using these operators: @>, <@, =, &&
BRIN
- Short for Block Range Indexes, they are effective for columns whose values are well-correlated with the physical order of the table rows. Supported index queries: <, <=, =, >=, >

You can create multi-column indexes using the syntax below. Currently, only B-tree, GIN, and BRIN index types support multiple-key-column indexes. Multicolumn indexes should be used sparingly; in most situations, an index on a single column is sufficient and saves space and time

PG Vector Indexes

The following indexes can be used on columns to speed up queries involving vector embeddings.

HNSW
- Creates a multilayer graph
- It has better query performance than IVFFlat, but uses more memory and has slower build time. Also, the index can be created without any data in the table. You can add an index for each distance function you want to use. Replace the distance_function below with one of the following distance functions:
- - L2 Distance: vector_l2_ops
  - - The L2 distance of a vector (also called euclidean distance) is its distance from the origin. The L2 Distance between the two vectors.

- - Inner Product: vector_ip_ops
  - - The inner product / dot product between vectors is the dot product between the vectors.

- - Cosine Distance: vector_cosine_ops
  - - Cosine Similarity is commonly used in Natural Language Processing; it measures the similarity between two documents regardless of the magnitude.
  - L1 Distance: vector_l1_ops
  - - L1 distance, AKA Manhatten Distance, is calculated as the sum of the absolute differences in two vectors' Cartesian coodinates.

- - Hamming Distance: bit_hamming_ops
  - - Computes how many changes are needed to convert one vector to the other. Also can be used for string similarity.
  - Jaccard Distance: bit_jaccard_ops
  - - Jaccard Disance is a statistic used for gauging the similarity and diverity of sample sets. Best used for binary vectors.

CREATE INDEX ON table_name USING hnsw (column_name distance_function)

NOTE: There are more considerations for index options, supported types, index build time, and index progress that are not included here (For both HNSW and IVFFlat indexes.)

IVFFlat
- An IVFF index divides vectors into lists, and then searches a subset of those lists that are the closest to the query vector. It has faster build times and uses less memory than hnsw but has lower query performance.
- The keys to getting good recall are:
- - Create index after the database has some data
  - Choose an appropriate number of lists - a good place to start is rows / 1000 for up to 1 million rows and sqrt(rows) for over 1 million rows. (Note: Append WITH (lists=lists_number) after the last parantheses in the query above to specify lists number.)

Recall, in database research, is the ratio of number of relevant records retrieved to the total number of relevant records in the databse.

- You can use the same syntax as above (replace hnsw with ivfflat), but you can not use the indexes: L1 distance, Jaccard Distance.

Negatives of Indexes:

Indexes can take a while to create on big tables
- They block INSERT, UPDATE, and DELETE operations on tables, so use on production databases is often unacceptable.
Indexes ass overhead to manipulation operations
Add size to database

Indexes that are rarely used should be removed.

PostgreSQL Keys

Link to PostgreSQL documentation on Indices and Keys
A primary key is the field(s) used to identify a specific row
Different between primary keys and unique field(s):
- Primary key:
- - Is used for identifying the row and relating to it
  - Is impossible (or hard) to update
  - Should not allow NULLs.
  - Does not create an index to make the query faster

In fact, when you create a primary key or a unique combination of fields, nowhere in the SQL specification does it say that an index is created, nor that the retrieval of data by the key is going to be more efficient than a sequential scan!

- Unique Field(s):
- - Are used as an alternative access to the row
  - Are updateable, so long as they are kept unique
  - NULLs are acceptable
No secondary keys in PostgreSQL
- You will have to create indexes to make the query more efficient

PostgreSQL Extensions

Taking notes on some of the extensions that I have used and what they are used for.

pg_trgm

Here is a link to pg_trgm documentation on the PostgreSQL website

The pg_trm module provides functions and operators for determining the similarity of alphanumeric text based on trigram matching, as well as index operator classes that support fast searching for similar strings.

A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of two strings by counting the number of trigrams they share. This idea is very effective for measuring the similarity of words in many natural languages.

-- Function for returning trigrams in string: show_trgm(text) -> text[] 
show_trgm('word') ---> {"  w"," wo",ord,"rd ",wor}
show_trgm('two words') --> {"  t","  w"," tw"," wo","ds ",ord,rds,two,"wo ",wor}

The pg_trm extension offers many functions for measuring similarity between strings
The pg_trm extension offers indexes that can be used to speed up similarity searches, though the indexes may not be too helpful for equality searches.

Improving Search:

Trigram matching is a very useful tool when used in conjunction with a full text search index. In particular it can help to recognize misspelled input words that will not be matched directly by the full text search mechanism.

pgcrypto

Here is a link to the pgcrypto documentation on the PostgreSQL website
pgcrypto provides cryptographic functions for PostgreSQL
So far, I have mainly used the gen_random_uuid() function from this extension to generate random uuids .
The extension offers implementations of various hashing algorithms

plpgsql

Here is a link to the plpgsql documentation on the PostgreSQL website
I don't really use plpgsql functions that much any more for queries since complex queries can be more easily implemented (in terms of writing / remembering the function) as Common Table Expressions.
- plpgsql functions can be useful for complex queries though - such as updates to comments

plpgsql is a procedural language for the PostgreSQL database system. The design goals of PL/pgSQL were to create a loadable procedural language that:
- can be used to create functions, procedures, and triggers,
- adds control structures to the SQL language,
- can perform complex computations,
- inherits all user-defined types, functions, procedures, and operators,
- can be defined to be trusted by the server
- is easy to use
Functions created with PL/pgSQL can be used anywhere that built-in function could be used.

Advantages of using PL/pgSQL

SQL is the language PostgreSQL and most other relational databases use as query language. It's portable and easy to lean, But every SQL statement must be executed individually by the database server.
That means your client application must send each query to the database server, wait for it to be processed, receive and process the results, do some computation, then send further queries to the server. All this incurs inter-process communication and will also incur network overhead if your client is on a different machine than your database server.
With plpgsql, you can save considerably on client/server communication overhead.
- This can result in considerable performance increase
plpgsql functions can accept scalars, arrays, rows, and records and return these vales as well
These functions can return void

Here is a video that I watched that helped me first learn plpgsql:

vector

Here is a link to pgvector on GitHub

pgvector is an open-source vector similarity search for Postgres
It can be used to store embedding vectors generated using embedding models

postgis

Here is a link to the postgis documentation website

PostGIS extends the capabilities of PostgreSQL relational database by adding support for storing, indexing, and querying geospatial data.

I use PostGIS to store geospatial data.
I plan to use PostGIS to query geospatial data in order to make it such that polls can be restricted to certain geographic areas.

Aggregate Functions

I sometimes struggle with using the GROUP BY statement in PostgreSQL, so I am going to go over the docs on aggregate functions.

Like most other relational database products, PostgreSQL supports aggregate functions. An aggregate function computes a single result from multiple input rows. For example, there are aggregates to compute the count, sum, avg, max, and min over a set of rows.

Example: Find the highest low-temperature reading anywhere with:

SELECT max(temp_lo) FROM weather;

Aggregate functions cannot be used in the WHERE clause, except as a subquery:

# WRONG
SELECT city FROM weather WHERE temp_lo=max(temp_lo);
# Right - subquery
SELECT city FROM weather WHERE temp_lo = (SELECT max(temp_lo) FROM weather);

The subquery is fine because it is an independent computation that computes its own aggregate separately from what is happening in the outer query.
Aggregates are useful when used with the GROUP BY clause.
- Get the number of readings and the maximum low temperature observed in each city with:

SELECT city, count(*), max(temp_lo) FROM weather GROUP BY city;

     city      | count | max
---------------+-------+-----
 Hayward       |     1 |  37
 San Francisco |     2 |  46
(2 rows)

You can filter the grouped rows using HAVING:

SELECT city, count(*), max(temp_lo) FROM weather GROUP BY city HAVING max(temp_lo) < 40;

The fundamental difference between WHERE and HAVING:
- WHERE selects input rows before groups and aggregates are computed (this, it controls which rows go into the aggregate computation), whereas HAVING selects group rows after groups and aggregates are computed. Thus, the WHERE clause must not contain aggregate functions; it makes no sense to try to use an aggregate to determine which rows will be inputs to the aggregates. On the other hand, the HAVING clause always contains aggregate functions. (Strictly speaking, you are allowed to write a HAVING clause that doesn't use aggregates, but it's seldom useful. The same condition could be used more efficiently at the WHERE stage.)
Another way to select the rows that go into an aggregate computation is to use FILTER, which is a per-aggregate option:

SELECT city, count(*) FILTER (WHERE temp_lo < 45), max(temp_lo) FROM weather GROUP BY city;

Filter is much like WHERE, except that it removes rows only from the input of the particular aggregate function that it is attached to.

PostgreSQL Transactions

I am about to do some big updates to the database to change how articles are stores, and I am going to use transactions to do it to keep the code readable and simple. I also need to get some responses from the database intermittently, so I am going to look over transactions in the PostgreSQL documentation.

Transactions are a fundamental concept of all database systems. The essential point of a transaction is that it bundles multiple steps together into a single, all-or-nothing scenario. The intermediate states between the steps are not visible to other concurrent transactions, and if some failure occurs that prevents the transaction from completing, then none of the steps affect the database at all.

A transaction is said to be atomic from the point of view of other transactions, it either happens completely or not at all.
We ant to guarantee that a transaction is completed and acknowledged by the database system.
Another important property of transactional databases is closely related to the notion of atomic updates: when multiple transactions are running concurrently, each should not be able to see the incomplete changes made by others.
Transactions are set up by surrounding SQL statements with BEGIN and COMMIT commands. If partway through the transaction, you decide you don't want to commit, you can issue the ROLLBACK command.
PostgreSQL treats every SQL statement as being executed within a transaction. If you do not issue a BEGIN command, then each individual statement has an implicit BEGIN and (if successful) COMMIT wrapped around it. A group of statements surrounded by BEGIN and COMMIT is sometimes called a transaction block.
It's possible to control the statements in a transaction in a more granular fashion through the use of savepoints. Savepoints allow you to selectively discard parts of the transaction, while committing the rest. All the transaction's changes between defining the savepoint and rolling back to it are discarded, but changes earlier than the savepoint are kept.

BEGIN;
UPDATE accounts SET balance = balance - 100.00
    WHERE name = 'Alice';
SAVEPOINT my_savepoint;    
UPDATE accounts SET balance = balance + 100.00
    WHERE name = 'Bob';
-- oops ... forget that and use Wally's account
ROLLBACK TO my_savepoint;
UPDATE accounts SET balance = balance + 100.00
    WHERE name = 'Wally';
COMMIT;

Pattern Matching

Link to Pattern Matching Documentation

There are three separate approaches to pattern matching provided by PostgreSQL: the traditional SQL LIKE, the more recent SIMILAR TO operator, and POSIX-style regular expressions. Aside from the basic "does this string match this pattern" operators, functions are available to extract or replace matching substrings and to split a string at matching locations.

LIKE

string LIKE pattern [ESCAPE escape-character]
string NOT LIKE patterm [ESCAPE escape-character]

The LIKE expression returns true if the string matches the supplies pattern.
If pattern does not contain percent signs or underscores, then the pattern only represents the string itself; in that case LIKE acts like the quals operator.
An underscore (_) in pattern stands for (matches) any single character.
A percent sign (%) matches any sequence of zero or more characters

SELECT 'abc' LIKE 'abc'; -- true
SELECT 'abc' LIKE 'a%'; -- true
SELECT 'abc' LIKE '_b_'; -- true
SELECT 'abc' LIKE 'c'; -- false

The LIKE pattern always covers the entire string.
To match a literal underscore or percent sign, the character must be preceded by an escape character: '\%'
It is possible to disable escaping by writing ESCAPE ''
The keywork ILIKE can be used instead of LIKE to make the match case-insensitive according to the active locale.
The operator ~~ is equivalent to LIKE and the operator ~~* corresponds to ILIKE.
- There are also the !~~ and !~~* which correspond to NOT LIKE and NOT ILIKE respectively.

SIMILAR TO Regular Expressions

string SIMILAR TO patterm [ESCAPE escape-character]
string NOT SIMILAR TO pattern [ESCAPE escape-character]

The SIMILAR TO operator returns true or false depending on whether its pattern matches the given string. It is similar to LIKE except that it interprets the pattern using the SQL standard's definition of a regular expression. SQL regular expressions are a curious cross between LIKE notation and common (POSIX) regular expression notation.
The SIMILAR TO operator only succeeds if the regular expression matches the entire string, whereas normal regular expression behavior is where the pattern cat match any part of the string.
SIMILAR TO uses the _ and % as wildcard characters denoting any single character and any string, respectively (these are comparable to . and .* in POSIX regular expressions)

The substring function with three parameters provides extraction of a substring that matches an SQL regular expression pattern.

substring(string, pattern, escape-character)

As with SIMILAR TO, the specified pattern must match the entire data string, or else the function fails and returns null. To indicate the part of the pattern for which the matching data sub-string is of interest, the pattern should contain two occurrences of the escape character followed by the double quote (%"). The text matching portion of the pattern between these separators is returned when the match is successful.

substring('foobar','%"o_b%"', '') -- oob
substring('foobar' similar '#"o_b#"%' escape '#') -- NULL

POSIX Regular Expressions

Operator	Description	Example
`text ~ text`	String matches regular expression, case sensitivity	'thomas' ~ 't.*ma' -- true
`text ~* text`	String matches regular expression, case insensitivity	'thomas' ~ 'T.*ma' -- true
`text !~ text`	String does not match the regular expression, case sensitivity	'thomas' !~ 't.*max' -- true
`text !~* text`	String does not match the regular expression, case insensitivity	'thomas' !~* 'T.*ma' -- false

The POSIX Regular Expressions are similar to regular expressions in other languages.
There are many useful functions that you should look into that are built into PostgreSQL for string matching with regular expressions.