PostgreSQL Notes
Creating this note to keep track of some things about PostgreSQL that I want to remember / learn about. So far, I have taken notes mainly on PostgreSQL indexes and keys.
Listing All Columns in Table
- Sometimes you want to do this to get information about the table in a quick manner
-- Select most of the relevant information from the table
SELECT column_name, column_default, data_type, is_nullable
FROM information_schema.columns
WHERE table_name='TABLE_NAME';
A Note on Database Indexes
See the documentation for detailed notes on PostgreSQL indexes.
Indexes are a common way to enhance databse performance. An index allows the database server to find and retrieve specific rows much faster than it could do without an index. But indexes also add overhead to the database system as a whole, so they should be used sensibly.
Index Operations:
- index_name: Whatever you want to name the index
- table_name: the name of the table on which to create the index
- column_name: the name of the column to create the index on
-- Create Index
CREATE INDEX index_name ON table_name (column_name);
-- Drop Index
DROP INDEX index_name
In general, PostgreSQL indexes can be used to optimize queries that contain one or more WHERE
or JOIN
clauses of the form indexed-column
indexable-operator
comparison-value
where the indexed-column is whatever column or expression the index has been defined on. The indexable-operator is an operator that is a member of the index's operator class for the indexed column, and the comparison-value can be any expression that is not volatile and that does not reference the index's table.
Index Types:
PostgreSQL provides several index types: B-tree, Hash, GIST< SP-GiST, GIN, BRIN, and the extension bloom. Each index type uses a different algorithm that is best suited to different types of indexable clauses.
B-Tree indexes are the default. The other index types are selected using the syntax below. (Replace index_type with an index_type).
CREATE INDEX index_name ON table_name USING index_type (column_name)
- B-Tree
- B-trees can handle equality and range queries on data that can be sorted into some ordering. Good for queries that use these operators:
<, <=, =, >=, >
- B-trees can handle equality and range queries on data that can be sorted into some ordering. Good for queries that use these operators:
- Hash
- Stores a 32-bit hash code derived from the value of the indexed columnn. Such indexes can only handle simple equality comparisons (
=
).
- Stores a 32-bit hash code derived from the value of the indexed columnn. Such indexes can only handle simple equality comparisons (
- GiST
- The standard distribution of PostgreSQL includes GiST operator classes for several operator classes for several two-dimensional data types, which support indexed queries using these operators:
<<, &<, &>, >>, &<|, |&>, |>>, @>, <@, ~=, &&
- See Section 9.11 of the documentation for the meaning of these operators
- The standard distribution of PostgreSQL includes GiST operator classes for several operator classes for several two-dimensional data types, which support indexed queries using these operators:
- SP-GiST
- The standard distribution of PostgreSQL includes SP-PiST operator classes for two-dimensional points, which support indexed queries using these operators:
<<, >>, ~=, <@, <<|, |>>
- The standard distribution of PostgreSQL includes SP-PiST operator classes for two-dimensional points, which support indexed queries using these operators:
- GIN
- GIN indexes are "inverted indexes", which are appropriate ffor data values that contain multiple component values, such as arrays. An inverted index contains a seperate entry for each component value, and can effectively handle queries that test for the presence of specific component values.
- The standard distribution of PostgreSQL includes a GIN operator class for arrays, which supportes indexed queries using these operators:
@>, <@, =, &&
- BRIN
- Short for Block Range Indexes, they are effective for columns whose values are well-correlated with the physical order of the table rows. Supported index queries:
<, <=, =, >=, >
- Short for Block Range Indexes, they are effective for columns whose values are well-correlated with the physical order of the table rows. Supported index queries:
You can create multi-column indexes using the syntax below. Currently, only B-tree, GIN, and BRIN index types support multiple-key-column indexes. Multicolumn indexes should be used sparingly; in most situations, an index on a single column is sufficient and saves space and time
The following indexes can be used on columns to speed up queries involving vector embeddings.
- HNSW
- Creates a multilayer graph
- It has better query performance than IVFFlat, but uses more memory and has slower build time. Also, the index can be created without any data in the table. You can add an index for each distance function you want to use. Replace the distance_function below with one of the following distance functions:
- L2 Distance:
vector_l2_ops
- The L2 distance of a vector (also called euclidean distance) is its distance from the origin. The L2 Distance between the two vectors.
- L2 Distance:
- Inner Product:
vector_ip_ops
- The inner product / dot product between vectors is the dot product between the vectors.
- Inner Product:
- Cosine Distance:
vector_cosine_ops
- Cosine Similarity is commonly used in Natural Language Processing; it measures the similarity between two documents regardless of the magnitude.
- L1 Distance:
vector_l1_ops
- L1 distance, AKA Manhatten Distance, is calculated as the sum of the absolute differences in two vectors' Cartesian coodinates.
- Cosine Distance:
- Hamming Distance:
bit_hamming_ops
- Computes how many changes are needed to convert one vector to the other. Also can be used for string similarity.
- Jaccard Distance:
bit_jaccard_ops
- Jaccard Disance is a statistic used for gauging the similarity and diverity of sample sets. Best used for binary vectors.
- Hamming Distance:
CREATE INDEX ON table_name USING hnsw (column_name distance_function)
NOTE: There are more considerations for index options, supported types, index build time, and index progress that are not included here (For both HNSW and IVFFlat indexes.)
- IVFFlat
- An IVFF index divides vectors into lists, and then searches a subset of those lists that are the closest to the query vector. It has faster build times and uses less memory than hnsw but has lower query performance.
- The keys to getting good recall are:
- Create index after the database has some data
- Choose an appropriate number of lists - a good place to start is
rows / 1000
for up to 1 million rows andsqrt(rows)
for over 1 million rows. (Note: Append WITH (lists=lists_number) after the last parantheses in the query above to specify lists number.)
Recall, in database research, is the ratio of number of relevant records retrieved to the total number of relevant records in the databse.
- You can use the same syntax as above (replace hnsw with ivfflat), but you can not use the indexes: L1 distance, Jaccard Distance.
Negatives of Indexes:
- Indexes can take a while to create on big tables
- They block
INSERT
,UPDATE
, andDELETE
operations on tables, so use on production databases is often unacceptable.
- They block
- Indexes ass overhead to manipulation operations
- Add size to database
Indexes that are rarely used should be removed.
PostgreSQL Keys
- Link to PostgreSQL documentation on Indices and Keys
- A primary key is the field(s) used to identify a specific row
- Different between primary keys and unique field(s):
- Primary key:
- Is used for identifying the row and relating to it
- Is impossible (or hard) to update
- Should not allow NULLs.
- Does not create an index to make the query faster
In fact, when you create a primary key or a unique combination of fields, nowhere in the SQL specification does it say that an index is created, nor that the retrieval of data by the key is going to be more efficient than a sequential scan!
- Unique Field(s):
- Are used as an alternative access to the row
- Are updateable, so long as they are kept unique
- NULLs are acceptable
- No secondary keys in PostgreSQL
- You will have to create indexes to make the query more efficient
PostgreSQL Extensions
Taking notes on some of the extensions that I have used and what they are used for.
pg_trgm
- Here is a link to pg_trgm documentation on the PostgreSQL website
The pg_trm
module provides functions and operators for determining the similarity of alphanumeric text based on trigram matching, as well as index operator classes that support fast searching for similar strings.
- A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of two strings by counting the number of trigrams they share. This idea is very effective for measuring the similarity of words in many natural languages.
-- Function for returning trigrams in string: show_trgm(text) -> text[]
show_trgm('word') ---> {" w"," wo",ord,"rd ",wor}
show_trgm('two words') --> {" t"," w"," tw"," wo","ds ",ord,rds,two,"wo ",wor}
- The
pg_trm
extension offers many functions for measuring similarity between strings - The
pg_trm
extension offers indexes that can be used to speed up similarity searches, though the indexes may not be too helpful for equality searches.
Trigram matching is a very useful tool when used in conjunction with a full text search index. In particular it can help to recognize misspelled input words that will not be matched directly by the full text search mechanism.
pgcrypto
- Here is a link to the pgcrypto documentation on the PostgreSQL website
pgcrypto
provides cryptographic functions for PostgreSQL- So far, I have mainly used the
gen_random_uuid()
function from this extension to generate random uuids . - The extension offers implementations of various hashing algorithms
plpgsql
- Here is a link to the plpgsql documentation on the PostgreSQL website
- I don't really use plpgsql functions that much any more for queries since complex queries can be more easily implemented (in terms of writing / remembering the function) as Common Table Expressions.
- plpgsql functions can be useful for complex queries though - such as updates to comments
plpgsql
is a procedural language for the PostgreSQL database system. The design goals of PL/pgSQL were to create a loadable procedural language that:- can be used to create functions, procedures, and triggers,
- adds control structures to the SQL language,
- can perform complex computations,
- inherits all user-defined types, functions, procedures, and operators,
- can be defined to be trusted by the server
- is easy to use
- Functions created with PL/pgSQL can be used anywhere that built-in function could be used.
Advantages of using PL/pgSQL
- SQL is the language PostgreSQL and most other relational databases use as query language. It's portable and easy to lean, But every SQL statement must be executed individually by the database server.
That means your client application must send each query to the database server, wait for it to be processed, receive and process the results, do some computation, then send further queries to the server. All this incurs inter-process communication and will also incur network overhead if your client is on a different machine than your database server.
- With plpgsql, you can save considerably on client/server communication overhead.
- This can result in considerable performance increase
- plpgsql functions can accept scalars, arrays, rows, and records and return these vales as well
- These functions can return void
Here is a video that I watched that helped me first learn plpgsql:
vector
- Here is a link to pgvector on GitHub
- pgvector is an open-source vector similarity search for Postgres
- It can be used to store embedding vectors generated using embedding models
postgis
PostGIS extends the capabilities of PostgreSQL relational database by adding support for storing, indexing, and querying geospatial data.
- I use PostGIS to store geospatial data.
- I plan to use PostGIS to query geospatial data in order to make it such that polls can be restricted to certain geographic areas.
Aggregate Functions
I sometimes struggle with using the GROUP BY
statement in PostgreSQL, so I am going to go over the docs on aggregate functions.
- Like most other relational database products, PostgreSQL supports aggregate functions. An aggregate function computes a single result from multiple input rows. For example, there are aggregates to compute the
count
,sum
,avg
,max
, andmin
over a set of rows.
Example: Find the highest low-temperature reading anywhere with:
SELECT max(temp_lo) FROM weather;
- Aggregate functions cannot be used in the
WHERE
clause, except as a subquery:
# WRONG
SELECT city FROM weather WHERE temp_lo=max(temp_lo);
# Right - subquery
SELECT city FROM weather WHERE temp_lo = (SELECT max(temp_lo) FROM weather);
- The subquery is fine because it is an independent computation that computes its own aggregate separately from what is happening in the outer query.
- Aggregates are useful when used with the
GROUP BY
clause. - Get the number of readings and the maximum low temperature observed in each city with:
SELECT city, count(*), max(temp_lo) FROM weather GROUP BY city;
city | count | max
---------------+-------+-----
Hayward | 1 | 37
San Francisco | 2 | 46
(2 rows)
- You can filter the grouped rows using
HAVING
:
SELECT city, count(*), max(temp_lo) FROM weather GROUP BY city HAVING max(temp_lo) < 40;
- The fundamental difference between
WHERE
andHAVING
: WHERE
selects input rows before groups and aggregates are computed (this, it controls which rows go into the aggregate computation), whereasHAVING
selects group rows after groups and aggregates are computed. Thus, theWHERE
clause must not contain aggregate functions; it makes no sense to try to use an aggregate to determine which rows will be inputs to the aggregates. On the other hand, theHAVING
clause always contains aggregate functions. (Strictly speaking, you are allowed to write aHAVING
clause that doesn't use aggregates, but it's seldom useful. The same condition could be used more efficiently at theWHERE
stage.)
- Another way to select the rows that go into an aggregate computation is to use
FILTER
, which is a per-aggregate option:
SELECT city, count(*) FILTER (WHERE temp_lo < 45), max(temp_lo) FROM weather GROUP BY city;
- Filter is much like
WHERE
, except that it removes rows only from the input of the particular aggregate function that it is attached to.
PostgreSQL Transactions
I am about to do some big updates to the database to change how articles are stores, and I am going to use transactions to do it to keep the code readable and simple. I also need to get some responses from the database intermittently, so I am going to look over transactions in the PostgreSQL documentation.
Transactions are a fundamental concept of all database systems. The essential point of a transaction is that it bundles multiple steps together into a single, all-or-nothing scenario. The intermediate states between the steps are not visible to other concurrent transactions, and if some failure occurs that prevents the transaction from completing, then none of the steps affect the database at all.
- A transaction is said to be atomic from the point of view of other transactions, it either happens completely or not at all.
- We ant to guarantee that a transaction is completed and acknowledged by the database system.
- Another important property of transactional databases is closely related to the notion of atomic updates: when multiple transactions are running concurrently, each should not be able to see the incomplete changes made by others.
- Transactions are set up by surrounding SQL statements with
BEGIN
andCOMMIT
commands. If partway through the transaction, you decide you don't want to commit, you can issue theROLLBACK
command. - PostgreSQL treats every SQL statement as being executed within a transaction. If you do not issue a
BEGIN
command, then each individual statement has an implicitBEGIN
and (if successful)COMMIT
wrapped around it. A group of statements surrounded byBEGIN
andCOMMIT
is sometimes called a transaction block. - It's possible to control the statements in a transaction in a more granular fashion through the use of savepoints. Savepoints allow you to selectively discard parts of the transaction, while committing the rest. All the transaction's changes between defining the savepoint and rolling back to it are discarded, but changes earlier than the savepoint are kept.
BEGIN;
UPDATE accounts SET balance = balance - 100.00
WHERE name = 'Alice';
SAVEPOINT my_savepoint;
UPDATE accounts SET balance = balance + 100.00
WHERE name = 'Bob';
-- oops ... forget that and use Wally's account
ROLLBACK TO my_savepoint;
UPDATE accounts SET balance = balance + 100.00
WHERE name = 'Wally';
COMMIT;
Pattern Matching
Link to Pattern Matching Documentation
There are three separate approaches to pattern matching provided by PostgreSQL: the traditional SQL LIKE, the more recent SIMILAR TO operator, and POSIX-style regular expressions. Aside from the basic "does this string match this pattern" operators, functions are available to extract or replace matching substrings and to split a string at matching locations.
LIKE
string LIKE pattern [ESCAPE escape-character]
string NOT LIKE patterm [ESCAPE escape-character]
- The
LIKE
expression returns true if the string matches the supplies pattern. - If pattern does not contain percent signs or underscores, then the pattern only represents the string itself; in that case
LIKE
acts like the quals operator. - An underscore (_) in pattern stands for (matches) any single character.
- A percent sign (%) matches any sequence of zero or more characters
SELECT 'abc' LIKE 'abc'; -- true
SELECT 'abc' LIKE 'a%'; -- true
SELECT 'abc' LIKE '_b_'; -- true
SELECT 'abc' LIKE 'c'; -- false
- The
LIKE
pattern always covers the entire string. - To match a literal underscore or percent sign, the character must be preceded by an escape character:
'\%'
- It is possible to disable escaping by writing
ESCAPE ''
- The keywork
ILIKE
can be used instead ofLIKE
to make the match case-insensitive according to the active locale. - The operator
~~
is equivalent toLIKE
and the operator~~*
corresponds toILIKE
. - There are also the
!~~
and!~~*
which correspond toNOT LIKE
andNOT ILIKE
respectively.
- There are also the
SIMILAR TO Regular Expressions
string SIMILAR TO patterm [ESCAPE escape-character]
string NOT SIMILAR TO pattern [ESCAPE escape-character]
- The
SIMILAR TO
operator returns true or false depending on whether its pattern matches the given string. It is similar toLIKE
except that it interprets the pattern using the SQL standard's definition of a regular expression. SQL regular expressions are a curious cross betweenLIKE
notation and common (POSIX) regular expression notation. - The
SIMILAR TO
operator only succeeds if the regular expression matches the entire string, whereas normal regular expression behavior is where the pattern cat match any part of the string. SIMILAR TO
uses the_
and%
as wildcard characters denoting any single character and any string, respectively (these are comparable to.
and.*
in POSIX regular expressions)
The substring
function with three parameters provides extraction of a substring that matches an SQL regular expression pattern.
substring(string, pattern, escape-character)
As with SIMILAR TO
, the specified pattern must match the entire data string, or else the function fails and returns null. To indicate the part of the pattern for which the matching data sub-string is of interest, the pattern should contain two occurrences of the escape character followed by the double quote (%"
). The text matching portion of the pattern between these separators is returned when the match is successful.
substring('foobar','%"o_b%"', '') -- oob
substring('foobar' similar '#"o_b#"%' escape '#') -- NULL
POSIX Regular Expressions
Operator | Description | Example |
---|---|---|
| String matches regular expression, case sensitivity | 'thomas' ~ 't.*ma' -- true |
| String matches regular expression, case insensitivity | 'thomas' ~ 'T.*ma' -- true |
| String does not match the regular expression, case sensitivity | 'thomas' !~ 't.*max' -- true |
| String does not match the regular expression, case insensitivity | 'thomas' !~* 'T.*ma' -- false |
- The POSIX Regular Expressions are similar to regular expressions in other languages.
- There are many useful functions that you should look into that are built into PostgreSQL for string matching with regular expressions.
Comments
There are currently no comments to show for this article.