Learning PostgreSQL Textbook
I've been writing PostgreSQL commands for a while now, but I sometimes make mistakes that takes a minute or two to clear up and I also don't know much about how to really make sure performance is oin point. I am going to read Learning PostgreSQL to try to learn more about PostgreSQL so that I can more effectively manage a PostgreSQL database and write better queries.
References
- Learning PostgreSQL - Create, develop, and manage rekational databases in real-world applications using PostgreSQL, by Salahaldin Juba, Achim Vannahme, and Andrey Volkov
Relational Databases
- A database can be defined as a collection or a repository of data, which has a certain structure, managed by a database management system (DBMS).
- The CAP theorem states that it is impossible for a distributed computing system to simultaneously provide all three of the following guarantees:
- Consistency - all clients see (immediately) the latest data even in the case of updates
- Availability - all clients can find a replica of some data even in the case of a node failure. That means even if some part of the system goes down, the clients can still access the data.
- Partition Tolerance - The system continues to work regardless of arbitrary message loss or failure of part of the system.
- A relational database enforces ACID. ACID is the acronym for the following properties: Atomicity, Consistency, Isolation, and Durability.
- Atomicity - all or nothing, means that if a part of a transaction fails, then the transaction fails as a whole
- Consistency - Any transaction gets the database from one valid state to another valid state.
- Isolation - Concurrent execution of transactions results in a system that would be obtained if the transactions were executed serially.
- Durability - The transactions which are committed, that is executed successfully, are persistent even with power loss or some server crashes. This is done normally by a technique called write-ahead log.
- A relational database is formally described by relational algebra, and is modeled on the relational model. Object-relational database (ORD) are similar to relational databases. They support object-oriented model concepts such as:
- User defined complex data types
- Inheritance
- In a relational database, a single logical operation is called a transaction. The technical translation of a transaction is a set of database operations, which are create, read, update, and delete (CRUD).
- Relational databases are often linked to the Structured Query Language (SQL). SQL is a declarative programming language, and is the standard relational database language. American National Standard Institute (ANSI) and International Standard Organization (ISO) published the SQL standard for the first time in 1986, followed by many revisions.
- The SQL language has several parts:
- Data definition language (DDL): It defines and amends relational structure.
- Data manipulation language (DML): It retrieves and extracts information from the relations.
- Data control language (DCL): It controls the access rights to relations.
- A relational model is a first-order predicate logic, which was first introduced by Edgar F. Codd. A database is represented as a collection of relations. The state of the whole database is defined by the state of all the relations in the database.
- Think of a relation as a table with a header, columns, and rows. The table name and the header help in interpreting the data in the rows. Each row represents a group of related data, which points to a certain object.
- A relation is represented by a set of tuples. Tuples should have the same set of ordered attributes. Attributes have a domain, that is, a type and a name. Tuple order or position in the relation is not important, and the relation is not sensitive to tuple order.
- Separating the data in different relations is a key concept in relational database modeling. This concept called normalization is the process of organizing relational columns and relations to reduce data redundancy.
- A tuple is a set of ordered attributes. They are written by listing the elements within parentheses
()
and separated by commas. Tuples have the following properties:
- A tuple is not a set, and the order of attributes matters.
In the formal relational model, multi-values attributes as well as composite attributes are not allowed. This is important to reduce data redundancy and increasing data consistency. This isn't strictly true in modern relational database systems because of the utilization of complex data types such as JSON and key-value stores. There is a lot of debate regarding the application of normalization; the rule of thumb is to apply normalization unless there is a good reason not to do so.
- Another important concept is that of the unknown values, that is,
NULL
values. - Predicates in relational databases use three-valued logic (3VL), where there are three truth values: true, false, and unknown. In a relational database, the third value, unknown, can be interpreted in many ways, such as unknown data, missing data, or not applicable.
- Each attribute has a name and a domain, and the name should be distinct within the relation. The domain defines the possible set of values that the attribute can have. One way to define the domain is to define the data type and a constraint on this data type.
- The relational model defines many constraints in order to control data integrity, redundancy, and validity.
- Redundancy - Duplicate tuples are not allowed in the relation
- Validity - Domain constraints control data validity
- Integrity - The relations within a single database are linked to each other. An action on a relation such as updating or deleting a tuple might leave the other relations in an invalid state.
- We could classify constraints in a relational database roughly into two categories:
- Inherited constraints from the relational model: Domain integrity, entity integrity, and referential integrity constraints
- Semantic constraints, business rules, and application specific constraints: These constraints cannot be expressed explicitly by the relational model. However, with the introduction of procedural SQL languages such as PL / pgSQL for PostgreSQL, relational databases can also be used to model these constraints,.
- Domain Integrity Constraint - ensures data validity. Determine the appropriate data type and then check constraints.
- Check constraint - A check constraint can be applied to a single attribute or a combination of many attributes in a tuple.
- Default constraint - an attribute can have a default value
- Unique Constraint - a unique constraint guarantees that the attribute has a distinct value in each tuple
NOT NULL
Constraint - By default, an attribute can be null. TheNOT NULL
constraint restricts an attribute from having a null value.
- Entity Integrity Constraint
- A relation is defined a s set of tuples. By definition, the element of the set is distinct. This means that all the tuples in a relation must be distinct.
- The entity integrity constraint is enforced by having a primary key which is an attribute / set of attributes having the following characteristics:
- The attribute should be unique
- The attributes should be not null
- Each relation must have only one primary key, but can have many unique keys.
- In practice, we often pick up a single attribute to be a primary key instead of a compound key (key that consists of two or more attributes that uniquely identify a tuple) to reduce data redundancy, and to ease the joining of the relations with each other.
- If the primary key is generated by the DBMS, then it is called a surrogate key. Otherwise, it is called a natural key. The surrogate key candidates can be sequences and universal unique identifiers (UUID).
- Referential Integrity Constraints
- Relations are associated with each other via common attributes. Referential integrity constraints govern the association between two relations, and ensure data consistency between tuples. If a tuple in one relation references a tuple in another relation, then the referenced tuple must exist.
- The lack of referential integrity constraints can lead to problems such as:
- Invalid data in the common attributes
- Invalid information during joining of data from different relations
- Referential integrity constraints are achieved via foreign keys. A foreign key is an attribute or set of attributes that can identify a tuple in the referenced relation. Foreign keys are generally primary keys in the referenced relation.
- To ensure data integrity, foreign keys can be used to define several behaviors when a tuple in the referenced relation is updated or deleted. The following behaviors are called referential actions:
- Cascade: When a tuple is deleted or updated in the referenced relation, the tuples in the referencing relation are also updated or deleted.
- Restrict: The tuple cannot be deleted or the referenced attribute cannot be updated if it is referenced by another relation
- No action: Similar to restrict, but it is deferred to the end of the transaction
- Set default: When a tuple in the referenced relation is deleted or the referenced attribute is updated, then the foreign key value is assigned the default value
- Set null: The foreign key attribute is set to null when the referenced tuple is deleted
- Semantic Constraints
- Semantic integrity constraints or business logic constraints describe the database application constraints in general. Those constraints are either enforced by the business logic tier of the application program or by SQL procedural languages.
- Performance: RDBMSs often have complex analyzers to generate efficient execution plans.
- Last Minute Change: For the SQL procedural languages, one could deploy bug fixes without service disruption.
Relational algebra is the formal language of the relational model. It defines closed operations over relations, that is, the result of each operation is a new relation. Relational algebra inherits many operators from set algebra. Relational algebra could be categorized into two groups:
- The first one is a group of operations which are inherited from set theory, such as
UNION
,INTERSECTION
,SET DIFFERENCE
, andCARTESION PRODUCT
, also known asCROSS PRODUCT
. - The second one is a group of operations which are specific to the relational model such as
SELECT
andPROJECT
.
SELECT
is used to restrict tuples from the relation.- The SQL equivalent of the
PROJECT
operator in SQL isSELECT DISTINCT
. TheDISTINCT
keyword is used to eliminate duplicates. - The
RENAME
operation is used to alter the attribute name of the resultant relation, or to give a specific name to the resultant relation. TheRENAME
operation is used to: - Remove confusion is two or more relations have attributes with the same name.
- Provide user-friendly names for attributes, especially when interfacing with reporting engines
- Provide a convenient way to change the relation definition, and still be backward compatible
- The
AS
keyword in SQL is equivalent to theRENAME
operator in relational algebra. - The
CROSS JOIN
operation is used to combine tuples from two relations into a single relation. The number of attributes in a single relation equals the sum of the number of attributes of the two relations. The number of tuples in the single relation equals the product of the number of tuples in the two relations. - Data models describe real-world entities such as customer, service, products, and the relation between those entities. Data models provide an abstraction for the relations in the database.
- Data models as defined by ANSI:
- Conceptual data model: Described the domain semantics, and is used to communicate the main business rules, actors, and concepts. It describes the business requirements at a high level and is often called a high-level data model. The conceptual model is the chain between developers and business departments in the application development lifecycle.
- Logical data model: Describes the semantics for a certaintechnology
- Physical data model: Describes how data is actually stored and manipulated at the hardware level such as storage area network, table space, CPUs, and so on.
- The following pitfalls of database design should be avoided:
- Database redundancy - bad database designs elicit redundant data. Redundant data can cause several other problems including data inconsistency and performance degradation.
- Null saturation
- Tight coupling
PostgreSQL In Action
PostgreSQL or postgres is an open source object relational database management system. It emphasizes extensibility, creativity, as well as compatibility.
Advantages of PostgreSQL
- PostgreSQL is a free open source software (OSS).
- No associated licensing cost of PostgreSQL
- Unlimited number of deployments of PostgreSQL
- PostgreSQL is SQL standards compliant.
- Easy to earn
- Cross Platform
- Scalable
- High Performance
- Very reliable, rarely crashes, ACID complaint.
- It can be configured and installed as a cluster to ensure high availability (HA).
- New release almost every year
- very good documentation
- rich extension repository
- database admin tasks can be automated
PostgreSQL Applications
- The main PostgreSQL application domains can be classified into two categories:
- Online Transaction Processing (OLTP): OLTP is characterized by a large amount of CRUD operations, very fast processing of operations, and the maintaining of data integrity in a multi-access environment. Performance is measured in the number of transactions per second.
- Online Analytic Processing (OLAP): OLAP is characterized by a small amount of requests, complex queries which involve data aggregation, huge amounts of data from different sources and with different formats, data mining, and historical data analysis.
- OLTP is used to model business operations such as customer relationship management (CRM). OLAP applications are used for business intelligence, decision support, reporting, and planning. An OLTP database is relatively small compared to OLAP database. Unlike OLTP, OLAP's main operation is data retrieval. OLAP data is often generated by a process called ETL (extract, transform, and load). ETL is used to load data in to the OLAP database from different data sources and different formats.
PostgreSQL Forks
- HadoopDB is a hybrid between the PostgreSQL RDBMS and MapReduce technologies to target analytical workload.
- Amazon red shift is a popular data warehouse management system based on PostgreSQL. It is mainly designed for OLAP applications.
PostgreSQL Architecture
- PostgreSQL uses the client/server model, where the client and server programs can be on different hosts. The communication between the client and server is normally done via TCP/IP protocols or via Linux sockets.
- Client process or program (frontend) - the database frontend application performs a database action. The frontend can be a web server that wants to display a web page or a command line tool to do maintenance tasks. PostgreSQL provides frontend tools such as
psql
,createb
,dropdb
, andcreateuser
- Server process (backend) - the server process manages database files, accepts connections from client applications, and performs actions on behalf of the client. The server process name is postgres.
- Client process or program (frontend) - the database frontend application performs a database action. The frontend can be a web server that wants to display a web page or a command line tool to do maintenance tasks. PostgreSQL provides frontend tools such as
- PostgreSQL forks a new process for each new connection; thus, client and server processes communicate with each other without the intervention of the server main process (postgres), and they have a certain lifetime that is determined by accepting and terminating a client connection.
- The PostgreSQL server could be divided into four subsystems, as follows:
- Process Manager - the process manager manages client connections such as forking and the terminating process
- Query processor - when a client sends a query to PostgreSQL, the query is parsed by the parser, and then the traffic cop subsystem determines the query type. A utility query is passed to the utilities subsystem. Select, insert, update, and delete queries are rewritten by the rewriter following which an execution plan is generated by the planner. Finally, the query is executed and the result is returned to the client.
- Utilities - The utilities subsystem provides a means for maintaining the database such as claiming storage, updating statistics, and exporting and importing data with a certain format and logging.
- Storage manager - the storage manager handles the memory cache, disk buffers, and storage allocation
- Almost all PostgreSQL components can be configured, including a logger, planner, statistical analyzer, and store manager. PostgreSQL configuration is governed by the nature of the application, such as OLAP and OLTP.
PostgreSQL Capabilities
Replication
- Replication allows data from one database server to be replicated to another server. Replication is used mainly to achieve the following:
- High availability - a second server can take over if the primary server fails
- Load balancing - several servers can serve the same request
- Faster execution - a query is executed on several machines at once to gain performance
- PostgreSQL supports replication out of the box via streaming replication. Streaming replication is a master-slave replication that uses file-based log shipping. Streaming replication is a binary replication technique, because SQL statements are not analyzed.
Security
- PostgreSQL can control the database object access at several levels including database, table, view, function, sequence, and column. This enables PostgreSQL to have a great authentication control.
- PostgreSQL can use encryption to protect data by hardware encryption. Also, one can encrypt certain information by utilizing the
pgcrypto
extension.
Extension
- PostgreSQL can be extended to support new data types. PostgreSQL provides the
CREATE EXTENSION
command to load extensions to the current database. - The ability of PostgreSQL to support extensions is a result of the following features:
- PostgreSQL Data Types: PostgreSQL has very rich data types. It supports primitive data types as well as some primitive data structures, such as arrays, out of the box. In addition, such complex data types:
- Geometric Data Types- Including point, line segment (
lseg
), path, polygon and path - Network Address Types- Including
cidr
,inet
, andmacaddr
- tsvector- This is a sorted list of lexemes that enables postgres to perform full text search
- Universal Unique Identifiers (UUID)- UUID solves many problems related to databases, such as offline data generation
- NoSQL- it supports several NoSQL data types including XML, Hstore, and JSONB
- Enum, range, and domain are user-defined data types with specific constraints, such as a set of allowed values, data range constraint, and check constraints.
- Supported Languages: PostgreSQL allows functions to be written in several languages. The PostgreSQL community supports the following languages: SQL, C, Python, PL / pgSQL, Perl, and Tcl. In addition to these, there are many externally maintained procedural languages, including Java, R, PHP, Ruby, and UNIX shell.
- Composite type is a user defined data type, where an attribute is composed of several attributes.
- Geometric Data Types- Including point, line segment (
-- This example shows how to create a composite data type
CREATE TYPE phone_number AS (
area_code varchar(3),
line_number varchar(7)
);
NoSQL Capabilities
PostgreSQL is more than a relational database and an SQL language. PostgreSQL is now home to different NoSQL data types. The power of PostgreSQL and schemeless data stores enable developers to build reliable and flexible applications in an agile way.
- PostgreSQL supports JavaScript Object Notation (JSON) data type, which is often used for sharing data across different systems in modern RESTful web applications.
- Key value pairs with Hstore and XML (Extensible Markup Language)
Foreign Data Wrapper
- SQL/MED defines foreign data wrappers (FDW) to allow the relational database to manage external data.
- A simple use case for FDW s to have one database server for analytical purposes, and then ship the result of this server to another server that works as a caching layer.
Performance
- PostgreSQL has proven performance. It employs several techniques to improve concurrency and scalability, including:
- PostgreSQL Locking System - PostgreSQL provides several types of locks at the table and row levels. PostgreSQL is able to use more granular locks that prevent locking / blocking more than necessary; this increases concurrency and decreases the blocking time.
- Indexes - PostgreSQL provides four types of indexes: B-Tree, hash, generalized inverted index (GIN), and the Generalized Search Tree (GiST) index. Each index type can be used for certain scenarios
- Explain, Analyze, Vacuum, Cluster - PostgreSQL provides several commands to boost performance and provide transparency. The
EXPLAIN
command shows the execution plan of an SQL statement. One can change some parameter settings such as memory settings, and then compare the execution plan before and after the change. TheANALYZE
command is used to collect the statistics on tables and columns. TheVACUUM
command is used for garbage collection to reclaim unused hard disk space. TheCLUSTER
command is used for arranging data physically on the hard disk. - Table Inheritance and Constraint Exclusion - Table inheritance allows the creation of tables with the same structure easily. Those tables are used to store subsets of data based on certain criteria. This allows a very fast retrieval of information in certain scenarios, because only a subset of data is accessed when answering a query.
Very Rich SQL Constructs
- PostgreSQL supports very rich SQL constructs. It supports correlated and non-correlated subqueries. It supports Common Table Expression (CTE), window functions, and recursive queries.
Basic Server Configuration
- When PostgreSQL is installed, an operating system user as well as a database user, with the name
postgres
is created. - Client authentication is controlled by a configuration file named
pg_hba.conf
(Postgres Host Based Authentication).
# Take a look at peer authentication
$ grep -v '^#' /etc/postgresql/9.4/main/pg_hba.conf|grep 'peer'
local all postgres peer
local all all peer
- Interpretation of first line:
postgres
user can connect to all the databases by using Unix-socket and peer authentication method. - To connect to db servers using the
postgres
user, switch the OS current user topostgres
and then invokepsql
:
$ sudo su postgres
$ psql
psql (9.4.4)
Type "help" for help
postgres=# SELECT version();
...
- psql is configurable - you can select your favorite editor to show the output of the query
- Common psql commands:
\s+ [pattern]
- Describes all the relevant information for a relation. In PostgreSQL, the term relation can refer to a table, view, sequence, or index.
\df+ [pattern]
- Describes a function
\z [pattern]
- Shows the relation access privileges
- The psql client can be customized. The
psqlrc
file is used to store the user preference for later use.
PostgreSQL Basic Building Blocks
In the database world, compound object names often use underscore but not camel case due to the ANSI SQL standard specifications regarding identifiers quotation and case sensitivity. In the ANSI SQL standard, non-quoted identifiers are case-insensitive.
- If you wanted to use the camel case for renaming database objects, you could achieve that by putting the identifier in double quotes. PostgreSQL identifier names have the following constraints:
- The identifier name should start with an underscore or a letter
- The identifier name can be composed of letters, digits, underscores and the dollar sign. For compatibility reasons, the use of the dollar sign is not recommended.
- The minimum length of the identifier is typically one, and the maximum length is 63.
- For generating documentation for your database and its relations, look into yEd and schemaspy.
- PostgreSQL uses
--
and/**/
for single line and multi-line comments respectively. PostgreSQL allows the developer to store the database object description via theCOMMENT ON
command.
PostgreSQL Objects Hierarchy
- PostgreSQL has two template databases:
template1
, which can be modified to allow modification to all newly created databases, andtemplate0
, which is a safeguard database in casetemplate1
gets corrupted.
User Databases
- One can have as many databases as one wants in a database cluster. A client connection to the PostgreSQL server can access only the data in a single database that is specified in the connection string. That means that data is not shared between the databases, unless the postgres foreign data wrapper or
dblink
extensions are used. - Every database in the db cluster has an owner and a set of associated permissions to control the actions allowed for a particular role.
- The
psql
command\l
is used to list all the databases in the database cluster with their associated attributes. - The database access privileges are the following:
Create (-C)
: Create access privilege allows the specified role to create new schemas in the databaseConnect (-c)
: When a role tries to connect to a database, the connect permissions is checked.Temporary (-T)
: The temporary access privilege allows the specified role to create temporary tables. Temporary tables are like tables but not persistent and are destroyed after the user session is terminated.
- Encoding allows you to store text in a variety of character sets - PostgreSQL supports a rich set of character encodings.
- PostgreSQL has several other attributes for additional purposes:
- Maintenance: The attribute
datfrozenxid
is used to determine if a database vacuum is required - Storage Management: The
dattablespace
attribute is used to determine the databasetablespace
- Concurrency: The
datconnlimit
attribute is used to determine the number of allowed connections.-1
means no limit. - Protection: The
datallowconn
attribute disables the connection to a database. This is used mainly to protecttemplate0
from being altered.
- Maintenance: The attribute
Roles
Roles belong to the PostgreSQL server cluster and not to a certain database. A role can either be a database user or a database group. The role concept subsumes the concepts of users and groups in the old PostgreSQL versions.
- Roles have several attributes:
- Super user: A super user role can bypass all permission checks except the login attribute.
- Login: A role with the login attribute can be used by the client to connect to the database.
- Create Database: A role with the create database attribute can create databases.
- Initiating Replication: A role with this attribute can be used for streaming replication.
- Password: The role password can be used with the
md5
authentication method. Also, it can be encrypted. The password expiration can be controlled by specifying the validity period. - Connection Limit: Connection limit specifies the number of concurrent connections that the user can initiate.
- Inherit: If specified, the role will inherit the privileges assigned to the roles that it is a member of. If not specified, Inherit is the default.
- When a database cluster is created, the postgres super user role is created by default.
Tablespace
- Tablespace is defined storage location for a database or database objects. Tablespaces are used by administrators to achieve the following:
- Maintenance: If the hard disk partition runs out of space where the database cluster is created and cannot be extended, a tablespace on another partition can be created to solve this problem by moving the data to another location.
- Optimization: Heavily accessed data should be stored in fast media such as a solid state drive (SSD). At the same time, tables that are not performance critical should be stored on a slow disk.
Template Procedural Languages
- Template procedural language is used to register a new language in a convenient way.
Settings
- The PostgreSQL settings control different aspects of the PostgreSQL server, including replication, write ahead logs, resource consumption, query planning, logging, authentication, statistic collection, garbage collection, client connections, lock management, error handling, and debug options.
- The developer, in general, is concerned with two settings categories, which are:
- Client Connection Defaults: These settings control the statement, locale, and formatting
- Query Planning: These settings control the planner configuration, and give hints to the developer on how to rewrite SQL queries
PostgreSQL Database Components
- A PostgreSQL database could be considered as a container for database schema; the database must contain at least one schema. A database schema is used to organize the database objects in a manner similar to namespaces in high programming languages.
Schema
- The schema contains all the database named objects, including tables, functions, aggregates, indexes, sequences, triggers, data types, domains, and ranges.
- By default, there is a schema called public in the template databases.
- When a user wants to access a certain object, he needs to specify the schema name and the object name separated by a period (.).
- Schema Usages:
- Control Authorization - one can use schemas to group objects based on roles
- Organize Database Objects - one can organize the database objects in groups based on business logic
- Maintain Third-Part SQL Code - the extensions available in the contribution package can be used with several applications.
Table
- Tables can be of different types:
- Permanent Table - the table life cycle starts with table creation and ends with table dropping.
- Temporary Table - the table life cycle is the user session. This is used often with procedural languages to model some business logic.
- Unlogged Table - operations on unlogged tables are much faster than permanent tables, because data is not written into the WAL files. These tables are not crash-safe.
- Child Table - a table that inherits one or more tables
Native Data Types
- Be careful when selecting data types for table. When a db goes into production, changing the data type of a table often comes with locking the table, and in some case, rewriting it. Keep in mind:
- Extensibility: Can the maximum length of a type be increased of decreased without a full table rewrite and a full table scan?
- Data type size: Going for a safe option such as choosing big integers instead of integers will cause more storage consumption.
Numeric Types
- Serial types, namely
smallserial
,serial
, andbigserial
are wrappers on top ofsmallint
,int
, andbiginteger
respectively. Serial types are often used as surrogate keys, and by default, they are not allowed to have anull
value. The seral type utilizes sequences behind the scenes. A sequence is a db object that is used to generate sequences by specifying the min, max, and increment value. - Precision is the total number of digits, while scale is the number of digits of the fraction part.
Character Types
- PostgreSQL uses two general text types, which are
char(n)
andvarchar(n)
data types, where n is the number of characters allowed. In thechar
data type, if a value is less than the specified length, then trailing spaces are padded at the end of the value. - The maximum text size that can be stored is 1GB, which is the maximum column size.
- In PostgreSQL, there is no difference in performance between the different character types, so it is recommended to use the
text
data type. - You can use a
CHECK
constraint to constrain the length of the text, and then you can change theCHECK
constraint without having to lock and re-write the table.
- You can use a
Date and Time Types
- PostgreSQL stores timestamp with and without time zone in the universal coordinated time (UTC) format, and only time is stored without the time zone.
- The function
now()
returns the current timestamp with the time zone in the UTC format.
PostgreSQL Advanced Building Blocks
Views
Aview
can be seen as a named query, or as a wrapper around aSELECT
statement. Views are essential building blocks of relational databases from the UML modeling perspective; a view can be thought of as a method for a UML class. Views share several advantages over procedures, so the following benefits are shared between views and stored procedures:
- Simplifying complex queries and increasing code modularity
- Tuning performance by caching the view results for later use
- Decreasing the amount of SQL code
- Bridging the gap between relational databases and Object Oriented languages - especially updatable views
- Implementing authorization at the row level by leaving out rows that do not meet a certain predicate
- Implementing interfaces and the abstraction layer between high level languages and relational databases
- Implementing last minute changes without redeploying the software
Handling Views Properly
- The
view
definition should be crisp: the view should meet the current business need instead of potential future business needs. It should be designed to provide a certain functionality or service. - In PostgreSQL, the view is internally modeled as a table with a
_RETURN
rule.
View Categories
- Temporary Views: Dropped automatically at the end of the user session.
- Recursive Views: A recursive view is similar to the recursive functions in high level languages.
- Updateable views: Updateable views allow the user to see the view as a table.
- By default, simple PostgreSQL views are auto-updateable. Auto-updateable means that one could use the view with the
DELETE
,INSERT
, andUPDATE
statements to manipulate the data of the underlying table.
- By default, simple PostgreSQL views are auto-updateable. Auto-updateable means that one could use the view with the
- Materialized views: A materialized view is a table whose contents are periodically refreshed based on a certain query.
- Materialized views are often used in Business Analytics. In general a materialized view can be used for:
- Generating summary tables and keeping summary tables up-to-date
- Caching the results of recurring queries
Indexes
An index is a physical database object that is defined on a table column or a list of columns. In PostgreSQL, there are many types of indexes and several ways to use them.
- Indexes can be used, in general to:
- Optimize Performance
- Validate Constraints, instead of checking constraints. An index can be used to validate the constraints on several rows. For example, the
UNIQUE
check constraint creates a unique index on the column behind the scenes.
Index Types
- B-tree Index: The default index.
B
stands for balanced, which means that the data on both sides of the tree is roughly equal. B-tree can be used for equality, ranges, and null predicates. - Hash Index: They are not well supported in PostgreSQL. They are used for equality predicates - it is not recommended to use Hash indexes since B-tree supports equality predicates.
- Generalized Inverted index (GIN): The GIN index is useful when several values need to map to one row. Can be used for arrays and full text search.
- Generalized Search Tree (GiST): Allows the building of general balanced tree structures. They are useful in indexing geometric data types, as well as full text search.
- Block Range Index (BRIN) : Used for very large tables where size is limited. Slower than a B-tree index, but requires less space.
Partial Index
- A partial index indexes only a subset of the table data that meets a certain predicate; the
WHERE
clause is used with the index.
An index can be created on a single column or multiple columns. It also can be created on expressions and function results.
CREATE INDEX on account(lower(first_name));
SELECT * FROM account WHERE lower(first_name) = 'foo';
- Currently, only GIN, B-tree, and GiST support multicolumn indexes. When creating a multicolumn index, the column order is important.
- It is often useful to index columns that are used with predicates and foreign keys. This enables PostgreSQL to use an index scan instead of a sequential scan. The benefits of indexes are not limited to
SELECT
statements, but theDELETE
andUPDATE
statements can also benefit from it. - When creating an index, make sure that the index does not exist, otherwise one could end up with duplicate indexes.
Functions
A PostgreSQL function is used to provide a distinct service, and is often composed of a set of declarations, expressions, and statements. PostgreSQL has very rich built-in functions for almost all existing data types.
- Some developers use functions as an abstract interface with higher programming languages to hide the data model. Functions can have several other usages, such as:
- Performing complex logic that is difficult to perform with SQL
- In Dynamic SQL, a function argument can be used to pass table and view's names via the
EXECUTE
statement - Performing actions before or after the execution of an SQL statement via the trigger system
- Performing exception handling and additional logging via the
EXCEPTION
blocks andRAISE
statement respectively - Cleaning the SQL code by reusing the common code, and bundling the SQL codes in modules
- When creating a function, it is marked as volatile by default if the volatility classification is not specified. If the created function is not volatile, it is important to mark it as stable or immutable, because this will help the optimizer to generate the optimal extract plans. PostgreSQL function volatile classifications:
- Volatile
- Can return a different result on successive calls even if the function argument did not change, or it can change the data in the database.
- Stable and Immutable
- These functions cannot modify the database, and are guaranteed to return the same result for the same argument.
- If one would like to create indexes on partitioned tables, shell scripting is a better alternative
PostgreSQL User-Defined Data Types
- Two methods for implementing user-defined data types:
CREATE DOMAIN
- this command allows developers to create a user-defined data type with constraintsCREATE TYPE
-used to create a composite type, which is useful in procedural languages, and is used as the return data type
Triggers and Rules
- PostgreSQL provides triggers and rules systems to automatically perform a certain function when an event like
INSERT
,UPDATE
, orDELETE
is performed. - From the functionality point of view, the trigger system is more generic; it can be used to implement complex actions more easily than rules. However, both trigger and rule systems can be used to implement the same functionality in several cases. From the performance point of view, rules tend to be faster than triggers; but triggers tend to be simpler and more compatible with other RDBMs, since the rule system is a PostgreSQL extension.
- Creating a rule will either rewire the default rule, or create a new rule for a specific action on a specific table or view. In other words, a rule on an insert action can change the insert action behavior, or can create a new action for the insert.
CREATE TABLE car_log (LIKE car);
ALTER TABLE car_log
ADD COLUMN car_log_action varchar(1) NOT NULL,
ADD COLUMN car_log_time TIMESTAMP WITH TIME ZONE NOT NULL;
CREATE RULE car_log AS
ON INSERT TO car
DO ALSO
INSERT INTO car_log (car_id, number_of_owners, registeration_number, ... ) VALUES (new.car_id, new.number_of_owners, new.registeration_number, ...);
- Writing rules is often quicker than writing triggers, and the code is crisper and shorter. A common case of using rules is to ignore the CRUD operation on the table in order to protect the table against data changes. This scenario can be used to protect lookup tables or to ignore CRUD operations.
CREATE RULE account_search_log_insert AS
ON INSERT TO account_search_history
DO INSTEAD nothing;
- Be careful with infinite loops with rules
- PostgreSQL triggers a function when a certain event occurs on a table, view, or foreign table, Triggers are executed when a user tries to modify the data through any of the data manipulation language (DML) events, including
INSERT
,UPDATE
,DELETE
, orTRUNCATE
. - When a trigger is marked for each row, then the trigger function will be executed for each row that has been affected by the CRUD operation. A statement trigger is only executed once per operation.
I am not going to go over the exact syntax for creating triggers, rules, or any other SQL statement in these notes. I plan to go over the syntax for these operations in a future note, where I will be reviewing in depth common PostgreSQL SQL commands.
SQL Language
- Structured Query Language (SQL) is used to set up the structure of the database, to manipulate the data in the database and to query the database. This chapter will be dedicated to the Data Manipulation Language (DML).
- The complete reference of SQL can be found in the official PostgreSQL documentation.
- SQL is used to manipulate the data in the database and to query the database. Also, SQL is used to define the structure of the data. In general, SQL consists of:
- Data Definition Language (DDL) - create and manage the structure of the data
- Data Manipulation Language (DML) - manage the data itself
- DML has only four types of statements:
INSERT
- used to put new data into the databaseUPDATE
- used to change dataDELETE
- used to delete dataSELECT
- used to retrieve data
- Data Control Language (DCL) - control access to the data
- SQL is a declarative language. In SQL, the developer writes only one major sentence and leaves the details for the database.
- The minimal SQL instruction that can be executed by the database engine is called a statement (or command /query). SQL commands are terminated by a semicolon (;).
- SQL statements can contain:
- Keywords determine exactly what it required from the database to be done
- Identifiers refer to the objects in the database - tables, their fields, functions, and so on
- Constants are parts of expressions whose values are specified directly in the code
- Operators determine how the data is processed in the expressions
- Special characters, such as parenthesis, brackets, commas, and so on, which have other meanings than simply being an operator
- Whitespaces separate words from each other
- Comments are used to describe a particular line of code
- Identifiers are the names of the database objects. Objects such as tables or views can be referred by the name of the schema where it belongs to followed by the dot symbol (
.
) and the name of the object. This is called a qualified object name. - SQL is not case sensitive,
- Constants in SQL are also called literals. PostgreSQL supports three types of implicitly typed constants: numbers, strings, and bit strings.
- String constants should be quoted. There are two kinds of syntax string constants in PostgreSQL: single quoted constants like in SQL standard, and PostgreSQL - specific dollar-quoted constants. Putting the letter
E
before the string constant makes it possible to use C-style backslash escaped characters such as\n
and\t
. - Bit strings are preceded by the letter
B
and can only contain the digits 0 or 1. Alternatively, they can be preceded by a letterX
and be hexadecimal strings. - The sequence of operations performed by the
SELECT
query: - Take all records from all the sources tables. If there are subqueries in the
FROM
clause, they are evaluated first - Build all possible combinations of those records and discard combinations that do not follow the
JOIN
conditions or set some fieldsNULL
in case of outer joins - Filter out the combinations that do not match the condition of the
WHERE
clause - Build groups based on the values of the expressions of the
GROUP BY
list - Filter the groups that match the
HAVING
conditions - Evaluate the expressions of the Select-list
- Eliminate duplicated rows if
DISTINCT
is specified - Apply the set operations
UNION
,EXCEPT
, orINTERSECT
- Sort rows according to the
ORDER BY
clause - Discard records according to
OFFSET
andLIMIT
- Take all records from all the sources tables. If there are subqueries in the
- After the
SELECT
keyword, one should specify the list of fields or expressions to get from the database. This is called the Select-list. - Expressions in the Select-list are called value expressions or scalar expressions. That is because each expression in the Select-list always returns only one value (but the value can be an array).
- By default, PostgreSQL assigns to the output column the same name as the function.
- A conditional expression is an expression returning different results depending on some condition:
CASE WHEN <condition1> THEN <expression1> [WHEN <condition2> THEN <expression2> ...] [ELSE <expression n>] END
- The length of an SQL expression has no limits in language specification.
- The source of rows for the query is specified after the
FORM
keyword. It is called the FROM clause. The source of the rows for the query can be any of: - Table
- View
- Function
- Subquery
- Values
- Joins
- Logical operators are usually used in conditional expressions. They are
AND
,OR
, andNOT
. - The
GROUP BY
clause is used for grouping. Grouping means splitting the whole input of records into several groups with a view to have only one result row for each group. - It is almost useless just to group rows. Usually, it is necessary to do some computing on the groups.
- Aggregation means performing calculation, on a group of records returning a single value for the whole group.
- There are several aggregating functions available in PostgreSQL. The most frequently used are
count
,sum
,max
,min
, andavg
for computing, respectively the number of records in the group, the sum of any numeric expression for all the records in the group, the biggest, the lowest, and the average value of any expression. - When grouping and aggregation are used, the records are grouped. This means several records become one.
OFFSET
andLIMIT
can be used to offset and limit results.- Subqueries are a very powerful feature of SQL. They can be used almost everywhere in the query. The most obvious way to use subqueries is in the
FROM
clause as a source for the main query. - Subqueries can be nested.
- Set operations are used to combine the results of several queries. It is different from joining, although the same results can often be achieved by joining.
- Joining means placing the records of two tables, one table with another table horizontally. The result of joining is that the number of columns equal to the sum of the numbers of columns of the source tables, and the number of records will depend on the join conditions.
- Combining means putting the result of one query on top of the result of another query. The number of columns stays the same, but the number of rows is the sum of the rows from the sources.
- There are three set operations:
UNION
: This appends the result of the query to the result of another queryINTERSECT
: This returns the records that exist in the results of both queriesEXCEPT
: This returns the records from the first query that do not exist in the result of the second query - the difference
- The syntax of set operations:
<query1> UNION <query2>;
<query1> INTERSECT <query2>;
<query1> EXCEPT <query2>;
NULL
means the absence of any value. In relation to logical values,NULL
is neither true nor false.- There are a couple of functions designed to deal with
NULL
values:COALESCE
andNULLIF
. TheCOALESCE
function takes any number of arguments of the same data type or compatible types. It returns the value of the first of its arguments that is not null: COALESCE(a, b, c)
NULLIF
takes two arguments and returnsNULL
if they are equal. Otherwise, it returns the value of the first argument.
Advanced Query Writing
- Common Table Expressions (CTE) is a feature that makes it possible to define a subquery, give it a name, and then use it at several places in the main query.
- PostgreSQL executes the subqueries from the
WITH
clause only once, saves the results in the memory or on disk, depending on the size, and then reuses them. - It is possible to refer a substatement from itself. Statements like that are called recursive queries. Recursive queries must have a special structure that indicates to the database that the subquery is recursive. The structure of the subquery is recursive.
WITH RECURSIVE <subquery_name> (<field list>) AS (
<non-recursive term>
UNION [ALL|DISTINCT]
<recursive term>
)
[,...]
<main query>
A non-recursive or anchor subquery is a starting point of the execution of a recursive query. It cannot refer to the name of the recursive query. It is executed only once. The results of the non-recursive term are passed to the recursive term, which can reference the whole recursive query by its name. If the recursive term returns rows, they are passed to the recursive term again. This is called iteration. Iteration is repeated as long as the result of the recursive term is not empty. The result of the whole query is all the rows returned by the non-recursive term and all the iterations of the recursive term. If the keywords UNION ALL are used, then all the rows are returned. If UNION DISTINCT or just UNION is used, then the duplicated rows are removed from the result set.
- What if several CTEs change the same table or use the results produced by each other? There are some principles for their isolation and interaction:
- For sub-statements:
- All the sub-statements work with the data as it was at the moment of the start of the whole
WITH
query - They don't see the results of each other's work. You can't
DELETE
a row that wasINSERT
ed by a previous subquery. - The only way for passing information of the processed records from a data changing CTE to another CTE is using the
RETUNING
clause.
- All the sub-statements work with the data as it was at the moment of the start of the whole
- For triggers defined on the table before being changed:
- For the BEFORE triggers: Statement-level triggers are executed before the execution of each sub-statements. Row-level triggers are executed just before the changing of each record.
- For the AFTER triggers: Both, statement-level and row-level triggers are executed after the whole
WITH
query. They are executed in groups per every sub-statement: first row-level then statement level. - The statements inside the code of the triggers do see the changes in data that were made by other sub-statements.
- For constraints defined on the tables being changed:
PRIMARY KEy
andUNIQUE
constraints are validated for every record at the moment of insert or update of the recordCHECK
constraints are validated for every record at the moment of insert or update of the record.FOREIGN KEY
constraints are validated at the end of the execution of the wholeWITH
query.
- Window functions are another way to perform computations based on the values of several records. Window functions can do similar things, but they are executed for every record, and the number of records in the output and the input is the same.
- Syntax of window functions:
<function_name>(<function_arguments>)
OVER (
[PARTITION BY <expression_list>]
[ORDER BY <order_by_list>]
[{ROWS | RANGE} <frame_start> | {ROWS | RANGE} BETWEEN <frame_start> AND <frame_end>])
Example:
SELECT employee_number, department, salary, AVG(salary) OVER(PARTITION BY department), RANK() OVER(ORDER BY salary DESC) as overall_rank, RANK() OVER(PARTITION BY department ORDER BY salary DESC) as overall_rank, AS dept_avg FROM employees;
- The typical use cases for the window functions are computing statistical values of different kinds.
- The
OVER()
clause constructs a window. When it's empty, the window will include all records. PARTITION BY
- inside of theOVER()
, usePARTITION BY
to form rows into groups of row- Use
ORDER BY
inside of theOVER()
clause to re-order rows within each window - PostgreSQL window functions
- The PostgreSQL
DISTINCT ON
keywords are used to select the first record within each group.
There is a fundamental concept that is called transaction in relational databases. Transaction is a unit of data change. PostgreSQL implements a set of principles related to transaction processing, called ACID:
- Atomicity - Every transaction can be fixed (saved or committed) only as a whole. If one part of a transaction fails, the whole transaction is aborted and no changes are saved at all.
- Consistency - All data integrity constraints are checked and all triggers are processed before a transaction is fixed. If any constraint is violated, then the transaction cannot be fixed.
- Isolation - A database system manages concurrent access to the data objects by parallel transactions in a way that guarantees that the results of their execution is the same as if they were executed serially, one after the other.
- Durability - If a transaction is fixed, the changes it has made to the data are written to the non-volatile memory (disk). So, even if the system crashes or is powered-off immediately after that, the result of the transaction will persist in the database.
- PostgreSQL implements a special mechanism that maintains data consistency and transaction isolation. It is called multiversion concurrency control (MVCC). In short, it means that each transaction works with a consistent snapshot of data as it was some time ago, without taking into account changes made by other transactions that are not finished yet.
- When it comes to accessing the same data object by several concurrent transactions, the way they interact with each other is determined by the level of transaction isolation:
- Dirty Read: A transaction can read data that was written by another transaction, which is not committed yet.
- Nonrepeatable read: When rereading data, a transaction can find that the data has been modified by another transaction that has just committed. The same query executed twice can return different values for the same rows.
- Phantom Read: This is similar to nonrepeatable read, but it is related to new data, created by another transaction. The same query executed twice can return different number of records.
To block transactions and monitor serialization anomalies, PostgreSQL uses the mechanism of locks. Locks are special objects in the database that are obtained by transactions and associated with data objects: rows or tables. A data object cannot be locked by more than one transaction. So, when a transaction wants to update a record, it will first lock it. If another transaction tries to update the same record, it will also try to lock it but since it is already locked, the transaction will wait. Locks are created automatically by the database when it processes SQL statements and released when the transaction is finished.
Server Side Programming with PL/pgSQL
- The ability to write functions in PostgreSQL is an amazing feature. One can perform any task within the scope of the database server. These tasks might be related directly to data manipulation such as data aggregation and auditing, or can be used to perform miscellaneous services such as statistics collection, monitoring, system information acquisition, and job scheduling.
- The PL / pgSQL function execution plan is cached; caching the plan can help in reducing the execution time, but it can also hurt in case the plan is not optimal for the provided function parameters.
- These parameters are used by the planner to determine the cost of executing the function, the number of rows that are expected to be returned, and whether the function pushes down when evaluating predicates:
- Leakproof: Leakproof means that the function has no side effects
- Cost: It declares the execution cost per row
- Rows: The estimated number of rows by the function if the function is returning set.
- The general syntax of a variable declaration is as follows:
name { CONSTANT } type [ COLLATE collation_name ] [ NOT NULL ] [ { DEFAULT | := | = } expression ];
name
- the name should follow the naming rules for PostgreSQL tablesCONSTANT
- the variable cannot be assigned another value after the initializationtype
- The type of variableNOT NULL
- Causes a runtime error if the variable is assigned to nullDEFAULT
- Causes the initialization of the variable to be delayed until the block is entered.- Expression is the combination of one or more explicit values, operators, and functions that can be evaluated to another value
- The
BEGIN
andEND
keywords are not used to control transactional behavior, but only for grouping. - The assignment operators
:=
and=
are used to assign an expression to a variable:
variable { := | = } expression;
- For variable names, one should choose names that do not conflict with the column names.
- Conditional statements:
IF boolean_expression THEN statement [statement]...
[ELSIF boolean_expression THEN statement [statement]...]
[ELSIF boolean_expression THEN statement [statement]...]...
[ELSE statement [statement]...] END IF;
- Iteration is used to repeat a block of statements to achieve a certain goal.
I've been over PL / pgSQL before, and I don't want to go through it again. I will be skipping over the rest of this section.
PostgreSQL Security
- Don't use the
postgres
user as the user that your application uses during the running application - Create a new user with less privileges - should only be able to
SELECT
,DELETE
,UPDATE
, andINSERT
from certain tables / views.
The PostgreSQL System Catalog and System Administration Functions
pg_catalog
andinformation_schema
schemas hold information about tables, rules, users, groups, indexes, foreign data wrappers, triggers, constraints, views, functions, and so on.- Always index foreign keys: Indexing a table on a foreign key allows PostgreSQL to fetch data from the table using an index scan.
Optimizing Database Performance
pgbench
- benchmarking
Comments
You have to be logged in to add a comment
User Comments
There are currently no comments for this article.