Database Systems

I want to read about database systems to increase my knowledge of databases.

Date Created:

Last Edited:

1 139

References

Database Systems, The Complete Book; Second Edition; Hector Garcia-Molina; Jeffrey D. Ullman; Jennifer Widom

The power of databases comes from a body of knowledge and technology that has developed over several decades and is embodied in specialized software called database management system, or DBMS, or more colloquially database system. A DBMS is a powerful tool for creating and managing large amounts of data efficiently and allowing it to persist for creating and managing large amounts of data efficiently and allowing it to persist over long periods of time, safely. In common parlance, the term database refers to a collection of data that is managed by a DBMS. The DBMS is expected to:

Allow uses to create new databases and specify their schemas (logical structure of the data), using a specialized data-definition language.
Give users the ability to query the data (a query is database lingo for a question about the data) and modify the data, using an appropriate language, often called a query language or data-manipulation language.
Support the storage of very large amounts of data - many terabytes or more - over a large period of time, allowing efficient access to the data for queries and database modifications.
Enable durability, the recovery of the database in the face of failures, errors of many kinds, or intentional misuse.
Control access to data from many users at once, without allowing unexpected interactions among users (called isolation) and without actions on the data to be performed partially but not completely (called atomicity).

Following a famous paper by Ted Codd in 1970, database systems changed significantly. Codd proposed that database systems should present the user with a view of data organized as tables called relations. By 1990, relational database systems were the norm.

To a great extent, the old problem of building and maintaining databases has become one of information integration: joining the information contained in many related databases into a whole. One popular approach to information integration is the creation of data warehouses, where information form many legacy databases is copied periodically, with the appropriate translation, to a central database.

There are two distinct sources of commands to the DBMS:

Conventional users and application programs that ask for data or modify data.
A database administrator: a person responsible for the structure or schema of the database.

The great majority of interactions with the DBMS follow the path on the left of the figure below. A user or an application program initiates some action, using the data-manipulation language (DML). This command does not affect the schema of the database, but may affect the content of the database. (Schema-altering commands are called data-definition language (DDL) commands, and they are parsed by a DDL processor and passes to the execution engine, which then goes through the index/file/record manager to alter the metadata, that is, the schema information for the database).

Database Management System Components

DML statements are handled by two separate subsystems:

Answering the Query
1. The query is parsed and optimized by a query compiler. The resulting query plan, or sequence of actions the DBMS will perform to answer the query, is passed to the execution engine. The execution engine issues a sequence of requests for small pieces of data, typically records or tuples of a relation, to a resource manager that knows about data file (holding relations), the format and size of records in those files, and index files, which help find elements of fata files quickly.
2. The requests for data are passed to the buffer manager. The buffer manager's task is to bring appropriate portions of the data from secondary storage (disk) where it is kept permanently to the main-memory buffers. Normally, the page or disk block is the unit of transfer between buffers and disk. The buffer manager communicates with a storage manager to get data from disk.
Transaction Processing
1. Queries and other DML actions are grouped into transactions, which are units that must be executed automatically and in isolation from one another. Any query or modification action can be a transaction by itself. In addition, the execution of transactions must be durable, meaning that the effect of any completed transaction must be preserved even if the system fails in some way right after the completion of the transaction. The transaction processor is divided into two major parts:
2. 1. A concurrency-control manager, or scheduler, responsible for assuring atomicity and isolation of transactions and
  2. A logging and recovery manager , responsible for the durability of transactions.

The data of a database normally resides in secondary storage (hard disk). However, to perform any useful operation on data, that data must be in main memory. It is the job of the storage manager to control the placement of data on disk and its movement between disk and main memory. The storage manager keeps track of the location of files on the disk and obtains block or blocks containing a file from the buffer manager.

The buffer manager is responsible for partitioning the available main memory into buffers, which are page-sized regions into which disk blocks can be transferred. Kinds of information that various components may need:

Data: the contents of the database itself
Metadata: the database schema that describes the structure of, and constraints on, the database
Log Records: information about recent changes to the database; these support durability of the database
Statistics: Information gathered and stored by the DBMS about data properties such as the sizes of, and values in, various relations or other components of the database.
Indexes: Data structures that support efficient access to the data

It is normal to group one or more database operations into a transaction, which is a unit of work that must be executed atomically and in apparent isolation from other transactions. Durability - the work of completed transactions will never be loss. The transactions manager therefore accepts transaction commands from an application, which tell the transaction manager when transactions begin and end, as well as information about the expectations of the application. The transaction manager performs the following tasks:

Logging: In order to assure durability, every change in the database is logged separately on disk. The log manager follows one of several policies designed to assure that no matter when a system failure or crash occurs, a recovery manager will be able to examine the log of changes and restore the database to some consistent state.
Concurrency Control: Transactions must appear to execute in isolation. But in most systems, there will in truth be many transactions executing at once. The scheduler must assure that the individual actions of multiple transactions are executed in such an order that the net effect is the same as if the transactions had in fact executed in their entirety, one-at-a-time. A typical scheduler does its work by maintaining locks on certain pieces of the database. Locks are generally stored in a main-memory lock table.
Deadlock resolution: As transaction compete for resources through the locks that the scheduler grants, they can get into a situation where none can proceed because each needs something another transaction has. The transaction manager has the responsibility to intervene and cancel one or more transactions to let the others proceed.

The portion of the DBMS that most affects the performance that the user sees is the query processor. In the figure above, the query processor is represented by:

The query compiler, which translates the query into an internal form called the query plan. The latter is a sequence of operations to be performed on the data. The latter is a sequence of operations to be performed on the data. The query compiler uses metadata and statistics about the data to decide which operations is likely to be the fastest. The query compiler consists of three major units:
1. A query parser, which builds a tree structure from the textual form of the query
2. A query preprocessor, which performs semantic checks on the query and performing some tree transformations to turn the parse tree into a tree of algebraic operators representing the initial query plan.
3. A query optimizer, which transforms the initial query into the best available sequence of operations on the actual data.
The execution engine, which has the responsibility for executing each of the steps in the chosen query plan. The execution engine interacts with most of the other components of the DBMS either directly or through the buffers. It must get the data from the database into buffers in order to manipulate the data. It needs to interact with the scheduler to avoid accessing data that is locked, and with the log manager to make sure that all database changes are properly logged.

The Relational Model of Data

The most important model of data: the two-dimensional table or relation. A data model is a notation for describing data or information. The description consists of three parts:

Structure of Data
Operations on the data: In database data models, we are generally allowed to perform a limited set of queries (operations that retrieve information) and modifications (operations that change the database). By limiting operations, it is possible for programmers to describe database operations at a very high level, yet have the database management system implement the operations efficiently.
Constraints on the Data: Describe limitations on what the data can be.

The two preeminent data models today are:

The relational model, including object-relational extensions
The Semistructured-data model, including XML and related standards

The relational model is based on tables. Semistructured data resembles trees or graphs, rather than tables or arrays.

The columns of a relation are named by attributes. The name of a relation and the set of attributes for a relation is called the schema for that relation. The attributes in a relation schema are a set, not a list. The set of schemas for the relations of a database is called a relational database schema, or just a database schema. The rows of a relation, other than the header row containing the attribute names, are called tuples. A tuple has one component for each attribute of the relation. The relational model requires that each component of each tuple be atomic - must be of some elementary type such as integer or string. It is further assumed that associated with each attribute of a relation is a domain, that is, an elementary type. The components of any tuple of the relation must have, in each component, a value that belongs to the domain of the corresponding column. Relations are sets of tuples, not lists of tuples. Thus the order in which the tuples of a relation are presented is immaterial.

Schema changes can be very expensive, because each of perhaps mullions of tuples needs to be rewritten to add or delete components. A set of tuples for a given relation is an instance of that relation. A conventional database system maintains only one version of any relation: the set of tuples that are in the relation now. This instance of the relation is called the current instance.

A set of attributes forms a key for a relation if we do not allow two tuples in a relation instance to have the same values in all attributes of the key.

SQL (pronounced sequel) is the principal language used to describe and manipulate relational databases. The current standard for SQL is SQL-99. There are two aspects to SQL:

The Data-Definition sublanguage for declaring database schemas and
The Data-Manipulation sublanguage for querying (asking questions about) databases and for modifying the database.

SQL makes a distinction between three kinds of relations:

Stored relations, which are called tables. These relations exist in the database and can be modified in the database as well as queried.
Views which are relations defined by a computation. These relations are not stored, but are constructed, in whole or in part, when needed.
Temporary tables, which are constructed by the SQL language processor when it performs its job of executing queries and data modifications. These relations are thrown away and not stored.

Some SQL:

CREATE TABLE Movies (
  id PRIMARY KEY,
	title CHAR(IOO),
  year INT,
  length INT,
  genre CHAR(10) DEFAULT 'none',
  studioName CHAR(30),
  producerC INT
);
ALTER TABLE Movies ADD date_created BIGINT;
ALTER TABLE Movies DROP producerC;
DROP TABLE Movies;

Two declarations can be used to indicate keyness:

PRIMARY KEY: The attribute is not allowed to be NULL
UNIQUE

Relational algebra consists of some simple but powerful ways to construct new relations from given relations. Relational algebra is another example of algebra. Its atomic constraints are:

Variables that stand for relations.
Constants, which are finite relations.

The operations of traditional relational algebra fall into four broad classes:

The usual set operations - union, intersection, difference - applied to relations
1. , the union of R and S, is the set of elements that are in R or S or both. An element appears once in the union even if it is present in both R and S
2. , the intersection of R and S, is the set of elements that are in both R and S
3. , the difference of R and S, is the set of elements that are in R and not in S.
Operations that remove parts of a relation: selection eliminates some rows (tuples) and projection eliminates some columns.
1. The selection operator produces a new relation with a subset of R's tuples.
Operations that combine the tuples of two relations, including Cartesian product, which pairs the tuples of two relations in all possible ways and various kinds of join operations, which selectively pair tuples from two relations
An operation called renaming that does not affect the tuples of a relation, but changes the relational schema

Note: R and S must have schemas with identical attributes. Before computing set operations, the columns of R and S must be order so that the order of attributes is the same for both relations.

The Cartesian product (or cross-product or just product) of two sets R and S is the set of pairs that can be formed by choosing the first element of the pair to be any element of R and the second any element of S. This product is denoted . More often than we want to take the product of two relations, we find a need to join them by pairing only those tuples that match in some way.

A common kind of constraint, called a referential integrity constraint, asserts that a value appearing in one context also appears in another related context.

Design Theory for Relational Databases

A functional dependency (FD) on a relation is a statement of the form: If two tuples of R agrees on all of the attributes (i.e., the tuples have the same values in their respective components for each of these attributes), then they must also agree on all of another list of attributes . This FD is written formmaly as and say that .

A set of attributes that contains a key is called a superkey, short for superset of a key.

Problems such as redundancy that occur when we try to cram too much into a single relation are called anomalies. The principle kinds of anomalies that we encounter are:

Redundancy: Information may be repeated unnecessarily in several tuples.
Update Anomalies: We may change information in one tuple but leave the same information unchanged in another.
Deletion Anomalies: If a set of values becomes empty, we may lose other information as a side effect.

The accepted way to eliminate these anomalies is to decompose relations. The goal of decomposition is to replace a relation by several that do not exhibit anomalies. There is, it turns out, a simple condition under which anomalies can be guaranteed not to exist. This condition is called Boyce-Codd Normal Form, BCNF:

A relation is in BCNF if and only if: whenever there is a nontrivial FD for , it is the case that is a superkey for .

High-Level Database Models

The database modeling and implementation process

In the entity-relationship model, the structure of data is represented graphically, as an entity-relationship diagram, using three principle element types:

Entity sets,
Attributes, and
Relationships

An entity is an abstract object of some sort, and a collection of similar entities forms an entity set. An entity in some ways resembles an object in the sense of object-oriented programming. Entity sets have associated attributes, which are properties of the entities in that set. Relationships are connections among two or more entity sets. An E/R diagram is a graph representing entity sets, attributes, and relationships. Elements of each of these kinds are represented by nodes of the graph, and we use a special shape of node to indicate the kind:

Entity sets are represented by rectangles
Attributes are represented by ovals
Relationships are represented by diamonds

Example E/R Diagram

If each member of can be connected by to at most one member of , then we say that is many-one from to .
If is both many-one from E to F and many-one from F to E, then we say that is one-one.
If R is neither many-one from to or from to , then we say is many-many.

Design Principles

Faithfulness - the design should be faithful to the specifications of the application.
Avoid Redundancy - we should be careful to say everything once only.
Simplicity Counts - avoid introducing more elements into your design than is absolutely necessary.
Choosing the Right Relationships
Picking the Right Kind of Element - sometimes we have options regarding the type of design element used to represent a real-world concept.

Relational Database Programming

The Database Language SQL

The most common used relational DBMS's query and modify the database through a language called SQL. SQL stands for Structured Query Language. The portion of SQL that supports queries has capabilities very close to that of relational algebra. SQL serves as both a data-manipulation language and a data-definition language. SQL is case insensitive, meaning that it treats upper- and lower-case letters as the same letter.

Modeling and Programming for Semistructured Data

The Semistructured-Data Model

The Semistructured-data model plays a specific role in database systems:

It serves as a model suitable for integration of databases, that is, for describing the data contained in two or more databases that contain similar data with different schemas.
It serves as the underlying model for notations such as XML that are being used to share information on the web.

Interest in Semistructured-data is primarily motivated by its flexibility. A database of Semistructured data is a collection of nodes. Each node is either a leaf or interior. Leaf nodes have associated data; the type of this data can be any atomic type, such as numbers and strings. Interior nodes have one or more arcs out. Each arc has a label, which indicates how the node at the head of the arc relates to the node at the tail. One interior node, called the root, has no arcs entering and represents the entire database.

Semistructured Data

Tags in XML are text surrounded by triangular brackets, <...>, as in HTML. Also as in HTML, tags generally come in matching pairs, with an opening tag like <Foo> and matched closing tags with the same word in the brackets </Foo>. A pair of matching tags and everything that comes between them is called an element.

XML can be used in two different modes:

Well formed XML allows you to invent your own tags, much like the arc-labels in semistructured data. This mode corresponds quite closely to Semistructured data, in that there is no predefined schema, and each document is free to use whatever tags the author of the document wishes.
Valid XML involves a DTD or Document Type Definition that specifies the allowable tags and gives a grammar for how they may be nested.

As in HTML, an XML element can have attributes (name-value airs) within its opening tag. There are situations in which XML data involves tags that come from two or more different sources, and which may therefore have conflicting names.