Database Semantics

Database Semantics

Abstract

This paper, the first step to connect relational databases with systems consequence (Kent [5]), is concerned with the semantics of relational databases. It aims to to study system consequence in the logical/semantic system of relational databases. The paper, which was inspired by and which extends a recent set of papers on the theory of relational database systems (Spivak [6] [7]), is linked with work on the Information Flow Framework (IFF [9]) connected with the ontology standards effort (SUO), since relational databases naturally embed into first order logic. The database semantics discussed here is concerned with the conceptual level of database architecture. We offer both an intuitive and technical discussion. Corresponding to the notions of primary and foreign keys, relational database semantics takes two forms: a distinguished form where entities are distinguished from relations, and a unified form where relations and entities coincide. The distinguished form corresponds to the theory presented in (Spivak [6]). The unified form, a special case of the distinguished form, corresponds to the theory presented in (Spivak [7]). A later paper will discuss various formalisms of relational databases, such as relational algebra and first order logic, and will complete the description of the relational database logical environment.

Keywords:
database systems, database schemas, relational tables, primary and foreign keys, morphisms of databases, relational algebra, first order logic, system consequence.

1 Introduction

The author’s “Systems Consequence” paper (Kent [5]) is a very general theory and methodology for specification and inter-operation of systems of information resources. The generality comes from the fact that it is independent of the logical/semantic system (institution) being used. This is a wide-ranging theory, based upon ideas from information flow (Barwise and Seligman [1]), formal concept analysis (Wille and Ganter et al [2]), the theory of institutions (Goguen et al [3]), and the lattice of theories notion (Sowa), for the integration of both formal and semantic systems independent of logical environment. In order to better understand the motivations of that paper and to be able more readily to apply its concepts, in the future it will be important to study system consequence in various particular logical/semantic systems. This paper aims to do just that for the logical/semantic system of relational databases. The paper, which was inspired by and which extends a recent set of papers on the theory of relational database systems (Spivak [6],[7]), is linked with work on the Information Flow Framework (IFF [9]) connected with the ontology standards effort (SUO), since relational databases naturally embed into first order logic. We offer both an intuitive and technical discussion. Corresponding to the notions of primary and foreign keys, relational database semantics takes two forms: a distinguished form where entities are distinguished from relations, and a unified form where relations and entities coincide. The distinguished form corresponds to the theory presented in the paper (Spivak [6]). We extend Spivak’s treatment of tables from the static case of a single entity classification (type specification) to the dynamic case of classifications varying along infomorphisms. Our treatment of relational databases as diagrams of tables differs from Spivak’s sheaf theory of databases. The unified form, a special case of the distinguished form, corresponds to the theory presented in the paper (Spivak [7])). The unified form has a graphical presentation, which corresponds to the sketch theory of databases (Johnson and Rosebrugh [4]) and the resource description framework (RDF). This paper, which is the first step to connect relational databases with systems consequence, is concerned with the semantics of relational databases. A later paper will discuss various formalisms of relational databases, such as relational algebra and first order logic. Section 2 discusses the relational data model. Section 3 describes our representation for the table concept, both defining a category of tables, and proving that this category is complete (joins exist) and cocomplete (unions exist). Section 4 represents the relational database concept as a diagram of tables linked by the generalization-specialization of projections. Morphisms of relational databases are defined. Canonical examples of both are discussed. Finally, section 5 summarizes the results and gives some concluding remarks.

2 Relational Data Model

The paper defines an architectural semantics for the relational data model. 1 All information in the relational model is represented within relations. A relational database is a collection of relations (relational tables, or just tables). A table is represented as an array, organized into rows and columns. The rows are called the tuples (records) of the table, whereas the columns are called the attributes of the table. Both rows (tuples) and columns are unordered. In the basic relational data model all the components can be resolved into sets and functions.  2

The basic relational building block is the data domain represented by an entity type , where is the type set of an entity classification , whose instance set is a universe of data values local to the database. An entity instance is classified by an entity type when . Within the classification the entity type represents its extent, which is the domain of data values . We extend the classification to generalized elements. An indexed collection of entity types is called an -signature. It is denoted by the pair and represented as a map from index set to entity type set. An indexed collection of entity instances is called an -tuple. A tuple represents an object; either a concrete, physical object or an abstract, conceptual object. It is denoted by the pair and represented as a map from index set to the universe. The indexing set is called the arity of the signature or tuple. A -tuple is classified by an -signature , denote by , when they have the same arity and enjoy pointwise classification for all . The extent of an -signature is its tuple set .

Let be a relational table in a database based on the entity classification . An attribute of is an ordered pair consisting of an attribute name and an entity type , where is the arity of the table. The collection of attributes of forms its schema , where is an -signature. A tuple of is an -tuple that is classified by the table signature . Hence, the tuple set of is the set . Each tuple of must be uniquely identifiable by some combination (one or more) of its attribute values. This combination is referred to as the primary key. Without loss of generality, we assume that (primary) keys are single attributes. In addition, we conceptually separate the primary key attribute from the rest of the table and use it for indexing. Hence, the table is an indexed collection of -tuples , where is the set of primary keys of the table; that is, the table is represented as a map from keys to tuples.

Here is an small example of a relational database for a company in unified form, which illustrates both primary keys () and foreign keys (). It contains two relational tables, an employee table Emp and a department table Dept, which are indexed by primary keys and linked by foreign keys.

emp:Emp name:Str addr:Str dept: e1 Plato Greece e2 Aquinus Italy e3 Decartes France dept:Dept name:Str mngr: d1 Sales d2 Production

In this example, the entity (relation) types are Dept, Emp and Str. In the employee relational table Emp, the arity is , the signature is , and the (primary) key set is . Dotted items indicate relations (types or instances) being used as entities, since this is in unified form.

In the relational data model, there are three inherent integrity constraints: entity integrity, domain integrity, and referential integrity. Entity integrity asserts that every table must have a primary key column in which each entry identifies its own row (tuple). Domain integrity asserts that each data entry in a column must be of the type of that column. Entity and domain integrity are requirements for the distinguished form of database semantics. Entity integrity says there must be a tuple function from the set of (primary) keys, and domain integrity says that image tuples must be classified by the table signature . Hence, entity and domain integrity assert the existence of the tuple or content function . Referential integrity asserts that each entry in a foreign key column of a referencing table must occur in the primary key column of the referenced table. Referential integrity is a requirement for the unified form of database semantics. Referential integrity says there must be a function from a foreign key column of a referencing table to the primary key column of the referenced table. Hence, referential integrity asserts the existence of the functions in the sketch interpretation of a relational database.

The information in a database is accessed by specifying queries, which use operations such as select to identify tuples, project to identify attributes, and join to combine tables. In this paper, projection refers to a primitive generalization-specialization operation between pairs of relational tables (they are specified by the database schema, project from joined table to components, or other), whereas join is a composite operation on a linked collection of tables. Selection is a special case of join, which uses reference relations (tables).

3 Tables

A table (database relation) has an underlying (simple) schema with a set of entity types and an -signature , an entity classification with a common (entity) type set component and a local universe of entity instances , a set of (primary) keys, and a tuple function mapping keys to -tuples of type (signature) . Equivalently, it is an object in the the comma category of -tables .

A table morphism (morphism of database relations) consists of a (simple) schema morphism with a function on entity types and an -signature morphism , an entity infomorphism with a common (entity) type function component and a universe (entity instance) function , and a key function , which satisfy the condition , where 3 Table morphisms are illustrated in Figure 1. Here we see that table morphisms have the pleasing property that corresponding entries in the source and target tables satisfy the infomorphism condition from the theory of information flow (Barwise and Seligman [1]).

      
and
for all
let
then , and
hence iff
      
This four-part figure illustrates the condition on table morphisms. It has been annotated to help guide the understanding. The condition is symbolically stated in terms of set functions in the line of text just above. The top left diagram illustrates the condition, and the bottom left diagram expands on this. The top right diagram text is more detailed in terms of a source key . Here we see appearance of the infomorphism condition
Finally, the bottom left figure illustrates the effect of the morphism on source/target tables and .
Figure 1: Table Morphism

Composition of morphisms is defined component-wise. Let denote the category of tables (database relations) with the two projections ( is called the schema functor) and the key functor .

We can have three indexing categories for tables: classifications, schema or semidesignations. Each has their uses: classification indexing proves that the category of tables is complete (and the fibers help explain database fibers), semidesignation indexing proves that the category of tables is cocomplete, and schema indexing follows the true formal-semantics distinction.

3.1 Classification Indexed Category.

For any fixed classification , the -fiber category with respect to the classification functor , called the category of -tables, is the comma category associated with the tuple functor .

It has key and signature projection functors, an equivalent natural transformation , and is described as follows. A fiber object , or an -table (database -relation), is any table with entity classification and tuple (content) function mapping each key (abstract tuple) to a (concrete) -tuple in the extent of . A fiber morphism in is any table morphism in with identity infomorphism . It consists of a signature morphism and a key function , which satisfy the condition .

Proposition 1

There is an indexed category of tables from (the opposite of) the category of classifications and infomorphisms to the category of categories and functors. 4 The (opposite of the) fibered category corresponding to this (its Grothendieck construction) is isomorphic to the category of tables with the classification functor projection .

Proposition 2

The category of -tables is complete, its key projection functor is continuous and its signature projection functor is cocontinuous.

Proof

(We have both an abstract and a useful concrete proof, but only have room for the former.) The category is complete, since is cocomplete. The tuple functor is continuous. 5

The category of -tables is the semantical domain for a relational database with entity classification . Completeness of means that, not just binary, but database joins over arbitrary diagrams of tables of can be computed.

Proposition 3

For any infomorphism , the table fiber functor is continuous.

Continuity of means that database joins are preserved: database joins of -tables are mapped to database joins of -tables.

Theorem 3.1

The category of tables is complete.

Proof

The indexing category is complete, the fiber category is complete for each classification , and the fiber functor is continuous for each infomorphism . Hence, this is an application of a theorem of Tarlecki, Burstall and Goguen [8].  6

3.2 Schema Indexed Category.

For any fixed (simple) schema , the -fiber category with respect to the schema functor , called the category of -tables, is the comma category with key and -classification projection functors 7

It is described as follows. A fiber object , or an -table (database -relation), is any table with (simple) schema . A fiber morphism in is any table morphism in with identity (simple) schema morphism . It consists of a universe (entity instance) function defining an entity infomorphism and hence the presheaf morphism with tuple natural transformation , and a key function , which satisfy the condition . 8

Proposition 4

There is an indexed category of tables , whose Grothendieck construction (fibered category) is (the opposite of) the category of tables with the schema functor projection .

3.3 Semidesignation Indexed Category.

A semidesignation , consists of a schema , and an entity classification with a common (entity) type set component . A semidesignation morphism consists of a schema morphism and an entity infomorphism with a common (entity) type function component . For any semidesignation , the set of tuples of is , the set of -tuples in the extent of .

Lemma 1

Any semidesignation morphism defines a tuple function . Hence, there is a tuple functor .

Proposition 5

The category of tables is the comma category

associated with the tuple functor . The category of tables is cocomplete.

Proof

The opposite category of semidesignations is cocomplete, since is complete. 9

4 Relational Databases

A relational database is a naturally connected diagram of tables. It has an underlying relational database schema 10 with a category of relation types (symbols) linked by generalization-specialization, a set of entity types , and a signature functor , an entity classification with a common (entity) type set component and a local universe of entity instances , a key functor , and a tuple natural transformation . Equivalently, it consists of a table functor , where and is the tuple natural transformation that is an integral component of the comma category . Here are some examples of relational databases.

Table.

A table (database relation) with entity classification , schema , tuple set , and tuple function , is a one-object relational database with the same entity classification, the terminal category of relation types (symbols) , the signature functor with single -signature , the key functor with single key set , the tuple natural transformation with single component tuple function , and the table functor with single -table .

Classification.

A classification is a relational database , where the entity classification is itself ().  11 The additional components are described as follows. The category of relation types (symbols) is the reverse conceptual preorder (generalization-specialization order) on entity types with when ; that is, when is at least as general as . The signature functor is the principle filter operator ; it maps an entity type to the -signature , where is the principle filter of and is inclusion; and it maps an ordering with to the arity inclusion function . For any an entity type , the -tuple functor applied to the -signature is the tuple set . The key functor is the extent operator , which maps an entity type to its extent and maps the ordering with to the extent inclusion function . For any entity type , the tuple function maps an entity instance to the constant tuple . This defines a natural transformation .

IFF Structure.

Using the key functor, we can define the relation classification with type set , instance set , and incidence with when for key and relation symbol . The elements (keys) in are called abstract tuples in the Information Flow Framework [9]. Any relational database determines an structure (model) in the Information Flow Framework [9] with type language , entity classification , semidesignation and relation classification . This is an adjoint situation: any IFF structure determines a relational database.

Unified Database.

A unified database is a special case of a database, whose relation classification coincides with its entity classification . Unified databases allow the introduction of foreign keys. In fact, columns are either the single primary key or a foreign key. The entries in a column are keys of the type of the column. Actual datatypes, such as strings or numbers, can be regarded as primary keys of themselves. Conversely, we can think of any relational table with a single column, one whose schema is of the form , to be a set of entities.

Any relational database schema in unified form () has an associated sketch. Define the arity functor , Let denote the Grothendieck construction of with object set and morphisms for -constraints . The graph of the sketch has node set and edges with source and target . This graph is actually 2-dimensional, given the -constraints. The sketch specifies a cone for the signature of each relation type and constraints for the commuting diagrams in . Any relational database in unified form has an associated sketch interpretation . The interpretation maps a node (relation type) to the set of keys of and maps an edge to the map , where . This also is 2-dimensional.

A relational database morphism consists of a relational database schema morphism 12 an entity infomorphism with a common (entity) type function component and a universe (entity instance) function , and a key natural transformation , which satisfy the condition , where . 13 It is strict or trim when the underlying relational database schema morphism is strict or trim (). Figure 2 illustrates in detail a relational database morphism. Here are some examples of relational database morphisms.

Table morphism.

A relational database morphism with one-object source and target categories of relations and identity relation functor is identical to a single morphism of tables , except that the direction has switched.

Infomorphism.

An entity infomorphism is a relational database morphism between the classifications regarded as relational databases, where the following hold.  14 The type function is monotonic mapping an ordering with to the ordering , with , since iff implies iff . For each type , the type function restricts to an arity function