Posted on November 21, 2019 by Dmitry Olshansky, Denis Redozubov (translation, editing)

Sum Types for Relational Databases

Review of various approaches and better design for representing the sum types in the database

Matt Parsons talks about a few ways to encode sum types in relational databases in his blog.

Not so long ago I was thinking about the same problem and came up with slightly different solutions. Let’s start with an example Matt had used:

data Animal
    = Cat Name Age FavoriteFood
    | Dog Name OwnerId
    | Bird Name Song
A cat, a dog and a bird

Matt shows us three ways of dealing with it. Let’s do a recap to have a better picture:

  1. Shared Primary Key

Dependent tables (which have a 1-to-1 correspondence to constructors) refer to the primary table (which corresponds to the Haskell data type). To make sure it’s impossible to have multiple references to the same row of the primary table we add a column containing constructor names to the primary key of the primary table. At the same time, foreign keys of dependent tables have a similar columns, but in each table the column has only one possible value. Same identifiers should be used for both tables.

  1. The persistent approach

The primary table refers to the dependent tables. Identifiers of these tables are generated separately and are different. Only one value of the foreign key can be not null.

  1. Nullable Columns

Here a denormalized representation of the data is used. All data goes into the the same table, constructor is stored in one of the columns, check constraints are used to limit the range of values we can assign according to the constructor.

I’m proposing two other designs to model sum-types in relational databases.

Normalized sum

In his example, Matt uses sum-of-products ADTs we use in haskell, rather than a pure sum-type. Let’s take a look a pure sum first:

data Cat  = Cat Name Age FavoriteFood
data Dog  = Dog Name OwnerId
data Bird = Bird Name Song
data Animal = ACat Cat | ADog Dog | ABird Bird

For this example, I argue that the persistent approach is a better fit overall. But I propose a few changes:

  • The primary table should contain a constructor tag by introducing a column for every possible value of the sum type. The columns have a unit type (which admits only one value) but are nullable. Only one column can have a non-null value, indicating which constructor is used.
  • Dependent tables have a single extra column with a corresponding constructor tag which can only be unit.
  • Identifiers of the dependent tables mirror identifiers of the primary table. We add constructor tags to the foreign key.1

A few benefits of this approach comparing to the persistent one is:

  • We have fewer foreign key identifiers to work with. This leads to a reduction in the size of the primary table, but more importantly, it reduces the number of foreign key indexes.
  • In some cases it improves select performance, as you don’t have to map primary identifiers into dependent identifiers. Consider e.g. an extension to our database schema adding for each pet a history of its owners (many-to-many). To select all history of all dog owners we don’t need to use the animals table, it suffices to join the history table with the dogs table.

Implementation

create type unit as enum ('()');
create table cat (
  id integer not null,
  tag unit not null default '()',
  name text not null,
  age int not null,
  food text not null,
  primary key (id,tag)
);
create table dog (
  id integer not null,
  tag unit not null default '()',
  name text not null,
  owner_id int not null,
  primary key (id,tag)
);
create table bird (
  id integer not null,
  tag unit not null default '()',
  name text not null,
  song text not null,
  primary key (id,tag)
);
create table animal (
  id integer primary key,
  cat unit,
  dog unit,
  bird unit,
  check
    ( (cat is not null and dog is null and bird is null)
    or (cat is null and dog is not null and bird is null)
    or (cat is null and dog is null and bird is not null) ),
  constraint animal__cat foreign key (id,cat) references cat (id,tag),
  constraint animal__dog foreign key (id,dog) references dog (id,tag),
  constraint animal__bird foreign key (id,bird) references bird (id,tag));

When inserting we have to insert data first into the dependent table, and then into the primary table.

When deleting we have to do the opposite.

Denormalized sum

So, what’s the difference between normalized and denormalized sums and what is unsatisfactory in the proposed implementation?

In the normalized implementation, rows in the dependent table can exist on their own, without anything referencing them from the primary table. It’s possible to remove data from the primary table and accidentally forget to remove the correspnding data from the dependent table. Bulk deletion scenarios are not obvious as well. An even worse scenario can arise if we consider cascade deletion of some data in the primary table. This “disjointedness” of the dependent data is a very close approximation of the pure sum structure, but from the POV of database design it’s not the best representation possible. The same problem is exhibited in the persistent approach. For the “Shared Primary Key” approach the situation is reversed — data in the primary table can exist on its own. In other words, in the persistent approach the identifiers in the primary table are a subset of the union of the identifiers in the dependent tables, and in the “Shared Primary Key” approach — a superset.

Let’s just add two-way references! But how do we insert and delete the data in this scenario? Which table should we start with?

The answer lies in a fairly standard database practice which is called deferred constraints, which allows us to delay the checks until the end of the transaction, so we get a bijective relation between the primary table and the union of dependent tables.

Denormalized implementation

create type unit as enum ('()');
create table cat (
  id integer not null,
  tag unit not null default '()',
  name text not null,
  age int not null,
  food text not null,
  primary key (id,tag)
);
create table dog (
  id integer not null,
  tag unit not null default '()',
  name text not null,
  owner_id int not null,
  primary key (id,tag)
);
create table bird (
  id integer not null,
  tag unit not null default '()',
  name text not null,
  song text not null,
  primary key (id,tag)
);
create table animal (
  id integer primary key,
  cat unit,
  dog unit,
  bird unit,
  check
    ( (cat is not null and dog is null and bird is null)
    or (cat is null and dog is not null and bird is null)
    or (cat is null and dog is null and bird is not null) ),
  constraint animal__cat foreign key (id,cat) references cat (id,tag)
    deferrable initially deferred, -- !!!
  constraint animal__dog foreign key (id,dog) references dog (id,tag)
    deferrable initially deferred, -- !!!
  constraint animal__bird foreign key (id,bird) references bird (id,tag)
    deferrable initially deferred); -- !!!
---------- !!! ----------------|||
alter table cat add constraint cat__animal
  foreign key (id) references animal (id) on delete cascade;
alter table dog add constraint dog__animal
  foreign key (id) references animal (id) on delete cascade;
alter table bird add constraint bird__animal
  foreign key (id) references animal (id) on delete cascade;

All the differences of this implementation are highlighted with comments.

In this implementation we have to insert data starting with the primary table (which in my opinion is more natural). To delete the data, we just delete from the primary table, and then cascade deletion removes data from dependent tables.

begin;
  insert into animal(id,cat) values (1,'()');
  insert into cat(id,name,age,food) values (1,'kitty',1,'milk');
  insert into animal(id,dog) values (2,'()');
  insert into dog(id,name,owner_id) values (2,'dolly',5);
  insert into animal(id,bird) values (3,'()');
  insert into bird(id,name,song) values (3,'nightingale','twist');
commit;
delete from animal where id in (1,2)

Here, if desired we can also move the name column to the primary table.

Thanks for reading!


  1. We can use the fact that PostgreSQL has a support for compound foreign keys parts of which can be null, in which case the entire key is considered to be null. This is called “MATCH SIMPLE” and it’s used by default.↩︎

Recommended

You may also like

Want to know more?
Get in touch with us!
Contact Us

Privacy policy

Last updated: 1 September 2021

Typeable OU ("us", "we", or "our") operates https://typeable.io (the "Site"). This page informs you of our policies regarding the collection, use and disclosure of Personal Information we receive from users of the Site.

We use your Personal Information only for providing and improving the Site. By using the Site, you agree to the collection and use of information in accordance with this policy.

Information Collection And Use

While using our Site, we may ask you to provide us with certain personally identifiable information that can be used to contact or identify you. Personally identifiable information may include, but is not limited to your name ("Personal Information").

Log Data

Like many site operators, we collect information that your browser sends whenever you visit our Site ("Log Data").

This Log Data may include information such as your computer's Internet Protocol ("IP") address, browser type, browser version, the pages of our Site that you visit, the time and date of your visit, the time spent on those pages and other statistics.

In addition, we may use third party services such as Google Analytics that collect, monitor and analyze this ...

Cookies

Cookies are files with small amount of data, which may include an anonymous unique identifier. Cookies are sent to your browser from a web site and stored on your computer's hard drive.

Like many sites, we use "cookies" to collect information. You can instruct your browser to refuse all cookies or to indicate when a cookie is being sent. However, if you do not accept cookies, you may not be able to use some portions of our Site.

Security

The security of your Personal Information is important to us, so we don't store any personal information and use third-party GDPR-compliant services to store contact data supplied with a "Contact Us" form and job applications data, suplied via "Careers" page.

Changes To This Privacy Policy

This Privacy Policy is effective as of @@privacePolicyDate​ and will remain in effect except with respect to any changes in its provisions in the future, which will be in effect immediately after being posted on this page.

We reserve the right to update or change our Privacy Policy at any time and you should check this Privacy Policy periodically. Your continued use of the Service after we post any modifications to the Privacy Policy on this page will constitute your acknowledgment of the modifications and your consent to abide and be bound by the modified Privacy Policy.

If we make any material changes to this Privacy Policy, we will notify you either through the email address you have provided us, or by placing a prominent notice on our website.

Contact Us

If you have any questions about this Privacy Policy, please contact us.