Wednesday, May 10, 2017

Keys, Primary and Foreign

Welcome to the Bank

Suppose we have an oversimplified banking system, with two tables.

create table balances(name string, balance number);
create table transactions(tid number, from string, to string, amount number);

This allows us to populate a table of people with balances, and transactions where people give money to each other.

    balances
    name | balance
    -----+--------
    ann  | 200
    bob  | 150
    leo  | 300

    transactions
    tid | from | to  | amount
    ----+------+-----+-------
    1   | ann  | bob | 25
    2   | bob  | leo | 50

We can imagine traversing the transaction table something like this pseudocode:

    for (from, to, amount) in select("select from,to,amount
                                        from transactions
                                       order by tid")
        begin transaction
           update balances set balance = balance - amount where name = from;
           update balances set balance = balance + amount where name = to;
        commit;

(note that sql implementations all vary slightly in how tables and indexes are creates.  Everything following is pseudocode.)

Some Data Problems?

Here's a couple of problems we've got in the present system.

  1. We can have empty (null) fields.  For example, a transaction of insert (3,bob,NULL,NULL).
  2. We can have duplicate people in the balance table.  Note that it's OK if balances are duplicated (you and I can have the same amount of money!) but each person must be unique.
  3. We are going to be performing lookups keyed by transactions.tid and balances.name.  We better make that efficient if we're going to grow to be a bank with millions of users and transactions!

Some Data Solutions

1. Let's specify all fields that are required.

    create table balances(name string not null, balance number not null);
    create table transactions(tid number not null, from string not null,
                              to string not null, amount number not null);

Now a transaction will fail if we insert a row with a null column.  This  is nice, because now our code can safely assume we have good non-null data).

2. Specify we can't have duplicate names.

    create table balances(name string unique not null, balance number not null);

This will cause an error if we ever

    insert into balances(name, balance) values('bob', 100);
    insert into balances(name, balance) values('bob', 200);

3. Create some indexes on our key search columns.

    create index balances_name_ix on balances(name);
    create index transactions_tid_ix on transactions(tid);

Primary Keys

When we look at our tables, we notice that each has a column which serves as the main identifier of a row.

  - balances are "owned" by name
  - transactions are identified (and ordered) by tid.

We notice three things about these columns.

  • they should never be null (because they identify the row)
  • they should be unique (because it doesn't make sense for them to be dupes)
  • practically, they should have an index (because we usually use them in where clauses).

The concept of a table having a main identifying column, and that column having the three attributes just listed, is so common that we give this a name:  That column is the primary key, and can be specified as such.

    create table balances(name string primary key, balance number not null);
    create table transactions(tid number primary, from string not null,
                              to string not null, amount number not null);

Declaring a column to be a primary key automatically sets the three things we noticed earlier.

  - primary keys are NOT NULL
  - primary keys are UNIQUE
  - there's an index automatically created for the PK

In addition, via notation we've specified that the row's "identifier" is a particular key... we don't have to guess or infer this information.

Foreign Keys

Let's consider a row in the transactions table:

    (1, "ann", "zoo", 50);

If we run this throught the pseudocode above, everything will work and no errors will be reported.  But there's a very big problem.  This statement is correct,

    update balances set balance = balance + amount where name = 'zoo'

but it loses the $50 ann is trying to transfer, since there's no row in balances where name is 'zoo'.

We can solve this in the database by specifying that transactions.from and transactions.to refer to balances.name.  Note that semantically this removes the requirement that we declare transactions.from and transactions.to as NOT NULL, since they are required to match balances.name.

    create table transactions(tid number primary,
                              from string references(balances.name),
                              to string references(balances.name),
     amount number not null);

With this in place, this insert will fail:

    insert into transactions values(1, 'ann', 'foo', 100);
    *** ERROR: foo does not exist in balances(name)

and this delete will fail:

    delete from balances where name = 'ann';
    *** ERROR: can't delete row with dependencies

There are various options when creating foreign keys, such as ON CASCADE DELETE.  In this case,

    delete from balances where name = 'ann';

would cause this statement to be implicitly executed as part of the transaction.

    delete from transactions where from = 'ann' or to = 'ann';

Caveats

Note that a lot of systems might document the foreign key relations are, but not enforce them in the database, relying instead on programming logic.  Sometimes this is done for efficiency (e.g. many inserts into the transactions table -- if you verify from and to are correct, you don't want to pay two lookups penalty on each insert).

A Common PK/FK Idiom

Splitting a potential Fat table into several smaller tables.  Sometimes it's for ease of use, sometimes it's for other reasons.  For example

    create table master_users(uid, Name);
    create table billing_info(uid, credit_card, expires, ...);
    create table contact_info(uid, address, email, phone, ...);
    create table lawsuit_info(uid, ...)

We might be legally required to protect lawsuit_info and not make it visible to people who can see contact info.

In any case, we would make uid a PK in all four tables, allowing us to

    # get everyone
    select m.name, b.email
      from master_users m, billing_info b
     where m.uid = b.uid;

    # get one person
    select m.name, b.email
      from master_users m, billing_info b
     where m.uid = b.uid and m.uid = 123;

And we could optionally add FK relationships from *_info.uid to master_users.uid.