OfficeFrustration - View Single Post

**-CELKO-**

It has never caused me the least problem. There has been no revelation in this thread that would cause me to even revisit the decision.

How would you know, since autonumbering have no validation or
verification? Trust me, I earn a lot of my living cleaning up these
"fake pointer chain" SQL schemas.

ATTRIBUTES AND FIRST NORMAL FORM (1NF):

An attribute has to belong to an entity in the real world being
modeled by the RDBMS. In First Normal Form, the attributes is modeled
as a column in a table. It must also be an atomic value. I am not
going to get into the differences between atomic and scalar, but
scalar implied atomic and not the other way around.

IDENTITY does not exist in an entity in the real world being modeled
by the RDBMS. Thus, it is not an attribute and cannot be in a table,
by definition.

IDENTITY is a result of the physical state of particular piece of
hardware at a particular time as read by the current release of a
particular database product. It is not a data type. You cannot have
more than one column of this "type" in a table. It is not NULL-able,
which all data types have to be in RDBMS (Dr. Codd again). It is not
a numeric; you cannot do math with it.

It is what is called a "tag number" -- basically, a nominal scale
written with numbers instead of letters. Only equality tests make
sense (see the part about inserting sets of rows, which means that
IDENTITY has no natural ordering, not even temporal).

DEFINITION OF KEYS:

(Dr. Codd, RM II, page 23):
It is equally valid to interpret the uniqueness property in terms of
object identification: the value of the primary key in each row of the
pertinent R-table identifies the particular object represented by that
row uniquely with the type of object s that are presented by that
relation,

Create a table with an IDENTITY as the key and perform a series of
insertions of the same object, say an automobile:

INSERT INTO Vehicles (vin, vehicle_mileage, iso_tire_size) VALUES
(..);
INSERT INTO Vehicles (vin, vehicle_mileage, iso_tire_size) VALUES
(..);

I now have two cars with the same VIN number. Actually, I have two
copies of the same car (object) with an IDENTITY pseudo-key instead of
the industry standard VIN as the proper relational key. This is a
called an insertion anomaly.

Assume that this pair of insertions led to creating vehicles 41 and 42
in the table, which are the same object. I can update 42's mileage
without touching 41. I now have two versions of the truth in my
table. This is a called an update anomaly.

Likewise, if I wreak vehicle 41, I still have copy 42 in the motor
pool in spite of the fact that the object no longer exists. This is
deletion anomaly.

Before you say that you can make a key from (IDENTITY, vin), read more
Dr. Codd (ibid):

If the primary key is composite and if one of the columns is dropped
from the primary key, the first property [uniqueness] is no longer
guaranteed.

Assume that I have correct VINs and use (IDENTITY, vin) as a key.
Dropping the pair clearly does not work -- a lot of vehicles could
have the same mileage and tire sizes, so I do not have unique rows
guaranteed. Dropping IDENTITY will leave me with a proper key that
can be validated, verified and repeated.

Dropping the VIN does not leave me with a guarantee (i.e.
repeatability and predictability). If I run this code:

BEGIN ATOMIC
DELETE FROM Vehicles
WHERE id = 41;
INSERT INTO Vehicles (vehicle_mileage, iso_tire_size)
VALUES (values of inserted row 41 );
END;

The relational algebra says that I should have in effect done
nothing. I have dropped and re-inserted the same object - an EXCEPT
and UNION operation that cancel. But since IDENTITY is physical and
not logical, this does not work.

If I insert the same vehicle (object) into another table, the system
will not guarantee me that I get the same IDENTITY as the relational
key in the other table. The VIN would be guaranteed.

The guarantee requirement gets worse. SQL is a set-oriented language
and allows me to write things like this.

INSERT INTO Vehicles (vin, vehicle_emileage, iso_tire_size)
SELECT vin, vehicle__mileage, iso_tire_size
FROM NewPurchases;

Since a query result is a table, and a table is a set which has no
ordering, what should the IDENTITY numbers be? The entire, whole,
completed set is presented to Vehicles all at once, not a row at a
time. There are (n!) ways to number (n) rows. Which one did you
pick? Why? The answer in SQL products has been to use whatever the
*physical* order of the physical table happened to be. That non-
relational phrase "physical order" again!

But it is actually worse than that. If the same query is executed
again, but with new statistics or after an index has been dropped or
added, the new execution plan could bring the result set back in a
different physical order.

Can you explain from a logical model why the same rows in the second
query get different IDENTITY numbers? In the relational model, they
should be treated the same if all the values of all the attributes are
identical and each row models the same object as it did before.

Yes, I can write a lot of procedural code and triggers, thus violating
all Codd's rules about high-level data manipulation and integrity
being handled by the RDBMS. But that is a kludge and good SQL
programmers know it.

THE KINDS OF KEYS:

Now for a little more practice than theory. Here is my classification
of types of keys. It is based on common usage

1) A natural key is a subset of attributes which occur in a table and
act as a unique identifier. They are seen by the user. You can go to
the external reality and verify them. You would also like to have
some validation rule. Example: UPC codes on consumer goods (read the
package barcode) and validate them with a check digit or a
manufacturer's website or a tool (geographical co-ordinates validate
with a GPS).

These should be either industry standards or natural phenomena.

2) An artificial key is an extra attribute added to the table which is
seen by the user. It does not exist in the external reality, but can
be verified for syntax or check digits inside itself. It is up to the
DBA to maintain a trusted source for them. Example: the open codes in
the UPC scheme which a user can assign to his own stuff. The check
digits still work, but you have to verify them inside your own
enterprise.

If you have to construct a key yourself, it takes time to design them,
to invent a validation rule, set up audit trails, etc. Yes, doing
things right takes time and work. Not like just popping an IDENTITY
on every table in the schema, is it?

3) An "exposed physical locator" is not based on attributes in the
data model, but in the physical storage and is exposed to user. There
is no reasonable way to predict it or verify it, since it usually
comes from the physical state of the hardware at the time of data
insertion. The system obtains a value thru some physical process in
the hardware totally unrelated to the logical data model.

Just because IDENTITY does not hold a track/sector address (like
Oracle's row_id) does not make a logical key. A hash points to a
table with the address. An index (the mechanism in IDENTITY) resolves
to the target address via pointer chains. If you re-hash, or re-
index, the physical locator has to resolve to the new physical
location.

Technically, these IDENTITY values are not really keys at all, since
they are derived from the PHYSICAL storage and are not even attributes
in the LOGICAL data model.

Physical locators include assorted flavors of indexes (Ingres uses
over a dozen kinds), pointer chains (Sybase SQL Anywhere's key joins),
row numbers (Oracle), hash tables (Teradata), bit vectors (Rushmore
and Foxpro), non-contiguous storage (SAND, the whole database is
reduced to a multi-dimensional vector space), inverted list (Model
204), etc. Any device that you can use to locate a particular row in
storage without a sequential search of the storage qualifies.

In the old days, we had to explicitly create, use and update indexes.
They were very exposed and one of the classic errors was not to re-
index a file. For Oracle, the classic was storing a row_id in code or
a file, and then doing a storage re-organization. All of the exposed
locators were subject to some characteristic screw up where logical
and physical models met. Thus, we had utilities for cleaning up
broken pointer chains, re-indexing files, etc. in those early days.

One of the major advances in SQL was putting this responsibility on
the system and not the programmers. The bad news is that if someone
uses exposed locators in an SQL product, there are not a lot of
utilities for cleaning up the mess.

** Notice that people get "exposed physical locator" and "surrogate"
mixed up; they are totally different concepts. **

WHAT IS A SURROGATE KEY?

A quote from Dr. Codd: "..Database users may cause the system to
generate or delete a surrogate, but they have no control over its
value, nor is its value ever displayed to them ..."(Dr. Codd in ACM
TODS, pp 409-410) and Codd, E. (1979), Extending the database
relational model to capture more meaning. ACM Transactions on
Database Systems, 4(4). pp. 397-434.

This means that a surrogate ought to act like an index; created by the
user, managed by the system and NEVER seen by a user. That means
never used in queries, DRI or anything else that a user does.

Codd also wrote the following:

"There are three difficulties in employing user-controlled keys as
permanent surrogates for entities.

(1) The actual values of user-controlled keys are determined by users
and must therefore be subject to change by them (e.g. if two
companies merge, the two employee databases might be combined with the
result that some or all of the serial numbers might be changed.).

(2) Two relations may have user-controlled keys defined on distinct
domains (e.g. one uses social security, while the other uses employee
serial numbers) and yet the entities denoted are the same.

(3) It may be necessary to carry information about an entity either
before it has been assigned a user-controlled key value or after it
has ceased to have one (e.g. and applicant for a job and a retiree).

These difficulties have the important consequence that an equi-join on
common key values may not yield the same result as a join on common
entities. A solution - proposed in part [4] and more fully in [14] -
is to introduce entity domains which contain system-assigned
surrogates.

[emphasis begin] Database users may cause the system to generate or
delete a surrogate, but they have no control over its value, nor is
its value ever displayed to them....." (Codd in ACM TODS, pp 409-410)
[emphasis end].

That means if IDENTITY were a surrogate, we would not see its values,
print them out as invoice numbers and so forth. That means if
IDENTITY were a surrogate, we could drop it and the schema would still
work. Like an index. But IDENTITY fails on both those points.

PRACTICAL STUFF

This is a summary of points already made, but put in one place. The
summary is "Good bye, data integrity" in 25 words or less.

IDENTITY does not port. This is such a basic part of Software
Engineering; I am not going to comment on it. If anyone thinks that
they will always deal only with their own code on one machine with one
release of one product is not planning a long career.

IDENTITY leaves gaps. This is because it is not a surrogate and does
not verify its value against a relational key. It is a software
implementation of a counter, like we had on some of the old tape
drives. The practical problems involve having to explain the gaps to
the SOX guy in accounting.

IDENTITY has no validation. One tag number is as good as another.

IDENTITY has no verification rules. Looking at an object, you cannot
tell what IDENTITY value will be assigned to it. Is 41 or 42 the id
of this automobile? Prove it.

IDENTITY has no repeatability. The value depends on the arrival time
in the physical storage.

David Cressey advanced the theory that this all started out with
Access 97. If you build a database with one of the database wizards,
this is usually what you get.

Also, if you design your own database and create your own tables,
there comes a time, for each table, where MS Access intervenes, and
tells you that you have not assigned a primary key, an that one is
recommended, and that Access will do it for you, if that's ok.

If you say "yes" (who would not?) what you get is precisely this: an
extra numeric column, auto assigned, and declared as the key of the
new table.

Actually, I can live with this, for so called "entity tables". It
would be better for the newbie DBA to discover a natural key, and
declare that, but it's not so bad. For "relationship tables", the
primary key ought to be a compound key made up of two or more foreign
keys. But that is not what Access does.

Someone who has learned database design before building their first
Access database would not do this. But Access makes it seductively
simple to design and build your own database, without even a minimal
understanding of the consequences of your decisions.