View Single Post
  #18  
Old April 9th, 2007, 01:37 AM posted to microsoft.public.access
David W. Fenton
external usenet poster
 
Posts: 3,373
Default Why do all my Relationships show as Indeterminate ?

"Tom Ellison" wrote in
:

As long a we can discuss things in a way that is not acrimoniuous,
which seems quite possible given your response.


I don't know how to respond to that. Are you spoiling for a fight?

I'll say there's nothing in your response I haven't heard before.


I'm sure of that. But I still am baffled why anyone would still
think natural keys are the way to go except in a small number of
cases.

A natural key that is text may be longer in the dependent table
than putting a surrogate key in the dependent table. On the other
hand, you must add the identity surrogate column in the foreign
key table, and you must index this additional column.


You lost me. Assuming we're talking about a single-column PK, the
difference in storage space between a long integer and a text field
is going to be proportional to the difference between the storage
space required for the one vs. the other. Since both require storage
in two tables in order to map the relationship, and because both
have to be indexed, it seems to me that all you need to do is figure
out the difference between the size of the two data types as stored.
Only very small text fields are going to take up less space than a
long integer (4 bytes? or is it 8?).

So it seems quite obvious to me that natural keys are going to
always be longer, unless they are numeric data (in which case they
aren't really natural, as somebody has to maintain uniqueness of the
numeric values).

And the overhead of updating large indexes increases as the number
of records increases, regardless of which is the larger table in the
relationship (parent or child).

There are trade-offs in the size of the database, and it
is not always the case that there is any noticable difference in
the size of the database.


I would be much more concerned about the performance issues with the
larger index data pages that will be requuired for a text-based
natural key.

Also, I do not consider the size of the database to be a
critical factor. Hard disk space is not all that expensive. And
size does not, in this case, equate to performance.


In the case of indexes it certainly does, both in editing/updating
and in retrieval.

I would say that "long integer
columns can be stored more efficiently" is fairly insignificant if
what you're saying is it saves disk space. If you're talking
performance, then again there are significant trade-offs.


Ah! So you agree with what I've written above. I was only going with
"space" as the starting point because that's the ground you chose to
stake out in the discussion. I'm much more concerned with
performance, too, and that's a major reason why I would consider a
natural key only in a very limited number of instances.

I quite disagree that repeating the data in the dependent table,
even with a multi-column key, is a normalization issue. I do not
agree that this violates any of the rules of normalization. If
you point out that it repeats those columns that were in the
foreign table, then I respond that having an identity column in
the foreign table and repeating that in the dependent table must
be a violation of the same rule. It seems obvious then that there
is no such rule.


I don't know what theory says. I don't really care. What I do know
is that the more fields you have, the more they are likely to need
to CASCADE UPDATES. That means a *huge* performance issue, because
an update to any of the columns of the PK of the parent table will
need to update the corresponding column in all the records of the
child table. But with a surrogate key, which is never updated, you
never ever have that issue.

That you can keep the repeated data in synch at the engine level is
dependent on which engine you use. But you are updating data in
multiple locations. An autonumber surrogate key is never changed,
never updated, so the issue with keeping repeated data synchronized
simply vanishes.

To me, that's the main reason to normalize, so that you don't have
to maintain data in multiple locations. Even though it's usually
done at the engine level, multi-column natural keys still require
tye CPU cycles to keep the repeated data in synch.

It is not "vastly more efficient". It is not uncommon to write a
query that requires one of more columns to be displayed that are
part of the natural key from the foreign key table. If such
columns are repeated in the dependent table, then putting them on
a report will not require a JOIN to the foreign table. Since they
already exist in the dependent table, you can retrive them
directly. In this not uncommon case, there need be no lookup to
retrieve those values.


But SQL-based database engines are optimized to make such joins
every efficient. Secondly, in your version, the FK columns won't be
updatable, but in many views, the data the FK points to in the
parent table *will* be updatable.

I expect you are referring to avoiding cascade updates when you
speak of using natural PKs only "when the values will *never*
change". The rules of good database design actually state that
the natural key should rarely change, not never, but the point is
pretty much the same.


Well, now you get into the other major flaw of natural keys -- that
for many entities, there are no reasonable candidate keys. What
would you use as a natural key for a person? There is simply no
viable natural key to identify a person in any application that is
using real-world data and is not handing off the problem to some
other PK-assigning authority (like the Social Security
Administration, which is itself problematic, since, over time, that
PK gets re-used).

In some of my
designs, such cascades events are recorded. A system can go for a
few months between such occurrances. So, I largely agree whith
you, if you will permit.


There are lots of entities where the only candidate natural keys
will need to be updated often. And, of course, there can't be any
Null fields in the compound PK, which means storing default values
that will then have to be eliminated in reporting, etc.

There's quite a bit to this discussion, and I'm preparing a
series, if not a book, about this very thing. Frankly, I find
there are very cogent arguments on both sides of the discussion.
But, finally, I find that the disagreement is entirely artificial.
The real problem is that the issue should have been addressed
inside the design of the database engine (Jet, SQL Server, etc.)
and could then be entirely transparent to both users and
designers. The facility provided by surrogate "identity" keys
could have been addressed within the engine, automatically, and
this can be done with an efficiency vastly greater than either
method implemented by we database designers. That is, the problem
is artificial, and its best solution lies outside what you and I
can effect.


I'd be interested to see your thoughts on that at more length. It
intrigures me as a concept, and it's certainly a general principle
that I would support, but I'm afraid that in so many cases I'm not
happy when such things are actually done for me, behind the scenes.
Some examples: Name Autocorrect, Multi-value fields.

I propose that, for a 1:M relationship (the most common) the
database engine maintain the link between the foreign and
dependent tables as an internal pointer directly to the foreign
table row. Unlike the autonumber, this requires no index, so is
several times faster.


Do you really believe, for instance, that there is no index to the
data pages where memo fields are stored, so that retrieval of the
memo data with the record it belongs to is not quicker? There would
have to be internal structures doing the mapping, and they'd have to
be optimized for efficiency. Seems to me that it would just be a
bunch of hidden structures, as opposed to ones that you can create
and work with yourself.

The engine must keep track
of whether this link is valid or not, and use the lookup facility
using the key when it is not, all the time fixing up those links
that have been invalidated when there is idle time on the server.


Ack. That sounds terrible. What if the server never has idle time?
What if you're not using a server-based data engine? And what
happens to data requests between the start of the update and the
commit of it? Your proposal is adding a whole lot of new locking
issues, seems to me, and would be doing internally exactly what is
happening already in most database engines.

This would implement all the best features of both natural and
surrogate keys, and indeed out-perform either.


I really cannot see how. Perhaps you don't have this level of hand
waving in your planned long-form version of this, but I just don't
see how you get from where we are to what you propose without using
exactly the same kinds of structures we already use.

Once viewed in this way, the whole
issue is not really a debate between database designers, but a
failure of database engine designers. The debate we are having is
artificially created by a failure to build the database engine
optimally.


But the problem I see is that you'd be depending on the db engine to
determine something that is application-specific. That is,
uniqueness is a characteristic of the data used for a particular
purpose in a particular application, and natural keys in particular
are very dependent on that. I don't see how a database engine could
determine this without having an understanding of the way the entity
is being used in the particular application.

Please do not think I have no understanding or sympathy for the
point of view you have proposed. I quite agree with you on some
of the major aspects. My opinions on this have been
revolutionized in the past 2 years as I've come to realize the
underlying factors causing this debate.

I believe that, using the existing tools, there may be very good
cause to use surrogate keys for relationships. I'm not so sure
this ever happens in a database less than 2 GB. In my upcoming
paper on the subject, I have created databases and contrasted
there performances using both paradigms, with both Jet and SQL
Server. The results may surprise many.


Your proposed system would use surrogate GUIDs to maintain the
links. I can't see how it could work otherwise. And you'd still have
to maintain uniqueness of the actual attributes with indexes you
create yourself. So, I see absolutely no difference between the
current surrogate key type of RI and what you're proposing.

--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/