Why do all my Relationships show as Indeterminate ?

**'69 Camaro** · April 9th, 2007, 02:10 AM posted to microsoft.public.access

Hi, David.

As long a we can discuss things in a way that is not acrimoniuous,
which seems quite possible given your response.

I don't know how to respond to that. Are you spoiling for a fight?

Actually, if you re-read this ambiguous sentence -- like I had to, because I
got the same impression the first time! -- Tom is expressing his expectation
of avoiding a fight, given your civility in your previous posts, because all
too often this topic sparks brawls.

HTH.
Gunny

See http://www.QBuilt.com for all your database needs.
See http://www.Access.QBuilt.com for Microsoft Access tips and tutorials.
Blogs: www.DataDevilDog.BlogSpot.com, www.DatabaseTips.BlogSpot.com
http://www.Access.QBuilt.com/html/ex...ributors2.html for contact
info.

**Tom Ellison** · April 9th, 2007, 04:24 AM posted to microsoft.public.access

Dear friends:

Gunny's helpful post is very specifically and exactly the correct
understanding of my meaning and intentions. Thank you very much indeed.

Tom Ellison
Microsoft Access MVP

"'69 Camaro" AM wrote in
message ...
Hi, David.

As long a we can discuss things in a way that is not acrimoniuous,
which seems quite possible given your response.

I don't know how to respond to that. Are you spoiling for a fight?

Actually, if you re-read this ambiguous sentence -- like I had to, because
I got the same impression the first time! -- Tom is expressing his
expectation of avoiding a fight, given your civility in your previous
posts, because all too often this topic sparks brawls.

HTH.
Gunny

See http://www.QBuilt.com for all your database needs.
See http://www.Access.QBuilt.com for Microsoft Access tips and tutorials.
Blogs: www.DataDevilDog.BlogSpot.com, www.DatabaseTips.BlogSpot.com
http://www.Access.QBuilt.com/html/ex...ributors2.html for contact
info.

**Tom Ellison** · April 9th, 2007, 04:24 AM posted to microsoft.public.access

Dear friends:

Gunny's helpful post is very specifically and exactly the correct
understanding of my meaning and intentions. Thank you very much indeed.

Tom Ellison
Microsoft Access MVP

"'69 Camaro" AM wrote in
message ...
Hi, David.

As long a we can discuss things in a way that is not acrimoniuous,
which seems quite possible given your response.

I don't know how to respond to that. Are you spoiling for a fight?

Actually, if you re-read this ambiguous sentence -- like I had to, because
I got the same impression the first time! -- Tom is expressing his
expectation of avoiding a fight, given your civility in your previous
posts, because all too often this topic sparks brawls.

HTH.
Gunny

See http://www.QBuilt.com for all your database needs.
See http://www.Access.QBuilt.com for Microsoft Access tips and tutorials.
Blogs: www.DataDevilDog.BlogSpot.com, www.DatabaseTips.BlogSpot.com
http://www.Access.QBuilt.com/html/ex...ributors2.html for contact
info.

**Tom Ellison** · April 9th, 2007, 05:31 AM posted to microsoft.public.access

Dear David:

The natural key value would be in the foreign key table in any case. The
comparison is between adding the autonumber column to both tables plus
adding an index on the autonumber in the foreign key table. The break even
point is at about a 20 character natural key length, which happens to be a
very common length in my designs. I figure adding 4 bytes to each table,
plus at least 8 bytes of indexing, plus some overhead as indexes are never
filled to 100%.

More inline.

Tom Ellison
Microsoft Access MVP

"David W. Fenton" wrote in message
. 1...
"Tom Ellison" wrote in
:

As long a we can discuss things in a way that is not acrimoniuous,
which seems quite possible given your response.

I don't know how to respond to that. Are you spoiling for a fight?

I'll say there's nothing in your response I haven't heard before.

I'm sure of that. But I still am baffled why anyone would still
think natural keys are the way to go except in a small number of
cases.

I have actually built databases side by side and tested this. I do not
share your conclusion. There are some advantages to both. My contention
these days is that a database engine can, and should be built that gives the
best advantages of both, and indeed exceeds the performance of either. The
whole thing is a "stop, you're both right" situation.

A natural key that is text may be longer in the dependent table
than putting a surrogate key in the dependent table. On the other
hand, you must add the identity surrogate column in the foreign
key table, and you must index this additional column.

You lost me. Assuming we're talking about a single-column PK, the
difference in storage space between a long integer and a text field
is going to be proportional to the difference between the storage
space required for the one vs. the other. Since both require storage
in two tables in order to map the relationship, and because both
have to be indexed, it seems to me that all you need to do is figure
out the difference between the size of the two data types as stored.
Only very small text fields are going to take up less space than a
long integer (4 bytes? or is it 8?).

Not actually proportional. There are 3 entities involved.

1a. The foreign key table, which must grow by 4 bytes for each row it
contains for the autonumber column

2a. The dependent table, which will shrink by the difference between the
natural key and the autonumber key

3a. The index on the foreign key table for the autonumber column, which is
entirely avoided by the natural key method

There are also differences in performance.

1b. A lookup will be modesty faster for large tables using the autonumber.
The primary factor here is the number of levels in the b-tree index. A 20
byte natural text key makes for a 24 byte entry in each node of index,
meaning there will be a maximum of about 160 rows per node, while the 4 byte
autonumber has an 8 byte entry and about 500 entries per node. For, say, 1
million rows, the natural key index will require 3 levels of index, meaning
3 disk accesses for every joined row of data (160 cubed is about 4 million).
For the autonumber, it will still require 3 levels of b-tree, and will
perform at the same speed. However, 500 cubed is 125 million, so between 4
million rows and 125 million rows, the autonumber index is faster by a 3:4
ratio. This is a very modest gain in performance during lookups.

2b. When adding rows, the natural key method will be faster. The foreign
key row is smaller, and has one few indices to be maintained. There is no
index on the autonumber column to be maintained, either.

3b. When executing a query, it is often common to require only the natural
key value. See the Categories table in Northwind, where the natural key
value is the ONLY column in the foreign key table. If you put this value in
the dependent table, then you avoid the need to JOIN to this table
whatsoever.

I propose to gain the advantages of using both methods, and to gain
additional advantages as well. When my work is finished, I will be able to
show these things quite well.

As I have been saying, both approaches have some potential advantages,
especially in very large databases, probably those in excess of the 2 GB
limit on Jet. But the whole debate is somewhat frustrating since the best
solution would be a database engine that automatically provides all the best
advantages of both methods, and can indeed exceed either approach taken
separately.

There exists a value like the autonumber but which is the value recorded in
the lowest level of the b-tree, which points to the exact location of each
row in the foreign key table. If this value were recorded in the dependent
table in place of the autonumber value in 2a above and removing the index in
3a above. This values would not be stored in the foreign key table, 1a
above, but would be external to that table.

There are two challenges to this. One is to maintain integrity in the
database when a foreign key table entry is updated in the natural key. This
is currently handled by cascading updates. I propose this be deferred, and
instead keep a bitmap of all the rows in the foreign key table, showing
whether they have been updated but not yet cascaded. When this happens, the
database engine would force a lookup to get the current value of the natural
key, and not use the natural key value in the dependent table.

The other situation occurs especially with a table that is "clustered" (SQL
Server terminology) or cleaned up (Jet). This relocates the rows of tables
to the order of the primary key. In the case of surrogate autonumber keys,
this places the rows in an order that is not particularly useful. Indeed,
it is more efficient to use the natural key as the primary index, and create
a separate unique index for the surrogate keys. If, for example, you are
reporting the contents of the table, having the rows in the same natural key
order that the report requires will almost certainly be more efficient than
having the rows in the autonumber order, which would generally be the order
in which they were entered.

In this case, the natural key values in the dependent table will be correct,
but the surrogate "row pointer" values will be invalidated.

In both cases, I propose that the database engine should spend its idle time
cleaning up these situations, returning the database to a state where these
operations can proceed at the optimal speed.

Well, now I've spilled the beans. Does anyone follow this mess?

So it seems quite obvious to me that natural keys are going to
always be longer, unless they are numeric data (in which case they
aren't really natural, as somebody has to maintain uniqueness of the
numeric values).

And the overhead of updating large indexes increases as the number
of records increases, regardless of which is the larger table in the
relationship (parent or child).

There are trade-offs in the size of the database, and it
is not always the case that there is any noticable difference in
the size of the database.

I would be much more concerned about the performance issues with the
larger index data pages that will be requuired for a text-based
natural key.

Because of the way b-tree indexes are created, there is little, and often no
difference in performance, as I covered in my discussion of the number of
disk accesses needed to perform this operation.

Also, I do not consider the size of the database to be a
critical factor. Hard disk space is not all that expensive. And
size does not, in this case, equate to performance.

In the case of indexes it certainly does, both in editing/updating
and in retrieval.

The time taken to access an index is not at all proportional to its size.
It is proportional to the number of levels required, which grows in an
inverse geometric style.

I would say that "long integer
columns can be stored more efficiently" is fairly insignificant if
what you're saying is it saves disk space. If you're talking
performance, then again there are significant trade-offs.

Ah! So you agree with what I've written above. I was only going with
"space" as the starting point because that's the ground you chose to
stake out in the discussion. I'm much more concerned with
performance, too, and that's a major reason why I would consider a
natural key only in a very limited number of instances.

But this does not have the effect I believe you suppose it has. It has some
effect, but the geometric nature of the growth surprisingly limits this
effect.

I quite disagree that repeating the data in the dependent table,
even with a multi-column key, is a normalization issue. I do not
agree that this violates any of the rules of normalization. If
you point out that it repeats those columns that were in the
foreign table, then I respond that having an identity column in
the foreign table and repeating that in the dependent table must
be a violation of the same rule. It seems obvious then that there
is no such rule.

I don't know what theory says. I don't really care. What I do know
is that the more fields you have, the more they are likely to need
to CASCADE UPDATES. That means a *huge* performance issue, because
an update to any of the columns of the PK of the parent table will
need to update the corresponding column in all the records of the
child table. But with a surrogate key, which is never updated, you
never ever have that issue.

But you lose the advantage of having the natural key value stored in the
dependent table.

That you can keep the repeated data in synch at the engine level is
dependent on which engine you use. But you are updating data in
multiple locations. An autonumber surrogate key is never changed,
never updated, so the issue with keeping repeated data synchronized
simply vanishes.

There are specialized problems in maintaining surrogate keys in multiple
areas. The GUID was created to solve this. But you've just quadrupled the
size of the surrogate key. An yet you must somehow still maintain the
integrity of the uniqueness of the natural key! This illustrates the depth
and complexity of the problem with which we're dealing here.

To me, that's the main reason to normalize, so that you don't have
to maintain data in multiple locations. Even though it's usually
done at the engine level, multi-column natural keys still require
tye CPU cycles to keep the repeated data in synch.

It is not "vastly more efficient". It is not uncommon to write a
query that requires one of more columns to be displayed that are
part of the natural key from the foreign key table. If such
columns are repeated in the dependent table, then putting them on
a report will not require a JOIN to the foreign table. Since they
already exist in the dependent table, you can retrive them
directly. In this not uncommon case, there need be no lookup to
retrieve those values.

But SQL-based database engines are optimized to make such joins
every efficient. Secondly, in your version, the FK columns won't be
updatable, but in many views, the data the FK points to in the
parent table *will* be updatable.

Any method that increases the number of hard disk accesses is the biggest
enemy of performance. Avoiding a JOIN remains one of the biggest ways to
keep query performance high.

I expect you are referring to avoiding cascade updates when you
speak of using natural PKs only "when the values will *never*
change". The rules of good database design actually state that
the natural key should rarely change, not never, but the point is
pretty much the same.

Well, now you get into the other major flaw of natural keys -- that
for many entities, there are no reasonable candidate keys. What
would you use as a natural key for a person? There is simply no
viable natural key to identify a person in any application that is
using real-world data and is not handing off the problem to some
other PK-assigning authority (like the Social Security
Administration, which is itself problematic, since, over time, that
PK gets re-used).

The requirement that the BE a unique natural key cannot be camoflaged by
avoiding HAVING a unique natural index. To fail to do so prevents the
database designer from ensuring that there be a one-to-one correspondence
between the entities in the real world and the rows in his table. If you
depend only on a surrogate key, then you have done NOTHING to prevent the
duplication of those natural entities within your database. I do not agree
that this argument enters into consideration here. The need to have a
unique natural key is absolute, and no autonumber/identity key helps remove
that problem in any fashion.

I suggest that human ingenuity will commonly find a way to uniquely describe
natural entities in ordinary speech, and that we can tap into that to design
unique natural keys. The proposal that there is no way to specify a unique
person when we talk about John Smith in conversation means we cannot speak
of him or conceive of him uniquely. The fact that we consistently find a
way that does permit us to do so uniquely in natural speech tell me we can
have the computer do the same. Otherwise, there fails to be a way to
communicate at all.

In some of my
designs, such cascades events are recorded. A system can go for a
few months between such occurrances. So, I largely agree whith
you, if you will permit.

There are lots of entities where the only candidate natural keys
will need to be updated often. And, of course, there can't be any
Null fields in the compound PK, which means storing default values
that will then have to be eliminated in reporting, etc.

I disagree that there is such a lack of candidate natural keys. This is, as
I have just described, not something that happens in normal speech and
writing. We DO find a way to uniquely identify a person or entity when such
uniqueness is necessary. If we can convey that uniqueness to one another,
then we can convey it to the computer as well.

There's quite a bit to this discussion, and I'm preparing a
series, if not a book, about this very thing. Frankly, I find
there are very cogent arguments on both sides of the discussion.
But, finally, I find that the disagreement is entirely artificial.
The real problem is that the issue should have been addressed
inside the design of the database engine (Jet, SQL Server, etc.)
and could then be entirely transparent to both users and
designers. The facility provided by surrogate "identity" keys
could have been addressed within the engine, automatically, and
this can be done with an efficiency vastly greater than either
method implemented by we database designers. That is, the problem
is artificial, and its best solution lies outside what you and I
can effect.

I'd be interested to see your thoughts on that at more length. It
intrigures me as a concept, and it's certainly a general principle
that I would support, but I'm afraid that in so many cases I'm not
happy when such things are actually done for me, behind the scenes.
Some examples: Name Autocorrect, Multi-value fields.

I propose that, for a 1:M relationship (the most common) the
database engine maintain the link between the foreign and
dependent tables as an internal pointer directly to the foreign
table row. Unlike the autonumber, this requires no index, so is
several times faster.

Do you really believe, for instance, that there is no index to the
data pages where memo fields are stored, so that retrieval of the
memo data with the record it belongs to is not quicker? There would
have to be internal structures doing the mapping, and they'd have to
be optimized for efficiency. Seems to me that it would just be a
bunch of hidden structures, as opposed to ones that you can create
and work with yourself.

Depending on the design of the database engine, there is certain to be a
librarian keeping track of the pages of information stored in the database
file. However, for efficiency, I am convinced this index is always kept in
memory. This whole point is to eliminate hard drive accesses. That's where
99% of the performance issue lies.

The engine must keep track
of whether this link is valid or not, and use the lookup facility
using the key when it is not, all the time fixing up those links
that have been invalidated when there is idle time on the server.

Ack. That sounds terrible. What if the server never has idle time?
What if you're not using a server-based data engine? And what
happens to data requests between the start of the update and the
commit of it? Your proposal is adding a whole lot of new locking
issues, seems to me, and would be doing internally exactly what is
happening already in most database engines.

A server that has no idle time isn't keeping up with user requests. The
possibility that the server always has just enough power to perform exactly
what is asked of it is too unlikely to consider. A server that falls behind
just 10 minutes each day will soon be taking hours to answer each user
request. This is not what is happening in the real world. A server always
have some additional time available. Excess capacity is PLANNED. And, what
I propose will free up considerable additional server processing time,
making it possible to perform what I propose.

This would implement all the best features of both natural and
surrogate keys, and indeed out-perform either.

I really cannot see how. Perhaps you don't have this level of hand
waving in your planned long-form version of this, but I just don't
see how you get from where we are to what you propose without using
exactly the same kinds of structures we already use.

Once viewed in this way, the whole
issue is not really a debate between database designers, but a
failure of database engine designers. The debate we are having is
artificially created by a failure to build the database engine
optimally.

But the problem I see is that you'd be depending on the db engine to
determine something that is application-specific. That is,
uniqueness is a characteristic of the data used for a particular
purpose in a particular application, and natural keys in particular
are very dependent on that. I don't see how a database engine could
determine this without having an understanding of the way the entity
is being used in the particular application.

It is not at all the case that the implementation of relationships is
application specific. This is a general approach to a general problem.

Please do not think I have no understanding or sympathy for the
point of view you have proposed. I quite agree with you on some
of the major aspects. My opinions on this have been
revolutionized in the past 2 years as I've come to realize the
underlying factors causing this debate.

I believe that, using the existing tools, there may be very good
cause to use surrogate keys for relationships. I'm not so sure
this ever happens in a database less than 2 GB. In my upcoming
paper on the subject, I have created databases and contrasted
there performances using both paradigms, with both Jet and SQL
Server. The results may surprise many.

Your proposed system would use surrogate GUIDs to maintain the
links. I can't see how it could work otherwise. And you'd still have
to maintain uniqueness of the actual attributes with indexes you
create yourself. So, I see absolutely no difference between the
current surrogate key type of RI and what you're proposing.

No, my proposal uses existing internal pointers, as found in the lowest
level nodes of any index, to implement what I propose.

The differences a the natural values are found in the dependent rows,
and the surrogate key is completely hidden from the database designer, and
operates without the need for any index.

--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/

**David W. Fenton** · April 9th, 2007, 06:30 PM posted to microsoft.public.access

"'69 Camaro" AM
wrote in :

As long a we can discuss things in a way that is not
acrimoniuous, which seems quite possible given your response.

I don't know how to respond to that. Are you spoiling for a
fight?

Actually, if you re-read this ambiguous sentence -- like I had to,
because I got the same impression the first time! -- Tom is
expressing his expectation of avoiding a fight, given your
civility in your previous posts, because all too often this topic
sparks brawls.

I don't recall any kind of "brawls" on this topic. Spirited
disagreement, yes, but hey, it's Usenet!

I read it as being addressed to alleged past behavior by me, in that
the civil tone was unexpected. Perhaps he meant that in terms of
this particular topic instead of addressing me, specifically, but
somehow it didn't seem to me to be a general comment.

I hope I am wrong on that, and then I'd have to admit to being
overly sensitive, which would be pretty ironic!

--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/

**David W. Fenton** · April 9th, 2007, 07:30 PM posted to microsoft.public.access

"Tom Ellison" wrote in
:

The natural key value would be in the foreign key table in any
case. The comparison is between adding the autonumber column to
both tables plus adding an index on the autonumber in the foreign
key table. The break even point is at about a 20 character
natural key length, which happens to be a very common length in my
designs. I figure adding 4 bytes to each table, plus at least 8
bytes of indexing, plus some overhead as indexes are never filled
to 100%.

I don't understand how a 20-character text key can be as efficient
in terms of storage and index maintenance as a long integer, which
my Access help file tells me is 4 bytes. For the data storage alone,
you're talking 5 times as much space per record. Secondly, my
understanding (which could be wrong) is that Jet (and other db
engines) have optimized retrieval of numeric values as compared to
text values, even when indexed. So there'd be a performance hit
because you're no longer benefiting from that optimization. Then
there's the index update hit because you've got to maintain more
data in the index tree.

"David W. Fenton" wrote in message
. 1...
"Tom Ellison" wrote in
:

[]

A natural key that is text may be longer in the dependent table
than putting a surrogate key in the dependent table. On the
other hand, you must add the identity surrogate column in the
foreign key table, and you must index this additional column.

You lost me. Assuming we're talking about a single-column PK, the
difference in storage space between a long integer and a text
field is going to be proportional to the difference between the
storage space required for the one vs. the other. Since both
require storage in two tables in order to map the relationship,
and because both have to be indexed, it seems to me that all you
need to do is figure out the difference between the size of the
two data types as stored. Only very small text fields are going
to take up less space than a long integer (4 bytes? or is it 8?).

Not actually proportional. There are 3 entities involved.

1a. The foreign key table, which must grow by 4 bytes for each
row it contains for the autonumber column

2a. The dependent table, which will shrink by the difference
between the natural key and the autonumber key

3a. The index on the foreign key table for the autonumber column,
which is entirely avoided by the natural key method

Your terminology confused me for a while, and that caused me to not
understand your point. By "foreign key table" you mean the parent
table in the relationship. Yes, you are adding 4 bytes per record
*if* there is an actual candidate natural key.

So, yes, you are right that it is not directly proportional.

But only in cases where there really is a proper candidate natural
key, which I find very rare except for simply lookup tables.

There are also differences in performance.

1b. A lookup will be modesty faster for large tables using the
autonumber. The primary factor here is the number of levels in the
b-tree index. A 20 byte natural text key makes for a 24 byte
entry in each node of index, meaning there will be a maximum of
about 160 rows per node, while the 4 byte autonumber has an 8 byte
entry and about 500 entries per node. For, say, 1 million rows,
the natural key index will require 3 levels of index, meaning 3
disk accesses for every joined row of data (160 cubed is about 4
million). For the autonumber, it will still require 3 levels of
b-tree, and will perform at the same speed. However, 500 cubed is
125 million, so between 4 million rows and 125 million rows, the
autonumber index is faster by a 3:4 ratio. This is a very modest
gain in performance during lookups.

That is, if the only speed benefit is from the nodes that need to be
traverssed. My understanding is that there are other optimizations
for numeric values in most db engines that speed the processing of
joins beyond just the b-tree traversal. But I'm just going on
third-hand information there, so I could be wrong on that.

Certainly the need to handle double-byte data surely must introduce
some kind of additional overhead in the text-based indexes.

2b. When adding rows, the natural key method will be faster. The
foreign key row is smaller, and has one few indices to be
maintained. There is no index on the autonumber column to be
maintained, either.

I sure wish you'd use "parent row" instead of "foreign key row."

I don't think this is a huge issue, as you're on the 1 side of the
join. The many side is where most of the records will be added. I
can see where that would be less of an issue if most of you N's are
1 or 2, but once you have an average of 2 or more, you've really
multiplied the amount of maintenance hit well beyond the small
difference in updating the surrogate key index.

3b. When executing a query, it is often common to require only
the natural key value. See the Categories table in Northwind,
where the natural key value is the ONLY column in the foreign key
table. If you put this value in the dependent table, then you
avoid the need to JOIN to this table whatsoever.

But that's *precisely* the kind of table I'm *agreeing* works very
well with a natural key, becuase it's a one-column lookup table.
However, I don't always use a natural key, because the values
sometimes have to be updated (in some apps more often than others).

I propose to gain the advantages of using both methods, and to
gain additional advantages as well. When my work is finished, I
will be able to show these things quite well.

Well, I'm begining to see where you're going, but I just don't see
the advantages except in the type of case I was already agreeing
were just fine for natural keys.

Once you get to a 2-column or more PK, then I think your whole
theory breaks down. It certainly doesn't change the data storage
issues, but it does magnify the storage and index updating issues in
the child table (i.e., the one with the N records), and the join
issue only works when the data you need to filter on is only one
join away (i.e., it only helps with direct relationships). Once you
need data two joins away, you probably have greatly *increased* the
problems with join performance.

As I have been saying, both approaches have some potential
advantages, especially in very large databases, probably those in
excess of the 2 GB limit on Jet. But the whole debate is somewhat
frustrating since the best solution would be a database engine
that automatically provides all the best advantages of both
methods, and can indeed exceed either approach taken separately.

There exists a value like the autonumber but which is the value
recorded in the lowest level of the b-tree, which points to the
exact location of each row in the foreign key table. If this
value were recorded in the dependent table in place of the
autonumber value in 2a above and removing the index in 3a above.
This values would not be stored in the foreign key table, 1a
above, but would be external to that table.

But my understanding is that the value that is recorded there points
to a data page with an offset for the start of the record. Maybe the
offset is stored in the data page, instead. But when you compact,
the indexes have to be updated, and if you stored that value in each
child record instead, you'd have to update it in many more places
than you do with the current situation. I'm speaking of Jet here,
but surely every database engine has some similar kind of methods
(MySQL would have a file name and a row number).

In short, it's exactly the same problem I have with depending
CASCADE UPDATES for natural keys -- when the parent value changes
you have to do a bunch of updates to a lot of records and the
indexes for those records.

There are two challenges to this. One is to maintain integrity in
the database when a foreign key table entry is updated in the
natural key. This is currently handled by cascading updates. I
propose this be deferred, and instead keep a bitmap of all the
rows in the foreign key table, showing whether they have been
updated but not yet cascaded. When this happens, the database
engine would force a lookup to get the current value of the
natural key, and not use the natural key value in the dependent
table.

That requires branching logic to decide which to do, and the time it
takes to run that test could add significantly to data retrieval
time. This would greatly increase your need to perform data
maintenance to get the data defragmented.

The other situation occurs especially with a table that is
"clustered" (SQL Server terminology) or cleaned up (Jet). This
relocates the rows of tables to the order of the primary key. In
the case of surrogate autonumber keys, this places the rows in an
order that is not particularly useful.

On the contrary, it can be *very* useful in reducing concurrency. A
random Autonumber means that the data ends up randomly distributed,
so that updates are not as likely to collide on the same data page.
Certainly something similar is going to be the case in all data
stores, at some level or the other.

Indeed,
it is more efficient to use the natural key as the primary index,
and create a separate unique index for the surrogate keys. If,
for example, you are reporting the contents of the table, having
the rows in the same natural key order that the report requires
will almost certainly be more efficient than having the rows in
the autonumber order, which would generally be the order in which
they were entered.

This would be what I would call a "premature optimization," in that
you're trying to get application-level performance enhancements out
of operations at the lowest level of the database. I see this as a
mistaken approach, in that you're building a bias into the data
store that is not always going to be useful. And it violates the
principles behind Codd's rules and SQL in that you're worrying about
the data store.

No, a data engine that makes it easy to load into memory and
pre-optimize for a particular purpose, might very well show a real
performance boost without downgrading it elsewhere.

It certainly is true that most database engines are optimized for
reading/writing to disk, when these days there's enough RAM to run
production apps directly from the image in memory. But there would
be reliability issues with that and I'm not sure we're prepared for
that yet. But I do know that a lot of large databases are, in fact,
completely loaded into RAM to improve performance. I don't know how
many databases out there have added features to exploit running from
RAM.

In this case, the natural key values in the dependent table will
be correct, but the surrogate "row pointer" values will be
invalidated.

In both cases, I propose that the database engine should spend its
idle time cleaning up these situations, returning the database to
a state where these operations can proceed at the optimal speed.

That's exactly where the RAM vs. disk storage advantage could be
exploited. An engine designed around the capabilities of RAM would
fix this.

But, in order to do it, you would have re-introduced a hash table
that translates the original locations into the current locations of
the data, and you're basically back to the current index structure,
but with fragmentation inherent in your translation structure.

Well, now I've spilled the beans. Does anyone follow this mess?

Yes, I'm following you, but I'm not sure the performance
improvements would follow from what you're suggesting. But, I'm used
to thinking in terms of storage on disk, read serially, rather than
storage in RAM, read randomly (all locations are, theoretically,
just as close as all others, thought they aren't, really, but much
more so than on a hard drive). Perhaps you should add the whole RAM
vs. disk issue into your discussion of this, as I'm pretty sure it's
central to the current design of database engines.

So it seems quite obvious to me that natural keys are going to
always be longer, unless they are numeric data (in which case
they aren't really natural, as somebody has to maintain
uniqueness of the numeric values).

And the overhead of updating large indexes increases as the
number of records increases, regardless of which is the larger
table in the relationship (parent or child).

There are trade-offs in the size of the database, and it
is not always the case that there is any noticable difference in
the size of the database.

I would be much more concerned about the performance issues with
the larger index data pages that will be requuired for a
text-based natural key.

Because of the way b-tree indexes are created, there is little,
and often no difference in performance, as I covered in my
discussion of the number of disk accesses needed to perform this
operation.

But you've tended to restrict your discussion of that issue to the
parent table, and rather ignored the way the problem can multiply in
the child table. In an app where every parent record has 100
children, the issue can become pretty great, both in storage and
performance, seems to me.

Also, I do not consider the size of the database to be a
critical factor. Hard disk space is not all that expensive.
And size does not, in this case, equate to performance.

In the case of indexes it certainly does, both in
editing/updating and in retrieval.

The time taken to access an index is not at all proportional to
its size. It is proportional to the number of levels required,
which grows in an inverse geometric style.

Again, I refer to my "impression" that there were index
optimizations that favored numeric values over text.

But index updating performance *does* degrade with the size of the
field being updated.

I would say that "long integer
columns can be stored more efficiently" is fairly insignificant
if what you're saying is it saves disk space. If you're talking
performance, then again there are significant trade-offs.

Ah! So you agree with what I've written above. I was only going
with "space" as the starting point because that's the ground you
chose to stake out in the discussion. I'm much more concerned
with performance, too, and that's a major reason why I would
consider a natural key only in a very limited number of
instances.

But this does not have the effect I believe you suppose it has.
It has some effect, but the geometric nature of the growth
surprisingly limits this effect.

Again, I think you're ignoring the asymmetric nature of the
performance issues. You concentrated mostly on the parent table.
I've concentrated on the results of duplicating the parent record's
data in multiple records in the child table. When it's 1:1, then
there's little issue. When it's 1:10 or 1:100, it becomes a whole
different ballgame.

I quite disagree that repeating the data in the dependent table,
even with a multi-column key, is a normalization issue. I do
not agree that this violates any of the rules of normalization.
If you point out that it repeats those columns that were in the
foreign table, then I respond that having an identity column in
the foreign table and repeating that in the dependent table must
be a violation of the same rule. It seems obvious then that
there is no such rule.

I don't know what theory says. I don't really care. What I do
know is that the more fields you have, the more they are likely
to need to CASCADE UPDATES. That means a *huge* performance
issue, because an update to any of the columns of the PK of the
parent table will need to update the corresponding column in all
the records of the child table. But with a surrogate key, which
is never updated, you never ever have that issue.

But you lose the advantage of having the natural key value stored
in the dependent table.

Joins are so easy that I just don't see that as much of an
advantage. I'm not an end user, after all -- I understand how to
write SQL! And most db engines are optimized for joins, precisely
because of the issues involved.

And, of course, if you *do* need to join on your natural-key
indexes, you've probably reduced the efficiency of the join because
it's a text-based index.

[]

It is not "vastly more efficient". It is not uncommon to write
a query that requires one of more columns to be displayed that
are part of the natural key from the foreign key table. If such
columns are repeated in the dependent table, then putting them
on a report will not require a JOIN to the foreign table. Since
they already exist in the dependent table, you can retrive them
directly. In this not uncommon case, there need be no lookup to
retrieve those values.

But SQL-based database engines are optimized to make such joins
every efficient. Secondly, in your version, the FK columns won't
be updatable, but in many views, the data the FK points to in the
parent table *will* be updatable.

Any method that increases the number of hard disk accesses is the
biggest enemy of performance. Avoiding a JOIN remains one of the
biggest ways to keep query performance high.

Depends on what your app is doing most of -- retrieving data or
editing it.

And throw in the database-in-RAM issue and see if the results don't
change.

I expect you are referring to avoiding cascade updates when you
speak of using natural PKs only "when the values will *never*
change". The rules of good database design actually state that
the natural key should rarely change, not never, but the point
is pretty much the same.

Well, now you get into the other major flaw of natural keys --
that for many entities, there are no reasonable candidate keys.
What would you use as a natural key for a person? There is simply
no viable natural key to identify a person in any application
that is using real-world data and is not handing off the problem
to some other PK-assigning authority (like the Social Security
Administration, which is itself problematic, since, over time,
that PK gets re-used).

The requirement that the BE a unique natural key cannot be
camoflaged by avoiding HAVING a unique natural index. To fail to
do so prevents the database designer from ensuring that there be a
one-to-one correspondence between the entities in the real world
and the rows in his table. If you depend only on a surrogate key,
then you have done NOTHING to prevent the duplication of those
natural entities within your database. I do not agree that this
argument enters into consideration here. The need to have a
unique natural key is absolute, and no autonumber/identity key
helps remove that problem in any fashion.

I am not persuaded of the theoretical importance of your PK insuring
uniqueness. There's a difference between uniqueness of the record
and uniqueness of the entity being represented by that record. Given
the imperfections of the latter, I don't see that it makes sense to
work awfully hard attempting to make the two correspond.

Data is imperfect precisely because we never have complete
information. If you force the PK function onto the natural key
fields (and the PK function is a meta-function of the database
engine, not of your entities themselves), then you put requirements
on the data you can store in those fields, and thus have to make up
some values to make sure you have no Nulls. Then you have to
suppress those fake values in certain situations, all because of a
choice you've made to overload your data fields with both user-level
functions (fully representing the data known about the entity) and
meta-functions at the database level (relating the record to records
in other tables).

I suggest that human ingenuity will commonly find a way to
uniquely describe natural entities in ordinary speech, and that we
can tap into that to design unique natural keys. The proposal
that there is no way to specify a unique person when we talk about
John Smith in conversation means we cannot speak of him or
conceive of him uniquely. The fact that we consistently find a
way that does permit us to do so uniquely in natural speech tell
me we can have the computer do the same. Otherwise, there fails
to be a way to communicate at all.

If natural speech were so easy to represent in digital form, then I
think we'd have much more accurate voice recognition systems than we
do.

In some of my
designs, such cascades events are recorded. A system can go for
a few months between such occurrances. So, I largely agree
whith you, if you will permit.

There are lots of entities where the only candidate natural keys
will need to be updated often. And, of course, there can't be any
Null fields in the compound PK, which means storing default
values that will then have to be eliminated in reporting, etc.

I disagree that there is such a lack of candidate natural keys.
This is, as I have just described, not something that happens in
normal speech and writing.

But that is precisely because we don't put our identifiers into neat
little columns. We have a Gestalt representation in our heads that
we use to identify the entity, one that often includes information
that is useless in a business application. Do you really want to
have your contact management app asking people for their hair and
eye color, just so we can use that information to distinguish the
John Smith with red hair and green eyes from the one with black hair
and brown eyes?

In a real-world application, when you ask for too much data, you end
up with fake data, or no data. So there's always a tension between
getting the most possible complete data and the practical realities
of incomplete information and what actual users will have the
patience to put in. If the attribute doesn't have any use in your
app other than to help establish uniqueness, then your users aren't
going to want to be bothered, and you'll end up with a column of
UNKNOWN.

We DO find a way to uniquely identify a person or entity when such
uniqueness is necessary. If we can convey that uniqueness to one
another, then we can convey it to the computer as well.

But our data storage systems are several orders of magnitude more
complex than anything even conceived of for the computer.

[]

I propose that, for a 1:M relationship (the most common) the
database engine maintain the link between the foreign and
dependent tables as an internal pointer directly to the foreign
table row. Unlike the autonumber, this requires no index, so is
several times faster.

Do you really believe, for instance, that there is no index to
the data pages where memo fields are stored, so that retrieval of
the memo data with the record it belongs to is not quicker? There
would have to be internal structures doing the mapping, and
they'd have to be optimized for efficiency. Seems to me that it
would just be a bunch of hidden structures, as opposed to ones
that you can create and work with yourself.

Depending on the design of the database engine, there is certain
to be a librarian keeping track of the pages of information stored
in the database file. However, for efficiency, I am convinced
this index is always kept in memory. This whole point is to
eliminate hard drive accesses. That's where 99% of the
performance issue lies.

Bingo!

But you'll still end up maintaining a table mapping the RAM image to
the current disk image. That will be much faster than doing it on
disk, but will still require time/CPUs, so efficiency will still be
important.

It may be that current RAM prices make it possible to do this at a
level that makes the difference no longer relevant (just as in the
90s graphics speed became fast enough to make a GUI completely
viable without being too sluggish in comparison to character-based
UIs).

But the real question would be whether or not re-engineering a
database engine is going to give enough performance/ease-of-use
benefit to be worth the development investment.

I'm not convinced there's going to be enough of a difference to
justify it, nor that it's a good idea to prematurely optimize the
data storage structures for any particular application (see above).

The engine must keep track
of whether this link is valid or not, and use the lookup
facility using the key when it is not, all the time fixing up
those links that have been invalidated when there is idle time
on the server.

Ack. That sounds terrible. What if the server never has idle
time? What if you're not using a server-based data engine? And
what happens to data requests between the start of the update and
the commit of it? Your proposal is adding a whole lot of new
locking issues, seems to me, and would be doing internally
exactly what is happening already in most database engines.

A server that has no idle time isn't keeping up with user
requests. The possibility that the server always has just enough
power to perform exactly what is asked of it is too unlikely to
consider. A server that falls behind just 10 minutes each day
will soon be taking hours to answer each user request. This is
not what is happening in the real world. A server always have
some additional time available. Excess capacity is PLANNED. And,
what I propose will free up considerable additional server
processing time, making it possible to perform what I propose.

I'm always scared of delayed writes, which is basically what you're
proposing, even in a transactional system.

This would implement all the best features of both natural and
surrogate keys, and indeed out-perform either.

I really cannot see how. Perhaps you don't have this level of
hand waving in your planned long-form version of this, but I just
don't see how you get from where we are to what you propose
without using exactly the same kinds of structures we already
use.

Let me point out that you've proposed exactly that in your
discussion of moving the data page pointers from the indexes to the
child records.

Once viewed in this way, the whole
issue is not really a debate between database designers, but a
failure of database engine designers. The debate we are having
is artificially created by a failure to build the database
engine optimally.

But the problem I see is that you'd be depending on the db engine
to determine something that is application-specific. That is,
uniqueness is a characteristic of the data used for a particular
purpose in a particular application, and natural keys in
particular are very dependent on that. I don't see how a database
engine could determine this without having an understanding of
the way the entity is being used in the particular application.

It is not at all the case that the implementation of relationships
is application specific. This is a general approach to a general
problem.

I think in retrospect what you're actually proposing amounts to
using hidden surrogate keys for the meta function of relating
records in different tables.

Please do not think I have no understanding or sympathy for the
point of view you have proposed. I quite agree with you on some
of the major aspects. My opinions on this have been
revolutionized in the past 2 years as I've come to realize the
underlying factors causing this debate.

I believe that, using the existing tools, there may be very good
cause to use surrogate keys for relationships. I'm not so sure
this ever happens in a database less than 2 GB. In my upcoming
paper on the subject, I have created databases and contrasted
there performances using both paradigms, with both Jet and SQL
Server. The results may surprise many.

Your proposed system would use surrogate GUIDs to maintain the
links. I can't see how it could work otherwise. And you'd still
have to maintain uniqueness of the actual attributes with indexes
you create yourself. So, I see absolutely no difference between
the current surrogate key type of RI and what you're proposing.

No, my proposal uses existing internal pointers, as found in the
lowest level nodes of any index, to implement what I propose.

The differences a the natural values are found in the
dependent rows, and the surrogate key is completely hidden from
the database designer, and operates without the need for any
index.

Perhaps not an index created by the DBA, but there's an index in
there somewhere, behind the scenes.

--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/

**'69 Camaro** · April 9th, 2007, 10:08 PM posted to microsoft.public.access

Hi, David.

I don't recall any kind of "brawls" on this topic. Spirited
disagreement, yes, but hey, it's Usenet!

Then you've been very fortunate not to have attended some of the same
meetings I have! ;-)

Perhaps he meant that in terms of
this particular topic instead of addressing me, specifically, but
somehow it didn't seem to me to be a general comment.

I doubt he was aiming his comment specifically at you. It's a religious war
and wherever you go, you'll find warriors on both sides. Some are ready to
duke it out at the first hint of "your side is wrong." In those cases, it's
usually best to keep a safe distance, or else wear armor or asbestos
underwear. Mine always seem to be at the dry cleaners when I need them.

I hope I am wrong on that, and then I'd have to admit to being
overly sensitive, which would be pretty ironic!

LOL!

Gunny

See http://www.QBuilt.com for all your database needs.
See http://www.Access.QBuilt.com for Microsoft Access tips and tutorials.
Blogs: www.DataDevilDog.BlogSpot.com, www.DatabaseTips.BlogSpot.com
http://www.Access.QBuilt.com/html/ex...ributors2.html for contact
info.

"David W. Fenton" wrote in message
. 1...
"'69 Camaro" AM
wrote in :

As long a we can discuss things in a way that is not
acrimoniuous, which seems quite possible given your response.

I don't know how to respond to that. Are you spoiling for a
fight?

Actually, if you re-read this ambiguous sentence -- like I had to,
because I got the same impression the first time! -- Tom is
expressing his expectation of avoiding a fight, given your
civility in your previous posts, because all too often this topic
sparks brawls.

I don't recall any kind of "brawls" on this topic. Spirited
disagreement, yes, but hey, it's Usenet!

I read it as being addressed to alleged past behavior by me, in that
the civil tone was unexpected. Perhaps he meant that in terms of
this particular topic instead of addressing me, specifically, but
somehow it didn't seem to me to be a general comment.

I hope I am wrong on that, and then I'd have to admit to being
overly sensitive, which would be pretty ironic!

--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/

**Tom Ellison** · April 9th, 2007, 10:43 PM posted to microsoft.public.access

Dear David:

If you use the surrogate identity key, you must add that column to both the
foreign key table and the dependent table, while you can remove the natural
key value from the dependent table. In addition you will be adding an index
to the foreign key table. In this index, each row of the foreign key table
will require at least 8 bytes, 4 for the key value, and 4 for the pointer to
the data it represents (in some databases, it is more than this). That's a
minimum of 16 bytes to replace what took 20 bytes before (if we assume one
dependent row for each foreign key row).

Now it is not always the case that there is only one row in the dependent
table for each row in the foreign key table. There can be 100 rows in the
dependent table. In this case, you have saved some space. But there can be
0 rows in the dependent table for some of the rows in the foreign key table.
In this case, you have added at least 12 bytes and gained none. When you
add in a fill factor for the index of perhaps 50%, you will have used 24
bytes instead of 20.

In terms of saving space, this will happen if the average foreign key row is
used more than 3 times. Below that, the effect is probably very small.

The entire effect is often minimal, but depends on the actual data used. In
any case, this is hardly the main point of discussion (in my opinion). The
main point should be performance, not storage, and it would be if we did not
have an artificial 2 GB limit built in.

Tom Ellison

"David W. Fenton" wrote in message
. 1...
"Tom Ellison" wrote in
:

The natural key value would be in the foreign key table in any
case. The comparison is between adding the autonumber column to
both tables plus adding an index on the autonumber in the foreign
key table. The break even point is at about a 20 character
natural key length, which happens to be a very common length in my
designs. I figure adding 4 bytes to each table, plus at least 8
bytes of indexing, plus some overhead as indexes are never filled
to 100%.

I don't understand how a 20-character text key can be as efficient
in terms of storage and index maintenance as a long integer, which
my Access help file tells me is 4 bytes. For the data storage alone,
you're talking 5 times as much space per record. Secondly, my
understanding (which could be wrong) is that Jet (and other db
engines) have optimized retrieval of numeric values as compared to
text values, even when indexed. So there'd be a performance hit
because you're no longer benefiting from that optimization. Then
there's the index update hit because you've got to maintain more
data in the index tree.

"David W. Fenton" wrote in message
. 1...

**David W. Fenton** · April 10th, 2007, 07:35 PM posted to microsoft.public.access

"Tom Ellison" wrote in
:

The entire effect is often minimal, but depends on the actual data
used. In any case, this is hardly the main point of discussion
(in my opinion). The main point should be performance, not
storage, and it would be if we did not have an artificial 2 GB
limit built in.

I think the main point is performance, specifically in regards to
the maintenance of indexes, and especially on the child tables. I
don't care about the extra 4 bytes in the parent table. We can
reduce the discussion to a scenario that favors your techniques,
but, again, I would say that will only apply to a very small number
of situations, and is thus not terribly useful. It is not that
natural keys can *not* be more efficient or perform better. It is
that in most cases they do not, and they introduce significant
problems that lead either to compromises in your data or to very
complex logic in your application.

--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/

**Tony Toews [MVP]** · April 10th, 2007, 08:34 PM posted to microsoft.public.access

"David W. Fenton" wrote:

As long a we can discuss things in a way that is not
acrimoniuous, which seems quite possible given your response.

I don't know how to respond to that. Are you spoiling for a
fight?

Actually, if you re-read this ambiguous sentence -- like I had to,
because I got the same impression the first time! -- Tom is
expressing his expectation of avoiding a fight, given your
civility in your previous posts, because all too often this topic
sparks brawls.

I don't recall any kind of "brawls" on this topic. Spirited
disagreement, yes, but hey, it's Usenet!

I read it as being addressed to alleged past behavior by me, in that
the civil tone was unexpected. Perhaps he meant that in terms of
this particular topic instead of addressing me, specifically, but
somehow it didn't seem to me to be a general comment.

I hope I am wrong on that, and then I'd have to admit to being
overly sensitive, which would be pretty ironic!

I'm with Gunny on this topic. Also I've met Tom Ellison in person at
a previous MVP summit. Now he's passionate about natural keys. But I
very much doubt he's looking for a fight.

I'm sure Tom didn't mean you specifically. Just folks in general.
He might not have hung out much if at all in c.d.m-a and didn't
realize you are a distinguished Access denizen over there.

Now it was an ambiguous sentence but yes you are being a tad
sensitive. smile

Tony
--
Tony Toews, Microsoft Access MVP
Please respond only in the newsgroups so that others can
read the entire thread of messages.
Microsoft Access Links, Hints, Tips & Accounting Systems at
http://www.granite.ab.ca/accsmstr.htm

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode