If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#21
|
|||
|
|||
Why do all my Relationships show as Indeterminate ?
Hi, David.
As long a we can discuss things in a way that is not acrimoniuous, which seems quite possible given your response. I don't know how to respond to that. Are you spoiling for a fight? Actually, if you re-read this ambiguous sentence -- like I had to, because I got the same impression the first time! -- Tom is expressing his expectation of avoiding a fight, given your civility in your previous posts, because all too often this topic sparks brawls. HTH. Gunny See http://www.QBuilt.com for all your database needs. See http://www.Access.QBuilt.com for Microsoft Access tips and tutorials. Blogs: www.DataDevilDog.BlogSpot.com, www.DatabaseTips.BlogSpot.com http://www.Access.QBuilt.com/html/ex...ributors2.html for contact info. |
#22
|
|||
|
|||
Why do all my Relationships show as Indeterminate ?
Dear friends:
Gunny's helpful post is very specifically and exactly the correct understanding of my meaning and intentions. Thank you very much indeed. Tom Ellison Microsoft Access MVP "'69 Camaro" AM wrote in message ... Hi, David. As long a we can discuss things in a way that is not acrimoniuous, which seems quite possible given your response. I don't know how to respond to that. Are you spoiling for a fight? Actually, if you re-read this ambiguous sentence -- like I had to, because I got the same impression the first time! -- Tom is expressing his expectation of avoiding a fight, given your civility in your previous posts, because all too often this topic sparks brawls. HTH. Gunny See http://www.QBuilt.com for all your database needs. See http://www.Access.QBuilt.com for Microsoft Access tips and tutorials. Blogs: www.DataDevilDog.BlogSpot.com, www.DatabaseTips.BlogSpot.com http://www.Access.QBuilt.com/html/ex...ributors2.html for contact info. |
#23
|
|||
|
|||
Why do all my Relationships show as Indeterminate ?
Dear friends:
Gunny's helpful post is very specifically and exactly the correct understanding of my meaning and intentions. Thank you very much indeed. Tom Ellison Microsoft Access MVP "'69 Camaro" AM wrote in message ... Hi, David. As long a we can discuss things in a way that is not acrimoniuous, which seems quite possible given your response. I don't know how to respond to that. Are you spoiling for a fight? Actually, if you re-read this ambiguous sentence -- like I had to, because I got the same impression the first time! -- Tom is expressing his expectation of avoiding a fight, given your civility in your previous posts, because all too often this topic sparks brawls. HTH. Gunny See http://www.QBuilt.com for all your database needs. See http://www.Access.QBuilt.com for Microsoft Access tips and tutorials. Blogs: www.DataDevilDog.BlogSpot.com, www.DatabaseTips.BlogSpot.com http://www.Access.QBuilt.com/html/ex...ributors2.html for contact info. |
#24
|
|||
|
|||
Why do all my Relationships show as Indeterminate ?
Dear David:
The natural key value would be in the foreign key table in any case. The comparison is between adding the autonumber column to both tables plus adding an index on the autonumber in the foreign key table. The break even point is at about a 20 character natural key length, which happens to be a very common length in my designs. I figure adding 4 bytes to each table, plus at least 8 bytes of indexing, plus some overhead as indexes are never filled to 100%. More inline. Tom Ellison Microsoft Access MVP "David W. Fenton" wrote in message . 1... "Tom Ellison" wrote in : As long a we can discuss things in a way that is not acrimoniuous, which seems quite possible given your response. I don't know how to respond to that. Are you spoiling for a fight? I'll say there's nothing in your response I haven't heard before. I'm sure of that. But I still am baffled why anyone would still think natural keys are the way to go except in a small number of cases. I have actually built databases side by side and tested this. I do not share your conclusion. There are some advantages to both. My contention these days is that a database engine can, and should be built that gives the best advantages of both, and indeed exceeds the performance of either. The whole thing is a "stop, you're both right" situation. A natural key that is text may be longer in the dependent table than putting a surrogate key in the dependent table. On the other hand, you must add the identity surrogate column in the foreign key table, and you must index this additional column. You lost me. Assuming we're talking about a single-column PK, the difference in storage space between a long integer and a text field is going to be proportional to the difference between the storage space required for the one vs. the other. Since both require storage in two tables in order to map the relationship, and because both have to be indexed, it seems to me that all you need to do is figure out the difference between the size of the two data types as stored. Only very small text fields are going to take up less space than a long integer (4 bytes? or is it 8?). Not actually proportional. There are 3 entities involved. 1a. The foreign key table, which must grow by 4 bytes for each row it contains for the autonumber column 2a. The dependent table, which will shrink by the difference between the natural key and the autonumber key 3a. The index on the foreign key table for the autonumber column, which is entirely avoided by the natural key method There are also differences in performance. 1b. A lookup will be modesty faster for large tables using the autonumber. The primary factor here is the number of levels in the b-tree index. A 20 byte natural text key makes for a 24 byte entry in each node of index, meaning there will be a maximum of about 160 rows per node, while the 4 byte autonumber has an 8 byte entry and about 500 entries per node. For, say, 1 million rows, the natural key index will require 3 levels of index, meaning 3 disk accesses for every joined row of data (160 cubed is about 4 million). For the autonumber, it will still require 3 levels of b-tree, and will perform at the same speed. However, 500 cubed is 125 million, so between 4 million rows and 125 million rows, the autonumber index is faster by a 3:4 ratio. This is a very modest gain in performance during lookups. 2b. When adding rows, the natural key method will be faster. The foreign key row is smaller, and has one few indices to be maintained. There is no index on the autonumber column to be maintained, either. 3b. When executing a query, it is often common to require only the natural key value. See the Categories table in Northwind, where the natural key value is the ONLY column in the foreign key table. If you put this value in the dependent table, then you avoid the need to JOIN to this table whatsoever. I propose to gain the advantages of using both methods, and to gain additional advantages as well. When my work is finished, I will be able to show these things quite well. As I have been saying, both approaches have some potential advantages, especially in very large databases, probably those in excess of the 2 GB limit on Jet. But the whole debate is somewhat frustrating since the best solution would be a database engine that automatically provides all the best advantages of both methods, and can indeed exceed either approach taken separately. There exists a value like the autonumber but which is the value recorded in the lowest level of the b-tree, which points to the exact location of each row in the foreign key table. If this value were recorded in the dependent table in place of the autonumber value in 2a above and removing the index in 3a above. This values would not be stored in the foreign key table, 1a above, but would be external to that table. There are two challenges to this. One is to maintain integrity in the database when a foreign key table entry is updated in the natural key. This is currently handled by cascading updates. I propose this be deferred, and instead keep a bitmap of all the rows in the foreign key table, showing whether they have been updated but not yet cascaded. When this happens, the database engine would force a lookup to get the current value of the natural key, and not use the natural key value in the dependent table. The other situation occurs especially with a table that is "clustered" (SQL Server terminology) or cleaned up (Jet). This relocates the rows of tables to the order of the primary key. In the case of surrogate autonumber keys, this places the rows in an order that is not particularly useful. Indeed, it is more efficient to use the natural key as the primary index, and create a separate unique index for the surrogate keys. If, for example, you are reporting the contents of the table, having the rows in the same natural key order that the report requires will almost certainly be more efficient than having the rows in the autonumber order, which would generally be the order in which they were entered. In this case, the natural key values in the dependent table will be correct, but the surrogate "row pointer" values will be invalidated. In both cases, I propose that the database engine should spend its idle time cleaning up these situations, returning the database to a state where these operations can proceed at the optimal speed. Well, now I've spilled the beans. Does anyone follow this mess? So it seems quite obvious to me that natural keys are going to always be longer, unless they are numeric data (in which case they aren't really natural, as somebody has to maintain uniqueness of the numeric values). And the overhead of updating large indexes increases as the number of records increases, regardless of which is the larger table in the relationship (parent or child). There are trade-offs in the size of the database, and it is not always the case that there is any noticable difference in the size of the database. I would be much more concerned about the performance issues with the larger index data pages that will be requuired for a text-based natural key. Because of the way b-tree indexes are created, there is little, and often no difference in performance, as I covered in my discussion of the number of disk accesses needed to perform this operation. Also, I do not consider the size of the database to be a critical factor. Hard disk space is not all that expensive. And size does not, in this case, equate to performance. In the case of indexes it certainly does, both in editing/updating and in retrieval. The time taken to access an index is not at all proportional to its size. It is proportional to the number of levels required, which grows in an inverse geometric style. I would say that "long integer columns can be stored more efficiently" is fairly insignificant if what you're saying is it saves disk space. If you're talking performance, then again there are significant trade-offs. Ah! So you agree with what I've written above. I was only going with "space" as the starting point because that's the ground you chose to stake out in the discussion. I'm much more concerned with performance, too, and that's a major reason why I would consider a natural key only in a very limited number of instances. But this does not have the effect I believe you suppose it has. It has some effect, but the geometric nature of the growth surprisingly limits this effect. I quite disagree that repeating the data in the dependent table, even with a multi-column key, is a normalization issue. I do not agree that this violates any of the rules of normalization. If you point out that it repeats those columns that were in the foreign table, then I respond that having an identity column in the foreign table and repeating that in the dependent table must be a violation of the same rule. It seems obvious then that there is no such rule. I don't know what theory says. I don't really care. What I do know is that the more fields you have, the more they are likely to need to CASCADE UPDATES. That means a *huge* performance issue, because an update to any of the columns of the PK of the parent table will need to update the corresponding column in all the records of the child table. But with a surrogate key, which is never updated, you never ever have that issue. But you lose the advantage of having the natural key value stored in the dependent table. That you can keep the repeated data in synch at the engine level is dependent on which engine you use. But you are updating data in multiple locations. An autonumber surrogate key is never changed, never updated, so the issue with keeping repeated data synchronized simply vanishes. There are specialized problems in maintaining surrogate keys in multiple areas. The GUID was created to solve this. But you've just quadrupled the size of the surrogate key. An yet you must somehow still maintain the integrity of the uniqueness of the natural key! This illustrates the depth and complexity of the problem with which we're dealing here. To me, that's the main reason to normalize, so that you don't have to maintain data in multiple locations. Even though it's usually done at the engine level, multi-column natural keys still require tye CPU cycles to keep the repeated data in synch. It is not "vastly more efficient". It is not uncommon to write a query that requires one of more columns to be displayed that are part of the natural key from the foreign key table. If such columns are repeated in the dependent table, then putting them on a report will not require a JOIN to the foreign table. Since they already exist in the dependent table, you can retrive them directly. In this not uncommon case, there need be no lookup to retrieve those values. But SQL-based database engines are optimized to make such joins every efficient. Secondly, in your version, the FK columns won't be updatable, but in many views, the data the FK points to in the parent table *will* be updatable. Any method that increases the number of hard disk accesses is the biggest enemy of performance. Avoiding a JOIN remains one of the biggest ways to keep query performance high. I expect you are referring to avoiding cascade updates when you speak of using natural PKs only "when the values will *never* change". The rules of good database design actually state that the natural key should rarely change, not never, but the point is pretty much the same. Well, now you get into the other major flaw of natural keys -- that for many entities, there are no reasonable candidate keys. What would you use as a natural key for a person? There is simply no viable natural key to identify a person in any application that is using real-world data and is not handing off the problem to some other PK-assigning authority (like the Social Security Administration, which is itself problematic, since, over time, that PK gets re-used). The requirement that the BE a unique natural key cannot be camoflaged by avoiding HAVING a unique natural index. To fail to do so prevents the database designer from ensuring that there be a one-to-one correspondence between the entities in the real world and the rows in his table. If you depend only on a surrogate key, then you have done NOTHING to prevent the duplication of those natural entities within your database. I do not agree that this argument enters into consideration here. The need to have a unique natural key is absolute, and no autonumber/identity key helps remove that problem in any fashion. I suggest that human ingenuity will commonly find a way to uniquely describe natural entities in ordinary speech, and that we can tap into that to design unique natural keys. The proposal that there is no way to specify a unique person when we talk about John Smith in conversation means we cannot speak of him or conceive of him uniquely. The fact that we consistently find a way that does permit us to do so uniquely in natural speech tell me we can have the computer do the same. Otherwise, there fails to be a way to communicate at all. In some of my designs, such cascades events are recorded. A system can go for a few months between such occurrances. So, I largely agree whith you, if you will permit. There are lots of entities where the only candidate natural keys will need to be updated often. And, of course, there can't be any Null fields in the compound PK, which means storing default values that will then have to be eliminated in reporting, etc. I disagree that there is such a lack of candidate natural keys. This is, as I have just described, not something that happens in normal speech and writing. We DO find a way to uniquely identify a person or entity when such uniqueness is necessary. If we can convey that uniqueness to one another, then we can convey it to the computer as well. There's quite a bit to this discussion, and I'm preparing a series, if not a book, about this very thing. Frankly, I find there are very cogent arguments on both sides of the discussion. But, finally, I find that the disagreement is entirely artificial. The real problem is that the issue should have been addressed inside the design of the database engine (Jet, SQL Server, etc.) and could then be entirely transparent to both users and designers. The facility provided by surrogate "identity" keys could have been addressed within the engine, automatically, and this can be done with an efficiency vastly greater than either method implemented by we database designers. That is, the problem is artificial, and its best solution lies outside what you and I can effect. I'd be interested to see your thoughts on that at more length. It intrigures me as a concept, and it's certainly a general principle that I would support, but I'm afraid that in so many cases I'm not happy when such things are actually done for me, behind the scenes. Some examples: Name Autocorrect, Multi-value fields. I propose that, for a 1:M relationship (the most common) the database engine maintain the link between the foreign and dependent tables as an internal pointer directly to the foreign table row. Unlike the autonumber, this requires no index, so is several times faster. Do you really believe, for instance, that there is no index to the data pages where memo fields are stored, so that retrieval of the memo data with the record it belongs to is not quicker? There would have to be internal structures doing the mapping, and they'd have to be optimized for efficiency. Seems to me that it would just be a bunch of hidden structures, as opposed to ones that you can create and work with yourself. Depending on the design of the database engine, there is certain to be a librarian keeping track of the pages of information stored in the database file. However, for efficiency, I am convinced this index is always kept in memory. This whole point is to eliminate hard drive accesses. That's where 99% of the performance issue lies. The engine must keep track of whether this link is valid or not, and use the lookup facility using the key when it is not, all the time fixing up those links that have been invalidated when there is idle time on the server. Ack. That sounds terrible. What if the server never has idle time? What if you're not using a server-based data engine? And what happens to data requests between the start of the update and the commit of it? Your proposal is adding a whole lot of new locking issues, seems to me, and would be doing internally exactly what is happening already in most database engines. A server that has no idle time isn't keeping up with user requests. The possibility that the server always has just enough power to perform exactly what is asked of it is too unlikely to consider. A server that falls behind just 10 minutes each day will soon be taking hours to answer each user request. This is not what is happening in the real world. A server always have some additional time available. Excess capacity is PLANNED. And, what I propose will free up considerable additional server processing time, making it possible to perform what I propose. This would implement all the best features of both natural and surrogate keys, and indeed out-perform either. I really cannot see how. Perhaps you don't have this level of hand waving in your planned long-form version of this, but I just don't see how you get from where we are to what you propose without using exactly the same kinds of structures we already use. Once viewed in this way, the whole issue is not really a debate between database designers, but a failure of database engine designers. The debate we are having is artificially created by a failure to build the database engine optimally. But the problem I see is that you'd be depending on the db engine to determine something that is application-specific. That is, uniqueness is a characteristic of the data used for a particular purpose in a particular application, and natural keys in particular are very dependent on that. I don't see how a database engine could determine this without having an understanding of the way the entity is being used in the particular application. It is not at all the case that the implementation of relationships is application specific. This is a general approach to a general problem. Please do not think I have no understanding or sympathy for the point of view you have proposed. I quite agree with you on some of the major aspects. My opinions on this have been revolutionized in the past 2 years as I've come to realize the underlying factors causing this debate. I believe that, using the existing tools, there may be very good cause to use surrogate keys for relationships. I'm not so sure this ever happens in a database less than 2 GB. In my upcoming paper on the subject, I have created databases and contrasted there performances using both paradigms, with both Jet and SQL Server. The results may surprise many. Your proposed system would use surrogate GUIDs to maintain the links. I can't see how it could work otherwise. And you'd still have to maintain uniqueness of the actual attributes with indexes you create yourself. So, I see absolutely no difference between the current surrogate key type of RI and what you're proposing. No, my proposal uses existing internal pointers, as found in the lowest level nodes of any index, to implement what I propose. The differences a the natural values are found in the dependent rows, and the surrogate key is completely hidden from the database designer, and operates without the need for any index. -- David W. Fenton http://www.dfenton.com/ usenet at dfenton dot com http://www.dfenton.com/DFA/ |
#25
|
|||
|
|||
Why do all my Relationships show as Indeterminate ?
"'69 Camaro" AM
wrote in : As long a we can discuss things in a way that is not acrimoniuous, which seems quite possible given your response. I don't know how to respond to that. Are you spoiling for a fight? Actually, if you re-read this ambiguous sentence -- like I had to, because I got the same impression the first time! -- Tom is expressing his expectation of avoiding a fight, given your civility in your previous posts, because all too often this topic sparks brawls. I don't recall any kind of "brawls" on this topic. Spirited disagreement, yes, but hey, it's Usenet! I read it as being addressed to alleged past behavior by me, in that the civil tone was unexpected. Perhaps he meant that in terms of this particular topic instead of addressing me, specifically, but somehow it didn't seem to me to be a general comment. I hope I am wrong on that, and then I'd have to admit to being overly sensitive, which would be pretty ironic! -- David W. Fenton http://www.dfenton.com/ usenet at dfenton dot com http://www.dfenton.com/DFA/ |
#26
|
|||
|
|||
Why do all my Relationships show as Indeterminate ?
"Tom Ellison" wrote in
: The natural key value would be in the foreign key table in any case. The comparison is between adding the autonumber column to both tables plus adding an index on the autonumber in the foreign key table. The break even point is at about a 20 character natural key length, which happens to be a very common length in my designs. I figure adding 4 bytes to each table, plus at least 8 bytes of indexing, plus some overhead as indexes are never filled to 100%. I don't understand how a 20-character text key can be as efficient in terms of storage and index maintenance as a long integer, which my Access help file tells me is 4 bytes. For the data storage alone, you're talking 5 times as much space per record. Secondly, my understanding (which could be wrong) is that Jet (and other db engines) have optimized retrieval of numeric values as compared to text values, even when indexed. So there'd be a performance hit because you're no longer benefiting from that optimization. Then there's the index update hit because you've got to maintain more data in the index tree. "David W. Fenton" wrote in message . 1... "Tom Ellison" wrote in : [] A natural key that is text may be longer in the dependent table than putting a surrogate key in the dependent table. On the other hand, you must add the identity surrogate column in the foreign key table, and you must index this additional column. You lost me. Assuming we're talking about a single-column PK, the difference in storage space between a long integer and a text field is going to be proportional to the difference between the storage space required for the one vs. the other. Since both require storage in two tables in order to map the relationship, and because both have to be indexed, it seems to me that all you need to do is figure out the difference between the size of the two data types as stored. Only very small text fields are going to take up less space than a long integer (4 bytes? or is it 8?). Not actually proportional. There are 3 entities involved. 1a. The foreign key table, which must grow by 4 bytes for each row it contains for the autonumber column 2a. The dependent table, which will shrink by the difference between the natural key and the autonumber key 3a. The index on the foreign key table for the autonumber column, which is entirely avoided by the natural key method Your terminology confused me for a while, and that caused me to not understand your point. By "foreign key table" you mean the parent table in the relationship. Yes, you are adding 4 bytes per record *if* there is an actual candidate natural key. So, yes, you are right that it is not directly proportional. But only in cases where there really is a proper candidate natural key, which I find very rare except for simply lookup tables. There are also differences in performance. 1b. A lookup will be modesty faster for large tables using the autonumber. The primary factor here is the number of levels in the b-tree index. A 20 byte natural text key makes for a 24 byte entry in each node of index, meaning there will be a maximum of about 160 rows per node, while the 4 byte autonumber has an 8 byte entry and about 500 entries per node. For, say, 1 million rows, the natural key index will require 3 levels of index, meaning 3 disk accesses for every joined row of data (160 cubed is about 4 million). For the autonumber, it will still require 3 levels of b-tree, and will perform at the same speed. However, 500 cubed is 125 million, so between 4 million rows and 125 million rows, the autonumber index is faster by a 3:4 ratio. This is a very modest gain in performance during lookups. That is, if the only speed benefit is from the nodes that need to be traverssed. My understanding is that there are other optimizations for numeric values in most db engines that speed the processing of joins beyond just the b-tree traversal. But I'm just going on third-hand information there, so I could be wrong on that. Certainly the need to handle double-byte data surely must introduce some kind of additional overhead in the text-based indexes. 2b. When adding rows, the natural key method will be faster. The foreign key row is smaller, and has one few indices to be maintained. There is no index on the autonumber column to be maintained, either. I sure wish you'd use "parent row" instead of "foreign key row." I don't think this is a huge issue, as you're on the 1 side of the join. The many side is where most of the records will be added. I can see where that would be less of an issue if most of you N's are 1 or 2, but once you have an average of 2 or more, you've really multiplied the amount of maintenance hit well beyond the small difference in updating the surrogate key index. 3b. When executing a query, it is often common to require only the natural key value. See the Categories table in Northwind, where the natural key value is the ONLY column in the foreign key table. If you put this value in the dependent table, then you avoid the need to JOIN to this table whatsoever. But that's *precisely* the kind of table I'm *agreeing* works very well with a natural key, becuase it's a one-column lookup table. However, I don't always use a natural key, because the values sometimes have to be updated (in some apps more often than others). I propose to gain the advantages of using both methods, and to gain additional advantages as well. When my work is finished, I will be able to show these things quite well. Well, I'm begining to see where you're going, but I just don't see the advantages except in the type of case I was already agreeing were just fine for natural keys. Once you get to a 2-column or more PK, then I think your whole theory breaks down. It certainly doesn't change the data storage issues, but it does magnify the storage and index updating issues in the child table (i.e., the one with the N records), and the join issue only works when the data you need to filter on is only one join away (i.e., it only helps with direct relationships). Once you need data two joins away, you probably have greatly *increased* the problems with join performance. As I have been saying, both approaches have some potential advantages, especially in very large databases, probably those in excess of the 2 GB limit on Jet. But the whole debate is somewhat frustrating since the best solution would be a database engine that automatically provides all the best advantages of both methods, and can indeed exceed either approach taken separately. There exists a value like the autonumber but which is the value recorded in the lowest level of the b-tree, which points to the exact location of each row in the foreign key table. If this value were recorded in the dependent table in place of the autonumber value in 2a above and removing the index in 3a above. This values would not be stored in the foreign key table, 1a above, but would be external to that table. But my understanding is that the value that is recorded there points to a data page with an offset for the start of the record. Maybe the offset is stored in the data page, instead. But when you compact, the indexes have to be updated, and if you stored that value in each child record instead, you'd have to update it in many more places than you do with the current situation. I'm speaking of Jet here, but surely every database engine has some similar kind of methods (MySQL would have a file name and a row number). In short, it's exactly the same problem I have with depending CASCADE UPDATES for natural keys -- when the parent value changes you have to do a bunch of updates to a lot of records and the indexes for those records. There are two challenges to this. One is to maintain integrity in the database when a foreign key table entry is updated in the natural key. This is currently handled by cascading updates. I propose this be deferred, and instead keep a bitmap of all the rows in the foreign key table, showing whether they have been updated but not yet cascaded. When this happens, the database engine would force a lookup to get the current value of the natural key, and not use the natural key value in the dependent table. That requires branching logic to decide which to do, and the time it takes to run that test could add significantly to data retrieval time. This would greatly increase your need to perform data maintenance to get the data defragmented. The other situation occurs especially with a table that is "clustered" (SQL Server terminology) or cleaned up (Jet). This relocates the rows of tables to the order of the primary key. In the case of surrogate autonumber keys, this places the rows in an order that is not particularly useful. On the contrary, it can be *very* useful in reducing concurrency. A random Autonumber means that the data ends up randomly distributed, so that updates are not as likely to collide on the same data page. Certainly something similar is going to be the case in all data stores, at some level or the other. Indeed, it is more efficient to use the natural key as the primary index, and create a separate unique index for the surrogate keys. If, for example, you are reporting the contents of the table, having the rows in the same natural key order that the report requires will almost certainly be more efficient than having the rows in the autonumber order, which would generally be the order in which they were entered. This would be what I would call a "premature optimization," in that you're trying to get application-level performance enhancements out of operations at the lowest level of the database. I see this as a mistaken approach, in that you're building a bias into the data store that is not always going to be useful. And it violates the principles behind Codd's rules and SQL in that you're worrying about the data store. No, a data engine that makes it easy to load into memory and pre-optimize for a particular purpose, might very well show a real performance boost without downgrading it elsewhere. It certainly is true that most database engines are optimized for reading/writing to disk, when these days there's enough RAM to run production apps directly from the image in memory. But there would be reliability issues with that and I'm not sure we're prepared for that yet. But I do know that a lot of large databases are, in fact, completely loaded into RAM to improve performance. I don't know how many databases out there have added features to exploit running from RAM. In this case, the natural key values in the dependent table will be correct, but the surrogate "row pointer" values will be invalidated. In both cases, I propose that the database engine should spend its idle time cleaning up these situations, returning the database to a state where these operations can proceed at the optimal speed. That's exactly where the RAM vs. disk storage advantage could be exploited. An engine designed around the capabilities of RAM would fix this. But, in order to do it, you would have re-introduced a hash table that translates the original locations into the current locations of the data, and you're basically back to the current index structure, but with fragmentation inherent in your translation structure. Well, now I've spilled the beans. Does anyone follow this mess? Yes, I'm following you, but I'm not sure the performance improvements would follow from what you're suggesting. But, I'm used to thinking in terms of storage on disk, read serially, rather than storage in RAM, read randomly (all locations are, theoretically, just as close as all others, thought they aren't, really, but much more so than on a hard drive). Perhaps you should add the whole RAM vs. disk issue into your discussion of this, as I'm pretty sure it's central to the current design of database engines. So it seems quite obvious to me that natural keys are going to always be longer, unless they are numeric data (in which case they aren't really natural, as somebody has to maintain uniqueness of the numeric values). And the overhead of updating large indexes increases as the number of records increases, regardless of which is the larger table in the relationship (parent or child). There are trade-offs in the size of the database, and it is not always the case that there is any noticable difference in the size of the database. I would be much more concerned about the performance issues with the larger index data pages that will be requuired for a text-based natural key. Because of the way b-tree indexes are created, there is little, and often no difference in performance, as I covered in my discussion of the number of disk accesses needed to perform this operation. But you've tended to restrict your discussion of that issue to the parent table, and rather ignored the way the problem can multiply in the child table. In an app where every parent record has 100 children, the issue can become pretty great, both in storage and performance, seems to me. Also, I do not consider the size of the database to be a critical factor. Hard disk space is not all that expensive. And size does not, in this case, equate to performance. In the case of indexes it certainly does, both in editing/updating and in retrieval. The time taken to access an index is not at all proportional to its size. It is proportional to the number of levels required, which grows in an inverse geometric style. Again, I refer to my "impression" that there were index optimizations that favored numeric values over text. But index updating performance *does* degrade with the size of the field being updated. I would say that "long integer columns can be stored more efficiently" is fairly insignificant if what you're saying is it saves disk space. If you're talking performance, then again there are significant trade-offs. Ah! So you agree with what I've written above. I was only going with "space" as the starting point because that's the ground you chose to stake out in the discussion. I'm much more concerned with performance, too, and that's a major reason why I would consider a natural key only in a very limited number of instances. But this does not have the effect I believe you suppose it has. It has some effect, but the geometric nature of the growth surprisingly limits this effect. Again, I think you're ignoring the asymmetric nature of the performance issues. You concentrated mostly on the parent table. I've concentrated on the results of duplicating the parent record's data in multiple records in the child table. When it's 1:1, then there's little issue. When it's 1:10 or 1:100, it becomes a whole different ballgame. I quite disagree that repeating the data in the dependent table, even with a multi-column key, is a normalization issue. I do not agree that this violates any of the rules of normalization. If you point out that it repeats those columns that were in the foreign table, then I respond that having an identity column in the foreign table and repeating that in the dependent table must be a violation of the same rule. It seems obvious then that there is no such rule. I don't know what theory says. I don't really care. What I do know is that the more fields you have, the more they are likely to need to CASCADE UPDATES. That means a *huge* performance issue, because an update to any of the columns of the PK of the parent table will need to update the corresponding column in all the records of the child table. But with a surrogate key, which is never updated, you never ever have that issue. But you lose the advantage of having the natural key value stored in the dependent table. Joins are so easy that I just don't see that as much of an advantage. I'm not an end user, after all -- I understand how to write SQL! And most db engines are optimized for joins, precisely because of the issues involved. And, of course, if you *do* need to join on your natural-key indexes, you've probably reduced the efficiency of the join because it's a text-based index. [] It is not "vastly more efficient". It is not uncommon to write a query that requires one of more columns to be displayed that are part of the natural key from the foreign key table. If such columns are repeated in the dependent table, then putting them on a report will not require a JOIN to the foreign table. Since they already exist in the dependent table, you can retrive them directly. In this not uncommon case, there need be no lookup to retrieve those values. But SQL-based database engines are optimized to make such joins every efficient. Secondly, in your version, the FK columns won't be updatable, but in many views, the data the FK points to in the parent table *will* be updatable. Any method that increases the number of hard disk accesses is the biggest enemy of performance. Avoiding a JOIN remains one of the biggest ways to keep query performance high. Depends on what your app is doing most of -- retrieving data or editing it. And throw in the database-in-RAM issue and see if the results don't change. I expect you are referring to avoiding cascade updates when you speak of using natural PKs only "when the values will *never* change". The rules of good database design actually state that the natural key should rarely change, not never, but the point is pretty much the same. Well, now you get into the other major flaw of natural keys -- that for many entities, there are no reasonable candidate keys. What would you use as a natural key for a person? There is simply no viable natural key to identify a person in any application that is using real-world data and is not handing off the problem to some other PK-assigning authority (like the Social Security Administration, which is itself problematic, since, over time, that PK gets re-used). The requirement that the BE a unique natural key cannot be camoflaged by avoiding HAVING a unique natural index. To fail to do so prevents the database designer from ensuring that there be a one-to-one correspondence between the entities in the real world and the rows in his table. If you depend only on a surrogate key, then you have done NOTHING to prevent the duplication of those natural entities within your database. I do not agree that this argument enters into consideration here. The need to have a unique natural key is absolute, and no autonumber/identity key helps remove that problem in any fashion. I am not persuaded of the theoretical importance of your PK insuring uniqueness. There's a difference between uniqueness of the record and uniqueness of the entity being represented by that record. Given the imperfections of the latter, I don't see that it makes sense to work awfully hard attempting to make the two correspond. Data is imperfect precisely because we never have complete information. If you force the PK function onto the natural key fields (and the PK function is a meta-function of the database engine, not of your entities themselves), then you put requirements on the data you can store in those fields, and thus have to make up some values to make sure you have no Nulls. Then you have to suppress those fake values in certain situations, all because of a choice you've made to overload your data fields with both user-level functions (fully representing the data known about the entity) and meta-functions at the database level (relating the record to records in other tables). I suggest that human ingenuity will commonly find a way to uniquely describe natural entities in ordinary speech, and that we can tap into that to design unique natural keys. The proposal that there is no way to specify a unique person when we talk about John Smith in conversation means we cannot speak of him or conceive of him uniquely. The fact that we consistently find a way that does permit us to do so uniquely in natural speech tell me we can have the computer do the same. Otherwise, there fails to be a way to communicate at all. If natural speech were so easy to represent in digital form, then I think we'd have much more accurate voice recognition systems than we do. In some of my designs, such cascades events are recorded. A system can go for a few months between such occurrances. So, I largely agree whith you, if you will permit. There are lots of entities where the only candidate natural keys will need to be updated often. And, of course, there can't be any Null fields in the compound PK, which means storing default values that will then have to be eliminated in reporting, etc. I disagree that there is such a lack of candidate natural keys. This is, as I have just described, not something that happens in normal speech and writing. But that is precisely because we don't put our identifiers into neat little columns. We have a Gestalt representation in our heads that we use to identify the entity, one that often includes information that is useless in a business application. Do you really want to have your contact management app asking people for their hair and eye color, just so we can use that information to distinguish the John Smith with red hair and green eyes from the one with black hair and brown eyes? In a real-world application, when you ask for too much data, you end up with fake data, or no data. So there's always a tension between getting the most possible complete data and the practical realities of incomplete information and what actual users will have the patience to put in. If the attribute doesn't have any use in your app other than to help establish uniqueness, then your users aren't going to want to be bothered, and you'll end up with a column of UNKNOWN. We DO find a way to uniquely identify a person or entity when such uniqueness is necessary. If we can convey that uniqueness to one another, then we can convey it to the computer as well. But our data storage systems are several orders of magnitude more complex than anything even conceived of for the computer. [] I propose that, for a 1:M relationship (the most common) the database engine maintain the link between the foreign and dependent tables as an internal pointer directly to the foreign table row. Unlike the autonumber, this requires no index, so is several times faster. Do you really believe, for instance, that there is no index to the data pages where memo fields are stored, so that retrieval of the memo data with the record it belongs to is not quicker? There would have to be internal structures doing the mapping, and they'd have to be optimized for efficiency. Seems to me that it would just be a bunch of hidden structures, as opposed to ones that you can create and work with yourself. Depending on the design of the database engine, there is certain to be a librarian keeping track of the pages of information stored in the database file. However, for efficiency, I am convinced this index is always kept in memory. This whole point is to eliminate hard drive accesses. That's where 99% of the performance issue lies. Bingo! But you'll still end up maintaining a table mapping the RAM image to the current disk image. That will be much faster than doing it on disk, but will still require time/CPUs, so efficiency will still be important. It may be that current RAM prices make it possible to do this at a level that makes the difference no longer relevant (just as in the 90s graphics speed became fast enough to make a GUI completely viable without being too sluggish in comparison to character-based UIs). But the real question would be whether or not re-engineering a database engine is going to give enough performance/ease-of-use benefit to be worth the development investment. I'm not convinced there's going to be enough of a difference to justify it, nor that it's a good idea to prematurely optimize the data storage structures for any particular application (see above). The engine must keep track of whether this link is valid or not, and use the lookup facility using the key when it is not, all the time fixing up those links that have been invalidated when there is idle time on the server. Ack. That sounds terrible. What if the server never has idle time? What if you're not using a server-based data engine? And what happens to data requests between the start of the update and the commit of it? Your proposal is adding a whole lot of new locking issues, seems to me, and would be doing internally exactly what is happening already in most database engines. A server that has no idle time isn't keeping up with user requests. The possibility that the server always has just enough power to perform exactly what is asked of it is too unlikely to consider. A server that falls behind just 10 minutes each day will soon be taking hours to answer each user request. This is not what is happening in the real world. A server always have some additional time available. Excess capacity is PLANNED. And, what I propose will free up considerable additional server processing time, making it possible to perform what I propose. I'm always scared of delayed writes, which is basically what you're proposing, even in a transactional system. This would implement all the best features of both natural and surrogate keys, and indeed out-perform either. I really cannot see how. Perhaps you don't have this level of hand waving in your planned long-form version of this, but I just don't see how you get from where we are to what you propose without using exactly the same kinds of structures we already use. Let me point out that you've proposed exactly that in your discussion of moving the data page pointers from the indexes to the child records. Once viewed in this way, the whole issue is not really a debate between database designers, but a failure of database engine designers. The debate we are having is artificially created by a failure to build the database engine optimally. But the problem I see is that you'd be depending on the db engine to determine something that is application-specific. That is, uniqueness is a characteristic of the data used for a particular purpose in a particular application, and natural keys in particular are very dependent on that. I don't see how a database engine could determine this without having an understanding of the way the entity is being used in the particular application. It is not at all the case that the implementation of relationships is application specific. This is a general approach to a general problem. I think in retrospect what you're actually proposing amounts to using hidden surrogate keys for the meta function of relating records in different tables. Please do not think I have no understanding or sympathy for the point of view you have proposed. I quite agree with you on some of the major aspects. My opinions on this have been revolutionized in the past 2 years as I've come to realize the underlying factors causing this debate. I believe that, using the existing tools, there may be very good cause to use surrogate keys for relationships. I'm not so sure this ever happens in a database less than 2 GB. In my upcoming paper on the subject, I have created databases and contrasted there performances using both paradigms, with both Jet and SQL Server. The results may surprise many. Your proposed system would use surrogate GUIDs to maintain the links. I can't see how it could work otherwise. And you'd still have to maintain uniqueness of the actual attributes with indexes you create yourself. So, I see absolutely no difference between the current surrogate key type of RI and what you're proposing. No, my proposal uses existing internal pointers, as found in the lowest level nodes of any index, to implement what I propose. The differences a the natural values are found in the dependent rows, and the surrogate key is completely hidden from the database designer, and operates without the need for any index. Perhaps not an index created by the DBA, but there's an index in there somewhere, behind the scenes. -- David W. Fenton http://www.dfenton.com/ usenet at dfenton dot com http://www.dfenton.com/DFA/ |
#27
|
|||
|
|||
Why do all my Relationships show as Indeterminate ?
Hi, David.
I don't recall any kind of "brawls" on this topic. Spirited disagreement, yes, but hey, it's Usenet! Then you've been very fortunate not to have attended some of the same meetings I have! ;-) Perhaps he meant that in terms of this particular topic instead of addressing me, specifically, but somehow it didn't seem to me to be a general comment. I doubt he was aiming his comment specifically at you. It's a religious war and wherever you go, you'll find warriors on both sides. Some are ready to duke it out at the first hint of "your side is wrong." In those cases, it's usually best to keep a safe distance, or else wear armor or asbestos underwear. Mine always seem to be at the dry cleaners when I need them. I hope I am wrong on that, and then I'd have to admit to being overly sensitive, which would be pretty ironic! LOL! Gunny See http://www.QBuilt.com for all your database needs. See http://www.Access.QBuilt.com for Microsoft Access tips and tutorials. Blogs: www.DataDevilDog.BlogSpot.com, www.DatabaseTips.BlogSpot.com http://www.Access.QBuilt.com/html/ex...ributors2.html for contact info. "David W. Fenton" wrote in message . 1... "'69 Camaro" AM wrote in : As long a we can discuss things in a way that is not acrimoniuous, which seems quite possible given your response. I don't know how to respond to that. Are you spoiling for a fight? Actually, if you re-read this ambiguous sentence -- like I had to, because I got the same impression the first time! -- Tom is expressing his expectation of avoiding a fight, given your civility in your previous posts, because all too often this topic sparks brawls. I don't recall any kind of "brawls" on this topic. Spirited disagreement, yes, but hey, it's Usenet! I read it as being addressed to alleged past behavior by me, in that the civil tone was unexpected. Perhaps he meant that in terms of this particular topic instead of addressing me, specifically, but somehow it didn't seem to me to be a general comment. I hope I am wrong on that, and then I'd have to admit to being overly sensitive, which would be pretty ironic! -- David W. Fenton http://www.dfenton.com/ usenet at dfenton dot com http://www.dfenton.com/DFA/ |
#28
|
|||
|
|||
Why do all my Relationships show as Indeterminate ?
Dear David:
If you use the surrogate identity key, you must add that column to both the foreign key table and the dependent table, while you can remove the natural key value from the dependent table. In addition you will be adding an index to the foreign key table. In this index, each row of the foreign key table will require at least 8 bytes, 4 for the key value, and 4 for the pointer to the data it represents (in some databases, it is more than this). That's a minimum of 16 bytes to replace what took 20 bytes before (if we assume one dependent row for each foreign key row). Now it is not always the case that there is only one row in the dependent table for each row in the foreign key table. There can be 100 rows in the dependent table. In this case, you have saved some space. But there can be 0 rows in the dependent table for some of the rows in the foreign key table. In this case, you have added at least 12 bytes and gained none. When you add in a fill factor for the index of perhaps 50%, you will have used 24 bytes instead of 20. In terms of saving space, this will happen if the average foreign key row is used more than 3 times. Below that, the effect is probably very small. The entire effect is often minimal, but depends on the actual data used. In any case, this is hardly the main point of discussion (in my opinion). The main point should be performance, not storage, and it would be if we did not have an artificial 2 GB limit built in. Tom Ellison "David W. Fenton" wrote in message . 1... "Tom Ellison" wrote in : The natural key value would be in the foreign key table in any case. The comparison is between adding the autonumber column to both tables plus adding an index on the autonumber in the foreign key table. The break even point is at about a 20 character natural key length, which happens to be a very common length in my designs. I figure adding 4 bytes to each table, plus at least 8 bytes of indexing, plus some overhead as indexes are never filled to 100%. I don't understand how a 20-character text key can be as efficient in terms of storage and index maintenance as a long integer, which my Access help file tells me is 4 bytes. For the data storage alone, you're talking 5 times as much space per record. Secondly, my understanding (which could be wrong) is that Jet (and other db engines) have optimized retrieval of numeric values as compared to text values, even when indexed. So there'd be a performance hit because you're no longer benefiting from that optimization. Then there's the index update hit because you've got to maintain more data in the index tree. "David W. Fenton" wrote in message . 1... |
#29
|
|||
|
|||
Why do all my Relationships show as Indeterminate ?
"Tom Ellison" wrote in
: The entire effect is often minimal, but depends on the actual data used. In any case, this is hardly the main point of discussion (in my opinion). The main point should be performance, not storage, and it would be if we did not have an artificial 2 GB limit built in. I think the main point is performance, specifically in regards to the maintenance of indexes, and especially on the child tables. I don't care about the extra 4 bytes in the parent table. We can reduce the discussion to a scenario that favors your techniques, but, again, I would say that will only apply to a very small number of situations, and is thus not terribly useful. It is not that natural keys can *not* be more efficient or perform better. It is that in most cases they do not, and they introduce significant problems that lead either to compromises in your data or to very complex logic in your application. -- David W. Fenton http://www.dfenton.com/ usenet at dfenton dot com http://www.dfenton.com/DFA/ |
#30
|
|||
|
|||
Why do all my Relationships show as Indeterminate ?
"David W. Fenton" wrote:
As long a we can discuss things in a way that is not acrimoniuous, which seems quite possible given your response. I don't know how to respond to that. Are you spoiling for a fight? Actually, if you re-read this ambiguous sentence -- like I had to, because I got the same impression the first time! -- Tom is expressing his expectation of avoiding a fight, given your civility in your previous posts, because all too often this topic sparks brawls. I don't recall any kind of "brawls" on this topic. Spirited disagreement, yes, but hey, it's Usenet! I read it as being addressed to alleged past behavior by me, in that the civil tone was unexpected. Perhaps he meant that in terms of this particular topic instead of addressing me, specifically, but somehow it didn't seem to me to be a general comment. I hope I am wrong on that, and then I'd have to admit to being overly sensitive, which would be pretty ironic! I'm with Gunny on this topic. Also I've met Tom Ellison in person at a previous MVP summit. Now he's passionate about natural keys. But I very much doubt he's looking for a fight. I'm sure Tom didn't mean you specifically. Just folks in general. He might not have hung out much if at all in c.d.m-a and didn't realize you are a distinguished Access denizen over there. Now it was an ambiguous sentence but yes you are being a tad sensitive. smile Tony -- Tony Toews, Microsoft Access MVP Please respond only in the newsgroups so that others can read the entire thread of messages. Microsoft Access Links, Hints, Tips & Accounting Systems at http://www.granite.ab.ca/accsmstr.htm |
Thread Tools | |
Display Modes | |
|
|