If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#91
|
|||
|
|||
Separate PK in Jxn Tbl?
On Jan 22, 1:26*pm, "Neil" wrote:
Whenever I've created junction tables in the past, I always made the PK of the junction table the combined pks from the two other tables. Ex: Table1 PK=A; Table2 PK=B; Junction table consists of two fields, A and B, which together comprise the PK for the junction table. However, I just came across some code in which the person created a junction table with a separate PK consisting of an autonumber field, and then the two fields. So I was wondering how others did junction tables -- with a standalone autonumber PK, or with a PK consisting of the PKs of the tables being joined? And, if a standalone PK, then why? Thanks! Neil I don't claim to know more about db theory than the cdt regulars, or to have more experience than the many practioners who have contributed to this thread. I do know that the debate over relationship representations has been going on for years, neither side giving any ground. Some aspects of it are never addressed. For example, making the two foreign keys in a junction table a composite PRIMARY key prohibits NULL values from either key, but declaring a separate surrogate primary key together with a UNIQUE constraint on the two-foreign-key-composite-key does allow NULL values for either (or both!). This can be useful for representing unrelated entity tuples, either childless parents or orphans. Practioners in the audience may be interested in this page: http://www.sfdbs.com/toplevel/fasttrack/fasttrack.shtml of my website in which I describe an entirely new way to represent relationships. Or this page: http://www.sfdbs.com/solopages/relcardtypes.shtml where the representational capabilities of this new representation (called Aggregate-Link) are compared to the two more traditional representations. Please note, I am not suggesting the use of this representation for conventional database design. Merely consider it new, and possibly interesting. (It may not be possible to deploy Aggregate-Link in MS Access.) The theorists have already had a field day dumping on this new representation in this thread: http://groups.google.com/group/comp....514365a600841# But most theorists seem to be immune to the practical considerations of IT. In particular, they don't begin to understand that IT employers' appreciation for Access is based on cost, not beauty. Rob |
#93
|
|||
|
|||
Separate PK in Jxn Tbl?
JOG wrote in
m: I certainly don't think developers should excuse sloppy RDBMS design just because they are using access (and of course I'm sure many of the professionals here wouldn't dream of doing so, despite others laxness). What *are* you talking about? Any mistakes in schema design that you can make in Access, you can make in any other RDBMS. I would agree that there are many places that the wizards in Access and the sample databases encourage sub-optimal practices. But most professional developers aren't using either wizards or the sample databases as their models for developing their own applications. If your impression of Access comes from futzing with it for 10 minutes and from encountering kludged-together apps created by your company receptionist, then you just haven't a clue what Access offers, either as an application development platform or as a data store (using its native Jet engine). This is a schema question, and that is orthogonal to Access, because Access is an application development platform. If you use Jet for your data store, then Jet is relevant to the discussion. But you can use any data store that offers an ISAM or ADO or ODBC drivers, and that means all your schema issues are completely divorced from Access itself. That you can't seem to keep this distinction clear in your mentions of Access demonstrates pretty clearly that you are completely clueless about Access and really aren't in any position to be making disparaging comments about it. -- David W. Fenton http://www.dfenton.com/ usenet at dfenton dot com http://www.dfenton.com/DFA/ |
#94
|
|||
|
|||
Separate PK in Jxn Tbl?
If I were a library lending 2 undiffentiatable copies of 1 book to the same person, I suppose I would need an Autonumber PK plus A BookID and BorrowerID.
No, you would fire your librarian for failure to keep an accession number or copy number as part of the book's identifier. And the book should be identified by an ISBN. |
#95
|
|||
|
|||
Separate PK in Jxn Tbl?
JOG wrote in
m: No probs, although off the top of my head its gonna be a bit contrived. With an artificial key: Marriages {id, husband, wife, date} Kids_from_Marriage {from_id, name, birth} A query that asks "fetch me all the children whose mother is x" obviously requires an equijoin, matching Marriages.id and Kids.from_id. However with the original natural keys: Marriages {id, husband, wife, date} Kids_from_Marriage {mother, father, name, birth} The same query is a simple select. That certainly seems a lot less complicated to me Assuming you've got some form of CASCADE UPDATE on your enforced relationship, that will work. But it's repeating a huge amount of data, and adding a bunch more indexes to keep updated. Perhaps these are insignificant issues to *you* and *your* apps, but my clients' apps (some using Jet, some using SQL Server, some using MySQL) don't perform so well when you add in all the overhead. And it all leaves aside the question of how you know that husband/wife/date is always going to be unique. I think that on any given day in the US, there are plenty of marriages in which those three values will be identical. You could add place. But then, in large cities, that might not be enough. So use Postal Code in place of place, and that might do the trick, although in large cities that might not do it, either. Given that I can foresee a reasonable possibility of a collision on this candidate key as currently defined, I'd think long and hard on whether to use it or not. And it's one of the main problems any time you're storing data about people in a data table -- you often lack pieces of the information, and you run a high risk of collisions between people with the same names. And that fact of the real-world entities being modelled makes finding a natural key that will work as a PK a very hard task. -- David W. Fenton http://www.dfenton.com/ usenet at dfenton dot com http://www.dfenton.com/DFA/ |
#96
|
|||
|
|||
Separate PK in Jxn Tbl?
Rob wrote in
m: I do know that the debate over relationship representations has been going on for years, neither side giving any ground. Some aspects of it are never addressed. For example, making the two foreign keys in a junction table a composite PRIMARY key prohibits NULL values from either key, but declaring a separate surrogate primary key together with a UNIQUE constraint on the two-foreign-key-composite-key does allow NULL values for either (or both!). This can be useful for representing unrelated entity tuples, either childless parents or orphans. Er, why go to all that trouble? You can represent the same childless parents by simply omitting a record in the junction table. -- David W. Fenton http://www.dfenton.com/ usenet at dfenton dot com http://www.dfenton.com/DFA/ |
#97
|
|||
|
|||
Separate PK in Jxn Tbl?
Whenever I've created junction tables [sic] in the past, I always made the PK of the junction table the combined pks from the two other tables..
I wish I knew where people got the term "junction table" when they mean a table that models a relationship . I guess if you are still thinking in terms PK-FK being pointer chains and structure and not relational, then you would borrow old terminology from Network Databases. And bring that mindset with you, too. However, I just came across some code in which the person created a junction table with a separate PK consisting of an autonumber field [sic], and then the two fields [sic]. See what I mean about the poisoned mindset? Rows are not records; fields are not columns; tables are not files; references are constraints and not pointers. And you have been doing it right, in spite of having a bad vocabulary !! The "id"-iot with the magical auto-numbering is still using a sequential file model in his head. How can an exposed physical locator, based on the state of one particular machine at insertion time, with one copy of one particular release of one particular SQL product be a relational key? Since a key is subset of the attributes of an entity, that magical number must exist before each row in inserted into the table and not be subject to insertion time; this is by definition. Well, it doesn't, so it is crap. I would try to clean up this schema since it probably has redundant duplicates and orphaned rows. |
#98
|
|||
|
|||
Separate PK in Jxn Tbl?
Does a bottle of Gin have a serial no? I am too poor to buy 2 at a time to see if there is a different number Ah!
In the US, you get a UPC/EAN code and then a batch number with booze. The closest thing to a serial number is your cash register receipt. Now if someone will explain to me why we have expiration dates on salt and bottled water to me because I don't get it. |
#99
|
|||
|
|||
Separate PK in Jxn Tbl?
It has never caused me the least problem. There has been no revelation in this thread that would cause me to even revisit the decision.
How would you know, since autonumbering have no validation or verification? Trust me, I earn a lot of my living cleaning up these "fake pointer chain" SQL schemas. ATTRIBUTES AND FIRST NORMAL FORM (1NF): An attribute has to belong to an entity in the real world being modeled by the RDBMS. In First Normal Form, the attributes is modeled as a column in a table. It must also be an atomic value. I am not going to get into the differences between atomic and scalar, but scalar implied atomic and not the other way around. IDENTITY does not exist in an entity in the real world being modeled by the RDBMS. Thus, it is not an attribute and cannot be in a table, by definition. IDENTITY is a result of the physical state of particular piece of hardware at a particular time as read by the current release of a particular database product. It is not a data type. You cannot have more than one column of this "type" in a table. It is not NULL-able, which all data types have to be in RDBMS (Dr. Codd again). It is not a numeric; you cannot do math with it. It is what is called a "tag number" -- basically, a nominal scale written with numbers instead of letters. Only equality tests make sense (see the part about inserting sets of rows, which means that IDENTITY has no natural ordering, not even temporal). DEFINITION OF KEYS: (Dr. Codd, RM II, page 23): It is equally valid to interpret the uniqueness property in terms of object identification: the value of the primary key in each row of the pertinent R-table identifies the particular object represented by that row uniquely with the type of object s that are presented by that relation, Create a table with an IDENTITY as the key and perform a series of insertions of the same object, say an automobile: INSERT INTO Vehicles (vin, vehicle_mileage, iso_tire_size) VALUES (..); INSERT INTO Vehicles (vin, vehicle_mileage, iso_tire_size) VALUES (..); I now have two cars with the same VIN number. Actually, I have two copies of the same car (object) with an IDENTITY pseudo-key instead of the industry standard VIN as the proper relational key. This is a called an insertion anomaly. Assume that this pair of insertions led to creating vehicles 41 and 42 in the table, which are the same object. I can update 42's mileage without touching 41. I now have two versions of the truth in my table. This is a called an update anomaly. Likewise, if I wreak vehicle 41, I still have copy 42 in the motor pool in spite of the fact that the object no longer exists. This is deletion anomaly. Before you say that you can make a key from (IDENTITY, vin), read more Dr. Codd (ibid): If the primary key is composite and if one of the columns is dropped from the primary key, the first property [uniqueness] is no longer guaranteed. Assume that I have correct VINs and use (IDENTITY, vin) as a key. Dropping the pair clearly does not work -- a lot of vehicles could have the same mileage and tire sizes, so I do not have unique rows guaranteed. Dropping IDENTITY will leave me with a proper key that can be validated, verified and repeated. Dropping the VIN does not leave me with a guarantee (i.e. repeatability and predictability). If I run this code: BEGIN ATOMIC DELETE FROM Vehicles WHERE id = 41; INSERT INTO Vehicles (vehicle_mileage, iso_tire_size) VALUES (values of inserted row 41 ); END; The relational algebra says that I should have in effect done nothing. I have dropped and re-inserted the same object - an EXCEPT and UNION operation that cancel. But since IDENTITY is physical and not logical, this does not work. If I insert the same vehicle (object) into another table, the system will not guarantee me that I get the same IDENTITY as the relational key in the other table. The VIN would be guaranteed. The guarantee requirement gets worse. SQL is a set-oriented language and allows me to write things like this. INSERT INTO Vehicles (vin, vehicle_emileage, iso_tire_size) SELECT vin, vehicle__mileage, iso_tire_size FROM NewPurchases; Since a query result is a table, and a table is a set which has no ordering, what should the IDENTITY numbers be? The entire, whole, completed set is presented to Vehicles all at once, not a row at a time. There are (n!) ways to number (n) rows. Which one did you pick? Why? The answer in SQL products has been to use whatever the *physical* order of the physical table happened to be. That non- relational phrase "physical order" again! But it is actually worse than that. If the same query is executed again, but with new statistics or after an index has been dropped or added, the new execution plan could bring the result set back in a different physical order. Can you explain from a logical model why the same rows in the second query get different IDENTITY numbers? In the relational model, they should be treated the same if all the values of all the attributes are identical and each row models the same object as it did before. Yes, I can write a lot of procedural code and triggers, thus violating all Codd's rules about high-level data manipulation and integrity being handled by the RDBMS. But that is a kludge and good SQL programmers know it. THE KINDS OF KEYS: Now for a little more practice than theory. Here is my classification of types of keys. It is based on common usage 1) A natural key is a subset of attributes which occur in a table and act as a unique identifier. They are seen by the user. You can go to the external reality and verify them. You would also like to have some validation rule. Example: UPC codes on consumer goods (read the package barcode) and validate them with a check digit or a manufacturer's website or a tool (geographical co-ordinates validate with a GPS). These should be either industry standards or natural phenomena. 2) An artificial key is an extra attribute added to the table which is seen by the user. It does not exist in the external reality, but can be verified for syntax or check digits inside itself. It is up to the DBA to maintain a trusted source for them. Example: the open codes in the UPC scheme which a user can assign to his own stuff. The check digits still work, but you have to verify them inside your own enterprise. If you have to construct a key yourself, it takes time to design them, to invent a validation rule, set up audit trails, etc. Yes, doing things right takes time and work. Not like just popping an IDENTITY on every table in the schema, is it? 3) An "exposed physical locator" is not based on attributes in the data model, but in the physical storage and is exposed to user. There is no reasonable way to predict it or verify it, since it usually comes from the physical state of the hardware at the time of data insertion. The system obtains a value thru some physical process in the hardware totally unrelated to the logical data model. Just because IDENTITY does not hold a track/sector address (like Oracle's row_id) does not make a logical key. A hash points to a table with the address. An index (the mechanism in IDENTITY) resolves to the target address via pointer chains. If you re-hash, or re- index, the physical locator has to resolve to the new physical location. Technically, these IDENTITY values are not really keys at all, since they are derived from the PHYSICAL storage and are not even attributes in the LOGICAL data model. Physical locators include assorted flavors of indexes (Ingres uses over a dozen kinds), pointer chains (Sybase SQL Anywhere's key joins), row numbers (Oracle), hash tables (Teradata), bit vectors (Rushmore and Foxpro), non-contiguous storage (SAND, the whole database is reduced to a multi-dimensional vector space), inverted list (Model 204), etc. Any device that you can use to locate a particular row in storage without a sequential search of the storage qualifies. In the old days, we had to explicitly create, use and update indexes. They were very exposed and one of the classic errors was not to re- index a file. For Oracle, the classic was storing a row_id in code or a file, and then doing a storage re-organization. All of the exposed locators were subject to some characteristic screw up where logical and physical models met. Thus, we had utilities for cleaning up broken pointer chains, re-indexing files, etc. in those early days. One of the major advances in SQL was putting this responsibility on the system and not the programmers. The bad news is that if someone uses exposed locators in an SQL product, there are not a lot of utilities for cleaning up the mess. ** Notice that people get "exposed physical locator" and "surrogate" mixed up; they are totally different concepts. ** WHAT IS A SURROGATE KEY? A quote from Dr. Codd: "..Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434. This means that a surrogate ought to act like an index; created by the user, managed by the system and NEVER seen by a user. That means never used in queries, DRI or anything else that a user does. Codd also wrote the following: "There are three difficulties in employing user-controlled keys as permanent surrogates for entities. (1) The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g. if two companies merge, the two employee databases might be combined with the result that some or all of the serial numbers might be changed.). (2) Two relations may have user-controlled keys defined on distinct domains (e.g. one uses social security, while the other uses employee serial numbers) and yet the entities denoted are the same. (3) It may be necessary to carry information about an entity either before it has been assigned a user-controlled key value or after it has ceased to have one (e.g. and applicant for a job and a retiree). These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution - proposed in part [4] and more fully in [14] - is to introduce entity domains which contain system-assigned surrogates. [emphasis begin] Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them....." (Codd in ACM TODS, pp 409-410) [emphasis end]. That means if IDENTITY were a surrogate, we would not see its values, print them out as invoice numbers and so forth. That means if IDENTITY were a surrogate, we could drop it and the schema would still work. Like an index. But IDENTITY fails on both those points. PRACTICAL STUFF This is a summary of points already made, but put in one place. The summary is "Good bye, data integrity" in 25 words or less. IDENTITY does not port. This is such a basic part of Software Engineering; I am not going to comment on it. If anyone thinks that they will always deal only with their own code on one machine with one release of one product is not planning a long career. IDENTITY leaves gaps. This is because it is not a surrogate and does not verify its value against a relational key. It is a software implementation of a counter, like we had on some of the old tape drives. The practical problems involve having to explain the gaps to the SOX guy in accounting. IDENTITY has no validation. One tag number is as good as another. IDENTITY has no verification rules. Looking at an object, you cannot tell what IDENTITY value will be assigned to it. Is 41 or 42 the id of this automobile? Prove it. IDENTITY has no repeatability. The value depends on the arrival time in the physical storage. David Cressey advanced the theory that this all started out with Access 97. If you build a database with one of the database wizards, this is usually what you get. Also, if you design your own database and create your own tables, there comes a time, for each table, where MS Access intervenes, and tells you that you have not assigned a primary key, an that one is recommended, and that Access will do it for you, if that's ok. If you say "yes" (who would not?) what you get is precisely this: an extra numeric column, auto assigned, and declared as the key of the new table. Actually, I can live with this, for so called "entity tables". It would be better for the newbie DBA to discover a natural key, and declare that, but it's not so bad. For "relationship tables", the primary key ought to be a compound key made up of two or more foreign keys. But that is not what Access does. Someone who has learned database design before building their first Access database would not do this. But Access makes it seductively simple to design and build your own database, without even a minimal understanding of the consequences of your decisions. |
#100
|
|||
|
|||
Separate PK in Jxn Tbl?
I always use an autonumber PK and a unique index set on the two FK fields [sic]. Why? No particular good reason. One of my database rules is that all tables have an autonumber primary key [sic].
You add redundancy to a schema and never thought about it? If you had gone thru the Normalization process, this would stick out as a fundamental design error immediately. You have not been writing SQL; you are faking a sequential file system in SQL and even say "field" instead of "column" -- huge conceptual and implementation differences! It's also slightly easier to delete the record [sic: rows are not records!] in code. How do you know that you got the right entity? Don't you have a text editor if a little extra typing is that serious a problem that you need to add errors and overhead to your SQL? Now if I was to have a child table [sic: that terms comes from Network DBs; did you used to program in IMS?] from the junction table [sic: did you mean a table that models a relationship among many entities? Or a multi-way pointer structure as in an Network DB?] then I would absolutely use a autonumber primary key [sic: it cannot be a key by definition] for ease of use when designing queries, forms and reports. I hope that you are not designing forms and reports in the database. That is a job for the front. Now, in the old days, when COBOL, et al had a file system, we had to do it this -- monolithic architecture vs tiered architecture. The theorists will argue. I don't care. Would you say the same thing to an accountant who wanted the books to balance ?? |
Thread Tools | |
Display Modes | |
|
|