If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
|
Thread Tools | Display Modes |
|
#1
|
|||
|
|||
Restated: "Fields are expensive, records are cheap"
Hi,
First let me apolozie for the empty question below. I hit the Post button my mistake and it posted a blank question. Sorry. Hi, I am restating my question because based upon the responses I receive I obviously stated my question incorrectly. So please let me try again. Hopefully I will be a bit more sucessful this time around. In a previous question, I stated that I added 30 fields the membership table (one of many in the system). The initial implementation was SO successful, that the user requested quite a few enhancements resulting in the in a HUGE increase in the scope of the original project. These changes resulted in the addition of 30 new fields to the membership table. The last time people responded to my question, they were concerned about the whether or not these were normalized fields. Please let me state that the row has been normalized to 3rd normal form and these fields are NOT re-occurring fields. One MVP who responded to my original question stated "Fields are expensive, records are cheap". Maybe I am misinterpreting his comment. But to me this means that it is better to have smaller row in multiple linked tables than it is to have one large row that contains all of the normalized data. (IE – it is better to split a large normalize row in a single table into multiple rows in multiple tables). I hope I explained that the right way. My question pertains to the difference in the disk access time it takes to retrieve one record from the disk over a network versus retrieving multiple records versus the overhead involved in Access parsing a lot of fields out of a large row or parsing the same number of fields from multiple smaller rows. I've always been taught the exact opposite - that "Fields are cheap, records are expensive" since going to disk is so SLOW versus accessing data in memory. Is there something different about Access where the statement "Fields are expense, records are cheap" is true? I'm using Access on local machine where the front and backs end reside on the same machine as well as having multiple front ends on each client's machine tied into the back end which resides on a file server over a cat 5 hardwired Ethernet network. My question is strictly concerning the data access time of multiple row over the network versus Access’ overhead of extracting data from multiple small rows versus one large row. And we are assuming a 3rd normal form database design. And it may well be that I am totally misinterpreting the “Fields are expensive, records are cheap” comment. Thank you for your comments. Dennis |
#2
|
|||
|
|||
Restated: "Fields are expensive, records are cheap"
The mantra in the subject line is purely a rule-of-thumb for beginning
database designers. It has no bearing on your question about data access. Your question revolves around how the data is physically stored on the disk. You're saying that there is a delay to move the disk heads to different physical locations to retrieve records, and that delay represents degraded performance. While there's some truth in that simple view, the reality is much more complex than that. Access uses pointers to store the data. If you define a Text field as 255 characters but enter 2 characters in it, it does not use up 510 bytes (unicode) of disk space to store those 2 characters. Internally, the database keeps track of where each field starts in each record. Some fields (memo, OLE, attachments, MVFs) are not actually stored in-line with the other data in the record either. So it's not a matter of a single record being faster to retrieve. There are many other factors at work here, including whether the database has been compacted recently, whether you are using page- or record-locking, whether the disk is badly fragmented, how intelligent are the various layers of caching (physical on the drive, or the controller, or the operating system, or in Access), the Rushmore algorithms in JET, and so on. Then we may have to start another mindset to handle SSD storage issues as well. In the end, you don't really need to consider these as factors when designing JET tables. It makes no sense to break a 60-field table down into one main table with several one-to-one relationships just for performance reasons. The management of those relationships alone represents unnecessary complexity for the database and for developer. You may be forced to go that route if you are subclassing, or if your record is too wide to fit into the 4K page size; otherwise don't even consider it (which is probably what you were thinking in posting.) Having said that, having 60 fields in one table is unusual. The database I happen to be working on right now has 93 tables, and none of them has more than 32 fields (including 6 for tracking when and by whom the record was created, most recently edited, and deactivated.) -- Allen Browne - Microsoft MVP. Perth, Western Australia Tips for Access users - http://allenbrowne.com/tips.html Reply to group, rather than allenbrowne at mvps dot org. "Dennis" wrote in message ... Hi, First let me apolozie for the empty question below. I hit the Post button my mistake and it posted a blank question. Sorry. Hi, I am restating my question because based upon the responses I receive I obviously stated my question incorrectly. So please let me try again. Hopefully I will be a bit more sucessful this time around. In a previous question, I stated that I added 30 fields the membership table (one of many in the system). The initial implementation was SO successful, that the user requested quite a few enhancements resulting in the in a HUGE increase in the scope of the original project. These changes resulted in the addition of 30 new fields to the membership table. The last time people responded to my question, they were concerned about the whether or not these were normalized fields. Please let me state that the row has been normalized to 3rd normal form and these fields are NOT re-occurring fields. One MVP who responded to my original question stated "Fields are expensive, records are cheap". Maybe I am misinterpreting his comment. But to me this means that it is better to have smaller row in multiple linked tables than it is to have one large row that contains all of the normalized data. (IE – it is better to split a large normalize row in a single table into multiple rows in multiple tables). I hope I explained that the right way. My question pertains to the difference in the disk access time it takes to retrieve one record from the disk over a network versus retrieving multiple records versus the overhead involved in Access parsing a lot of fields out of a large row or parsing the same number of fields from multiple smaller rows. I've always been taught the exact opposite - that "Fields are cheap, records are expensive" since going to disk is so SLOW versus accessing data in memory. Is there something different about Access where the statement "Fields are expense, records are cheap" is true? I'm using Access on local machine where the front and backs end reside on the same machine as well as having multiple front ends on each client's machine tied into the back end which resides on a file server over a cat 5 hardwired Ethernet network. My question is strictly concerning the data access time of multiple row over the network versus Access’ overhead of extracting data from multiple small rows versus one large row. And we are assuming a 3rd normal form database design. And it may well be that I am totally misinterpreting the “Fields are expensive, records are cheap” comment. Thank you for your comments. Dennis |
#3
|
|||
|
|||
Restated: "Fields are expensive, records are cheap"
Allen,
Thank you very much for the answering my question. I understand the issue of disk access quite complex given all those issues and SCSI versus SATA, non-raid vs different level of raids, speed of the controller, network switches, routers, and all of the other issues that you brought up just causes my head to hurt when I even try to think about them. So you are correct, I did take a VERY simplistic approach in my question. I did not realize that database actually kept track of where each field starts and stop. The other variable length record databases I have worked with just buried a start and stop marker in the record. So while the "presentation" layer of the database that knew where each field started and ended, the actual engine had no idea. I will have to go back and check if it would be possible for all of the fields to be filled out and possible exceeds the 4k record / row limit. I did not know about the 4 k limit. Thank you for bringing that up!!!!! You may have saved me again. You are correct when you say I was thinking about breaking my 60 field record into multiple tables. I interpreted the comment from the MVP as suggesting I should break up 60 field record up into multiple tables. When an MVP takes time to comment on something I doing, I do try to follow their advice. It has not yet led me astray. I have since been informed that the comment "Fields are expensive...” refers to the dollar cost of added the field to the forms, reports, and the future maintenance cost associated with that field. But that is another story. My background is with large commercial insurance companies. All of the companies have had relational databases designed to the 3rd normal form. Between the policy information, reinsurance tracking, state, federal, rate development information we have to maintain for each policy, the typical policy master table had well over 200 fields in it. The policy master table just had fields that were common to all policies! Any data that applied to a specified line (auto, homeowners, etc) is stored in line specific master policy field. Our coverage tables have over 100 fields. Our Claim master table had over 100 fields in it. So for me, 60 fields are actually pretty small. However, I will go back and re-examine my design. There are things that I could put in a separate table, but they really are properties of the member. Thank you for responding and supplying all that wonderful information. It gives me a better understanding of how Access works internally. Thanks once again, Dennis |
#4
|
|||
|
|||
Restated: "Fields are expensive, records are cheap"
=?Utf-8?B?RGVubmlz?= wrote in
: My background is with large commercial insurance companies. All of the companies have had relational databases designed to the 3rd normal form. The determination of whether or not those tables were actually normalized depends on the chosen definition of the entity being modelled in the table. I would consder the 200-field table you mention later to be unquestionably denormalized, even though I know nothing at all about the content/function of those fields. That sounds like a table that has a bunch of fields that are used only for a single record type, so that an auto insurance policy has one set of fields, but a renter's insurance policy has a different set of fields. Any time you're using some fields for some records and not for others, it's an indication to me that the entity has been misdefined, and should probably be broken into at least two tables, with a narrow header table and a long child table, where each row stores what was formerly a field in the wide table. All that said, my conclusion could be wrong for any particular application. But "fields are expensive, rows are cheap" is a generalized rule of thumb, not a hard-and-fast law of nature. It allows for exceptions for certain purposes, but is a starting point for evaluating a schema design. -- David W. Fenton http://www.dfenton.com/ usenet at dfenton dot com http://www.dfenton.com/DFA/ |
#5
|
|||
|
|||
Restated: "Fields are expensive, records are cheap"
David, Your Comment: The determination of whether or not those tables were actually normalized depends on the chosen definition of the entity being modelled in the table. I would consider the 200-field table you mention later to be unquestionably denormalized, even though I know nothing at all about the content/function of those fields. Response: This is a very common statement for people who do not know the P & C Insurance industry. The 100 to 200 field master record (depending upon the vendor) contained ONLY the common and non-recurring policy information between the different lines of businesses (type of policies). The insured’s name and address information is not even included in these field as that information is in another table. Actually, when you examine the data at the “policy” level for different lines ( Personal Auto, Homeowners, Renters, Work Comp, Commercial Auto, General Liability) you would be surprised to find it is actually quite similar. It is not until you get to the “coverage” and insured “object” level that the different lines vary dramatically. There is a lot of policy information that is never seen by the public. Just off the top of my head some common policy information is reinsurance information, cancelation and reinstatement status (the cancellation & reinstatement history is maintained in a separate table), statistical report to ISO and each individual state, premium accounting (not Acct Rec.) with written, earned, unearned, and inforce premiums, renewal offers tracking, voluntary audit information, physical audit tracking. You could break all this information into it their own tables, but why? So much of the above information is inter-related (like effective and expiration date) that breaking in into separate tables just slows down data access and increase complexity which just increas Once of the client I worked was a state owned insurer of last resort. If you could not find insurance anywhere else, you could purchase it from this company. The company was less then a year old when I started working there. There were located in a city where there were a lot of banking expertise but very few little insurance expertise. Their staff had all sort of banking experience, but no insurance experienced people. The first thing I did was sit down the vendor and go over their system. I learned that system in less than a week. It was simple to learn not because I that brilliant, but because there are only so many ways you can build an insurance system. And if you understand insurance, you can easily understand a new system. (Talk to an auto mechanic – it does not take them long to learn a new “normal passenger” car [I’m excluding the exotic engines] because there are only so many ways to build a “normal passenger car”.) The vendor commented that they were glad I was there because they had been trying to teach the banking people and AS/400 people for about a year and no one really understood the system. Again, that is not because the company did not have smart people or people that lacked experience. It was because their people lacked insurance inexperience. I had to give the insurance company’s CFO the reinsurance premium accounting journal entries for their financial books. This was not because she was dumb (as a matter of fact she was quite brilliant), it was because she did not have insurance accounting experience, which is a quite a bit different from normal accounting entries. But I went through all that just to say the head of the company’s IT department thought the same thing you do (he also came from a banking background). So, he hired some database design consultants / experts to review the database’s design, who again did not understand insurance. (Had the understood insurance, they could have taken a quick look and realize the database was in pretty decent shape.) Instead, they gather all of the data, all of the relationship, all of the interdependencies, and did there thing. Guess what, they came up with some minor suggestions but no major changes, which is what I told the CIO before he started this effort. But oh well. There is where the experience comes in. Also, as I stated in the other discussion on this subject (which I surprised you missed as you are commenting in that discussion also), I’ve worked on 12 different vendor’s insurance systems over the years. Those system have been written DECADES apart with totally different technology and totally different development groups. At one extreme we have the flat file system running on a mainframe and at the other end we have a Window’s base object oriented client server system using an SQL database. And they have all had a 100 to 200 field policy master table. (The more capable systems had the larger number of fields). It is interesting that you would disagree with all those people with all that experience. But whatever. Your comments: That sounds like a table that has a bunch of fields that are used only for a single record type, so that an auto insurance policy has one set of fields, but a renter's insurance policy has a different set of fields. Response: Well, it may sound like that but again this is the common statement for a newbie in the P & C Insurance field. Then normal way I’ve seen the policy master designed is to a common policy master fields where all common fields (all 100 to 200 depending upon the system) are stored in a single table. Then for each line of business (such as auto or renters), you have a line of business policy master file that contains those fields specific to that line of business. This table is an extension of the common policy master table. In a good design, you simply don’t store line specific line fields in the policy master table, you store them in the line specific policy mater files. One of the reasons the policy master record is so big is there is a whole lot of “behind the scenes” data that is being stored that the policy holder never sees. (See above). At the coverage and insured object level, the story is totally different. While structure of the coverage tables mirrors the policy master and line specific policy master, the coverage master table is actually quite narrow. That is because there is not a whole lot of common information (other than effective and expiration dates, policy accounting [not Acct / Rec. info], statistical accounting, coverage limits, and reinsurance) at the coverage level. Most of the coverage information is stored in different line specific coverage and insured objects tables (two or more tables). These tables are extension of the coverage master table and children of the line specific policy master tables. The homeowner coverage is actually comprised of multiple coverage tables because a homeowner policy can cover multiple lines of business. For example, home owner policy can coverage fire and property damage (1 line of business), general liability (another line of business), theft (another line of business), work comp for house hold help (another line of business). These were just the lines of businesses that I could think of off the time of my head. A full implementation of a homeowner policy is extremely involved and very complicated. But, back to your example. Your statement is incorrect. The personal auto policy master, coverage and insured object tables contain the auto specific coverage information, while the renter’s policy, coverage and insured object tables contain the renter’s specific coverage information. The common information for both the auto and renter’s policy is stored in the policy master table. Your comment: Any time you're using some fields for some records and not for others, it's an indication to me that the entity has been misdefined, and should probably be broken into at least two tables, with a narrow header table and a long child table, where each row stores what was formerly a field in the wide table. Response: You are preaching to the choir here! I totally agree. However, we are going to have to disagree on the “narrow header table” issue. The header table is as long as the data model / structure requires it to be. If is it short, it short. If it is long, then it is long. Your comment: All that said, my conclusion could be wrong for any particular application. Response: I agree with this point. Your comment: But "fields are expensive, rows are cheap" is a generalized rule of thumb, not a hard-and-fast law of nature. It allows for exceptions for certain purposes, but is a starting point for evaluating a schema design. Response: I now understand John’s logic behind “Fields are expensive, rows are cheap” and, given the context, I fully agree with it. Dennis |
#6
|
|||
|
|||
Restated: "Fields are expensive, records are cheap"
=?Utf-8?B?RGVubmlz?= wrote in
: Also, as I stated in the other discussion on this subject (which I surprised you missed as you are commenting in that discussion also), I Tve worked on 12 different vendor Ts insurance systems over the years. Those system have been written DECADES apart with totally different technology and totally different development groups I have looked at the data structures that have become common practice for a lot of different application types and have repeatedly seen bad designs become the norm for those creating those types of applications. That many people have used the same data structure doesn't make it a good one. Likewise, lots of applications have to support legacy applications that can't easily handle more normalized relational structures and thus a lot of structures that were required 15-20 years ago by the available technology are still in use. This is quite understandable, of course, but if you're developing a new application using modern technology, there's no reason to maintain the old data structures unless they are really the best model of the entities involved. I see this kind of thing all the time. A 200-field table is just not a good model and for Access is way too close to the 255-field limit for me to be comfortable with it. While it's theoretically possible that the structure is not denormalized, I think it's extremely unlikely, and that there are better structures to store and manipulate the very same data. -- David W. Fenton http://www.dfenton.com/ usenet at dfenton dot com http://www.dfenton.com/DFA/ |
#7
|
|||
|
|||
Restated: "Fields are expensive, records are cheap"
"Allen Browne" wrote in
: While there's some truth in that simple view, the reality is much more complex than that. Access uses pointers to store the data. If you define a Text field as 255 characters but enter 2 characters in it, it does not use up 510 bytes (unicode) of disk space to store those 2 characters. Internally, the database keeps track of where each field starts in each record. Some fields (memo, OLE, attachments, MVFs) are not actually stored in-line with the other data in the record either. So it's not a matter of a single record being faster to retrieve. There are many other factors at work here, including whether the database has been compacted recently, whether you are using page- or record-locking, whether the disk is badly fragmented, how intelligent are the various layers of caching (physical on the drive, or the controller, or the operating system, or in Access), the Rushmore algorithms in JET, and so on. Then we may have to start another mindset to handle SSD storage issues as well. And none of this considers the issue of disk caching, such that there's not that much difference between data loaded into RAM and data that is being read from/written to disk, since the latter usually takes place through the caching mechanism, and is not going to be limited by the speed of the actual storage medium, but by RAM. This has been the case in Windows since at least c. 1991-2, when Windows 3.1 was released with disk caching turned on by default. It was essential for decent performance in Windows, but it also meant that your databases were going to be speeded up because of the disk cache, too (although back then it was largely a read-only improvement, as lazy writes and such had not been implemented in the DOS disk cache; any modern version of Windows, though, caches both reads and writes). -- David W. Fenton http://www.dfenton.com/ usenet at dfenton dot com http://www.dfenton.com/DFA/ |
#8
|
|||
|
|||
Restated: "Fields are expensive, records are cheap"
David,
I can see where disk caching would help in a sequential process, but does disk caching really help in a randomly accessed database during data entry? The first data entry might access the first record, the next the 1,000th record, then next on the 5,000th record, and so on and so on. So, unless the entire table is cached, does it really help? Dennis |
#9
|
|||
|
|||
Restated: "Fields are expensive, records are cheap"
=?Utf-8?B?RGVubmlz?= wrote in
: I can see where disk caching would help in a sequential process, but does disk caching really help in a randomly accessed database during data entry? Yes, because every modern database use b-tree traversal of indexes to locate records. Next question? -- David W. Fenton http://www.dfenton.com/ usenet at dfenton dot com http://www.dfenton.com/DFA/ |
#10
|
|||
|
|||
Restated: "Fields are expensive, records are cheap"
David,
I stated "The first data entry might access the first record, the next the 1,000th record, then next on the 5,000th record, and so on and so on. So, unless the entire table is cached, does it really help? " Your comment: Yes, because every modern database use b-tree traversal of indexes to locate records? My response: So what if the modern database uses b-tree traversal of indexes to locate the records. What does that have to do with physically reading the record from the disk on a server? I can see where the speed of determining the location of the particular record would be assisted by this, but knowing where the record is and getting the record are two totally different things. Once the disk address / page / whatever is deteremine the db engine still has to go to disk to get the record unless the entire table is in cach or in memeory on the local machine. So once again, how does all this caching and b-tree traversal speed up the physical reading of a record that is not in memory. the database engine still has to go to disk or worse yet - over the network to get the desired record. If I've got it wrong (which I may well have), please explain where I missing the poing. Thanks, Dennis |
|
Thread Tools | |
Display Modes | |
|
|