If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
|
#1
|
|||
|
|||
I was told "Fields are expensive, records are cheap"
Hi,
I just have a self education question. In response to one of my questions I stated that I added a lot of fields (a lot was 30 fields) to my table. The initial implementation was so successful, the user requested quite a few more enhancements resulting in the addition of 30 additional fields. One MVP who responded stated "Fields are expensive, records are cheap". I'm currious about his statement. I'm new to Access (less than a year) but I have over 30 years experience with relational databased on multiple platforms. I've always been taught the exact opposite - that "Fields are cheap, records are expensive" since going to disk is so slow versus accessing data in memory. Is there something different about Access where the statement "Fields are expense, records are cheap" is true. I'm using Access on local machine where the front and backs end reside on the same machine as well as having multiple front ends on each client's machine tied into the back end which resides on a file server. We have a hardwired ethernet cat5 cable network. Dennis |
#2
|
|||
|
|||
I was told "Fields are expensive, records are cheap"
Dennis wrote:
Hi, I just have a self education question. In response to one of my questions I stated that I added a lot of fields (a lot was 30 fields) to my table. The initial implementation was so successful, the user requested quite a few more enhancements resulting in the addition of 30 additional fields. One MVP who responded stated "Fields are expensive, records are cheap". I'm currious about his statement. I'm new to Access (less than a year) but I have over 30 years experience with relational databased on multiple platforms. I've always been taught the exact opposite - that "Fields are cheap, records are expensive" since going to disk is so slow versus accessing data in memory. So you think rows are on the disk but fields are in RAM? Data has to be retrieved from and written to disk regardless of this difference. Is there something different about Access where the statement "Fields are expense, records are cheap" is true. I'm using Access on local machine where the front and backs end reside on the same machine as well as having multiple front ends on each client's machine tied into the back end which resides on a file server. We have a hardwired ethernet cat5 cable network. This is not specific to Access. It is a "best practice" in all relational databases to have a normalized data design. This generally results in taller, narrower tables as opposed to wider, shallower tables. When table modifications call for lots of new fields it often means that a one-to-many relationship that ought to be set up in multiple tables is being shoe-horned into a single table. |
#3
|
|||
|
|||
I was told "Fields are expensive, records are cheap"
Rick,
Thanks for you responses, either I did not explain myself well or you missed some obvious points in my statements. Your comment: “It is a "best practice" in all relational databases to have a normalized data design. Response: This is true. I figured you would have picked up on that when I stated I’ve been designing relational databases for over 30 years, which is why I included that statement. But obviously I was wrong. Yes, relational databases have been around for well over 30 years. Your comment: “This generally results in taller, narrower tables as opposed to wider, shallower tables.” Response: This is true when you compare normalized tables to flat files. However, I was not. I was comparing relational to relational. Therefore, this statement has nothing to do with the question. As for narrow tables, that depends on the applications. Most accounting applications have pretty narrow tables. However, most personal lines and commercial lines insurance policy and claim tables can get very wide. Your comment: “So you think rows are on the disk but fields are in RAM?” Response: Maybe you learned different computer science that I did, but while this is a gross over simplification, your statement is ABSOLUTELY correct. Rows are on disk and fields are in RAM. I know how other relation database work, but I don’t know how Access works, which is what I am trying to find out. It is absolutely true that a program can only works with fields that are in RAM. An application program can not work with a field that is on disk, fields are only accessible while they are in RAM. The row has to be retrieve from the disk and placed in memory in order for the VBA code to have access to the row’s fields. Most database system provides the data to the program a record or row at a time (excluding all of the look ahead and optimization techniques.) A program reads and writes rows (not fields) to the disk. I don’t believe that VBA code can work on the entire row, unless there is a way to treat the entire row as a string. I know that other databases have this capability, but I don’t know if Access does or not. When the file system write the row, except for blob and other such specialty fields, the read and write functions within the database engine do not “know” where one field ends and the next one starts. The simply read a record from the table and place it in memory or they write a record / row to the table. Given, the above I am truly puzzled by your statement. You make it sound as if it this is not true when in fact that is exactly the way it works, even if it is extremely over simplified. Just out of curiosity, do you know of any application software that works directly on the disk? Granted, the OS and OS utilities work directly on the hard drive, but I’m not aware of any application software what does not. Your comment: “Data has to be retrieved from and written to disk regardless of this difference. “ Response: This statement is the point of my question! But one again, we disagree. The difference in the disk access time it takes to retrieve one record from the disk over a network versus retrieving multiple records. To me that statement “Fields are expensive, records are cheap;” implies it is better to have smaller row and multiple linked tables than it is to have one row that contains all of the normalized data. To me, this is ludicrous – unless there is something I don’t know about Access. It seems to be that the only things that are achieved by breaking up a large normalized row into multiple smaller rows in multiple tables in a significant increase in disk access time and significant higher network traffic. But then maybe Access is very inefficient in dealing with a lot of fields and it is truly better to deal with slower disk access time than Access’s overhead. I don’t know. That is what I am trying to find out. Even given the increase in network and disk access time, CPU processing speed and memory access time are still many, many times faster than access the disks. Also, in most client-server systems I’ve had experience with; the network traffic (data access) has been the bottle neck. Your comment: “When table modifications call for lots of new fields it often means that a one-to-many relationship that ought to be set up in multiple tables is being shoe-horned into a single table.” Response: Obviously you did not read my statement. This was something I learned over 30 years ago. Third normal form tables do not contain repeating fields unless they are a multi-valued field, in which case they are quite useful in exploding parts for a completed assembly such as a car door, engine or other such things. However, since Access does not support multi-valued fields at this time, this is not an issue with respect to Access. As I stated, the initial implementation was SO successful, the user requested quire a few new enhancement that were beyond the scope of the original project that we had to add 30 new normalized data fields. (I guess I should have included that word). Dennis |
#4
|
|||
|
|||
I was told "Fields are expensive, records are cheap"
=?Utf-8?B?RGVubmlz?= wrote in
news However, most personal lines and commercial lines insurance policy and claim tables can get very wide. That suggests they are probably denormalized so some degree. I know a number of popular applications (ACT springs to mind) use very wide tables, and this means their application has to be more complicated to work with the data. It also is one of the main causes of the performance problems that ACT has in terms of filtering large datasets -- it's harder to efficiently use indexes when you've got the data denormalized (and some of the data in ACT (last time I checked, which was many, many years ago) seems to be stored in multi-value fields, which is yet another reason for performance problems, since even if they are indexed, they can't be used efficiently). I will agree that a wide table *can* be normalized. I will also agree that there can be perfectly good reasons to denormalize. But I have very few tables in any of my apps that have even close to 50 fields. A common reason for denormalization that looks properly normalized is mis-definition of the entities (i.e., a table holds records, each of which represents an entity, and each field is an attribute of the entity). In a survey, for instance, it's very common to create a single record for each survey, with a field for each question. But this approach is completely unextensible -- any change to the questionaire requires a change to the table. Or you have to create a table for each questionaire. Or you have to add fields that are used in some questionaires and not in others. I learned this lesson with a medical study database that was tied to the structure of the UNOS national transplant registry. There was only one medical study and for it to work the variables being examined had to be established at the beginning, so it was never going to change. It seemed logical to use a record per patient. But before the app was even finished, UNOS changed the data structure of how they recorded data, and we ended up having to add fields to the existing tables and then change all the forms and reports to reflect that. If we'd used a proper structure, i.e., with a record for each variable, and a table that defined what the variables were, we would have needed to make absolutely no changes to the *structure* of the data or the application -- we would have been able to accomodate the additional variables just by adding *data* to the existing tables. That's the essence of the problem. Fields are expensive, records are cheap -- it's easy to change the data, hard to change the data storage *structure*. -- David W. Fenton http://www.dfenton.com/ usenet at dfenton dot com http://www.dfenton.com/DFA/ |
#5
|
|||
|
|||
I was told "Fields are expensive, records are cheap"
=?Utf-8?B?RGVubmlz?= wrote in
news A program reads and writes rows (not fields) to the disk. I think you're stuck in a very old-fashioned approach to this, one that reflects the very constricted memory and storage available back in the 80s and early 90s. A Jet/ACE data store doesn't store data in records, but in data pages. It's not like legacy systems at all. And the fields are variable length, so records are variable length. The efficiency of interaction with the storage and RAM are pretty moot since the days of disk caching, when even a write is done to RAM (and then persisted to disk when convenient). This means there's hardly any difference between RAM manipulation and reading/writing from/to the data store, since both are happening in RAM. Disk caching has been a given on Windows since Win3.11, c. 1992. This predates Access, in fact, so Access/Jet has never been as bound to disk space/performance as the databases that seem to form the model in your head. And when a server database is involved, the bottleneck is not RAM vs. disk, but the network connection, which is vastly slower than not just the RAM, but even slower than disk access. So, your model seems completely outdated to me. -- David W. Fenton http://www.dfenton.com/ usenet at dfenton dot com http://www.dfenton.com/DFA/ |
#6
|
|||
|
|||
I was told "Fields are expensive, records are cheap"
David, I want everyone to know that I stand corrected. In my response to Bruce I stated that no one had answered the original question in this series of correspondences. At that point I had not read your response. Thank you for answering the question. I also reposted the question and Allen Browne, you, and a few other responded to the re-post and answered the disk access speed issue. Your comment :A Jet/ACE data store doesn't store data in records, but in data pages. It's not like legacy systems at all. And the fields are variable length, so records are variable length. The efficiency of interaction with the storage and RAM are pretty moot since the days of disk caching, when even a write is done to RAM (and then persisted to disk when convenient). This means there's hardly any difference between RAM manipulation and reading/writing from/to the data store, since both are happening in RAM. Response: I can see what you are saying, but that assumes that you are accessing data records sequentially. What happens to all that disk caching when I am randomly reading records which I would do during data entry (which is where I am most concerned about disk access and network speed). The user might need the 1st rcd, then the 1,000 rcd, then the 5,257th rcd, then the 10th rcd, etc. etc.). Unless the entire table is cached, was good does loading a data page do if the record that you need next is not on that page? And if the data is cached in a server, that does not good all because you have the network bottle neck. I know about write caching and it does not affect the speed of retrieving data (unless you do not have enough memory and your machine is thrashing), which is why is was not part of the question. Your comment: And when a server database is involved, the bottleneck is not RAM vs. disk, but the network connection, which is vastly slower than not just the RAM, but even slower than disk access. My response: I know the network speed is MUCH, MUCH slower which is why I made it part of the question. Remember, my original interpretation of John’s “Fields are expensive, records are cheap” had with do with disk access over a network versus internal CPU process. I though he knew something about internal process of Access where Access was very slow when processing large records. I grossly misunderstood his comment. I thought he was saying it is better to access the disk multiple times with small records that to have one large Access record. He has since explained his comment and now I know it had nothing to do with disk accessing speed. Dennis |
#7
|
|||
|
|||
I was told "Fields are expensive, records are cheap"
=?Utf-8?B?RGVubmlz?= wrote in
: Your comment :A Jet/ACE data store doesn't store data in records, but in data pages. It's not like legacy systems at all. And the fields are variable length, so records are variable length. The efficiency of interaction with the storage and RAM are pretty moot since the days of disk caching, when even a write is done to RAM (and then persisted to disk when convenient). This means there's hardly any difference between RAM manipulation and reading/writing from/to the data store, since both are happening in RAM. Response: I can see what you are saying, but that assumes that you are accessing data records sequentially. What happens to all that disk caching when I am randomly reading records which I would do during data entry (which is where I am most concerned about disk access and network speed). The user might need the 1st rcd, then the 1,000 rcd, then the 5,257th rcd, then the 10th rcd, etc. etc.). Unless the entire table is cached, was good does loading a data page do if the record that you need next is not on that page? And if the data is cached in a server, that does not good all because you have the network bottle neck. You seem not to have done much investigation of how Jet works. Indexes are the key. When you request data from a table, Jet first requests the relevant indexes (based on your criteria) and then uses the indexes to figure out which data pages to retrieve. The requested data is going to be brought into RAM for processing, and since it's been requested from disk, it will be cached in Jet's internal cache, as well as in the OS's disk cache. When you write to the data, Jet will take care of writing to disk, which will then be handled by the OS, and if the written data pages are in the disk cache (and you are caching writes), the writes will be to the disk cache and then will be persisted to disk according to your disk cache settings (and you can also tell Jet explicitly to flush its cache, which in turn tells the OS to flush its write cache, i.e., persist to disk). This is all managed quite efficienctly, so efficiently, in fact, that there is very little reason to worry about the things you are worrying about until you reach extremely large datasets. That's usually not what is being used by an Access application directly (i.e., if you have a large dataset, you're likely going to use a different back-end than Jet/ACE). I know about write caching and it does not affect the speed of retrieving data (unless you do not have enough memory and your machine is thrashing), which is why is was not part of the question. Writing is much slower than reading because it involves so many different operations (writing indexes, determining if new pages need to be allocated), as well as maintaining as set of appropriate locks (write locks are a much bigger deal than read locks). But disk caching is a big deal for reading, too. As I explained above about the way Jet works, the indexes are the key to retrieving data efficiently. Data pages in Jet 4 and ACE are 4KBs in size, so you're retrieving data from the disk in blocks whose size is very often well-aligned with the file system's block size (the NTFS file system uses default block sizes in increments of KBs, with 4KBs being the smallest usual block size, and the next being 8, 16, etc., so reading 4KB data pages is likely to be very efficient, since they will be aligned with the file system's block size). Jet retrieves what it needs as needed, but only as much as needed. A simplified explanation of how Jet works: this SQL statement: SELECT tblPerson.LastName, tblPerson.FirstName, tblPerson.PersonID FROM tblPerson WHERE tblPerson.LastName = "Fenton" ....will first retrieve the index on LastName which will be keyed to the Primary Key, so Jet will then retrieve the primary key index. In both cases, only as much of the index as necessary is retrieved (it's not a sequential scan, but a b-tree traversal), and in the case of the primary key index, since Jet tables are clustered on the primary key, I'm pretty sure the primary key index is actually the real data pages (I could be wrong on that). I don't know for certain if the non-PK indexes store a link to the PK or if they directly reference the data page of the PK, but the point is that it's not a sequential scan of anything, but a b-tree index traversal, which is substantially more efficient. -- David W. Fenton http://www.dfenton.com/ usenet at dfenton dot com http://www.dfenton.com/DFA/ |
#8
|
|||
|
|||
I was told "Fields are expensive, records are cheap"
David,
Your comment “When table modifications call for lots of new fields it often means that a one-to-many relationship that ought to be set up in multiple tables is being shoe-horned into a single table.” It also means that you have to alter your front-end forms and reports by adding all the fields, whereas if you're adding records instead, whatever continuous or datasheet form you're using to display those records will just have more records in it, so there is no need to alter then user interface objects. Response: I don’t understand you comment. It is absolutely that that is I add more records, I don’t’ have altered anything. But when the user asks where to they enter these new fields and where the new reports are, what am I suppose to say. Well, I did not do add those fields but you can enter more data? That makes no sense at all. I’ve yet to figure how to capture additional data and produce reports with this additional data without modifying the UI or creating the new reports. Sorry, I have no idea what you mean by the above statement. Your comment: To *me*, that's what the phrase means, that adding fields is more complicated in terms of UI, which, to be frank, is where 90% of our time is spent in the development process (rather than in schema design/changes). My response: You are preaching to the choir! I fully agree. However, this project started out a simple membership mailing list. And that is all it was. Once the users saw it working, their imagination went wild and they wanted to start capturing additional information so they could produce additional reports. When this project started, I knew about the other report and strongly suggested that they just capture the additional fields and produce all of the report. However, the users kept saying no, no, no we don’t need at that stuff. The process is working just fine. And the boss man said – just do the mailing labels and that all. So that is what I did. I designed it for the full requirements, but only implemented the mailing label piece. But what are you supposed to do when the user, who is paying you, say no we don’t need that when you know that are going to need it? You can only push the customer so far. But again, none of this had anything to do with the original disk access speed vs. Access internal process speed question. Dennis |
#9
|
|||
|
|||
I was told "Fields are expensive, records are cheap"
=?Utf-8?B?RGVubmlz?= wrote in
news Your comment “When table modifications call for lots of new fields it often means that a one-to-many relationship that ought to be set up in multiple tables is being shoe-horned into a single table.” It also means that you have to alter your front-end forms and reports by adding all the fields, whereas if you're adding records instead, whatever continuous or datasheet form you're using to display those records will just have more records in it, so there is no need to alter then user interface objects. Response: I don’t understand you comment. It is absolutely that that is I add more records, I don’t’ have altered anything. But when the user asks where to they enter these new fields and where the new reports are, what am I suppose to say. Well, I did not do add those fields but you can enter more data? That makes no sense at all. I suspect you haven't looked at Duane's survey database. In the structure there (which is a fairly extreme version of what I'm talking about), when you add a new question to a survey, you add a new record to the questions table, and link it to a particular survey. This will then be displayed by the same user interface that was used to display the other questions. In the structure you're suggestion, you'd have to add a field for the question to your survey table and then alter your form for data entry, and your report for printing out, and all of the tools you use to analyze and summarize the results. Now, a lot of statistical tools expect a flat-file structure, with one record for each completed survey, so it's usually the case that one takes the many-records structure and denormalizes it into a flat structure that is exported for use by the statistics program. The key there is that you don't choose your data structure based on how you're going to output it. Instead, you choose the structure that is going to be most extensible with the least amount of work, and then you take that data and convert it to the structures required for particular purposes, which is very often a denormalized structure. I’ve yet to figure how to capture additional data and produce reports with this additional data without modifying the UI or creating the new reports. Sorry, I have no idea what you mean by the above statement. I strongly recommend you examine Duane's survey sample database. Your comment: To *me*, that's what the phrase means, that adding fields is more complicated in terms of UI, which, to be frank, is where 90% of our time is spent in the development process (rather than in schema design/changes). My response: You are preaching to the choir! I fully agree. However, this project started out a simple membership mailing list. And that is all it was. Once the users saw it working, their imagination went wild and they wanted to start capturing additional information so they could produce additional reports. I have plenty of databases that are badly structured because they started out small, when it seemed it wasn't necessary to choose an industrial-strength schema, and then when they got extended, I greatly regretted not doing it the right way on the front end. So, the existing app is no justification for continuing to use a less than optimal schema. When this project started, I knew about the other report and strongly suggested that they just capture the additional fields and produce all of the report. However, the users kept saying no, no, no we don’t need at that stuff. The process is working just fine. And the boss man said – just do the mailing labels and that all. So that is what I did. I designed it for the full requirements, but only implemented the mailing label piece. But what are you supposed to do when the user, who is paying you, say no we don’t need that when you know that are going to need it? You can only push the customer so far. The approach I'm advocating doesn't take more time to implement to cover the initial requirements and is inherently extensible, without any additional programming for the UI. You've front-loaded the programming to handle the future extensions, which is something I do, too, but the suggestion by those advocating records over fields is that there is no extra work involved. You don't have to do anything extra to prepare for the N new fields, because the structure you've built handles any number of records from the very beginning. Have you ever implemented a many-to-many join when the current requirement was only a many-to-one? Say the client has customers and there's a customer type. You can store it in a field in the customer record, but what if the client later decides that a customer can be more than one customer type simultaneously? At that point you need a many-to-many join between customers and customer types. If you structured it that way from the beginning, it would handle 1 type per customer, and then would handle as many customer types as you needed when the scope expands. This is the simplest example of the cheap records vs. expensive fields equation. And the point is that as soon as the client needs more than 1 type per customer, you have to decide how many you're going to accomodate. You could, I guess, add a second customer type field, but then when they need three, you'd have to add a third new field. With the join table, you accomodate anything from 0 to however many you could ever conceive of. It's not more complicated to implement, and when the customer needs more than 1, there is nothing that needs to be done (other than, perhaps, altering your UI to remove any limitations you may have previously put on entry). But again, none of this had anything to do with the original disk access speed vs. Access internal process speed question. That was your red herring. You asked for an explanation of the expression. You misinterpreted it and went down the access speed garden path. I'm explaining what the phrase refers to, and it is, in fact, more efficient to add records than fields (consider the number of indexes that have to be maintained/searched). -- David W. Fenton http://www.dfenton.com/ usenet at dfenton dot com http://www.dfenton.com/DFA/ |
#10
|
|||
|
|||
I was told "Fields are expensive, records are cheap"
Rick Brandt wrote in
: When table modifications call for lots of new fields it often means that a one-to-many relationship that ought to be set up in multiple tables is being shoe-horned into a single table. It also means that you have to alter your front-end forms and reports by adding all the fields, whereas if you're adding records instead, whatever continuous or datasheet form you're using to display those records will just have more records in it, so there is no need to alter then user interface objects. To *me*, that's what the phrase means, that adding fields is more complicated in terms of UI, which, to be frank, is where 90% of our time is spent in the development process (rather than in schema design/changes). -- David W. Fenton http://www.dfenton.com/ usenet at dfenton dot com http://www.dfenton.com/DFA/ |
Thread Tools | |
Display Modes | |
|
|