A Microsoft Office (Excel, Word) forum. OfficeFrustration

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Go Back   Home » OfficeFrustration forum » Microsoft Access » Database Design
Site Map Home Register Authors List Search Today's Posts Mark Forums Read  

I was told "Fields are expensive, records are cheap"



 
 
Thread Tools Display Modes
  #1  
Old February 25th, 2010, 12:44 AM posted to microsoft.public.access.tablesdbdesign
Dennis
external usenet poster
 
Posts: 1,222
Default I was told "Fields are expensive, records are cheap"

Hi,

I just have a self education question.

In response to one of my questions I stated that I added a lot of fields (a
lot was 30 fields) to my table. The initial implementation was so
successful, the user requested quite a few more enhancements resulting in the
addition of 30 additional fields.

One MVP who responded stated "Fields are expensive, records are cheap". I'm
currious about his statement. I'm new to Access (less than a year) but I
have over 30 years experience with relational databased on multiple
platforms. I've always been taught the exact opposite - that "Fields are
cheap, records are expensive" since going to disk is so slow versus accessing
data in memory.

Is there something different about Access where the statement "Fields are
expense, records are cheap" is true. I'm using Access on local machine where
the front and backs end reside on the same machine as well as having multiple
front ends on each client's machine tied into the back end which resides on
a file server. We have a hardwired ethernet cat5 cable network.

Dennis

  #2  
Old February 25th, 2010, 01:04 AM posted to microsoft.public.access.tablesdbdesign
Rick Brandt
external usenet poster
 
Posts: 4,354
Default I was told "Fields are expensive, records are cheap"

Dennis wrote:

Hi,

I just have a self education question.

In response to one of my questions I stated that I added a lot of fields
(a
lot was 30 fields) to my table. The initial implementation was so
successful, the user requested quite a few more enhancements resulting in
the addition of 30 additional fields.

One MVP who responded stated "Fields are expensive, records are cheap".
I'm
currious about his statement. I'm new to Access (less than a year) but I
have over 30 years experience with relational databased on multiple
platforms. I've always been taught the exact opposite - that "Fields are
cheap, records are expensive" since going to disk is so slow versus
accessing data in memory.


So you think rows are on the disk but fields are in RAM? Data has to be
retrieved from and written to disk regardless of this difference.

Is there something different about Access where the statement "Fields are
expense, records are cheap" is true. I'm using Access on local machine
where the front and backs end reside on the same machine as well as having
multiple
front ends on each client's machine tied into the back end which resides
on
a file server. We have a hardwired ethernet cat5 cable network.


This is not specific to Access. It is a "best practice" in all relational
databases to have a normalized data design. This generally results in
taller, narrower tables as opposed to wider, shallower tables.

When table modifications call for lots of new fields it often means that a
one-to-many relationship that ought to be set up in multiple tables is being
shoe-horned into a single table.


  #3  
Old February 25th, 2010, 05:15 AM posted to microsoft.public.access.tablesdbdesign
Dennis
external usenet poster
 
Posts: 1,222
Default I was told "Fields are expensive, records are cheap"

Rick,

Thanks for you responses, either I did not explain myself well or you missed
some obvious points in my statements.

Your comment: “It is a "best practice" in all relational databases to have
a normalized data design.

Response: This is true. I figured you would have picked up on that when I
stated I’ve been designing relational databases for over 30 years, which is
why I included that statement. But obviously I was wrong. Yes, relational
databases have been around for well over 30 years.


Your comment: “This generally results in taller, narrower tables as opposed
to wider, shallower tables.”

Response: This is true when you compare normalized tables to flat files.
However, I was not. I was comparing relational to relational. Therefore,
this statement has nothing to do with the question.

As for narrow tables, that depends on the applications. Most accounting
applications have pretty narrow tables. However, most personal lines and
commercial lines insurance policy and claim tables can get very wide.


Your comment: “So you think rows are on the disk but fields are in RAM?”

Response: Maybe you learned different computer science that I did, but
while this is a gross over simplification, your statement is ABSOLUTELY
correct. Rows are on disk and fields are in RAM. I know how other relation
database work, but I don’t know how Access works, which is what I am trying
to find out.

It is absolutely true that a program can only works with fields that are in
RAM. An application program can not work with a field that is on disk,
fields are only accessible while they are in RAM. The row has to be retrieve
from the disk and placed in memory in order for the VBA code to have access
to the row’s fields. Most database system provides the data to the program a
record or row at a time (excluding all of the look ahead and optimization
techniques.)

A program reads and writes rows (not fields) to the disk. I don’t believe
that VBA code can work on the entire row, unless there is a way to treat the
entire row as a string. I know that other databases have this capability, but
I don’t know if Access does or not.

When the file system write the row, except for blob and other such specialty
fields, the read and write functions within the database engine do not “know”
where one field ends and the next one starts. The simply read a record from
the table and place it in memory or they write a record / row to the table.

Given, the above I am truly puzzled by your statement. You make it sound as
if it this is not true when in fact that is exactly the way it works, even if
it is extremely over simplified.

Just out of curiosity, do you know of any application software that works
directly on the disk? Granted, the OS and OS utilities work directly on the
hard drive, but I’m not aware of any application software what does not.


Your comment: “Data has to be retrieved from and written to disk regardless
of this difference. “

Response: This statement is the point of my question! But one again, we
disagree. The difference in the disk access time it takes to retrieve one
record from the disk over a network versus retrieving multiple records. To
me that statement “Fields are expensive, records are cheap;” implies it is
better to have smaller row and multiple linked tables than it is to have one
row that contains all of the normalized data.

To me, this is ludicrous – unless there is something I don’t know about
Access. It seems to be that the only things that are achieved by breaking up
a large normalized row into multiple smaller rows in multiple tables in a
significant increase in disk access time and significant higher network
traffic. But then maybe Access is very inefficient in dealing with a lot of
fields and it is truly better to deal with slower disk access time than
Access’s overhead. I don’t know. That is what I am trying to find out.

Even given the increase in network and disk access time, CPU processing
speed and memory access time are still many, many times faster than access
the disks. Also, in most client-server systems I’ve had experience with; the
network traffic (data access) has been the bottle neck.


Your comment: “When table modifications call for lots of new fields it
often means that a one-to-many relationship that ought to be set up in
multiple tables is being shoe-horned into a single table.”

Response: Obviously you did not read my statement. This was something I
learned over 30 years ago. Third normal form tables do not contain repeating
fields unless they are a multi-valued field, in which case they are quite
useful in exploding parts for a completed assembly such as a car door, engine
or other such things. However, since Access does not support multi-valued
fields at this time, this is not an issue with respect to Access.

As I stated, the initial implementation was SO successful, the user
requested quire a few new enhancement that were beyond the scope of the
original project that we had to add 30 new normalized data fields. (I guess
I should have included that word).


Dennis

  #4  
Old February 25th, 2010, 06:32 PM posted to microsoft.public.access.tablesdbdesign
David W. Fenton
external usenet poster
 
Posts: 3,373
Default I was told "Fields are expensive, records are cheap"

=?Utf-8?B?RGVubmlz?= wrote in
news
However, most personal lines and
commercial lines insurance policy and claim tables can get very
wide.


That suggests they are probably denormalized so some degree.

I know a number of popular applications (ACT springs to mind) use
very wide tables, and this means their application has to be more
complicated to work with the data. It also is one of the main causes
of the performance problems that ACT has in terms of filtering large
datasets -- it's harder to efficiently use indexes when you've got
the data denormalized (and some of the data in ACT (last time I
checked, which was many, many years ago) seems to be stored in
multi-value fields, which is yet another reason for performance
problems, since even if they are indexed, they can't be used
efficiently).

I will agree that a wide table *can* be normalized.

I will also agree that there can be perfectly good reasons to
denormalize.

But I have very few tables in any of my apps that have even close to
50 fields.

A common reason for denormalization that looks properly normalized
is mis-definition of the entities (i.e., a table holds records, each
of which represents an entity, and each field is an attribute of the
entity). In a survey, for instance, it's very common to create a
single record for each survey, with a field for each question. But
this approach is completely unextensible -- any change to the
questionaire requires a change to the table. Or you have to create a
table for each questionaire. Or you have to add fields that are used
in some questionaires and not in others.

I learned this lesson with a medical study database that was tied to
the structure of the UNOS national transplant registry. There was
only one medical study and for it to work the variables being
examined had to be established at the beginning, so it was never
going to change. It seemed logical to use a record per patient. But
before the app was even finished, UNOS changed the data structure of
how they recorded data, and we ended up having to add fields to the
existing tables and then change all the forms and reports to reflect
that. If we'd used a proper structure, i.e., with a record for each
variable, and a table that defined what the variables were, we would
have needed to make absolutely no changes to the *structure* of the
data or the application -- we would have been able to accomodate the
additional variables just by adding *data* to the existing tables.

That's the essence of the problem. Fields are expensive, records are
cheap -- it's easy to change the data, hard to change the data
storage *structure*.

--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/
  #5  
Old February 25th, 2010, 06:37 PM posted to microsoft.public.access.tablesdbdesign
David W. Fenton
external usenet poster
 
Posts: 3,373
Default I was told "Fields are expensive, records are cheap"

=?Utf-8?B?RGVubmlz?= wrote in
news
A program reads and writes rows (not fields) to the disk.


I think you're stuck in a very old-fashioned approach to this, one
that reflects the very constricted memory and storage available back
in the 80s and early 90s.

A Jet/ACE data store doesn't store data in records, but in data
pages. It's not like legacy systems at all. And the fields are
variable length, so records are variable length. The efficiency of
interaction with the storage and RAM are pretty moot since the days
of disk caching, when even a write is done to RAM (and then
persisted to disk when convenient). This means there's hardly any
difference between RAM manipulation and reading/writing from/to the
data store, since both are happening in RAM.

Disk caching has been a given on Windows since Win3.11, c. 1992.
This predates Access, in fact, so Access/Jet has never been as bound
to disk space/performance as the databases that seem to form the
model in your head.

And when a server database is involved, the bottleneck is not RAM
vs. disk, but the network connection, which is vastly slower than
not just the RAM, but even slower than disk access.

So, your model seems completely outdated to me.

--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/
  #6  
Old February 26th, 2010, 07:21 AM posted to microsoft.public.access.tablesdbdesign
Dennis
external usenet poster
 
Posts: 1,222
Default I was told "Fields are expensive, records are cheap"


David,

I want everyone to know that I stand corrected. In my response to Bruce I
stated that no one had answered the original question in this series of
correspondences. At that point I had not read your response.

Thank you for answering the question. I also reposted the question and
Allen Browne, you, and a few other responded to the re-post and answered the
disk access speed issue.


Your comment :A Jet/ACE data store doesn't store data in records, but in
data pages. It's not like legacy systems at all. And the fields are variable
length, so records are variable length. The efficiency of interaction with
the storage and RAM are pretty moot since the days of disk caching, when even
a write is done to RAM (and then persisted to disk when convenient). This
means there's hardly any difference between RAM manipulation and
reading/writing from/to the data store, since both are happening in RAM.

Response: I can see what you are saying, but that assumes that you are
accessing data records sequentially. What happens to all that disk caching
when I am randomly reading records which I would do during data entry (which
is where I am most concerned about disk access and network speed). The user
might need the 1st rcd, then the 1,000 rcd, then the 5,257th rcd, then the
10th rcd, etc. etc.). Unless the entire table is cached, was good does
loading a data page do if the record that you need next is not on that page?
And if the data is cached in a server, that does not good all because you
have the network bottle neck.

I know about write caching and it does not affect the speed of retrieving
data (unless you do not have enough memory and your machine is thrashing),
which is why is was not part of the question.

Your comment: And when a server database is involved, the bottleneck is not
RAM vs. disk, but the network connection, which is vastly slower than not
just the RAM, but even slower than disk access.

My response: I know the network speed is MUCH, MUCH slower which is why I
made it part of the question. Remember, my original interpretation of John’s
“Fields are expensive, records are cheap” had with do with disk access over a
network versus internal CPU process. I though he knew something about
internal process of Access where Access was very slow when processing large
records. I grossly misunderstood his comment. I thought he was saying it is
better to access the disk multiple times with small records that to have one
large Access record. He has since explained his comment and now I know it
had nothing to do with disk accessing speed.

Dennis

  #7  
Old February 26th, 2010, 11:49 PM posted to microsoft.public.access.tablesdbdesign
David W. Fenton
external usenet poster
 
Posts: 3,373
Default I was told "Fields are expensive, records are cheap"

=?Utf-8?B?RGVubmlz?= wrote in
:

Your comment :A Jet/ACE data store doesn't store data in records,
but in data pages. It's not like legacy systems at all. And the
fields are variable length, so records are variable length. The
efficiency of interaction with the storage and RAM are pretty moot
since the days of disk caching, when even a write is done to RAM
(and then persisted to disk when convenient). This means there's
hardly any difference between RAM manipulation and reading/writing
from/to the data store, since both are happening in RAM.

Response: I can see what you are saying, but that assumes that
you are accessing data records sequentially. What happens to all
that disk caching when I am randomly reading records which I
would do during data entry (which is where I am most concerned
about disk access and network speed). The user might need the 1st
rcd, then the 1,000 rcd, then the 5,257th rcd, then the 10th rcd,
etc. etc.). Unless the entire table is cached, was good does
loading a data page do if the record that you need next is not on
that page? And if the data is cached in a server, that does not
good all because you have the network bottle neck.


You seem not to have done much investigation of how Jet works.
Indexes are the key. When you request data from a table, Jet first
requests the relevant indexes (based on your criteria) and then uses
the indexes to figure out which data pages to retrieve. The
requested data is going to be brought into RAM for processing, and
since it's been requested from disk, it will be cached in Jet's
internal cache, as well as in the OS's disk cache. When you write to
the data, Jet will take care of writing to disk, which will then be
handled by the OS, and if the written data pages are in the disk
cache (and you are caching writes), the writes will be to the disk
cache and then will be persisted to disk according to your disk
cache settings (and you can also tell Jet explicitly to flush its
cache, which in turn tells the OS to flush its write cache, i.e.,
persist to disk).

This is all managed quite efficienctly, so efficiently, in fact,
that there is very little reason to worry about the things you are
worrying about until you reach extremely large datasets. That's
usually not what is being used by an Access application directly
(i.e., if you have a large dataset, you're likely going to use a
different back-end than Jet/ACE).

I know about write caching and it does not affect the speed of
retrieving data (unless you do not have enough memory and your
machine is thrashing), which is why is was not part of the
question.


Writing is much slower than reading because it involves so many
different operations (writing indexes, determining if new pages need
to be allocated), as well as maintaining as set of appropriate locks
(write locks are a much bigger deal than read locks).

But disk caching is a big deal for reading, too. As I explained
above about the way Jet works, the indexes are the key to retrieving
data efficiently. Data pages in Jet 4 and ACE are 4KBs in size, so
you're retrieving data from the disk in blocks whose size is very
often well-aligned with the file system's block size (the NTFS file
system uses default block sizes in increments of KBs, with 4KBs
being the smallest usual block size, and the next being 8, 16, etc.,
so reading 4KB data pages is likely to be very efficient, since they
will be aligned with the file system's block size). Jet retrieves
what it needs as needed, but only as much as needed.

A simplified explanation of how Jet works: this SQL statement:

SELECT tblPerson.LastName, tblPerson.FirstName, tblPerson.PersonID
FROM tblPerson
WHERE tblPerson.LastName = "Fenton"

....will first retrieve the index on LastName which will be keyed to
the Primary Key, so Jet will then retrieve the primary key index. In
both cases, only as much of the index as necessary is retrieved
(it's not a sequential scan, but a b-tree traversal), and in the
case of the primary key index, since Jet tables are clustered on the
primary key, I'm pretty sure the primary key index is actually the
real data pages (I could be wrong on that).

I don't know for certain if the non-PK indexes store a link to the
PK or if they directly reference the data page of the PK, but the
point is that it's not a sequential scan of anything, but a b-tree
index traversal, which is substantially more efficient.

--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/
  #8  
Old February 26th, 2010, 07:39 AM posted to microsoft.public.access.tablesdbdesign
Dennis
external usenet poster
 
Posts: 1,222
Default I was told "Fields are expensive, records are cheap"

David,

Your comment “When table modifications call for lots of new fields it often
means that a one-to-many relationship that ought to be set up in
multiple tables is being shoe-horned into a single table.”

It also means that you have to alter your front-end forms and
reports by adding all the fields, whereas if you're adding records
instead, whatever continuous or datasheet form you're using to
display those records will just have more records in it, so there is
no need to alter then user interface objects.

Response: I don’t understand you comment. It is absolutely that that is I
add more records, I don’t’ have altered anything. But when the user asks
where to they enter these new fields and where the new reports are, what am I
suppose to say. Well, I did not do add those fields but you can enter more
data? That makes no sense at all.

I’ve yet to figure how to capture additional data and produce reports with
this additional data without modifying the UI or creating the new reports.
Sorry, I have no idea what you mean by the above statement.

Your comment: To *me*, that's what the phrase means, that adding fields is
more complicated in terms of UI, which, to be frank, is where 90% of our time
is spent in the development process (rather than in schema design/changes).

My response: You are preaching to the choir! I fully agree.

However, this project started out a simple membership mailing list. And
that is all it was. Once the users saw it working, their imagination went
wild and they wanted to start capturing additional information so they could
produce additional reports.

When this project started, I knew about the other report and strongly
suggested that they just capture the additional fields and produce all of the
report. However, the users kept saying no, no, no we don’t need at that
stuff. The process is working just fine. And the boss man said – just do the
mailing labels and that all. So that is what I did. I designed it for the
full requirements, but only implemented the mailing label piece.

But what are you supposed to do when the user, who is paying you, say no we
don’t need that when you know that are going to need it? You can only push
the customer so far.

But again, none of this had anything to do with the original disk access
speed vs. Access internal process speed question.


Dennis
  #9  
Old February 27th, 2010, 12:07 AM posted to microsoft.public.access.tablesdbdesign
David W. Fenton
external usenet poster
 
Posts: 3,373
Default I was told "Fields are expensive, records are cheap"

=?Utf-8?B?RGVubmlz?= wrote in
news
Your comment “When table modifications call for lots of new
fields it often means that a one-to-many relationship that ought
to be set up in multiple tables is being shoe-horned into a single
table.”

It also means that you have to alter your front-end forms and
reports by adding all the fields, whereas if you're adding records
instead, whatever continuous or datasheet form you're using to
display those records will just have more records in it, so there
is no need to alter then user interface objects.

Response: I don’t understand you comment. It is absolutely
that that is I add more records, I don’t’ have altered
anything. But when the user asks where to they enter these new
fields and where the new reports are, what am I suppose to say.
Well, I did not do add those fields but you can enter more data?
That makes no sense at all.


I suspect you haven't looked at Duane's survey database. In the
structure there (which is a fairly extreme version of what I'm
talking about), when you add a new question to a survey, you add a
new record to the questions table, and link it to a particular
survey. This will then be displayed by the same user interface that
was used to display the other questions.

In the structure you're suggestion, you'd have to add a field for
the question to your survey table and then alter your form for data
entry, and your report for printing out, and all of the tools you
use to analyze and summarize the results.

Now, a lot of statistical tools expect a flat-file structure, with
one record for each completed survey, so it's usually the case that
one takes the many-records structure and denormalizes it into a flat
structure that is exported for use by the statistics program. The
key there is that you don't choose your data structure based on how
you're going to output it. Instead, you choose the structure that is
going to be most extensible with the least amount of work, and then
you take that data and convert it to the structures required for
particular purposes, which is very often a denormalized structure.

I’ve yet to figure how to capture additional data and produce
reports with this additional data without modifying the UI or
creating the new reports. Sorry, I have no idea what you mean by
the above statement.


I strongly recommend you examine Duane's survey sample database.

Your comment: To *me*, that's what the phrase means, that adding
fields is more complicated in terms of UI, which, to be frank, is
where 90% of our time is spent in the development process (rather
than in schema design/changes).

My response: You are preaching to the choir! I fully agree.

However, this project started out a simple membership mailing
list. And that is all it was. Once the users saw it working,
their imagination went wild and they wanted to start capturing
additional information so they could produce additional reports.


I have plenty of databases that are badly structured because they
started out small, when it seemed it wasn't necessary to choose an
industrial-strength schema, and then when they got extended, I
greatly regretted not doing it the right way on the front end.

So, the existing app is no justification for continuing to use a
less than optimal schema.

When this project started, I knew about the other report and
strongly suggested that they just capture the additional fields
and produce all of the report. However, the users kept saying no,
no, no we don’t need at that stuff. The process is working just
fine. And the boss man said – just do the mailing labels and
that all. So that is what I did. I designed it for the full
requirements, but only implemented the mailing label piece.

But what are you supposed to do when the user, who is paying you,
say no we don’t need that when you know that are going to need
it? You can only push the customer so far.


The approach I'm advocating doesn't take more time to implement to
cover the initial requirements and is inherently extensible, without
any additional programming for the UI.

You've front-loaded the programming to handle the future extensions,
which is something I do, too, but the suggestion by those advocating
records over fields is that there is no extra work involved. You
don't have to do anything extra to prepare for the N new fields,
because the structure you've built handles any number of records
from the very beginning.

Have you ever implemented a many-to-many join when the current
requirement was only a many-to-one? Say the client has customers and
there's a customer type. You can store it in a field in the customer
record, but what if the client later decides that a customer can be
more than one customer type simultaneously? At that point you need a
many-to-many join between customers and customer types. If you
structured it that way from the beginning, it would handle 1 type
per customer, and then would handle as many customer types as you
needed when the scope expands.

This is the simplest example of the cheap records vs. expensive
fields equation. And the point is that as soon as the client needs
more than 1 type per customer, you have to decide how many you're
going to accomodate. You could, I guess, add a second customer type
field, but then when they need three, you'd have to add a third new
field. With the join table, you accomodate anything from 0 to
however many you could ever conceive of. It's not more complicated
to implement, and when the customer needs more than 1, there is
nothing that needs to be done (other than, perhaps, altering your UI
to remove any limitations you may have previously put on entry).

But again, none of this had anything to do with the original disk
access speed vs. Access internal process speed question.


That was your red herring. You asked for an explanation of the
expression. You misinterpreted it and went down the access speed
garden path. I'm explaining what the phrase refers to, and it is, in
fact, more efficient to add records than fields (consider the number
of indexes that have to be maintained/searched).

--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/
  #10  
Old February 25th, 2010, 06:22 PM posted to microsoft.public.access.tablesdbdesign
David W. Fenton
external usenet poster
 
Posts: 3,373
Default I was told "Fields are expensive, records are cheap"

Rick Brandt wrote in
:

When table modifications call for lots of new fields it often
means that a one-to-many relationship that ought to be set up in
multiple tables is being shoe-horned into a single table.


It also means that you have to alter your front-end forms and
reports by adding all the fields, whereas if you're adding records
instead, whatever continuous or datasheet form you're using to
display those records will just have more records in it, so there is
no need to alter then user interface objects.

To *me*, that's what the phrase means, that adding fields is more
complicated in terms of UI, which, to be frank, is where 90% of our
time is spent in the development process (rather than in schema
design/changes).

--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/
 




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump


All times are GMT +1. The time now is 12:42 AM.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright 2004-2024 OfficeFrustration.
The comments are property of their posters.