Import/Normalize approach - column-DML, or loop

**Bas Cost Budde** · Nov 12 '05, 07:39 PM

Re: Import/Normalize approach - column-DML, or loop

> If I do something like this again, would my code end up performing just as[color=blue]
> well, and being easier to write and maintain using simple iteration rather
> than SQL-DML?[/color]

My guess is that you can build a fairly specialized optimalisation here.
After all, the Jet engine is not the heaviest in the world, and its
query optimizers can do only so much for true DML.
I have a query that calculates several values by simple IIFs, and that
takes three minutes for a simple 20000 records table!

Somewhere in the line any SQL statement must be executed
procedurally--SQL is invented only for us humans to focus attention on
the what, not the how.

I'd say: use the procedure. Bonus: you can display your own progress
indicator. A slow function with feedback is often perceived as quicker
than a slow function without feedback, even if it runs several seconds
quicker not having to update the display.

Steve Jorgensen wrote:[color=blue]
>
> All that was going swimmingly, and performing pretty well until I got to the
> fields containing multiple, delimited values. My whole dedign is based on
> using SQL/DML passes for everything, but the only way I could figure out to
> make that work was to call a user defined function from within the query to
> pull out, in successive query slices, argument 1, argument 2, etc. I was
> going to use a where clause to exclude null results (input doesn't have
> arguments n or above), and quit after the first pass with .RecordsAffecte d =
> 0.[/color]

Sometimes you can invent a preprocessor (this time inside the
spreadsheet), is that possible? I'd like to explore the point but see
little value doing that in the group (eating up everyones bandwith,
verbose as I tend to be). If you have time, mail me. My domain is not
org but nl.

--
Bas Cost Budde

**Steve Jorgensen** · Nov 12 '05, 07:39 PM

Re: Import/Normalize approach - column-DML, or loop

On Thu, 29 Jan 2004 09:44:01 +0100, Bas Cost Budde <bas@heuveltop. org> wrote:
[color=blue][color=green]
>> If I do something like this again, would my code end up performing just as
>> well, and being easier to write and maintain using simple iteration rather
>> than SQL-DML?[/color]
>
>My guess is that you can build a fairly specialized optimalisation here.
> After all, the Jet engine is not the heaviest in the world, and its
>query optimizers can do only so much for true DML.
>I have a query that calculates several values by simple IIFs, and that
>takes three minutes for a simple 20000 records table![/color]

I generally have much better performance than that. In fact, if we're not
talking about a multi-processor or multi-user system, I generally find that
JET is faster than most SQL Servers since it has fewer contingencies to deal
with.
[color=blue]
>I'd say: use the procedure. Bonus: you can display your own progress
>indicator. A slow function with feedback is often perceived as quicker
>than a slow function without feedback, even if it runs several seconds
>quicker not having to update the display.[/color]

Actually, since most of my vertical slice insert queries run in under 2 or 3
seconds, I have a pretty good progress bar already - until I get to the
multi-value split, that is.
[color=blue]
>
>Steve Jorgensen wrote:[color=green]
>>
>> All that was going swimmingly, and performing pretty well until I got to the
>> fields containing multiple, delimited values. My whole dedign is based on
>> using SQL/DML passes for everything, but the only way I could figure out to
>> make that work was to call a user defined function from within the query to
>> pull out, in successive query slices, argument 1, argument 2, etc. I was
>> going to use a where clause to exclude null results (input doesn't have
>> arguments n or above), and quit after the first pass with .RecordsAffecte d =
>> 0.[/color]
>
>Sometimes you can invent a preprocessor (this time inside the
>spreadsheet) , is that possible? I'd like to explore the point but see[/color]

Actually, I do a small amount of pre-processing in the spreadsheet (just in
the top few rows) to make it into something Access wil import properly, then I
import that into a staging database, and link to that from the Access importer
database app (also linked to the target back-end). The transformation,
import, and link steps take about 6 or 7 seconds altogether. From there, it's
an ordinary table (albeit, with all Text columns), so I have my choice of DML
or looping to do the complicated part from there.
[color=blue]
>little value doing that in the group (eating up everyones bandwith,
>verbose as I tend to be). If you have time, mail me. My domain is not
>org but nl.[/color]

I don't need that kind of detail. I was really just looking for people's
experience/optinions of the values of DML vs looping for this kind of job.

**Steve Jorgensen** · Nov 12 '05, 07:39 PM

Re: Import/Normalize approach - column-DML, or loop

On Thu, 29 Jan 2004 09:44:01 +0100, Bas Cost Budde <bas@heuveltop. org> wrote:

Ah - forgot a point...[color=blue]
>Somewhere in the line any SQL statement must be executed
>procedurally--SQL is invented only for us humans to focus attention on
>the what, not the how.[/color]

Well, it's more than that, though. Using procedural code, every single
read/write must go through several code layers. The SQL engine is designed to
handle set operations optimally without crossing all those layers, and it can
use covering indexes (a concept hidden from VB code), etc. Still, it may be
that for a data transformation operation, those advantages are not realized,
or worse.

**Bruce Pick** · Nov 12 '05, 07:40 PM

Re: Import/Normalize approach - column-DML, or loop

Steve,

I'm assuming you need to separate the delimited fields into a set of
fields. Below is code that would separate them out in a query, which
should run fairly quickly. (If you need to create multiple rows for
each source string, the approach won't help.)

You could use InStr() to locate the delimiters and then generate the
fields based on that knowledge.

Field1: Left([SourceField],(InStr(1,[SourceField],',')-1))
Balance1: Mid([SourceField],(InStr(1,[SourceField],',')+1))
Field2: Left([Balance1],(InStr(1,[Balance1],',')-1))
Balance2: Mid([Balance1],(InStr(1,[Balance1],',')+1))
Field3:IIf(InSt r(1,[Balance2],',')=0,Null,Le ft([Balance2],(InStr(1,[Balance2],',')-1)))
Balance3:IIf(Is Null([Field3]),'',Mid([Balance2],(InStr(1,[Balance2],',')+1)))
Field4:IIf(InSt r(1,[Balance3],',')=0,'',Left ([Balance3],(InStr(1,[Balance3],',')-1)))
I think you get the idea.

If the number of delimited components varies betseen records, you will
need to add code to test for IIf( InStr(xxx) = 0, '', ... as above). I
set up Field3, Balance3, and Field4 with the needed logic.

I admit it's a bit of a pain to write, but once the pattern is set, the
statements are easy to derive. I think it should run more quickly
than a user-defined function in a module.

Bruce Pick

Steve Jorgensen wrote:
[color=blue]
> Hi all,
> . . .
> All that was going swimmingly, and performing pretty well until I got to the
> fields containing multiple, delimited values. My whole dedign is based on
> using SQL/DML passes for everything, but the only way I could figure out to
> make that work was to call a user defined function from within the query to
> pull out, in successive query slices, argument 1, argument 2, etc.
> . .
> Sounds good, but with a mere 8000 input rows, I had to cancel the first pass
> after waiting about 20 minutes.
> .. .
> Thanks for any opinions,
>
> - Steve J[/color]

**Steve Jorgensen** · Nov 12 '05, 07:40 PM

Re: Import/Normalize approach - column-DML, or loop

No, that's not it at all. I need to split the items out into rows in another
table with a 1-M relationship, and the number of items varies from row to row.
I have certainly written code like your example before, could generate that
pattern for a series of queries to extract one argument at a time, but I
believe I would hit the "Expression too complex" error condition long before I
got to the last argument (try about 40).

On Thu, 29 Jan 2004 09:57:53 -0500, Bruce Pick <brucepick1@com cast.net> wrote:
[color=blue]
>Steve,
>
>I'm assuming you need to separate the delimited fields into a set of
>fields. Below is code that would separate them out in a query, which
>should run fairly quickly. (If you need to create multiple rows for
>each source string, the approach won't help.)
>
>You could use InStr() to locate the delimiters and then generate the
>fields based on that knowledge.
>
>Field1: Left([SourceField],(InStr(1,[SourceField],',')-1))
>Balance1: Mid([SourceField],(InStr(1,[SourceField],',')+1))
>Field2: Left([Balance1],(InStr(1,[Balance1],',')-1))
>Balance2: Mid([Balance1],(InStr(1,[Balance1],',')+1))
>Field3:IIf(InS tr(1,[Balance2],',')=0,Null,Le ft([Balance2],(InStr(1,[Balance2],',')-1)))
>Balance3:IIf(I sNull([Field3]),'',Mid([Balance2],(InStr(1,[Balance2],',')+1)))
>Field4:IIf(InS tr(1,[Balance3],',')=0,'',Left ([Balance3],(InStr(1,[Balance3],',')-1)))
>I think you get the idea.
>
>If the number of delimited components varies betseen records, you will
>need to add code to test for IIf( InStr(xxx) = 0, '', ... as above). I
>set up Field3, Balance3, and Field4 with the needed logic.
>
>I admit it's a bit of a pain to write, but once the pattern is set, the
> statements are easy to derive. I think it should run more quickly
>than a user-defined function in a module.
>
>Bruce Pick
>
>Steve Jorgensen wrote:
>[color=green]
>> Hi all,
>> . . .
>> All that was going swimmingly, and performing pretty well until I got to the
>> fields containing multiple, delimited values. My whole dedign is based on
>> using SQL/DML passes for everything, but the only way I could figure out to
>> make that work was to call a user defined function from within the query to
>> pull out, in successive query slices, argument 1, argument 2, etc.
>> . .
>> Sounds good, but with a mere 8000 input rows, I had to cancel the first pass
>> after waiting about 20 minutes.
>> .. .
>> Thanks for any opinions,
>>
>> - Steve J[/color][/color]

**David W. Fenton** · Nov 12 '05, 07:41 PM

Re: Import/Normalize approach - column-DML, or loop

Steve Jorgensen <nospam@nospam. nospam> wrote in
news:oafh10ho26 53v9d89fiio3lsj acaljgl9i@4ax.c om:
[color=blue]
> Sounds good, but with a mere 8000 input rows, I had to cancel the
> first pass after waiting about 20 minutes. Note that a whole
> import process takes about 3 minutes if this part is ommitted, and
> that includes about 40 vertical split passes. The UDF is very
> simple, and performs quite fast from VB, but I guess the overhead
> of calling this function from a query is VERY SEVERE. Just to get
> this thing out the door, I've decided to just tnot split these for
> now since we're not doing queries of that data yet.[/color]

I did this about two years ago with multiple fields with
comma-separated values.

My solution was this:

1. first, determine the distribution of multiple values. That is,
what percent of the table has one value? What percent, two? Three?
And so on.

2. in my cases, 95% of the records had 4 or fewer values, so I
created four additional fields in my processing table.

3. I then copied the values that have no commas (only one value)
into the first of the processing columns.

4. then, in a query that eliminates the rows without commas in the
original column, copy the value before the first comma into the
second column.

5. then, in a query that eliminates the rows that have no commas in
the source column and a value in the second target column, copy the
values between the 1st and 2nd column into the second target column.

And so forth.

This is not too hard to do because you're doing it all in relatively
small sets, rather than trying to process the whole thing with a
single multi-purpose subroutine.

If this isn't clear, let me know and I'll outline it more clearly.

--
David W. Fenton http://www.bway.net/~dfenton
dfenton at bway dot net http://www.bway.net/~dfassoc

**Steve Jorgensen** · Nov 12 '05, 07:42 PM

Re: Import/Normalize approach - column-DML, or loop

On Thu, 29 Jan 2004 20:29:55 GMT, "David W. Fenton"
<dXXXfenton@bwa y.net.invalid> wrote:
[color=blue]
>Steve Jorgensen <nospam@nospam. nospam> wrote in
>news:oafh10ho2 653v9d89fiio3ls jacaljgl9i@4ax. com:
>[color=green]
>> Sounds good, but with a mere 8000 input rows, I had to cancel the
>> first pass after waiting about 20 minutes. Note that a whole
>> import process takes about 3 minutes if this part is ommitted, and
>> that includes about 40 vertical split passes. The UDF is very
>> simple, and performs quite fast from VB, but I guess the overhead
>> of calling this function from a query is VERY SEVERE. Just to get
>> this thing out the door, I've decided to just tnot split these for
>> now since we're not doing queries of that data yet.[/color]
>
>I did this about two years ago with multiple fields with
>comma-separated values.
>
>My solution was this:
>
>1. first, determine the distribution of multiple values. That is,
>what percent of the table has one value? What percent, two? Three?
>And so on.
>
>2. in my cases, 95% of the records had 4 or fewer values, so I
>created four additional fields in my processing table.
>
>3. I then copied the values that have no commas (only one value)
>into the first of the processing columns.
>
>4. then, in a query that eliminates the rows without commas in the
>original column, copy the value before the first comma into the
>second column.
>
>5. then, in a query that eliminates the rows that have no commas in
>the source column and a value in the second target column, copy the
>values between the 1st and 2nd column into the second target column.
>
>And so forth.
>
>This is not too hard to do because you're doing it all in relatively
>small sets, rather than trying to process the whole thing with a
>single multi-purpose subroutine.
>
>If this isn't clear, let me know and I'll outline it more clearly.[/color]

Well, I'm not sure that's much help in this case. it's not a field that was
originally indended to be single-valued, then had some multi-valued data put
in, it is an export form a system with no other way to represent 1-to-many
data in a spreadsheet than to put it in a multi-valued field. Out of 8000
records, there are about 2500 unique combinations (using GROUP BY).

I think I've figured out that a hybrid apprach might be best for my current
situation, given that I already have all the infrastructure to build the DML
parts. The idea is that I query a recordset with all the unique combinations,
cycle through that, and parse out the arguments in each one, build a
collection of unique values (2500 combinations, but probably only a few
hundred unique items), then insert the collection items into the lookup table.
From there, I should be able join from the source table to the lookup using a
Like expression ("/" + combination + "/" Like "*/" + delimiter + item-text +
delimier + "/*") to populate the junction table.

I think/hope this will do the job.

What is your opinion if I do another project like this in the future? DML, or
loops?

**David W. Fenton** · Nov 12 '05, 07:42 PM

Re: Import/Normalize approach - column-DML, or loop

Steve Jorgensen <nospam@nospam. nospam> wrote in
news:o6nj10dpfh i2f3oeosv3q3aid dloek4uaa@4ax.c om:
[color=blue]
> Well, I'm not sure that's much help in this case. it's not a
> field that was originally indended to be single-valued, then had
> some multi-valued data put in, it is an export form a system with
> no other way to represent 1-to-many data in a spreadsheet than to
> put it in a multi-valued field. Out of 8000 records, there are
> about 2500 unique combinations (using GROUP BY).[/color]

With 8000->2500, I'm not sure there's any real benefit in processing
unique values.

Either way, though, I'd first process the values into columns and
then process the columns into rows. But if more than half the
records have more than 10 values in them, then it would become more
unwieldy, I think, than just walking the table and creating the new
records.

I'd do that in two steps of course, as you're likely to want a
lookup here, so you'll need both a list of the values connected to
the source record and a list of the unique values to pick from in
new records. So, I'd create the 1:N records (FK + Value), then
create the lookup table with a DISTINCT, then join on the value to
populate the lookup key.

But with only 8000 records, I think I'd do the record creation by
walking the recordset and then walking the field and parsing.
[color=blue]
> I think I've figured out that a hybrid apprach might be best for
> my current situation, given that I already have all the
> infrastructure to build the DML parts. . . .[/color]

DML? Perhaps I should not have ignored the fact that I don't know
what DML means?
[color=blue]
> . . . The idea is that I query a
> recordset with all the unique combinations, cycle through that,
> and parse out the arguments in each one, build a collection of
> unique values (2500 combinations, but probably only a few hundred
> unique items), then insert the collection items into the lookup
> table. From there, I should be able join from the source table to
> the lookup using a Like expression ("/" + combination + "/" Like
> "*/" + delimiter + item-text + delimier + "/*") to populate the
> junction table.
> I think/hope this will do the job.[/color]

Why join in that fashion?

Why not do this: walk the source table, parse each value and insert
a record with the source PK and the value into a temporary table,
then from that table, create another table witht he unique values
(this will be your lookup table, with a PK), then join these two new
tables on the value field and insert the PK from the source table
and the PK from the lookup table into your junction table.

With very large numbers of records, I'd break the first step into 2
parts, first parsing the values into columns and then inserting the
records from the columns.
[color=blue]
> What is your opinion if I do another project like this in the
> future? DML, or loops?[/color]

Depends on the number of records and the number of values in the
multi-value field. With larger numbers of records and/or larger
numbers of values in the multi-value field, I'd tend to process the
values into columns first and then create records from the columns.
With smaller recordcounts (as in your case), I might do it directly,
especially if the number of values stored in each field is large.

The last time I did this I was processing a phone number field that
had multiple values in it. Most of the fields had 1, 2 or 3 numbers,
a handful had 4 and very small number had more than that. I
processed into 4 columns, with the 4th column still being
multi-valued, but with only a small number of records to process
(indeed, it was such a small number that I believe I created the
real records manually!). My experience with this kind of data is
that you have lots of records with 1 or 2 values, about half as many
with 3, and then a quick falloff from there, with 10% or less having
more than that. But it depends entirely on the kind of data. Phone
numbers have pretty much an upper limit, but other kinds of data
will not.

I guess what I'm saying is that with large numbers of records, I'd
separate the process of parsing the multi-value field from the
process of creating the records, for performance purposes.

--
David W. Fenton http://www.bway.net/~dfenton
dfenton at bway dot net http://www.bway.net/~dfassoc

**Steve Jorgensen** · Nov 12 '05, 07:43 PM

Re: Import/Normalize approach - column-DML, or loop

On Fri, 30 Jan 2004 16:34:18 GMT, "David W. Fenton"
<dXXXfenton@bwa y.net.invalid> wrote:
[color=blue]
>Steve Jorgensen <nospam@nospam. nospam> wrote in
>news:o6nj10dpf hi2f3oeosv3q3ai ddloek4uaa@4ax. com:
>[color=green]
>> Well, I'm not sure that's much help in this case. it's not a
>> field that was originally indended to be single-valued, then had
>> some multi-valued data put in, it is an export form a system with
>> no other way to represent 1-to-many data in a spreadsheet than to
>> put it in a multi-valued field. Out of 8000 records, there are
>> about 2500 unique combinations (using GROUP BY).[/color]
>
>With 8000->2500, I'm not sure there's any real benefit in processing
>unique values.[/color]

Well, the Group By seems to run pretty fast. Intuitively, it seems like
walking more than 3x as many rows to parse would waste more time than the
Group By. Of course, it's trivial to try it both ways, and time it.
[color=blue]
>Either way, though, I'd first process the values into columns and
>then process the columns into rows. But if more than half the
>records have more than 10 values in them, then it would become more
>unwieldy, I think, than just walking the table and creating the new
>records.[/color]

I think it's borderline on that one. I'm not certain I know what you are
advocating, though. Are you saying to add columns to the import table, and
step-wise run queries to peel off argument 1 to the first added column,
removing it from the original source column (leaving argument 2 as argument
1), then query again to peel off argument 1 to the second added column, etc.?
[color=blue]
>I'd do that in two steps of course, as you're likely to want a
>lookup here, so you'll need both a list of the values connected to
>the source record and a list of the unique values to pick from in
>new records. So, I'd create the 1:N records (FK + Value), then
>create the lookup table with a DISTINCT, then join on the value to
>populate the lookup key.[/color]

Right, that's pretty much how I'm handling all the other single-valued fields
with repeated text already. It's working great.
[color=blue]
>But with only 8000 records, I think I'd do the record creation by
>walking the recordset and then walking the field and parsing.[/color]

Yeah. If i get what you're saying, I don't thing ripping arguments into 10 or
40 new columns will be a good thing, and anything short of the 40 or so means
writing more code to handle the arguments past #10 or so differently than the
rest.
[color=blue][color=green]
>> I think I've figured out that a hybrid apprach might be best for
>> my current situation, given that I already have all the
>> infrastructure to build the DML parts. . . .[/color]
>
>DML? Perhaps I should not have ignored the fact that I don't know
>what DML means?[/color]

Oh, it's just SQL that's not DDL. SQL consists of DDL and DML. To me, use of
the term DML implies a focus on INSERT/UPDATE queries rather than simple
SELECTS, though I believe SELECT is still considered part of DML. I welcome
anyone's corrections on these facts.
[color=blue][color=green]
>> . . . The idea is that I query a
>> recordset with all the unique combinations, cycle through that,
>> and parse out the arguments in each one, build a collection of
>> unique values (2500 combinations, but probably only a few hundred
>> unique items), then insert the collection items into the lookup
>> table. From there, I should be able join from the source table to
>> the lookup using a Like expression ("/" + combination + "/" Like
>> "*/" + delimiter + item-text + delimier + "/*") to populate the
>> junction table.
>> I think/hope this will do the job.[/color]
>
>Why join in that fashion?
>
>Why not do this: walk the source table, parse each value and insert
>a record with the source PK and the value into a temporary table,
>then from that table, create another table witht he unique values
>(this will be your lookup table, with a PK), then join these two new
>tables on the value field and insert the PK from the source table
>and the PK from the lookup table into your junction table.[/color]

I'm guessing that this join, even though it would be an extra query that will
not benefit from any indexes, could be quicker than executing 10 to 40
individual inserts from code per source row as I go. I dunno, though. Your
way is probably easier to write.
[color=blue]
>With very large numbers of records, I'd break the first step into 2
>parts, first parsing the values into columns and then inserting the
>records from the columns.
>[color=green]
>> What is your opinion if I do another project like this in the
>> future? DML, or loops?[/color]
>
>Depends on the number of records and the number of values in the
>multi-value field. With larger numbers of records and/or larger
>numbers of values in the multi-value field, I'd tend to process the
>values into columns first and then create records from the columns.
>With smaller recordcounts (as in your case), I might do it directly,
> especially if the number of values stored in each field is large.
>
>The last time I did this I was processing a phone number field that
>had multiple values in it. Most of the fields had 1, 2 or 3 numbers,
>a handful had 4 and very small number had more than that. I
>processed into 4 columns, with the 4th column still being
>multi-valued, but with only a small number of records to process
>(indeed, it was such a small number that I believe I created the
>real records manually!). My experience with this kind of data is
>that you have lots of records with 1 or 2 values, about half as many
>with 3, and then a quick falloff from there, with 10% or less having
>more than that. But it depends entirely on the kind of data. Phone
>numbers have pretty much an upper limit, but other kinds of data
>will not.
>
>I guess what I'm saying is that with large numbers of records, I'd
>separate the process of parsing the multi-value field from the
>process of creating the records, for performance purposes.[/color]

I guess that sounds reasonable. I think I would have done pretty much the
same thing for a field with mostly one or 2 values per row. It was only this
strange case of having a 1-M stuffed into a column that bought me this
headache to begin with.

The other issue I'm seeing, though, is that the code would probably have been
simpler, cost less to write, and would now be simpler to maintain if I had
used loops instead of queries. Since this is a monthly import, if the
processing took less that twice as long as now, perhaps, the simplicity alone
would have been sufficient reason to just do the loops, eh? Of course, a
slower process means testing less often during development, and that means
more difficult debugging sessions each time, so perhaps it would not have cost
less to write, but I still think I could have made it more legible for
maintenance that way.

**David W. Fenton** · Nov 12 '05, 07:43 PM

Re: Import/Normalize approach - column-DML, or loop

Steve Jorgensen <nospam@nospam. nospam> wrote in
news:0v2l10hog8 rratup427c7a4s1 sjeme50tp@4ax.c om:
[color=blue]
> On Fri, 30 Jan 2004 16:34:18 GMT, "David W. Fenton"
><dXXXfenton@bw ay.net.invalid> wrote:
>[color=green]
>>Steve Jorgensen <nospam@nospam. nospam> wrote in
>>news:o6nj10dp fhi2f3oeosv3q3a iddloek4uaa@4ax .com:
>>[color=darkred]
>>> Well, I'm not sure that's much help in this case. it's not a
>>> field that was originally indended to be single-valued, then had
>>> some multi-valued data put in, it is an export form a system
>>> with no other way to represent 1-to-many data in a spreadsheet
>>> than to put it in a multi-valued field. Out of 8000 records,
>>> there are about 2500 unique combinations (using GROUP BY).[/color]
>>
>>With 8000->2500, I'm not sure there's any real benefit in
>>processing unique values.[/color]
>
> Well, the Group By seems to run pretty fast. Intuitively, it
> seems like walking more than 3x as many rows to parse would waste
> more time than the Group By. Of course, it's trivial to try it
> both ways, and time it.[/color]

But you'll end up with unique values that have no meaning, and
really only save you a tiny amount of time. On the other hand, it
doesn't necessarily take very much time to do it, so perhaps it's a
wash.
[color=blue][color=green]
>>Either way, though, I'd first process the values into columns and
>>then process the columns into rows. But if more than half the
>>records have more than 10 values in them, then it would become
>>more unwieldy, I think, than just walking the table and creating
>>the new records.[/color]
>
> I think it's borderline on that one. I'm not certain I know what
> you are advocating, though. Are you saying to add columns to the
> import table, and step-wise run queries to peel off argument 1 to
> the first added column, removing it from the original source
> column (leaving argument 2 as argument 1), then query again to
> peel off argument 1 to the second added column, etc.?[/color]

That's one way to do it. He other way is to have a single query do
it. Or to write code that will write the SQL for you.
[color=blue][color=green]
>>I'd do that in two steps of course, as you're likely to want a
>>lookup here, so you'll need both a list of the values connected to
>>the source record and a list of the unique values to pick from in
>>new records. So, I'd create the 1:N records (FK + Value), then
>>create the lookup table with a DISTINCT, then join on the value to
>>populate the lookup key.[/color]
>
> Right, that's pretty much how I'm handling all the other
> single-valued fields with repeated text already. It's working
> great.
>[color=green]
>>But with only 8000 records, I think I'd do the record creation by
>>walking the recordset and then walking the field and parsing.[/color]
>
> Yeah. If i get what you're saying, I don't thing ripping
> arguments into 10 or 40 new columns will be a good thing, and
> anything short of the 40 or so means writing more code to handle
> the arguments past #10 or so differently than the rest.[/color]

Well, I'm rather shocked at the idea that there'd be 40 separate
values stored in a single field. I know you said it was a
spreadsheet, but what kind of morons would write a spreadsheet that
complex when they obviously need a database?
[color=blue][color=green][color=darkred]
>>> I think I've figured out that a hybrid apprach might be best for
>>> my current situation, given that I already have all the
>>> infrastructure to build the DML parts. . . .[/color]
>>
>>DML? Perhaps I should not have ignored the fact that I don't know
>>what DML means?[/color]
>
> Oh, it's just SQL that's not DDL. SQL consists of DDL and DML.
> To me, use of the term DML implies a focus on INSERT/UPDATE
> queries rather than simple SELECTS, though I believe SELECT is
> still considered part of DML. I welcome anyone's corrections on
> these facts.[/color]

I've never heard that terminology before.

Of course, I never use DDL, either.
[color=blue][color=green][color=darkred]
>>> . . . The idea is that I query a
>>> recordset with all the unique combinations, cycle through that,
>>> and parse out the arguments in each one, build a collection of
>>> unique values (2500 combinations, but probably only a few
>>> hundred unique items), then insert the collection items into the
>>> lookup table. From there, I should be able join from the source
>>> table to the lookup using a Like expression ("/" + combination +
>>> "/" Like "*/" + delimiter + item-text + delimier + "/*") to
>>> populate the junction table.
>>> I think/hope this will do the job.[/color]
>>
>>Why join in that fashion?
>>
>>Why not do this: walk the source table, parse each value and
>>insert a record with the source PK and the value into a temporary
>>table, then from that table, create another table witht he unique
>>values (this will be your lookup table, with a PK), then join
>>these two new tables on the value field and insert the PK from the
>>source table and the PK from the lookup table into your junction
>>table.[/color]
>
> I'm guessing that this join, even though it would be an extra
> query that will not benefit from any indexes, could be quicker
> than executing 10 to 40 individual inserts from code per source
> row as I go. I dunno, though. Your way is probably easier to
> write.[/color]

A join on an expression faster than individual queries? I don't
think so!
[color=blue][color=green]
>>With very large numbers of records, I'd break the first step into
>>2 parts, first parsing the values into columns and then inserting
>>the records from the columns.
>>[color=darkred]
>>> What is your opinion if I do another project like this in the
>>> future? DML, or loops?[/color]
>>
>>Depends on the number of records and the number of values in the
>>multi-value field. With larger numbers of records and/or larger
>>numbers of values in the multi-value field, I'd tend to process
>>the values into columns first and then create records from the
>>columns. With smaller recordcounts (as in your case), I might do
>>it directly,
>> especially if the number of values stored in each field is large.
>>
>>The last time I did this I was processing a phone number field
>>that had multiple values in it. Most of the fields had 1, 2 or 3
>>numbers, a handful had 4 and very small number had more than that.
>>I processed into 4 columns, with the 4th column still being
>>multi-valued, but with only a small number of records to process
>>(indeed, it was such a small number that I believe I created the
>>real records manually!). My experience with this kind of data is
>>that you have lots of records with 1 or 2 values, about half as
>>many with 3, and then a quick falloff from there, with 10% or less
>>having more than that. But it depends entirely on the kind of
>>data. Phone numbers have pretty much an upper limit, but other
>>kinds of data will not.
>>
>>I guess what I'm saying is that with large numbers of records, I'd
>>separate the process of parsing the multi-value field from the
>>process of creating the records, for performance purposes.[/color]
>
> I guess that sounds reasonable. I think I would have done pretty
> much the same thing for a field with mostly one or 2 values per
> row. It was only this strange case of having a 1-M stuffed into a
> column that bought me this headache to begin with.
>
> The other issue I'm seeing, though, is that the code would
> probably have been simpler, cost less to write, and would now be
> simpler to maintain if I had used loops instead of queries. Since
> this is a monthly import, if the processing took less that twice
> as long as now, perhaps, the simplicity alone would have been
> sufficient reason to just do the loops, eh? Of course, a slower
> process means testing less often during development, and that
> means more difficult debugging sessions each time, so perhaps it
> would not have cost less to write, but I still think I could have
> made it more legible for maintenance that way.[/color]

What kind of source application is built around such a poorly
designed spreadsheet?

Is there any plan to get away from such a bad design?

--
David W. Fenton http://www.bway.net/~dfenton
dfenton at bway dot net http://www.bway.net/~dfassoc

**Steve Jorgensen** · Nov 12 '05, 07:43 PM

Re: Import/Normalize approach - column-DML, or loop

On Fri, 30 Jan 2004 19:39:25 GMT, "David W. Fenton"
<dXXXfenton@bwa y.net.invalid> wrote:

....[color=blue][color=green]
>> Yeah. If i get what you're saying, I don't thing ripping
>> arguments into 10 or 40 new columns will be a good thing, and
>> anything short of the 40 or so means writing more code to handle
>> the arguments past #10 or so differently than the rest.[/color]
>
>Well, I'm rather shocked at the idea that there'd be 40 separate
>values stored in a single field. I know you said it was a
>spreadsheet, but what kind of morons would write a spreadsheet that
>complex when they obviously need a database?[/color]

It probably is stored in a database inside the proprietary system it comes
from. What I get is the spreadsheet exported from that system. Also, the
items in the export are very short codes.
[color=blue][color=green][color=darkred]
>>>> I think I've figured out that a hybrid approach might be best for
>>>> my current situation, given that I already have all the
>>>> infrastructure to build the DML parts. . . .[/color][/color][/color]
....[color=blue][color=green][color=darkred]
>>>> . . . The idea is that I query a
>>>> recordset with all the unique combinations, cycle through that,
>>>> and parse out the arguments in each one, build a collection of
>>>> unique values (2500 combinations, but probably only a few
>>>> hundred unique items), then insert the collection items into the
>>>> lookup table. From there, I should be able join from the source
>>>> table to the lookup using a Like expression ("/" + combination +
>>>> "/" Like "*/" + delimiter + item-text + delimier + "/*") to
>>>> populate the junction table.
>>>> I think/hope this will do the job.
>>>
>>>Why join in that fashion?
>>>
>>>Why not do this: walk the source table, parse each value and
>>>insert a record with the source PK and the value into a temporary
>>>table, then from that table, create another table witht he unique
>>>values (this will be your lookup table, with a PK), then join
>>>these two new tables on the value field and insert the PK from the
>>>source table and the PK from the lookup table into your junction
>>>table.[/color]
>>
>> I'm guessing that this join, even though it would be an extra
>> query that will not benefit from any indexes, could be quicker
>> than executing 10 to 40 individual inserts from code per source
>> row as I go. I dunno, though. Your way is probably easier to
>> write.[/color]
>
>A join on an expression faster than individual queries? I don't
>think so![/color]

Huh? Each query has to go through the whole VBA/DAO/JET layer 10-40 times per
input row. I should expect that to be quicker than running one query, an
letting JET do the whole thing at the JET engine level?
[color=blue][color=green][color=darkred]
>>>With very large numbers of records, I'd break the first step into
>>>2 parts, first parsing the values into columns and then inserting
>>>the records from the columns.
>>>
>>>> What is your opinion if I do another project like this in the
>>>> future? DML, or loops?
>>>[/color][/color][/color]
....[color=blue][color=green][color=darkred]
>>>I guess what I'm saying is that with large numbers of records, I'd
>>>separate the process of parsing the multi-value field from the
>>>process of creating the records, for performance purposes.[/color]
>>
>> I guess that sounds reasonable. I think I would have done pretty
>> much the same thing for a field with mostly one or 2 values per
>> row. It was only this strange case of having a 1-M stuffed into a
>> column that bought me this headache to begin with.
>>
>> The other issue I'm seeing, though, is that the code would
>> probably have been simpler, cost less to write, and would now be
>> simpler to maintain if I had used loops instead of queries. Since
>> this is a monthly import, if the processing took less that twice
>> as long as now, perhaps, the simplicity alone would have been
>> sufficient reason to just do the loops, eh? Of course, a slower
>> process means testing less often during development, and that
>> means more difficult debugging sessions each time, so perhaps it
>> would not have cost less to write, but I still think I could have
>> made it more legible for maintenance that way.[/color]
>
>What kind of source application is built around such a poorly
>designed spreadsheet?
>
>Is there any plan to get away from such a bad design?[/color]

No. It's the export format provided from the application we need to get the
data from. The vendor provides another version of the program with a querying
interface for analysis, but it's missing much of the data we need that is
exported in the spreadsheet.

**Mike Sherrill** · Nov 12 '05, 07:43 PM

Re: Import/Normalize approach - column-DML, or loop

On Thu, 29 Jan 2004 08:18:26 GMT, Steve Jorgensen
<nospam@nospam. nospam> wrote:

[snip][color=blue]
>All that was going swimmingly, and performing pretty well until I got to the
>fields containing multiple, delimited values. My whole dedign is based on
>using SQL/DML passes for everything, but the only way I could figure out to
>make that work was to call a user defined function from within the query[/color]
[snip]

Did you consider using code to generate SQL statements, then executing
them? (Possibly within explicit transactions?)

--
Mike Sherrill
Information Management Systems

**David W. Fenton** · Nov 12 '05, 07:44 PM

Re: Import/Normalize approach - column-DML, or loop

Steve Jorgensen <nospam@nospam. nospam> wrote in
news:75ll105g7s t0pdnoqhn1r3vpv 664erila9@4ax.c om:
[color=blue]
> On Fri, 30 Jan 2004 19:39:25 GMT, "David W. Fenton"
><dXXXfenton@bw ay.net.invalid> wrote:
>
> ...[color=green][color=darkred]
>>> Yeah. If i get what you're saying, I don't thing ripping
>>> arguments into 10 or 40 new columns will be a good thing, and
>>> anything short of the 40 or so means writing more code to handle
>>> the arguments past #10 or so differently than the rest.[/color]
>>
>>Well, I'm rather shocked at the idea that there'd be 40 separate
>>values stored in a single field. I know you said it was a
>>spreadsheet , but what kind of morons would write a spreadsheet
>>that complex when they obviously need a database?[/color]
>
> It probably is stored in a database inside the proprietary system
> it comes from. What I get is the spreadsheet exported from that
> system. Also, the items in the export are very short codes.[/color]

So, you're normalizing data that has been denormalized for export?

How stupid is that?

Wouldn't it be better to have someone skip the spreadsheet and have
a normalized export process instead?

[]
[color=blue][color=green]
>>What kind of source application is built around such a poorly
>>designed spreadsheet?
>>
>>Is there any plan to get away from such a bad design?[/color]
>
> No. It's the export format provided from the application we need
> to get the data from. The vendor provides another version of the
> program with a querying interface for analysis, but it's missing
> much of the data we need that is exported in the spreadsheet.[/color]

Are you sure there's absolutely no access to the underlying data
structures?

This is the kind of thing that drives me crazy, having to program
something to undo something that has been extensively programmed
already. Any changes to the export will break your import routine,
for instance.

I recently replaced a client's system for importing data from MYOB
with direct connections via the MYOB ODBC, and vastly improved the
quality of data (previously, certain kinds of data were just not
available). I don't know if the application in question has any such
capability, but I would certainly let the client know that anything
you program is heavily contingent on there being no changes in the
output format at all.

--
David W. Fenton http://www.bway.net/~dfenton
dfenton at bway dot net http://www.bway.net/~dfassoc

**Steve Jorgensen** · Nov 12 '05, 07:44 PM

Re: Import/Normalize approach - column-DML, or loop

On Fri, 30 Jan 2004 18:08:32 -0500, Mike Sherrill
<MSherrillnonon o@compuserve.co m> wrote:
[color=blue]
>On Thu, 29 Jan 2004 08:18:26 GMT, Steve Jorgensen
><nospam@nospam .nospam> wrote:
>
>[snip][color=green]
>>All that was going swimmingly, and performing pretty well until I got to the
>>fields containing multiple, delimited values. My whole dedign is based on
>>using SQL/DML passes for everything, but the only way I could figure out to
>>make that work was to call a user defined function from within the query[/color]
>[snip]
>
>Did you consider using code to generate SQL statements, then executing
>them? (Possibly within explicit transactions?)[/color]

Yes, that's precisely what I am doing.

Import/Normalize approach - column-DML, or loop

Import/Normalize approach - column-DML, or loop

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment