Import/Normalize approach - column-DML, or loop

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Steve Jorgensen

    Import/Normalize approach - column-DML, or loop

    Hi all,

    I've just finished almost all of what has turned out to be a real bear of a
    project. It has to import data from a monthly spreadsheet export from another
    program, and convert that into normalized data. The task is made more
    difficult by the fact that the structure itself can vary from month to month
    (in well defined ways).

    So, I used the SQL-centric approach, taking vertical stripes at a time so
    that, for instance, for each field with simple, repeating text data, I make a
    group-by pass, inserting into the destination lookup table, then I do another
    query, joining from the input table to the text fields in the lookups to get
    the foreign keys for the main destination table. When I have multiple columns
    that need to become 1-M, I make a pass for each of those columns, inserting
    lookup record that identifies the split, then inserting the rows into the
    many-side table, yadda, yadda, yadda.

    All that was going swimmingly, and performing pretty well until I got to the
    fields containing multiple, delimited values. My whole dedign is based on
    using SQL/DML passes for everything, but the only way I could figure out to
    make that work was to call a user defined function from within the query to
    pull out, in successive query slices, argument 1, argument 2, etc. I was
    going to use a where clause to exclude null results (input doesn't have
    arguments n or above), and quit after the first pass with .RecordsAffecte d =
    0.

    Sounds good, but with a mere 8000 input rows, I had to cancel the first pass
    after waiting about 20 minutes. Note that a whole import process takes about
    3 minutes if this part is ommitted, and that includes about 40 vertical split
    passes. The UDF is very simple, and performs quite fast from VB, but I guess
    the overhead of calling this function from a query is VERY SEVERE. Just to
    get this thing out the door, I've decided to just tnot split these for now
    since we're not doing queries of that data yet.

    So, my next thought is that, perhaps this step is better done by simply,
    brute-force, cycling through a recordset of the input table, and inserting
    rows into the destination tables. If I'm soing that, though, then am I really
    getting any benefit worth speaking of by doing everything -else- as SQL-DML in
    vertical slices, or would I have been much better off, just doing it the
    procedural way, walking through the input table, and creating destination rows
    one row at a time? I know it would have been easier to write, but here I was
    trying to do things the "right" way.

    If I do something like this again, would my code end up performing just as
    well, and being easier to write and maintain using simple iteration rather
    than SQL-DML?

    Thanks for any opinions,

    - Steve J
  • Bas Cost Budde

    #2
    Re: Import/Normalize approach - column-DML, or loop

    > If I do something like this again, would my code end up performing just as[color=blue]
    > well, and being easier to write and maintain using simple iteration rather
    > than SQL-DML?[/color]

    My guess is that you can build a fairly specialized optimalisation here.
    After all, the Jet engine is not the heaviest in the world, and its
    query optimizers can do only so much for true DML.
    I have a query that calculates several values by simple IIFs, and that
    takes three minutes for a simple 20000 records table!

    Somewhere in the line any SQL statement must be executed
    procedurally--SQL is invented only for us humans to focus attention on
    the what, not the how.

    I'd say: use the procedure. Bonus: you can display your own progress
    indicator. A slow function with feedback is often perceived as quicker
    than a slow function without feedback, even if it runs several seconds
    quicker not having to update the display.

    Steve Jorgensen wrote:[color=blue]
    >
    > All that was going swimmingly, and performing pretty well until I got to the
    > fields containing multiple, delimited values. My whole dedign is based on
    > using SQL/DML passes for everything, but the only way I could figure out to
    > make that work was to call a user defined function from within the query to
    > pull out, in successive query slices, argument 1, argument 2, etc. I was
    > going to use a where clause to exclude null results (input doesn't have
    > arguments n or above), and quit after the first pass with .RecordsAffecte d =
    > 0.[/color]

    Sometimes you can invent a preprocessor (this time inside the
    spreadsheet), is that possible? I'd like to explore the point but see
    little value doing that in the group (eating up everyones bandwith,
    verbose as I tend to be). If you have time, mail me. My domain is not
    org but nl.

    --
    Bas Cost Budde

    Comment

    • Steve Jorgensen

      #3
      Re: Import/Normalize approach - column-DML, or loop

      On Thu, 29 Jan 2004 09:44:01 +0100, Bas Cost Budde <bas@heuveltop. org> wrote:
      [color=blue][color=green]
      >> If I do something like this again, would my code end up performing just as
      >> well, and being easier to write and maintain using simple iteration rather
      >> than SQL-DML?[/color]
      >
      >My guess is that you can build a fairly specialized optimalisation here.
      > After all, the Jet engine is not the heaviest in the world, and its
      >query optimizers can do only so much for true DML.
      >I have a query that calculates several values by simple IIFs, and that
      >takes three minutes for a simple 20000 records table![/color]

      I generally have much better performance than that. In fact, if we're not
      talking about a multi-processor or multi-user system, I generally find that
      JET is faster than most SQL Servers since it has fewer contingencies to deal
      with.
      [color=blue]
      >I'd say: use the procedure. Bonus: you can display your own progress
      >indicator. A slow function with feedback is often perceived as quicker
      >than a slow function without feedback, even if it runs several seconds
      >quicker not having to update the display.[/color]

      Actually, since most of my vertical slice insert queries run in under 2 or 3
      seconds, I have a pretty good progress bar already - until I get to the
      multi-value split, that is.
      [color=blue]
      >
      >Steve Jorgensen wrote:[color=green]
      >>
      >> All that was going swimmingly, and performing pretty well until I got to the
      >> fields containing multiple, delimited values. My whole dedign is based on
      >> using SQL/DML passes for everything, but the only way I could figure out to
      >> make that work was to call a user defined function from within the query to
      >> pull out, in successive query slices, argument 1, argument 2, etc. I was
      >> going to use a where clause to exclude null results (input doesn't have
      >> arguments n or above), and quit after the first pass with .RecordsAffecte d =
      >> 0.[/color]
      >
      >Sometimes you can invent a preprocessor (this time inside the
      >spreadsheet) , is that possible? I'd like to explore the point but see[/color]

      Actually, I do a small amount of pre-processing in the spreadsheet (just in
      the top few rows) to make it into something Access wil import properly, then I
      import that into a staging database, and link to that from the Access importer
      database app (also linked to the target back-end). The transformation,
      import, and link steps take about 6 or 7 seconds altogether. From there, it's
      an ordinary table (albeit, with all Text columns), so I have my choice of DML
      or looping to do the complicated part from there.
      [color=blue]
      >little value doing that in the group (eating up everyones bandwith,
      >verbose as I tend to be). If you have time, mail me. My domain is not
      >org but nl.[/color]

      I don't need that kind of detail. I was really just looking for people's
      experience/optinions of the values of DML vs looping for this kind of job.

      Comment

      • Steve Jorgensen

        #4
        Re: Import/Normalize approach - column-DML, or loop

        On Thu, 29 Jan 2004 09:44:01 +0100, Bas Cost Budde <bas@heuveltop. org> wrote:

        Ah - forgot a point...[color=blue]
        >Somewhere in the line any SQL statement must be executed
        >procedurally--SQL is invented only for us humans to focus attention on
        >the what, not the how.[/color]

        Well, it's more than that, though. Using procedural code, every single
        read/write must go through several code layers. The SQL engine is designed to
        handle set operations optimally without crossing all those layers, and it can
        use covering indexes (a concept hidden from VB code), etc. Still, it may be
        that for a data transformation operation, those advantages are not realized,
        or worse.

        Comment

        • Bruce Pick

          #5
          Re: Import/Normalize approach - column-DML, or loop

          Steve,

          I'm assuming you need to separate the delimited fields into a set of
          fields. Below is code that would separate them out in a query, which
          should run fairly quickly. (If you need to create multiple rows for
          each source string, the approach won't help.)

          You could use InStr() to locate the delimiters and then generate the
          fields based on that knowledge.

          Field1: Left([SourceField],(InStr(1,[SourceField],',')-1))
          Balance1: Mid([SourceField],(InStr(1,[SourceField],',')+1))
          Field2: Left([Balance1],(InStr(1,[Balance1],',')-1))
          Balance2: Mid([Balance1],(InStr(1,[Balance1],',')+1))
          Field3:IIf(InSt r(1,[Balance2],',')=0,Null,Le ft([Balance2],(InStr(1,[Balance2],',')-1)))
          Balance3:IIf(Is Null([Field3]),'',Mid([Balance2],(InStr(1,[Balance2],',')+1)))
          Field4:IIf(InSt r(1,[Balance3],',')=0,'',Left ([Balance3],(InStr(1,[Balance3],',')-1)))
          I think you get the idea.

          If the number of delimited components varies betseen records, you will
          need to add code to test for IIf( InStr(xxx) = 0, '', ... as above). I
          set up Field3, Balance3, and Field4 with the needed logic.

          I admit it's a bit of a pain to write, but once the pattern is set, the
          statements are easy to derive. I think it should run more quickly
          than a user-defined function in a module.

          Bruce Pick

          Steve Jorgensen wrote:
          [color=blue]
          > Hi all,
          > . . .
          > All that was going swimmingly, and performing pretty well until I got to the
          > fields containing multiple, delimited values. My whole dedign is based on
          > using SQL/DML passes for everything, but the only way I could figure out to
          > make that work was to call a user defined function from within the query to
          > pull out, in successive query slices, argument 1, argument 2, etc.
          > . .
          > Sounds good, but with a mere 8000 input rows, I had to cancel the first pass
          > after waiting about 20 minutes.
          > .. .
          > Thanks for any opinions,
          >
          > - Steve J[/color]

          Comment

          • Steve Jorgensen

            #6
            Re: Import/Normalize approach - column-DML, or loop

            No, that's not it at all. I need to split the items out into rows in another
            table with a 1-M relationship, and the number of items varies from row to row.
            I have certainly written code like your example before, could generate that
            pattern for a series of queries to extract one argument at a time, but I
            believe I would hit the "Expression too complex" error condition long before I
            got to the last argument (try about 40).

            On Thu, 29 Jan 2004 09:57:53 -0500, Bruce Pick <brucepick1@com cast.net> wrote:
            [color=blue]
            >Steve,
            >
            >I'm assuming you need to separate the delimited fields into a set of
            >fields. Below is code that would separate them out in a query, which
            >should run fairly quickly. (If you need to create multiple rows for
            >each source string, the approach won't help.)
            >
            >You could use InStr() to locate the delimiters and then generate the
            >fields based on that knowledge.
            >
            >Field1: Left([SourceField],(InStr(1,[SourceField],',')-1))
            >Balance1: Mid([SourceField],(InStr(1,[SourceField],',')+1))
            >Field2: Left([Balance1],(InStr(1,[Balance1],',')-1))
            >Balance2: Mid([Balance1],(InStr(1,[Balance1],',')+1))
            >Field3:IIf(InS tr(1,[Balance2],',')=0,Null,Le ft([Balance2],(InStr(1,[Balance2],',')-1)))
            >Balance3:IIf(I sNull([Field3]),'',Mid([Balance2],(InStr(1,[Balance2],',')+1)))
            >Field4:IIf(InS tr(1,[Balance3],',')=0,'',Left ([Balance3],(InStr(1,[Balance3],',')-1)))
            >I think you get the idea.
            >
            >If the number of delimited components varies betseen records, you will
            >need to add code to test for IIf( InStr(xxx) = 0, '', ... as above). I
            >set up Field3, Balance3, and Field4 with the needed logic.
            >
            >I admit it's a bit of a pain to write, but once the pattern is set, the
            > statements are easy to derive. I think it should run more quickly
            >than a user-defined function in a module.
            >
            >Bruce Pick
            >
            >Steve Jorgensen wrote:
            >[color=green]
            >> Hi all,
            >> . . .
            >> All that was going swimmingly, and performing pretty well until I got to the
            >> fields containing multiple, delimited values. My whole dedign is based on
            >> using SQL/DML passes for everything, but the only way I could figure out to
            >> make that work was to call a user defined function from within the query to
            >> pull out, in successive query slices, argument 1, argument 2, etc.
            >> . .
            >> Sounds good, but with a mere 8000 input rows, I had to cancel the first pass
            >> after waiting about 20 minutes.
            >> .. .
            >> Thanks for any opinions,
            >>
            >> - Steve J[/color][/color]

            Comment

            • David W. Fenton

              #7
              Re: Import/Normalize approach - column-DML, or loop

              Steve Jorgensen <nospam@nospam. nospam> wrote in
              news:oafh10ho26 53v9d89fiio3lsj acaljgl9i@4ax.c om:
              [color=blue]
              > Sounds good, but with a mere 8000 input rows, I had to cancel the
              > first pass after waiting about 20 minutes. Note that a whole
              > import process takes about 3 minutes if this part is ommitted, and
              > that includes about 40 vertical split passes. The UDF is very
              > simple, and performs quite fast from VB, but I guess the overhead
              > of calling this function from a query is VERY SEVERE. Just to get
              > this thing out the door, I've decided to just tnot split these for
              > now since we're not doing queries of that data yet.[/color]

              I did this about two years ago with multiple fields with
              comma-separated values.

              My solution was this:

              1. first, determine the distribution of multiple values. That is,
              what percent of the table has one value? What percent, two? Three?
              And so on.

              2. in my cases, 95% of the records had 4 or fewer values, so I
              created four additional fields in my processing table.

              3. I then copied the values that have no commas (only one value)
              into the first of the processing columns.

              4. then, in a query that eliminates the rows without commas in the
              original column, copy the value before the first comma into the
              second column.

              5. then, in a query that eliminates the rows that have no commas in
              the source column and a value in the second target column, copy the
              values between the 1st and 2nd column into the second target column.

              And so forth.

              This is not too hard to do because you're doing it all in relatively
              small sets, rather than trying to process the whole thing with a
              single multi-purpose subroutine.

              If this isn't clear, let me know and I'll outline it more clearly.

              --
              David W. Fenton http://www.bway.net/~dfenton
              dfenton at bway dot net http://www.bway.net/~dfassoc

              Comment

              • Steve Jorgensen

                #8
                Re: Import/Normalize approach - column-DML, or loop

                On Thu, 29 Jan 2004 20:29:55 GMT, "David W. Fenton"
                <dXXXfenton@bwa y.net.invalid> wrote:
                [color=blue]
                >Steve Jorgensen <nospam@nospam. nospam> wrote in
                >news:oafh10ho2 653v9d89fiio3ls jacaljgl9i@4ax. com:
                >[color=green]
                >> Sounds good, but with a mere 8000 input rows, I had to cancel the
                >> first pass after waiting about 20 minutes. Note that a whole
                >> import process takes about 3 minutes if this part is ommitted, and
                >> that includes about 40 vertical split passes. The UDF is very
                >> simple, and performs quite fast from VB, but I guess the overhead
                >> of calling this function from a query is VERY SEVERE. Just to get
                >> this thing out the door, I've decided to just tnot split these for
                >> now since we're not doing queries of that data yet.[/color]
                >
                >I did this about two years ago with multiple fields with
                >comma-separated values.
                >
                >My solution was this:
                >
                >1. first, determine the distribution of multiple values. That is,
                >what percent of the table has one value? What percent, two? Three?
                >And so on.
                >
                >2. in my cases, 95% of the records had 4 or fewer values, so I
                >created four additional fields in my processing table.
                >
                >3. I then copied the values that have no commas (only one value)
                >into the first of the processing columns.
                >
                >4. then, in a query that eliminates the rows without commas in the
                >original column, copy the value before the first comma into the
                >second column.
                >
                >5. then, in a query that eliminates the rows that have no commas in
                >the source column and a value in the second target column, copy the
                >values between the 1st and 2nd column into the second target column.
                >
                >And so forth.
                >
                >This is not too hard to do because you're doing it all in relatively
                >small sets, rather than trying to process the whole thing with a
                >single multi-purpose subroutine.
                >
                >If this isn't clear, let me know and I'll outline it more clearly.[/color]

                Well, I'm not sure that's much help in this case. it's not a field that was
                originally indended to be single-valued, then had some multi-valued data put
                in, it is an export form a system with no other way to represent 1-to-many
                data in a spreadsheet than to put it in a multi-valued field. Out of 8000
                records, there are about 2500 unique combinations (using GROUP BY).

                I think I've figured out that a hybrid apprach might be best for my current
                situation, given that I already have all the infrastructure to build the DML
                parts. The idea is that I query a recordset with all the unique combinations,
                cycle through that, and parse out the arguments in each one, build a
                collection of unique values (2500 combinations, but probably only a few
                hundred unique items), then insert the collection items into the lookup table.
                From there, I should be able join from the source table to the lookup using a
                Like expression ("/" + combination + "/" Like "*/" + delimiter + item-text +
                delimier + "/*") to populate the junction table.

                I think/hope this will do the job.

                What is your opinion if I do another project like this in the future? DML, or
                loops?

                Comment

                • David W. Fenton

                  #9
                  Re: Import/Normalize approach - column-DML, or loop

                  Steve Jorgensen <nospam@nospam. nospam> wrote in
                  news:o6nj10dpfh i2f3oeosv3q3aid dloek4uaa@4ax.c om:
                  [color=blue]
                  > Well, I'm not sure that's much help in this case. it's not a
                  > field that was originally indended to be single-valued, then had
                  > some multi-valued data put in, it is an export form a system with
                  > no other way to represent 1-to-many data in a spreadsheet than to
                  > put it in a multi-valued field. Out of 8000 records, there are
                  > about 2500 unique combinations (using GROUP BY).[/color]

                  With 8000->2500, I'm not sure there's any real benefit in processing
                  unique values.

                  Either way, though, I'd first process the values into columns and
                  then process the columns into rows. But if more than half the
                  records have more than 10 values in them, then it would become more
                  unwieldy, I think, than just walking the table and creating the new
                  records.

                  I'd do that in two steps of course, as you're likely to want a
                  lookup here, so you'll need both a list of the values connected to
                  the source record and a list of the unique values to pick from in
                  new records. So, I'd create the 1:N records (FK + Value), then
                  create the lookup table with a DISTINCT, then join on the value to
                  populate the lookup key.

                  But with only 8000 records, I think I'd do the record creation by
                  walking the recordset and then walking the field and parsing.
                  [color=blue]
                  > I think I've figured out that a hybrid apprach might be best for
                  > my current situation, given that I already have all the
                  > infrastructure to build the DML parts. . . .[/color]

                  DML? Perhaps I should not have ignored the fact that I don't know
                  what DML means?
                  [color=blue]
                  > . . . The idea is that I query a
                  > recordset with all the unique combinations, cycle through that,
                  > and parse out the arguments in each one, build a collection of
                  > unique values (2500 combinations, but probably only a few hundred
                  > unique items), then insert the collection items into the lookup
                  > table. From there, I should be able join from the source table to
                  > the lookup using a Like expression ("/" + combination + "/" Like
                  > "*/" + delimiter + item-text + delimier + "/*") to populate the
                  > junction table.
                  > I think/hope this will do the job.[/color]

                  Why join in that fashion?

                  Why not do this: walk the source table, parse each value and insert
                  a record with the source PK and the value into a temporary table,
                  then from that table, create another table witht he unique values
                  (this will be your lookup table, with a PK), then join these two new
                  tables on the value field and insert the PK from the source table
                  and the PK from the lookup table into your junction table.

                  With very large numbers of records, I'd break the first step into 2
                  parts, first parsing the values into columns and then inserting the
                  records from the columns.
                  [color=blue]
                  > What is your opinion if I do another project like this in the
                  > future? DML, or loops?[/color]

                  Depends on the number of records and the number of values in the
                  multi-value field. With larger numbers of records and/or larger
                  numbers of values in the multi-value field, I'd tend to process the
                  values into columns first and then create records from the columns.
                  With smaller recordcounts (as in your case), I might do it directly,
                  especially if the number of values stored in each field is large.

                  The last time I did this I was processing a phone number field that
                  had multiple values in it. Most of the fields had 1, 2 or 3 numbers,
                  a handful had 4 and very small number had more than that. I
                  processed into 4 columns, with the 4th column still being
                  multi-valued, but with only a small number of records to process
                  (indeed, it was such a small number that I believe I created the
                  real records manually!). My experience with this kind of data is
                  that you have lots of records with 1 or 2 values, about half as many
                  with 3, and then a quick falloff from there, with 10% or less having
                  more than that. But it depends entirely on the kind of data. Phone
                  numbers have pretty much an upper limit, but other kinds of data
                  will not.

                  I guess what I'm saying is that with large numbers of records, I'd
                  separate the process of parsing the multi-value field from the
                  process of creating the records, for performance purposes.

                  --
                  David W. Fenton http://www.bway.net/~dfenton
                  dfenton at bway dot net http://www.bway.net/~dfassoc

                  Comment

                  • Steve Jorgensen

                    #10
                    Re: Import/Normalize approach - column-DML, or loop

                    On Fri, 30 Jan 2004 16:34:18 GMT, "David W. Fenton"
                    <dXXXfenton@bwa y.net.invalid> wrote:
                    [color=blue]
                    >Steve Jorgensen <nospam@nospam. nospam> wrote in
                    >news:o6nj10dpf hi2f3oeosv3q3ai ddloek4uaa@4ax. com:
                    >[color=green]
                    >> Well, I'm not sure that's much help in this case. it's not a
                    >> field that was originally indended to be single-valued, then had
                    >> some multi-valued data put in, it is an export form a system with
                    >> no other way to represent 1-to-many data in a spreadsheet than to
                    >> put it in a multi-valued field. Out of 8000 records, there are
                    >> about 2500 unique combinations (using GROUP BY).[/color]
                    >
                    >With 8000->2500, I'm not sure there's any real benefit in processing
                    >unique values.[/color]

                    Well, the Group By seems to run pretty fast. Intuitively, it seems like
                    walking more than 3x as many rows to parse would waste more time than the
                    Group By. Of course, it's trivial to try it both ways, and time it.
                    [color=blue]
                    >Either way, though, I'd first process the values into columns and
                    >then process the columns into rows. But if more than half the
                    >records have more than 10 values in them, then it would become more
                    >unwieldy, I think, than just walking the table and creating the new
                    >records.[/color]

                    I think it's borderline on that one. I'm not certain I know what you are
                    advocating, though. Are you saying to add columns to the import table, and
                    step-wise run queries to peel off argument 1 to the first added column,
                    removing it from the original source column (leaving argument 2 as argument
                    1), then query again to peel off argument 1 to the second added column, etc.?
                    [color=blue]
                    >I'd do that in two steps of course, as you're likely to want a
                    >lookup here, so you'll need both a list of the values connected to
                    >the source record and a list of the unique values to pick from in
                    >new records. So, I'd create the 1:N records (FK + Value), then
                    >create the lookup table with a DISTINCT, then join on the value to
                    >populate the lookup key.[/color]

                    Right, that's pretty much how I'm handling all the other single-valued fields
                    with repeated text already. It's working great.
                    [color=blue]
                    >But with only 8000 records, I think I'd do the record creation by
                    >walking the recordset and then walking the field and parsing.[/color]

                    Yeah. If i get what you're saying, I don't thing ripping arguments into 10 or
                    40 new columns will be a good thing, and anything short of the 40 or so means
                    writing more code to handle the arguments past #10 or so differently than the
                    rest.
                    [color=blue][color=green]
                    >> I think I've figured out that a hybrid apprach might be best for
                    >> my current situation, given that I already have all the
                    >> infrastructure to build the DML parts. . . .[/color]
                    >
                    >DML? Perhaps I should not have ignored the fact that I don't know
                    >what DML means?[/color]

                    Oh, it's just SQL that's not DDL. SQL consists of DDL and DML. To me, use of
                    the term DML implies a focus on INSERT/UPDATE queries rather than simple
                    SELECTS, though I believe SELECT is still considered part of DML. I welcome
                    anyone's corrections on these facts.
                    [color=blue][color=green]
                    >> . . . The idea is that I query a
                    >> recordset with all the unique combinations, cycle through that,
                    >> and parse out the arguments in each one, build a collection of
                    >> unique values (2500 combinations, but probably only a few hundred
                    >> unique items), then insert the collection items into the lookup
                    >> table. From there, I should be able join from the source table to
                    >> the lookup using a Like expression ("/" + combination + "/" Like
                    >> "*/" + delimiter + item-text + delimier + "/*") to populate the
                    >> junction table.
                    >> I think/hope this will do the job.[/color]
                    >
                    >Why join in that fashion?
                    >
                    >Why not do this: walk the source table, parse each value and insert
                    >a record with the source PK and the value into a temporary table,
                    >then from that table, create another table witht he unique values
                    >(this will be your lookup table, with a PK), then join these two new
                    >tables on the value field and insert the PK from the source table
                    >and the PK from the lookup table into your junction table.[/color]

                    I'm guessing that this join, even though it would be an extra query that will
                    not benefit from any indexes, could be quicker than executing 10 to 40
                    individual inserts from code per source row as I go. I dunno, though. Your
                    way is probably easier to write.
                    [color=blue]
                    >With very large numbers of records, I'd break the first step into 2
                    >parts, first parsing the values into columns and then inserting the
                    >records from the columns.
                    >[color=green]
                    >> What is your opinion if I do another project like this in the
                    >> future? DML, or loops?[/color]
                    >
                    >Depends on the number of records and the number of values in the
                    >multi-value field. With larger numbers of records and/or larger
                    >numbers of values in the multi-value field, I'd tend to process the
                    >values into columns first and then create records from the columns.
                    >With smaller recordcounts (as in your case), I might do it directly,
                    > especially if the number of values stored in each field is large.
                    >
                    >The last time I did this I was processing a phone number field that
                    >had multiple values in it. Most of the fields had 1, 2 or 3 numbers,
                    >a handful had 4 and very small number had more than that. I
                    >processed into 4 columns, with the 4th column still being
                    >multi-valued, but with only a small number of records to process
                    >(indeed, it was such a small number that I believe I created the
                    >real records manually!). My experience with this kind of data is
                    >that you have lots of records with 1 or 2 values, about half as many
                    >with 3, and then a quick falloff from there, with 10% or less having
                    >more than that. But it depends entirely on the kind of data. Phone
                    >numbers have pretty much an upper limit, but other kinds of data
                    >will not.
                    >
                    >I guess what I'm saying is that with large numbers of records, I'd
                    >separate the process of parsing the multi-value field from the
                    >process of creating the records, for performance purposes.[/color]

                    I guess that sounds reasonable. I think I would have done pretty much the
                    same thing for a field with mostly one or 2 values per row. It was only this
                    strange case of having a 1-M stuffed into a column that bought me this
                    headache to begin with.

                    The other issue I'm seeing, though, is that the code would probably have been
                    simpler, cost less to write, and would now be simpler to maintain if I had
                    used loops instead of queries. Since this is a monthly import, if the
                    processing took less that twice as long as now, perhaps, the simplicity alone
                    would have been sufficient reason to just do the loops, eh? Of course, a
                    slower process means testing less often during development, and that means
                    more difficult debugging sessions each time, so perhaps it would not have cost
                    less to write, but I still think I could have made it more legible for
                    maintenance that way.

                    Comment

                    • David W. Fenton

                      #11
                      Re: Import/Normalize approach - column-DML, or loop

                      Steve Jorgensen <nospam@nospam. nospam> wrote in
                      news:0v2l10hog8 rratup427c7a4s1 sjeme50tp@4ax.c om:
                      [color=blue]
                      > On Fri, 30 Jan 2004 16:34:18 GMT, "David W. Fenton"
                      ><dXXXfenton@bw ay.net.invalid> wrote:
                      >[color=green]
                      >>Steve Jorgensen <nospam@nospam. nospam> wrote in
                      >>news:o6nj10dp fhi2f3oeosv3q3a iddloek4uaa@4ax .com:
                      >>[color=darkred]
                      >>> Well, I'm not sure that's much help in this case. it's not a
                      >>> field that was originally indended to be single-valued, then had
                      >>> some multi-valued data put in, it is an export form a system
                      >>> with no other way to represent 1-to-many data in a spreadsheet
                      >>> than to put it in a multi-valued field. Out of 8000 records,
                      >>> there are about 2500 unique combinations (using GROUP BY).[/color]
                      >>
                      >>With 8000->2500, I'm not sure there's any real benefit in
                      >>processing unique values.[/color]
                      >
                      > Well, the Group By seems to run pretty fast. Intuitively, it
                      > seems like walking more than 3x as many rows to parse would waste
                      > more time than the Group By. Of course, it's trivial to try it
                      > both ways, and time it.[/color]

                      But you'll end up with unique values that have no meaning, and
                      really only save you a tiny amount of time. On the other hand, it
                      doesn't necessarily take very much time to do it, so perhaps it's a
                      wash.
                      [color=blue][color=green]
                      >>Either way, though, I'd first process the values into columns and
                      >>then process the columns into rows. But if more than half the
                      >>records have more than 10 values in them, then it would become
                      >>more unwieldy, I think, than just walking the table and creating
                      >>the new records.[/color]
                      >
                      > I think it's borderline on that one. I'm not certain I know what
                      > you are advocating, though. Are you saying to add columns to the
                      > import table, and step-wise run queries to peel off argument 1 to
                      > the first added column, removing it from the original source
                      > column (leaving argument 2 as argument 1), then query again to
                      > peel off argument 1 to the second added column, etc.?[/color]

                      That's one way to do it. He other way is to have a single query do
                      it. Or to write code that will write the SQL for you.
                      [color=blue][color=green]
                      >>I'd do that in two steps of course, as you're likely to want a
                      >>lookup here, so you'll need both a list of the values connected to
                      >>the source record and a list of the unique values to pick from in
                      >>new records. So, I'd create the 1:N records (FK + Value), then
                      >>create the lookup table with a DISTINCT, then join on the value to
                      >>populate the lookup key.[/color]
                      >
                      > Right, that's pretty much how I'm handling all the other
                      > single-valued fields with repeated text already. It's working
                      > great.
                      >[color=green]
                      >>But with only 8000 records, I think I'd do the record creation by
                      >>walking the recordset and then walking the field and parsing.[/color]
                      >
                      > Yeah. If i get what you're saying, I don't thing ripping
                      > arguments into 10 or 40 new columns will be a good thing, and
                      > anything short of the 40 or so means writing more code to handle
                      > the arguments past #10 or so differently than the rest.[/color]

                      Well, I'm rather shocked at the idea that there'd be 40 separate
                      values stored in a single field. I know you said it was a
                      spreadsheet, but what kind of morons would write a spreadsheet that
                      complex when they obviously need a database?
                      [color=blue][color=green][color=darkred]
                      >>> I think I've figured out that a hybrid apprach might be best for
                      >>> my current situation, given that I already have all the
                      >>> infrastructure to build the DML parts. . . .[/color]
                      >>
                      >>DML? Perhaps I should not have ignored the fact that I don't know
                      >>what DML means?[/color]
                      >
                      > Oh, it's just SQL that's not DDL. SQL consists of DDL and DML.
                      > To me, use of the term DML implies a focus on INSERT/UPDATE
                      > queries rather than simple SELECTS, though I believe SELECT is
                      > still considered part of DML. I welcome anyone's corrections on
                      > these facts.[/color]

                      I've never heard that terminology before.

                      Of course, I never use DDL, either.
                      [color=blue][color=green][color=darkred]
                      >>> . . . The idea is that I query a
                      >>> recordset with all the unique combinations, cycle through that,
                      >>> and parse out the arguments in each one, build a collection of
                      >>> unique values (2500 combinations, but probably only a few
                      >>> hundred unique items), then insert the collection items into the
                      >>> lookup table. From there, I should be able join from the source
                      >>> table to the lookup using a Like expression ("/" + combination +
                      >>> "/" Like "*/" + delimiter + item-text + delimier + "/*") to
                      >>> populate the junction table.
                      >>> I think/hope this will do the job.[/color]
                      >>
                      >>Why join in that fashion?
                      >>
                      >>Why not do this: walk the source table, parse each value and
                      >>insert a record with the source PK and the value into a temporary
                      >>table, then from that table, create another table witht he unique
                      >>values (this will be your lookup table, with a PK), then join
                      >>these two new tables on the value field and insert the PK from the
                      >>source table and the PK from the lookup table into your junction
                      >>table.[/color]
                      >
                      > I'm guessing that this join, even though it would be an extra
                      > query that will not benefit from any indexes, could be quicker
                      > than executing 10 to 40 individual inserts from code per source
                      > row as I go. I dunno, though. Your way is probably easier to
                      > write.[/color]

                      A join on an expression faster than individual queries? I don't
                      think so!
                      [color=blue][color=green]
                      >>With very large numbers of records, I'd break the first step into
                      >>2 parts, first parsing the values into columns and then inserting
                      >>the records from the columns.
                      >>[color=darkred]
                      >>> What is your opinion if I do another project like this in the
                      >>> future? DML, or loops?[/color]
                      >>
                      >>Depends on the number of records and the number of values in the
                      >>multi-value field. With larger numbers of records and/or larger
                      >>numbers of values in the multi-value field, I'd tend to process
                      >>the values into columns first and then create records from the
                      >>columns. With smaller recordcounts (as in your case), I might do
                      >>it directly,
                      >> especially if the number of values stored in each field is large.
                      >>
                      >>The last time I did this I was processing a phone number field
                      >>that had multiple values in it. Most of the fields had 1, 2 or 3
                      >>numbers, a handful had 4 and very small number had more than that.
                      >>I processed into 4 columns, with the 4th column still being
                      >>multi-valued, but with only a small number of records to process
                      >>(indeed, it was such a small number that I believe I created the
                      >>real records manually!). My experience with this kind of data is
                      >>that you have lots of records with 1 or 2 values, about half as
                      >>many with 3, and then a quick falloff from there, with 10% or less
                      >>having more than that. But it depends entirely on the kind of
                      >>data. Phone numbers have pretty much an upper limit, but other
                      >>kinds of data will not.
                      >>
                      >>I guess what I'm saying is that with large numbers of records, I'd
                      >>separate the process of parsing the multi-value field from the
                      >>process of creating the records, for performance purposes.[/color]
                      >
                      > I guess that sounds reasonable. I think I would have done pretty
                      > much the same thing for a field with mostly one or 2 values per
                      > row. It was only this strange case of having a 1-M stuffed into a
                      > column that bought me this headache to begin with.
                      >
                      > The other issue I'm seeing, though, is that the code would
                      > probably have been simpler, cost less to write, and would now be
                      > simpler to maintain if I had used loops instead of queries. Since
                      > this is a monthly import, if the processing took less that twice
                      > as long as now, perhaps, the simplicity alone would have been
                      > sufficient reason to just do the loops, eh? Of course, a slower
                      > process means testing less often during development, and that
                      > means more difficult debugging sessions each time, so perhaps it
                      > would not have cost less to write, but I still think I could have
                      > made it more legible for maintenance that way.[/color]

                      What kind of source application is built around such a poorly
                      designed spreadsheet?

                      Is there any plan to get away from such a bad design?

                      --
                      David W. Fenton http://www.bway.net/~dfenton
                      dfenton at bway dot net http://www.bway.net/~dfassoc

                      Comment

                      • Steve Jorgensen

                        #12
                        Re: Import/Normalize approach - column-DML, or loop

                        On Fri, 30 Jan 2004 19:39:25 GMT, "David W. Fenton"
                        <dXXXfenton@bwa y.net.invalid> wrote:

                        ....[color=blue][color=green]
                        >> Yeah. If i get what you're saying, I don't thing ripping
                        >> arguments into 10 or 40 new columns will be a good thing, and
                        >> anything short of the 40 or so means writing more code to handle
                        >> the arguments past #10 or so differently than the rest.[/color]
                        >
                        >Well, I'm rather shocked at the idea that there'd be 40 separate
                        >values stored in a single field. I know you said it was a
                        >spreadsheet, but what kind of morons would write a spreadsheet that
                        >complex when they obviously need a database?[/color]

                        It probably is stored in a database inside the proprietary system it comes
                        from. What I get is the spreadsheet exported from that system. Also, the
                        items in the export are very short codes.
                        [color=blue][color=green][color=darkred]
                        >>>> I think I've figured out that a hybrid approach might be best for
                        >>>> my current situation, given that I already have all the
                        >>>> infrastructure to build the DML parts. . . .[/color][/color][/color]
                        ....[color=blue][color=green][color=darkred]
                        >>>> . . . The idea is that I query a
                        >>>> recordset with all the unique combinations, cycle through that,
                        >>>> and parse out the arguments in each one, build a collection of
                        >>>> unique values (2500 combinations, but probably only a few
                        >>>> hundred unique items), then insert the collection items into the
                        >>>> lookup table. From there, I should be able join from the source
                        >>>> table to the lookup using a Like expression ("/" + combination +
                        >>>> "/" Like "*/" + delimiter + item-text + delimier + "/*") to
                        >>>> populate the junction table.
                        >>>> I think/hope this will do the job.
                        >>>
                        >>>Why join in that fashion?
                        >>>
                        >>>Why not do this: walk the source table, parse each value and
                        >>>insert a record with the source PK and the value into a temporary
                        >>>table, then from that table, create another table witht he unique
                        >>>values (this will be your lookup table, with a PK), then join
                        >>>these two new tables on the value field and insert the PK from the
                        >>>source table and the PK from the lookup table into your junction
                        >>>table.[/color]
                        >>
                        >> I'm guessing that this join, even though it would be an extra
                        >> query that will not benefit from any indexes, could be quicker
                        >> than executing 10 to 40 individual inserts from code per source
                        >> row as I go. I dunno, though. Your way is probably easier to
                        >> write.[/color]
                        >
                        >A join on an expression faster than individual queries? I don't
                        >think so![/color]

                        Huh? Each query has to go through the whole VBA/DAO/JET layer 10-40 times per
                        input row. I should expect that to be quicker than running one query, an
                        letting JET do the whole thing at the JET engine level?
                        [color=blue][color=green][color=darkred]
                        >>>With very large numbers of records, I'd break the first step into
                        >>>2 parts, first parsing the values into columns and then inserting
                        >>>the records from the columns.
                        >>>
                        >>>> What is your opinion if I do another project like this in the
                        >>>> future? DML, or loops?
                        >>>[/color][/color][/color]
                        ....[color=blue][color=green][color=darkred]
                        >>>I guess what I'm saying is that with large numbers of records, I'd
                        >>>separate the process of parsing the multi-value field from the
                        >>>process of creating the records, for performance purposes.[/color]
                        >>
                        >> I guess that sounds reasonable. I think I would have done pretty
                        >> much the same thing for a field with mostly one or 2 values per
                        >> row. It was only this strange case of having a 1-M stuffed into a
                        >> column that bought me this headache to begin with.
                        >>
                        >> The other issue I'm seeing, though, is that the code would
                        >> probably have been simpler, cost less to write, and would now be
                        >> simpler to maintain if I had used loops instead of queries. Since
                        >> this is a monthly import, if the processing took less that twice
                        >> as long as now, perhaps, the simplicity alone would have been
                        >> sufficient reason to just do the loops, eh? Of course, a slower
                        >> process means testing less often during development, and that
                        >> means more difficult debugging sessions each time, so perhaps it
                        >> would not have cost less to write, but I still think I could have
                        >> made it more legible for maintenance that way.[/color]
                        >
                        >What kind of source application is built around such a poorly
                        >designed spreadsheet?
                        >
                        >Is there any plan to get away from such a bad design?[/color]

                        No. It's the export format provided from the application we need to get the
                        data from. The vendor provides another version of the program with a querying
                        interface for analysis, but it's missing much of the data we need that is
                        exported in the spreadsheet.

                        Comment

                        • Mike Sherrill

                          #13
                          Re: Import/Normalize approach - column-DML, or loop

                          On Thu, 29 Jan 2004 08:18:26 GMT, Steve Jorgensen
                          <nospam@nospam. nospam> wrote:

                          [snip][color=blue]
                          >All that was going swimmingly, and performing pretty well until I got to the
                          >fields containing multiple, delimited values. My whole dedign is based on
                          >using SQL/DML passes for everything, but the only way I could figure out to
                          >make that work was to call a user defined function from within the query[/color]
                          [snip]

                          Did you consider using code to generate SQL statements, then executing
                          them? (Possibly within explicit transactions?)

                          --
                          Mike Sherrill
                          Information Management Systems

                          Comment

                          • David W. Fenton

                            #14
                            Re: Import/Normalize approach - column-DML, or loop

                            Steve Jorgensen <nospam@nospam. nospam> wrote in
                            news:75ll105g7s t0pdnoqhn1r3vpv 664erila9@4ax.c om:
                            [color=blue]
                            > On Fri, 30 Jan 2004 19:39:25 GMT, "David W. Fenton"
                            ><dXXXfenton@bw ay.net.invalid> wrote:
                            >
                            > ...[color=green][color=darkred]
                            >>> Yeah. If i get what you're saying, I don't thing ripping
                            >>> arguments into 10 or 40 new columns will be a good thing, and
                            >>> anything short of the 40 or so means writing more code to handle
                            >>> the arguments past #10 or so differently than the rest.[/color]
                            >>
                            >>Well, I'm rather shocked at the idea that there'd be 40 separate
                            >>values stored in a single field. I know you said it was a
                            >>spreadsheet , but what kind of morons would write a spreadsheet
                            >>that complex when they obviously need a database?[/color]
                            >
                            > It probably is stored in a database inside the proprietary system
                            > it comes from. What I get is the spreadsheet exported from that
                            > system. Also, the items in the export are very short codes.[/color]

                            So, you're normalizing data that has been denormalized for export?

                            How stupid is that?

                            Wouldn't it be better to have someone skip the spreadsheet and have
                            a normalized export process instead?

                            []
                            [color=blue][color=green]
                            >>What kind of source application is built around such a poorly
                            >>designed spreadsheet?
                            >>
                            >>Is there any plan to get away from such a bad design?[/color]
                            >
                            > No. It's the export format provided from the application we need
                            > to get the data from. The vendor provides another version of the
                            > program with a querying interface for analysis, but it's missing
                            > much of the data we need that is exported in the spreadsheet.[/color]

                            Are you sure there's absolutely no access to the underlying data
                            structures?

                            This is the kind of thing that drives me crazy, having to program
                            something to undo something that has been extensively programmed
                            already. Any changes to the export will break your import routine,
                            for instance.

                            I recently replaced a client's system for importing data from MYOB
                            with direct connections via the MYOB ODBC, and vastly improved the
                            quality of data (previously, certain kinds of data were just not
                            available). I don't know if the application in question has any such
                            capability, but I would certainly let the client know that anything
                            you program is heavily contingent on there being no changes in the
                            output format at all.

                            --
                            David W. Fenton http://www.bway.net/~dfenton
                            dfenton at bway dot net http://www.bway.net/~dfassoc

                            Comment

                            • Steve Jorgensen

                              #15
                              Re: Import/Normalize approach - column-DML, or loop

                              On Fri, 30 Jan 2004 18:08:32 -0500, Mike Sherrill
                              <MSherrillnonon o@compuserve.co m> wrote:
                              [color=blue]
                              >On Thu, 29 Jan 2004 08:18:26 GMT, Steve Jorgensen
                              ><nospam@nospam .nospam> wrote:
                              >
                              >[snip][color=green]
                              >>All that was going swimmingly, and performing pretty well until I got to the
                              >>fields containing multiple, delimited values. My whole dedign is based on
                              >>using SQL/DML passes for everything, but the only way I could figure out to
                              >>make that work was to call a user defined function from within the query[/color]
                              >[snip]
                              >
                              >Did you consider using code to generate SQL statements, then executing
                              >them? (Possibly within explicit transactions?)[/color]

                              Yes, that's precisely what I am doing.

                              Comment

                              Working...