Hi all,
I've just finished almost all of what has turned out to be a real bear of a
project. It has to import data from a monthly spreadsheet export from another
program, and convert that into normalized data. The task is made more
difficult by the fact that the structure itself can vary from month to month
(in well defined ways).
So, I used the SQL-centric approach, taking vertical stripes at a time so
that, for instance, for each field with simple, repeating text data, I make a
group-by pass, inserting into the destination lookup table, then I do another
query, joining from the input table to the text fields in the lookups to get
the foreign keys for the main destination table. When I have multiple columns
that need to become 1-M, I make a pass for each of those columns, inserting
lookup record that identifies the split, then inserting the rows into the
many-side table, yadda, yadda, yadda.
All that was going swimmingly, and performing pretty well until I got to the
fields containing multiple, delimited values. My whole dedign is based on
using SQL/DML passes for everything, but the only way I could figure out to
make that work was to call a user defined function from within the query to
pull out, in successive query slices, argument 1, argument 2, etc. I was
going to use a where clause to exclude null results (input doesn't have
arguments n or above), and quit after the first pass with .RecordsAffecte d =
0.
Sounds good, but with a mere 8000 input rows, I had to cancel the first pass
after waiting about 20 minutes. Note that a whole import process takes about
3 minutes if this part is ommitted, and that includes about 40 vertical split
passes. The UDF is very simple, and performs quite fast from VB, but I guess
the overhead of calling this function from a query is VERY SEVERE. Just to
get this thing out the door, I've decided to just tnot split these for now
since we're not doing queries of that data yet.
So, my next thought is that, perhaps this step is better done by simply,
brute-force, cycling through a recordset of the input table, and inserting
rows into the destination tables. If I'm soing that, though, then am I really
getting any benefit worth speaking of by doing everything -else- as SQL-DML in
vertical slices, or would I have been much better off, just doing it the
procedural way, walking through the input table, and creating destination rows
one row at a time? I know it would have been easier to write, but here I was
trying to do things the "right" way.
If I do something like this again, would my code end up performing just as
well, and being easier to write and maintain using simple iteration rather
than SQL-DML?
Thanks for any opinions,
- Steve J
I've just finished almost all of what has turned out to be a real bear of a
project. It has to import data from a monthly spreadsheet export from another
program, and convert that into normalized data. The task is made more
difficult by the fact that the structure itself can vary from month to month
(in well defined ways).
So, I used the SQL-centric approach, taking vertical stripes at a time so
that, for instance, for each field with simple, repeating text data, I make a
group-by pass, inserting into the destination lookup table, then I do another
query, joining from the input table to the text fields in the lookups to get
the foreign keys for the main destination table. When I have multiple columns
that need to become 1-M, I make a pass for each of those columns, inserting
lookup record that identifies the split, then inserting the rows into the
many-side table, yadda, yadda, yadda.
All that was going swimmingly, and performing pretty well until I got to the
fields containing multiple, delimited values. My whole dedign is based on
using SQL/DML passes for everything, but the only way I could figure out to
make that work was to call a user defined function from within the query to
pull out, in successive query slices, argument 1, argument 2, etc. I was
going to use a where clause to exclude null results (input doesn't have
arguments n or above), and quit after the first pass with .RecordsAffecte d =
0.
Sounds good, but with a mere 8000 input rows, I had to cancel the first pass
after waiting about 20 minutes. Note that a whole import process takes about
3 minutes if this part is ommitted, and that includes about 40 vertical split
passes. The UDF is very simple, and performs quite fast from VB, but I guess
the overhead of calling this function from a query is VERY SEVERE. Just to
get this thing out the door, I've decided to just tnot split these for now
since we're not doing queries of that data yet.
So, my next thought is that, perhaps this step is better done by simply,
brute-force, cycling through a recordset of the input table, and inserting
rows into the destination tables. If I'm soing that, though, then am I really
getting any benefit worth speaking of by doing everything -else- as SQL-DML in
vertical slices, or would I have been much better off, just doing it the
procedural way, walking through the input table, and creating destination rows
one row at a time? I know it would have been easier to write, but here I was
trying to do things the "right" way.
If I do something like this again, would my code end up performing just as
well, and being easier to write and maintain using simple iteration rather
than SQL-DML?
Thanks for any opinions,
- Steve J
Comment