SQL Server for data processing

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • William Cleveland

    SQL Server for data processing

    I'm working on a system right now where I have a database (two,
    actually, but one is discarded halfway through), but it's created
    and used as part of a process (reporting), rather than as the
    actual production data repository. I may be keeping the database
    permanantly, but it would be completely read-only; once the
    process is complete, the database will not change again. This has
    me wanting to do a few things that are rather foreign to my usual
    experience, and I don't know how many of them are supported.

    In several cases, I'm summarizing one table into another by several
    fields, and then updating the original table with an ID for the
    summary row each source row was summarized into (e.g., I summarize
    PlaceAndProduct Summary into PlaceSummary, and then populate
    PlaceSummaryID in PlaceAndProduct Summary). The update of the
    source table is much faster if the summary table has a clustered
    index on the summarized fields, but all later access will be faster
    if the clustered index is on the identity column. I've been
    including an ORDER BY the summarized fields in the original insert,
    so the identity column is in the same order as the summarized fields,
    but I don't know of any way to take advantage of that in the
    indexing declarations.

    As another approach to the above situation, if I change the
    clustered index on a table, and the rows happen to be in the
    same order by both indexes, will the table still get rebuilt?

    I will never do a roll-back in the process; if an action fails, I
    want to raise an error and halt (and I haven't lost any data).
    Is there any way to completely turn off logging?

    Will I gain anything by marking the database as single-user?

    Any indexes that I am not using while I populate the tables, I'm
    adding at the end with FillFactor 100, to keep any slack out.
    Is there a way to remove all the slack from everything else, at
    the end of the process? During a backup operation would be fine.

    Thanks,
    Bill

  • Hugo Kornelis

    #2
    Re: SQL Server for data processing

    On Mon, 07 Feb 2005 04:54:43 GMT, William Cleveland wrote:

    (snip)[color=blue]
    >In several cases, I'm summarizing one table into another by several
    >fields, and then updating the original table with an ID for the
    >summary row each source row was summarized into (e.g., I summarize
    >PlaceAndProduc tSummary into PlaceSummary, and then populate
    >PlaceSummary ID in PlaceAndProduct Summary). The update of the
    >source table is much faster if the summary table has a clustered
    >index on the summarized fields, but all later access will be faster
    >if the clustered index is on the identity column. I've been
    >including an ORDER BY the summarized fields in the original insert,
    >so the identity column is in the same order as the summarized fields,
    >but I don't know of any way to take advantage of that in the
    >indexing declarations.[/color]

    Hi William,

    First, you should know that including ORDER BY in an INSERT .. SELECT
    statement in meaningless. There is no way to guarantee that the identity
    values get handed out in the order you specify. If you really must have
    them in a specific order, the only safe way is to use a cursor - but since
    IDENTITY is meant to be a **meaningless** unique code, this is a quite
    questionable approach. Especially since there are more caveats to using an
    identity for this goal. If the generated value has to have some logical
    relation to the data, you'll have to calculate it, either on insert or on
    retrieval.

    I don't really understand the rest of this paragraph, though. It appears
    that you are merely looking for maximum speed. If that's the case, there
    might be other options to consider. Maybe, the update of the source table
    will be equally fast (or even faster) if you use a covering nonclustered
    index?

    If you could show some more specifics about what exactly you are trying to
    do, I might be able to give more specific comments.

    [color=blue]
    >As another approach to the above situation, if I change the
    >clustered index on a table, and the rows happen to be in the
    >same order by both indexes, will the table still get rebuilt?[/color]

    My guess is that it is - it should! First because SQL Server doesn't know
    that the rows are in order, so it will have to be checked anyway. Second
    because a clustered index is more than just the table - the table is in
    the leaf pages of the index; the root and the intermediate pages have to
    be built as well.

    To verify, I ran this code with the Show Execution Plan option turned on
    (Ctrl-K in Query Analyzer). Both with or without the added nonclustered
    index, the rebuilding of the clustered index had exactly the same plan as
    the building of the first clustered index.

    create table testit (a int not null, b int not null)
    go
    create unique clustered index i1 on testit(a)
    -- create unique nonclustered index i2 on testit(b)
    go
    insert testit values (1, 1)
    insert testit values (2, 2)
    insert testit values (3, 3)
    go
    create unique clustered index i1 on testit(b) with drop_existing
    go
    drop table testit
    go

    [color=blue]
    >I will never do a roll-back in the process; if an action fails, I
    >want to raise an error and halt (and I haven't lost any data).
    >Is there any way to completely turn off logging?[/color]

    No, there is no way to turn off logging. It's not only used for rollback,
    it's also used for trigger execution, for recovery after a system crash
    and for backup and restore - and then I'm probably forgetting some.

    Besides, *you* might never do a rollback, but for some errors, SQL Server
    will issue a rollback.

    [color=blue]
    >Will I gain anything by marking the database as single-user?[/color]

    Yes: you'll save the overhead of taking and releasing locks. The price you
    pay is that you can't run multiple connections at once. Without the
    single-user setting, you could still inspect some data in table A while a
    long running query is updating table B. This is impossible in single-user
    mode (in fact, it should have been called single-connection mode).

    [color=blue]
    >Any indexes that I am not using while I populate the tables, I'm
    >adding at the end with FillFactor 100, to keep any slack out.
    >Is there a way to remove all the slack from everything else, at
    >the end of the process? During a backup operation would be fine.[/color]

    The only way I know is to rebuild all indexes with fillfactor 100.

    Best, Hugo
    --

    (Remove _NO_ and _SPAM_ to get my e-mail address)

    Comment

    • Erland Sommarskog

      #3
      Re: SQL Server for data processing

      Hugo Kornelis (hugo@pe_NO_rFa ct.in_SPAM_fo) writes:[color=blue]
      > First, you should know that including ORDER BY in an INSERT .. SELECT
      > statement in meaningless. There is no way to guarantee that the identity
      > values get handed out in the order you specify.[/color]

      That is definitely not the same as saying that ORDER BY is meaningless.

      If you say:

      INSERT tbl (...)
      SELECT ....
      ORDER BY
      OPTION (MAXDOP 1)

      there is no guarantee that your IDENTITY values match - but your odds are
      pretty good.



      --
      Erland Sommarskog, SQL Server MVP, esquel@sommarsk og.se

      Books Online for SQL Server SP3 at
      Accelerate your AI application's time to market by harnessing the power of your own data and the built-in AI capabilities of SQL Server 2025, the enterprise database with best-in-class security, performance and availability.

      Comment

      Working...