Should I split this 175 Million record table?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Warren Wright

    Should I split this 175 Million record table?

    Hello,

    We maintain a 175 million record database table for our customer.
    This is an extract of some data collected for them by a third party
    vendor, who sends us regular updates to that data (monthly).

    The original data for the table came in the form of a single, large
    text file, which we imported.

    This table contains name and address information on potential
    customers.

    It is a maintenance nightmare for us, as prior to this the largest
    table we maintained was about 10 million records, with less
    complicated updates required.

    Here is the problem:
    * In order to do the searching we need to do on the table it has 8 of
    its 20 columns indexed.
    * It takes hours and hours to do anything to the table.
    * I'd like to cut down as much as possible the time required to update
    the file.

    We receive monthly one file containing 10 million records that are
    new, and can just be appended to the table (no problem, simple import
    into SQL Server).

    We also receive monthly one file containing 10 million records that
    are updates of information in the table. This is the tricky one. The
    only way to uniquely pair up a record in the update file with a record
    in the full database table is by a combination of individual_id, zip,
    and zip_plus4.

    There can be multiple records in the database for any given
    individual, because that individual could have a history that includes
    multiple addresses.

    How would you recommend handling this update? So far I have mostly
    tried a number of execution plans involving deleting out the records
    in the table that match those in the text file, so I can then import
    the text file, but the best of those plans takes well over 6 hours to
    run.

    My latest thought: Would it help in any way to partition the table
    into a number of smaller tables, with a view used to reference them?

    We have no performance issues querying the table, but I need some
    thoughts on how to better maintain it.

    One more thing, we do have 2 copies of the table on the server at all
    times so that one can be actively used in production while we run
    updates on the other one, so I can certainly try out some suggestions
    over the next week.

    Regards,

    Warren Wright
    Dallas
  • Hugo Kornelis

    #2
    Re: Should I split this 175 Million record table?

    On 1 Apr 2004 10:21:35 -0800, Warren Wright wrote:
    [color=blue]
    >We maintain a 175 million record database table for our customer.
    >This is an extract of some data collected for them by a third party
    >vendor, who sends us regular updates to that data (monthly).[/color]
    (snip)

    If this is such a performance nightmare, my first thought is whether
    all this data is really needed. Maybe you can move all data older than
    a set number of months to a history table?

    [color=blue]
    >How would you recommend handling this update? So far I have mostly
    >tried a number of execution plans involving deleting out the records
    >in the table that match those in the text file, so I can then import
    >the text file, but the best of those plans takes well over 6 hours to
    >run.[/color]

    This might be a situation wherre SQL Server proprietary syntax wins
    over ANSI standard. I'd try: first importing the text file in a temp
    work table, then doing the update as follows:

    UPDATE BigTable
    SET Column1 = WorkTable.Colum n1,
    Column2 = WorkTable.Colum n2,
    (...)
    ColumnN = WorkTable.Colum nN
    FROM BigTable
    INNER JOIN WorkTable
    ON WorkTable.indiv idual_id = BigTable.indivi dual_id
    AND WorkTable.zip = BigTable.zip
    AND WorkTable.zip_p lus4 = BigTable.zip_pl us4

    Of course, you would need an index on individual_id, zip and zip_plus4
    in the BigTable. It would be ideal if that was the clustered index
    (but unclustering another index might hurt performance elsewhere, so
    do doublecheck) Check the execution plan - WorkTable should be
    table-scanned; BigTable should be accessed through said index.

    That being said, I still think moving most of the rowsout of BigTable
    into HistoryTable is probably your best bet.

    Best, Hugo
    --

    (Remove _NO_ and _SPAM_ to get my e-mail address)

    Comment

    • Erland Sommarskog

      #3
      Re: Should I split this 175 Million record table?

      Warren Wright (warren.wright@ us.scorex.com) writes:[color=blue]
      > We also receive monthly one file containing 10 million records that
      > are updates of information in the table. This is the tricky one. The
      > only way to uniquely pair up a record in the update file with a record
      > in the full database table is by a combination of individual_id, zip,
      > and zip_plus4.
      >
      > There can be multiple records in the database for any given
      > individual, because that individual could have a history that includes
      > multiple addresses.
      >
      > How would you recommend handling this update? So far I have mostly
      > tried a number of execution plans involving deleting out the records
      > in the table that match those in the text file, so I can then import
      > the text file, but the best of those plans takes well over 6 hours to
      > run.
      >
      > My latest thought: Would it help in any way to partition the table
      > into a number of smaller tables, with a view used to reference them?[/color]

      Partitioned views are helpful for the case where you import new rows.
      Then you can import the new data into an empty table, set up indexes
      on that table, and then as a last step reorganize the view. Don't
      forget the partitioning constraint.

      Assuming that the updates can come all over this big table, partitioning
      cannot help you as far as you can work with a smaller table for the
      update.

      However, if you have plenty of disks and disk controllers available, you
      can spread out the partitions on different filegroups on different disks.
      This can improve speed somewhat, but since the transaction log is
      unchanged, I would guess that this does not give any fanastic effects.

      In any case, deleting to insert again, does not sound like the most
      effecient way. Hugo's UPDATE statements seems like a better way to
      go. You can also break up the UPDATE in batches, so that you only
      update half a million rows at a time or so. This is necessarily not
      faster, but can ease the strain on the transaction log.

      --
      Erland Sommarskog, SQL Server MVP, sommar@algonet. se

      Books Online for SQL Server SP3 at
      SQL Server 2025 redefines what's possible for enterprise data. With developer-first features and integration with analytics and AI models, SQL Server 2025 accelerates AI innovation using the data you already have.

      Comment

      • Warren Wright

        #4
        Re: Should I split this 175 Million record table?

        Hugo & Erland,

        Thanks for the advice. I will try the update option, and I'll drop the
        partitioned view idea for now.

        Since the new records that come each month would be spread randomly into
        the partitions, most likely, it wouldn't help much...and the imports
        aren't what is really killing us right now.

        One thing I thought was odd yesterday as I was working on these updates:

        I was just trying to get the simple import of new records files out of
        the way. There are two of them, and each contains about 10 million
        records.

        Before importing them I first dropped all indexes from the 175 million
        record table so that the import operation should be a simple append, and
        shouldn't take much time (a few hours).

        However, unlike previous imports where I did that, this one really
        crawled...only incrementing the row count by a thousand every 7 or 8
        seconds, and this would have taken forever.

        Anything that could have been keeping the import from just acting as a
        simple append?

        Regards,

        Warren Wright
        Dallas

        *** Sent via Developersdex http://www.developersdex.com ***
        Don't just participate in USENET...get rewarded for it!

        Comment

        • Kevin Frey

          #5
          Re: Should I split this 175 Million record table?

          As part of a performance impact study I would examine whether
          pre-sorting the input data and having an appropriately ordered index
          (based on the same sort order) gives rise to any performance
          improvements due to improved local caching of the data by SQL Server
          (ie. improving the cache hit ratio).

          If you have 175 million records and all 11 million result in updates,
          that's p[roughly] 6% of the database but you don't know which 6%, or 1
          in 16 records approx. Depending on the length of your records, if you
          have > 16 records per page (8Kb in SQL Server 2000), then having the
          input records ordered improves your chances of hitting multiple pages
          in succession, benefitting from the cache. Otherwise, with 175 million
          records which are randomly hit, the cache will probably improve the
          index performance but the data pages are likely to continue to be read
          from disk, unless you have a seriously large amount of RAM (ie. enough
          for 175 million records).

          You get 11 million records every month but you don't mention what
          proportion of those 11 million records are new records versus updated
          records.

          I also agree with the comments made about archiving the older data. If
          a particular customer record has multiple addresses listed, I would
          keep the current address in a Current table and all but the current
          address in a separate table.

          In doing this, this *might* also allow you to reduce the 3-part
          id/zip/zip+4 key lookup to just an id lookup, and the update process
          could then "sanity check" the other 2 parts as part of its processing.
          This assumes the ID you refer to is like a unique customer id that
          remains with the customer even if they change address - which you
          don't describe.

          Comment

          • Warren Wright

            #6
            Re: Should I split this 175 Million record table?

            kevin_g_frey@ho tmail.com (Kevin Frey) wrote[color=blue]
            > As part of a performance impact study I would examine whether
            > pre-sorting the input data and having an appropriately ordered index
            > (based on the same sort order) gives rise to any performance
            > improvements due to improved local caching of the data by SQL Server
            > (ie. improving the cache hit ratio).[/color]

            Good idea. I'll try that as well.
            [color=blue]
            > You get 11 million records every month but you don't mention what
            > proportion of those 11 million records are new records versus updated
            > records.[/color]

            I get 11 million records that are updates of old records, and another
            11 million that are entirely new. I also get a list of around 6
            million indiv_id's that are individuals whose records can be
            completely dropped (it ends up dropping around 10 million records
            since individuals often have multiple records).
            [color=blue]
            > I also agree with the comments made about archiving the older data. If
            > a particular customer record has multiple addresses listed, I would
            > keep the current address in a Current table and all but the current
            > address in a separate table.[/color]

            The purpose of the table is to be able to, based on a customer's
            name/address/ssn_last_4, find a match in the table, and get the
            indiv_id associated with that person.

            There are multiple addresses for each individual because it increases
            the chance there will be a match...if the customer gives us an address
            other than his most recent one, or if he has multiple addresses even
            discounting old ones.
            [color=blue]
            > In doing this, this *might* also allow you to reduce the 3-part
            > id/zip/zip+4 key lookup to just an id lookup, and the update process
            > could then "sanity check" the other 2 parts as part of its processing.
            > This assumes the ID you refer to is like a unique customer id that
            > remains with the customer even if they change address - which you
            > don't describe.[/color]

            I really wish I could drop it down to having an indiv_id primary key.
            You are right that it would make life easier.

            Do you think it would be useful to set up a true primary key on the
            three fields indiv_id, zip, and zip4? I already have to have each of
            the 3 independently indexed so I can do lookups on only one of them as
            part of various queries into the database.

            If I create a primary key index across all three, I still need an
            individual index on each of them separately, right? Assuming my
            queries need to sometimes use one of the three, but not all three?

            Thanks,

            Warren

            Comment

            • Hugo Kornelis

              #7
              Re: Should I split this 175 Million record table?

              On 6 Apr 2004 10:18:59 -0700, Warren Wright wrote:
              [color=blue]
              >If I create a primary key index across all three, I still need an
              >individual index on each of them separately, right? Assuming my
              >queries need to sometimes use one of the three, but not all three?[/color]

              In that case, you can drop one of the individual indexes. Which
              depends on the order in which you specify the columns in the PK.

              If, for example, you use the order indiv_id/zip/zip4, this index can
              also be used to search on indiv_id or indiv_id + zip, but not to
              search on zip, on zip4 or on zip + zip4. If you search on indiv_id +
              zip4, the PK index can only be used to do the first part of the search
              (the indiv_id), not the last (zip4). In this case, you no longer need
              a seperate index on indiv_id.

              Of course, if you change the column order on the PK index, you should
              keep the index on indiv_id and drop one of the others.

              Best, Hugo
              --

              (Remove _NO_ and _SPAM_ to get my e-mail address)

              Comment

              • Kevin Frey

                #8
                Re: Should I split this 175 Million record table?

                > The purpose of the table is to be able to, based on a customer's[color=blue]
                > name/address/ssn_last_4, find a match in the table, and get the
                > indiv_id associated with that person.[/color]

                Ummm - in your first email you said name/zip/zip+4. What's this one
                about?
                [color=blue]
                > Good idea. I'll try that as well.[/color]

                This is one of those tables that doesn't have a "natural" primary key
                from what I can tell. You could synthesize a primary key using an
                IDENTIFIER column but as you suggest it is probably better to use the
                combination of indiv_id,zip,zi p+4 and make that a clustered index.

                I would then take the input file of 11 million update records and
                pre-sort those into indiv_id,zip,zi p+4 to match the clustered index.

                Having indiv_id is also important because with your 6 million delete
                records (which produces about 10 million records to delete from the
                DB), the clustered index, because the first part of the key is
                indiv_id, the server can use that to find all the matching records and
                I would suspect quite efficiently.

                As the other poster mentioned you don't need three separate indexes
                for indiv_id,zip,zi p+4 once you have this primary key index. You'll
                only need zip and zip+4 indexes since the primary key index satisfies
                indiv_id queries (by itself).

                Is the actual customer information in this table as well (eg. their
                name?) or is that stored in a separate table? If it's stored in this
                table that seems like duplication, unless there is good reason for it.

                With the archiving concept that has been suggested, you could still
                investigate (if possible) how many of your update records "hit" the
                "current address" of a customer versus updates that hit "former"
                addresses. Assuming you know what the current address is. But I would
                not personally go down this track until applying the clustered index
                on indiv_id,zip,zi p+4, pre-sorting the input data, and doing the
                updates to see what happens.

                Your original approach of deleting the matching records before
                performing an import of the replacement data might also benefit from
                having the records pre-sorted to match the clustered key ordering.

                Implementation-wise, assuming you have some kind of update application
                already written: I'm not sure if UPDATE statements can be prepared,
                but if they can, you might achieve a performance improvement by using
                prepared statements (I've never personally tried a prepared UPDATE
                statement). This should eliminate the server needing to parse each
                update statement and instead you only supply the values.

                Cheers

                Kevin

                Comment

                Working...