Should I split this 175 Million record table?

**Hugo Kornelis** · Jul 20 '05, 03:19 AM

Re: Should I split this 175 Million record table?

On 1 Apr 2004 10:21:35 -0800, Warren Wright wrote:
[color=blue]
>We maintain a 175 million record database table for our customer.
>This is an extract of some data collected for them by a third party
>vendor, who sends us regular updates to that data (monthly).[/color]
(snip)

If this is such a performance nightmare, my first thought is whether
all this data is really needed. Maybe you can move all data older than
a set number of months to a history table?

[color=blue]
>How would you recommend handling this update? So far I have mostly
>tried a number of execution plans involving deleting out the records
>in the table that match those in the text file, so I can then import
>the text file, but the best of those plans takes well over 6 hours to
>run.[/color]

This might be a situation wherre SQL Server proprietary syntax wins
over ANSI standard. I'd try: first importing the text file in a temp
work table, then doing the update as follows:

UPDATE BigTable
SET Column1 = WorkTable.Colum n1,
Column2 = WorkTable.Colum n2,
(...)
ColumnN = WorkTable.Colum nN
FROM BigTable
INNER JOIN WorkTable
ON WorkTable.indiv idual_id = BigTable.indivi dual_id
AND WorkTable.zip = BigTable.zip
AND WorkTable.zip_p lus4 = BigTable.zip_pl us4

Of course, you would need an index on individual_id, zip and zip_plus4
in the BigTable. It would be ideal if that was the clustered index
(but unclustering another index might hurt performance elsewhere, so
do doublecheck) Check the execution plan - WorkTable should be
table-scanned; BigTable should be accessed through said index.

That being said, I still think moving most of the rowsout of BigTable
into HistoryTable is probably your best bet.

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)

**Erland Sommarskog** · Jul 20 '05, 03:19 AM

Re: Should I split this 175 Million record table?

Warren Wright (warren.wright@ us.scorex.com) writes:[color=blue]
> We also receive monthly one file containing 10 million records that
> are updates of information in the table. This is the tricky one. The
> only way to uniquely pair up a record in the update file with a record
> in the full database table is by a combination of individual_id, zip,
> and zip_plus4.
>
> There can be multiple records in the database for any given
> individual, because that individual could have a history that includes
> multiple addresses.
>
> How would you recommend handling this update? So far I have mostly
> tried a number of execution plans involving deleting out the records
> in the table that match those in the text file, so I can then import
> the text file, but the best of those plans takes well over 6 hours to
> run.
>
> My latest thought: Would it help in any way to partition the table
> into a number of smaller tables, with a view used to reference them?[/color]

Partitioned views are helpful for the case where you import new rows.
Then you can import the new data into an empty table, set up indexes
on that table, and then as a last step reorganize the view. Don't
forget the partitioning constraint.

Assuming that the updates can come all over this big table, partitioning
cannot help you as far as you can work with a smaller table for the
update.

However, if you have plenty of disks and disk controllers available, you
can spread out the partitions on different filegroups on different disks.
This can improve speed somewhat, but since the transaction log is
unchanged, I would guess that this does not give any fanastic effects.

In any case, deleting to insert again, does not sound like the most
effecient way. Hugo's UPDATE statements seems like a better way to
go. You can also break up the UPDATE in batches, so that you only
update half a million rows at a time or so. This is necessarily not
faster, but can ease the strain on the transaction log.

--
Erland Sommarskog, SQL Server MVP, sommar@algonet. se

Books Online for SQL Server SP3 at

Microsoft SQL Server

http://www.microsoft.com/sql/techinfo/productdoc/2000/books.asp

SQL Server 2025 redefines what's possible for enterprise data. With developer-first features and integration with analytics and AI models, SQL Server 2025 accelerates AI innovation using the data you already have.

**Warren Wright** · Jul 20 '05, 03:20 AM

Re: Should I split this 175 Million record table?

Hugo & Erland,

Thanks for the advice. I will try the update option, and I'll drop the
partitioned view idea for now.

Since the new records that come each month would be spread randomly into
the partitions, most likely, it wouldn't help much...and the imports
aren't what is really killing us right now.

One thing I thought was odd yesterday as I was working on these updates:

I was just trying to get the simple import of new records files out of
the way. There are two of them, and each contains about 10 million
records.

Before importing them I first dropped all indexes from the 175 million
record table so that the import operation should be a simple append, and
shouldn't take much time (a few hours).

However, unlike previous imports where I did that, this one really
crawled...only incrementing the row count by a thousand every 7 or 8
seconds, and this would have taken forever.

Anything that could have been keeping the import from just acting as a
simple append?

Regards,

Warren Wright
Dallas

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

**Kevin Frey** · Jul 20 '05, 03:21 AM

Re: Should I split this 175 Million record table?

As part of a performance impact study I would examine whether
pre-sorting the input data and having an appropriately ordered index
(based on the same sort order) gives rise to any performance
improvements due to improved local caching of the data by SQL Server
(ie. improving the cache hit ratio).

If you have 175 million records and all 11 million result in updates,
that's p[roughly] 6% of the database but you don't know which 6%, or 1
in 16 records approx. Depending on the length of your records, if you
have > 16 records per page (8Kb in SQL Server 2000), then having the
input records ordered improves your chances of hitting multiple pages
in succession, benefitting from the cache. Otherwise, with 175 million
records which are randomly hit, the cache will probably improve the
index performance but the data pages are likely to continue to be read
from disk, unless you have a seriously large amount of RAM (ie. enough
for 175 million records).

You get 11 million records every month but you don't mention what
proportion of those 11 million records are new records versus updated
records.

I also agree with the comments made about archiving the older data. If
a particular customer record has multiple addresses listed, I would
keep the current address in a Current table and all but the current
address in a separate table.

In doing this, this *might* also allow you to reduce the 3-part
id/zip/zip+4 key lookup to just an id lookup, and the update process
could then "sanity check" the other 2 parts as part of its processing.
This assumes the ID you refer to is like a unique customer id that
remains with the customer even if they change address - which you
don't describe.

**Warren Wright** · Jul 20 '05, 03:22 AM

Re: Should I split this 175 Million record table?

kevin_g_frey@ho tmail.com (Kevin Frey) wrote[color=blue]
> As part of a performance impact study I would examine whether
> pre-sorting the input data and having an appropriately ordered index
> (based on the same sort order) gives rise to any performance
> improvements due to improved local caching of the data by SQL Server
> (ie. improving the cache hit ratio).[/color]

Good idea. I'll try that as well.
[color=blue]
> You get 11 million records every month but you don't mention what
> proportion of those 11 million records are new records versus updated
> records.[/color]

I get 11 million records that are updates of old records, and another
11 million that are entirely new. I also get a list of around 6
million indiv_id's that are individuals whose records can be
completely dropped (it ends up dropping around 10 million records
since individuals often have multiple records).
[color=blue]
> I also agree with the comments made about archiving the older data. If
> a particular customer record has multiple addresses listed, I would
> keep the current address in a Current table and all but the current
> address in a separate table.[/color]

The purpose of the table is to be able to, based on a customer's
name/address/ssn_last_4, find a match in the table, and get the
indiv_id associated with that person.

There are multiple addresses for each individual because it increases
the chance there will be a match...if the customer gives us an address
other than his most recent one, or if he has multiple addresses even
discounting old ones.
[color=blue]
> In doing this, this *might* also allow you to reduce the 3-part
> id/zip/zip+4 key lookup to just an id lookup, and the update process
> could then "sanity check" the other 2 parts as part of its processing.
> This assumes the ID you refer to is like a unique customer id that
> remains with the customer even if they change address - which you
> don't describe.[/color]

I really wish I could drop it down to having an indiv_id primary key.
You are right that it would make life easier.

Do you think it would be useful to set up a true primary key on the
three fields indiv_id, zip, and zip4? I already have to have each of
the 3 independently indexed so I can do lookups on only one of them as
part of various queries into the database.

If I create a primary key index across all three, I still need an
individual index on each of them separately, right? Assuming my
queries need to sometimes use one of the three, but not all three?

Thanks,

Warren

**Hugo Kornelis** · Jul 20 '05, 03:22 AM

Re: Should I split this 175 Million record table?

On 6 Apr 2004 10:18:59 -0700, Warren Wright wrote:
[color=blue]
>If I create a primary key index across all three, I still need an
>individual index on each of them separately, right? Assuming my
>queries need to sometimes use one of the three, but not all three?[/color]

In that case, you can drop one of the individual indexes. Which
depends on the order in which you specify the columns in the PK.

If, for example, you use the order indiv_id/zip/zip4, this index can
also be used to search on indiv_id or indiv_id + zip, but not to
search on zip, on zip4 or on zip + zip4. If you search on indiv_id +
zip4, the PK index can only be used to do the first part of the search
(the indiv_id), not the last (zip4). In this case, you no longer need
a seperate index on indiv_id.

Of course, if you change the column order on the PK index, you should
keep the index on indiv_id and drop one of the others.

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)

**Kevin Frey** · Jul 20 '05, 03:22 AM

Re: Should I split this 175 Million record table?

> The purpose of the table is to be able to, based on a customer's[color=blue]
> name/address/ssn_last_4, find a match in the table, and get the
> indiv_id associated with that person.[/color]

Ummm - in your first email you said name/zip/zip+4. What's this one
about?
[color=blue]
> Good idea. I'll try that as well.[/color]

This is one of those tables that doesn't have a "natural" primary key
from what I can tell. You could synthesize a primary key using an
IDENTIFIER column but as you suggest it is probably better to use the
combination of indiv_id,zip,zi p+4 and make that a clustered index.

I would then take the input file of 11 million update records and
pre-sort those into indiv_id,zip,zi p+4 to match the clustered index.

Having indiv_id is also important because with your 6 million delete
records (which produces about 10 million records to delete from the
DB), the clustered index, because the first part of the key is
indiv_id, the server can use that to find all the matching records and
I would suspect quite efficiently.

As the other poster mentioned you don't need three separate indexes
for indiv_id,zip,zi p+4 once you have this primary key index. You'll
only need zip and zip+4 indexes since the primary key index satisfies
indiv_id queries (by itself).

Is the actual customer information in this table as well (eg. their
name?) or is that stored in a separate table? If it's stored in this
table that seems like duplication, unless there is good reason for it.

With the archiving concept that has been suggested, you could still
investigate (if possible) how many of your update records "hit" the
"current address" of a customer versus updates that hit "former"
addresses. Assuming you know what the current address is. But I would
not personally go down this track until applying the clustered index
on indiv_id,zip,zi p+4, pre-sorting the input data, and doing the
updates to see what happens.

Your original approach of deleting the matching records before
performing an import of the replacement data might also benefit from
having the records pre-sorted to match the clustered key ordering.

Implementation-wise, assuming you have some kind of update application
already written: I'm not sure if UPDATE statements can be prepared,
but if they can, you might achieve a performance improvement by using
prepared statements (I've never personally tried a prepared UPDATE
statement). This should eliminate the server needing to parse each
update statement and instead you only supply the values.

Cheers

Kevin

Should I split this 175 Million record table?

Should I split this 175 Million record table?

Comment

Comment

Comment

Comment

Comment

Comment

Comment