Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2005 Forums
 Transact-SQL (2005)
 Non clustered large table question

Author  Topic 

Jazid
Starting Member

3 Posts

Posted - 2010-03-31 : 12:45:32
I have a large table (approaching 800 million rows, 81Gb in size) which had a clustered compound Primary key across 8 different columns (consisting of Refid,Year, Aid,Aname,Bid,Cid,GeneratedDate etc).
The remaining 7 columns consist of a value and a single currency field.

There are only batch updates to this table which remove and insert records throughout this large table. With the clustered key it caused huge performance problems during inserts of new records (which happen in 1-10 million a batch).

I have removed the clustered index altogether (by converting the PK into a non clustered PK) and unsurprisingly this has really improved the insert performance, however the select performance does seem to suffered.

Are there any pitfalls with not having a clustered key on a table like this?

Sorry if this is a silly question and thanks in advance.

PackRat
Starting Member

26 Posts

Posted - 2010-03-31 : 13:37:15
Other than the obvious performance problems with SELECT that you've already discovered?
A table without a clustering key is a HEAP; being in no particular order, and is not so read friendly (as you've discovered).

Generally speaking every table ought to have a logical clustering sequence which can be optimized either for read or write operations if not both.
If the original clustered key were optimized for reading but requires the rows be inserted at various locations throughout the table, then your inserts where slowed because of the page splits and fragmentation occuring as the db makes room for the rows. If you're not using BCP or BULK INSERT for the inserts you might want to consider it, that should help speed things up by limiting the amount of logging going on during the insert.

You might also consider partitioning the table, with so many columns in the PK there should be some helpful partitioning scheme that would help across the board.

If you want to run with the current arrangement, not clustering on the PK and not selecting an alternative clustering key, you can opt to add a surrogate clustered index, such as an auto numbered int/bigint column and coresponding unique clustered index. That would prevent the table from being considered a heap and the clustered index, although irrelevant, can help speed up reads with key lookups.

_____________________________
wrote this on my TRS-80 COCO4

<PakRat/>
Go to Top of Page

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2010-03-31 : 13:41:10
are these batch updates happening frequently? if not you could drop indexes just before batch processes and reintroduce them afterwards

------------------------------------------------------------------------------------------------------
SQL Server MVP
http://visakhm.blogspot.com/

Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2010-04-01 : 03:54:21
How is the batch insert happening?

If, for example, you could use BCP and pre-order the import file into Clustered Index Order, and provide the BCP Hint that the import file is pre-ordered, that may help your import significantly.

If your RefId,Year combination is always increasing then new entries will be added at the end of the table (rather than having to be inserted in the middle of index pages), so that would also suggest that inserts can be made very efficient.

"if not you could drop indexes just before batch processes and reintroduce them afterwards"

You mean for any secondary indexes?

Presumably not a good idea for the Clustered Index itself as recreating could involve a lot of rearranging on the disk?
Go to Top of Page

Jazid
Starting Member

3 Posts

Posted - 2010-04-01 : 06:28:37
Thanks all for the reply.
Its a nightly batch load that is loading from a remote linked SQL Server database, so not non logging inserts are available (BULK INSERT or BCP) :(
Its also SQL Server Std Edition so there is no partitioning available and I dont think I can add a surrogate clustered identity column at this stage.

I couldn't belive the select performance would be so poor with a non clustered table (after all the table is stored as a HEAP, and the index is still be pointing to different data pages, its just that the data pages arent ordered, so there is a greater chance of a query having to retrieve more pages due to this and the nonsequential storage of data within pages )

The batch load inserts data for Refid,Year, Aid,Aname,Bid,Cid etc with new values and a newer GeneratedDate. So I am wondering on whether its worthwhile recreating this clustered key, but with the GeneratedDate ordered as the first column, which means new inserts will always be made at the end of the clustered table (however Im not sure of what additional overheads I will be reintroducing for new INSERTS with such a wide clustered PK).
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2010-04-01 : 07:11:42
"I couldn't belive the select performance would be so poor with a non clustered table"

Assuming your query was for a range, in the order of the keys in the clustered index, then they would be physically grouped together with clustered index, and "all over the table" with a heap - so lots of disk-head travel etc.

"but with the GeneratedDate ordered as the first column,"

That may help inserts, but will hurt your queries if they are a range based on the current key-order - i.e. a single, or modest range, of Refid is always part of the query.

It might be that there is always a SCAN of the clustered index - but because of the large number of columns all the WHERE clause is then "covered".

Might be worth considering changing to having a number of indexes, tuned to the most commonly used queries, and a slimmer clustered index.

But trying them, and making the change, on 800M rows is going to require some thumb-twiddling, and quite a lot of disk space
Go to Top of Page

PackRat
Starting Member

26 Posts

Posted - 2010-04-01 : 12:18:17
datawarehouse scenarios like this can be very tricky to optimize, pretty much always comes down to concessions between the read/write performance. Experimentation and testing of the read and write scenarios on a development copy of the database and examination of the resulting execution plans with incremental optimizations is most likely your best approach for finding balance. You can think your way through a few scenarios based on what you know about the selectivity of the various components of the primary key, but in the end you'll have to test and testing on production data is dangerous business.

If you've got a good idea how the data is consumed you can build effective covering indicies as Kristen mentions above, those can help eliminate clstered index scans / table scans, which will be your slowest performing lookups. You'll take a performance hit for the additional index maintenance, but if the indicies are narrow and based largely on numeric types it ought not to be too bad

if the maintenance window / table downtime is a factor and you're loading from a linked server and there is some network latency you might consider importing the data into a temporary table, with the same clustering sequence as the primary, preferably on a seperate disc/array and running the insert into the primary table as a local operation; that could shorten the time the table is locked

on the otherhand if this is a nightly load and availability isn't a factor I'd optimize entirely for reading

_____________________________
wrote this on my TRS-80 COCO4

<PakRat/>
Go to Top of Page

Lamprey
Master Smack Fu Yak Hacker

4614 Posts

Posted - 2010-04-01 : 13:01:10
Yeah, select performance on heaps tends to suffer a bit. Welcome to the world of GAM lookups!! :)

But, as far as insert/loading performance. It is not always the best or most performant to do a load in a single batch. There are several options on how to chunk the load. That might be an option worth investigating.
Go to Top of Page
   

- Advertisement -