Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
Author |
Topic |
ch9862
Yak Posting Veteran
76 Posts |
Posted - 2013-08-22 : 12:54:07
|
First off - apologies if this is in a wrong forum.I have a problem with performance of a query. I need to add 2 tables, contents of which will be compared to fish out delta.Here's an example (my tables have ID as IDENTITY):declare @tPrev table (ID int,TextValue varchar(256),TypeCode char(1))insert @tPrevselect 1, '20130530-0000319', 'A' union allselect 2, '20130530-0000320', 'A' union allselect 3, 'FileA.pdf', 'B' union allselect 4, 'FileB.pdf', 'B' union allselect 5, 'VR-3-EDT-100-VZ|01|PRS|ATT000001', 'C' union allselect 6, 'VR-2-EHT-100-LT|01|NTU|ATT000001', 'C' union allselect 7, 'unk-rep|134223|R|3510|PRT9-PRT8|comm-trace-124.log', 'D'declare @tCurrent table (ID int,TextValue varchar(256),TypeCode char(1))insert @tCurrentselect 1, '20130530-0000319', 'A' union allselect 2, '20130530-0000320', 'A' union allselect 3, '20130531-0000047', 'A' union allselect 4, 'FileA.pdf', 'B' union allselect 5, 'FileB.pdf', 'B' union allselect 6, 'FileC.pdf', 'B' union allselect 7, 'VR-3-EDT-100-VZ|01|PRS|ATT000001', 'C' union allselect 8, 'VR-2-EHT-100-LT|01|NTU|ATT000001', 'C' union allselect 9, 'GO-3-LLL-100-SD|01|PRE|ATT000001', 'C' union allselect 10, 'unk-rep|134223|R|3510|PRT9-PRT8|comm-trace-124.log', 'D' union allselect 11, 'mg-tor|148207|R|56722|SG5-TRT9|age-trim-hs.log.old', 'D'-- Query will be repeated for each TypeCodeselect TextValue,TypeCodefrom @tCurrentwhere TypeCode='A' and TextValue not in ( select TextValue from @tPrev where TypeCode='A') As you can see, the idea here is to periodically create a snapshot of various items, and compare it to the previous one. I expect to have around 450,000 records of 4 types in each snapshot, with only a handful (less than 50?) being different between each snapshot.I created these tables in my SQL Express and added around 120,000 test records. Filling the tables takes seconds, running the query I showed above - over 1 hour. So far I tried adding clustered index on TypeCode and non-clustered on TextValue, but without significant improvement. Do I need different indices, or update statistics after mass inserts, or is there a better way to handle the entire process?TIA |
|
lazerath
Constraint Violating Yak Guru
343 Posts |
Posted - 2013-08-22 : 13:10:35
|
Try this:select TextValue,TypeCodefrom @tCurrentEXCEPTselect TextValue,TypeCodefrom @tPrev |
 |
|
lazerath
Constraint Violating Yak Guru
343 Posts |
Posted - 2013-08-22 : 13:24:28
|
Hmmmmm, looking at the execution plan, the EXCEPT, although simpler from a coding perspective, requires a SORT where the original query does not. Eliminating the sort just gets you back to your original problem. On a large recordset, you might benefit from an index (either clustered or not) on the combination of TypeCode,TextValue. This still requires a fair bit of work for SQL Server to do the left anti semi join. If you have some flexibility in your design, you could reduce the work for this stage by shifting some work into the write process -- calculate a 16 byte MD5 hashcode using the HASHBYTES function during your write process and put an index on that column. Then do your exclusionary logic on the hashcode. Be aware that MD5 has the chance to create a collision, albiet with a very small probability. |
 |
|
ScottPletcher
Aged Yak Warrior
550 Posts |
Posted - 2013-08-22 : 16:14:35
|
Create a nonclustered index on ( TypeCode, TextValue ) in both tables (at least initially). Then try this code:SELECT tCurr.TextValue, tCurr.TypeCodeFROM @tCurrent tCurrWHERE NOT EXISTS( SELECT 1 FROM @tPrev tPrev WHERE tPrev.TypeCode = tCurr.TypeCode AND tPrev.TextValue = tCurr.TextValue ) |
 |
|
lazerath
Constraint Violating Yak Guru
343 Posts |
Posted - 2013-08-22 : 18:46:15
|
NOT EXISTS is more flexible and generally preferable to NOT IN, however it is very interesting to me how these three methods stack up. Try this on your machine:IF OBJECT_ID('tempdb..#tPrev') IS NOT NULL DROP TABLE #tPrev;IF OBJECT_ID('tempdb..#tCurrent') IS NOT NULL DROP TABLE #tCurrent;CREATE TABLE #tPrev (ID int,TextValue varchar(256),TypeCode char(1));GOinsert #tPrevselect 1, '20130530-0000319', 'A' union allselect 2, '20130530-0000320', 'A' union allselect 3, 'FileA.pdf', 'B' union allselect 4, 'FileB.pdf', 'B' union allselect 5, 'VR-3-EDT-100-VZ|01|PRS|ATT000001', 'C' union allselect 6, 'VR-2-EHT-100-LT|01|NTU|ATT000001', 'C' union allselect 7, 'unk-rep|134223|R|3510|PRT9-PRT8|comm-trace-124.log', 'D'GO 10000CREATE TABLE #tCurrent (ID int,TextValue varchar(256),TypeCode char(1));GOinsert #tCurrentselect 1, '20130530-0000319', 'A' union allselect 2, '20130530-0000320', 'A' union allselect 3, '20130531-0000047', 'A' union allselect 4, 'FileA.pdf', 'B' union allselect 5, 'FileB.pdf', 'B' union allselect 6, 'FileC.pdf', 'B' union allselect 7, 'VR-3-EDT-100-VZ|01|PRS|ATT000001', 'C' union allselect 8, 'VR-2-EHT-100-LT|01|NTU|ATT000001', 'C' union allselect 9, 'GO-3-LLL-100-SD|01|PRE|ATT000001', 'C' union allselect 10, 'unk-rep|134223|R|3510|PRT9-PRT8|comm-trace-124.log', 'D' union allselect 11, 'mg-tor|148207|R|56722|SG5-TRT9|age-trim-hs.log.old', 'D'GO 10000select count(*) from #tPrev;select count(*) from #tCurrent;-- Query will be repeated for each TypeCodeselect TextValue,TypeCodefrom #tCurrent cwhere TextValue not in ( select p.TextValue from #tPrev p where p.TypeCode=c.TypeCode) select TextValue,TypeCodefrom #tCurrentEXCEPTselect TextValue,TypeCodefrom #tPrevSELECT tCurr.TextValue, tCurr.TypeCodeFROM #tCurrent tCurrWHERE NOT EXISTS( SELECT 1 FROM #tPrev tPrev WHERE tPrev.TypeCode = tCurr.TypeCode AND tPrev.TextValue = tCurr.TextValue ) CREATE NONCLUSTERED INDEX ix_#tCurrent ON #tCurrent(TypeCode,TextValue);CREATE NONCLUSTERED INDEX ix_#tPrev ON #tPrev(TypeCode,TextValue);-- Query will be repeated for each TypeCodeselect TextValue,TypeCodefrom #tCurrent cwhere TextValue not in ( select p.TextValue from #tPrev p where p.TypeCode=c.TypeCode) select TextValue,TypeCodefrom #tCurrentEXCEPTselect TextValue,TypeCodefrom #tPrevSELECT tCurr.TextValue, tCurr.TypeCodeFROM #tCurrent tCurrWHERE NOT EXISTS( SELECT 1 FROM #tPrev tPrev WHERE tPrev.TypeCode = tCurr.TypeCode AND tPrev.TextValue = tCurr.TextValue ) It's worth noting that on my test machine, EXCEPT and NOT EXISTS were <=1% of the total cost of the batch both before and after the index creation, with the NOT IN taking 99% of the time. Comparing EXCEPT directly to NOT EXISTS, it comes out about 34% / 66% in favor of EXCEPT, however the resultset is a DISTINCT vs each row without a match as is the case for NOT EXISTS. |
 |
|
ch9862
Yak Posting Veteran
76 Posts |
Posted - 2013-08-26 : 08:54:59
|
Thanks for the suggestions - index on both fields works great. It is interesting to note differences between how these approaches stack up.Thanks again! |
 |
|
ShivaKrishna
Starting Member
20 Posts |
Posted - 2013-08-28 : 05:13:37
|
use the following query after indexing:SELECT tCurr.TextValue, tCurr.TypeCodeFROM @tCurrent tCurrWHERE NOT EXISTS( SELECT 1 FROM @tPrev tPrev WHERE tPrev.TypeCode = tCurr.TypeCode AND tPrev.TextValue = tCurr.TextValue ) |
 |
|
|
|
|
|
|