Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2008 Forums
 Transact-SQL (2008)
 Slow query comparing 2 tables

Author  Topic 

ch9862
Yak Posting Veteran

76 Posts

Posted - 2013-08-22 : 12:54:07
First off - apologies if this is in a wrong forum.

I have a problem with performance of a query. I need to add 2 tables, contents of which will be compared to fish out delta.

Here's an example (my tables have ID as IDENTITY):
declare @tPrev table (ID int,TextValue varchar(256),TypeCode char(1))
insert @tPrev
select 1, '20130530-0000319', 'A' union all
select 2, '20130530-0000320', 'A' union all
select 3, 'FileA.pdf', 'B' union all
select 4, 'FileB.pdf', 'B' union all
select 5, 'VR-3-EDT-100-VZ|01|PRS|ATT000001', 'C' union all
select 6, 'VR-2-EHT-100-LT|01|NTU|ATT000001', 'C' union all
select 7, 'unk-rep|134223|R|3510|PRT9-PRT8|comm-trace-124.log', 'D'


declare @tCurrent table (ID int,TextValue varchar(256),TypeCode char(1))
insert @tCurrent
select 1, '20130530-0000319', 'A' union all
select 2, '20130530-0000320', 'A' union all
select 3, '20130531-0000047', 'A' union all
select 4, 'FileA.pdf', 'B' union all
select 5, 'FileB.pdf', 'B' union all
select 6, 'FileC.pdf', 'B' union all
select 7, 'VR-3-EDT-100-VZ|01|PRS|ATT000001', 'C' union all
select 8, 'VR-2-EHT-100-LT|01|NTU|ATT000001', 'C' union all
select 9, 'GO-3-LLL-100-SD|01|PRE|ATT000001', 'C' union all
select 10, 'unk-rep|134223|R|3510|PRT9-PRT8|comm-trace-124.log', 'D' union all
select 11, 'mg-tor|148207|R|56722|SG5-TRT9|age-trim-hs.log.old', 'D'


-- Query will be repeated for each TypeCode
select TextValue,TypeCode
from @tCurrent
where TypeCode='A' and TextValue not in (
select TextValue from @tPrev where TypeCode='A'
)

As you can see, the idea here is to periodically create a snapshot of various items, and compare it to the previous one. I expect to have around 450,000 records of 4 types in each snapshot, with only a handful (less than 50?) being different between each snapshot.

I created these tables in my SQL Express and added around 120,000 test records. Filling the tables takes seconds, running the query I showed above - over 1 hour. So far I tried adding clustered index on TypeCode and non-clustered on TextValue, but without significant improvement. Do I need different indices, or update statistics after mass inserts, or is there a better way to handle the entire process?

TIA

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2013-08-22 : 13:10:35
Try this:

select TextValue,TypeCode
from @tCurrent
EXCEPT
select TextValue,TypeCode
from @tPrev


Go to Top of Page

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2013-08-22 : 13:24:28
Hmmmmm, looking at the execution plan, the EXCEPT, although simpler from a coding perspective, requires a SORT where the original query does not. Eliminating the sort just gets you back to your original problem.

On a large recordset, you might benefit from an index (either clustered or not) on the combination of TypeCode,TextValue. This still requires a fair bit of work for SQL Server to do the left anti semi join. If you have some flexibility in your design, you could reduce the work for this stage by shifting some work into the write process -- calculate a 16 byte MD5 hashcode using the HASHBYTES function during your write process and put an index on that column. Then do your exclusionary logic on the hashcode. Be aware that MD5 has the chance to create a collision, albiet with a very small probability.
Go to Top of Page

ScottPletcher
Aged Yak Warrior

550 Posts

Posted - 2013-08-22 : 16:14:35
Create a nonclustered index on ( TypeCode, TextValue ) in both tables (at least initially). Then try this code:



SELECT tCurr.TextValue, tCurr.TypeCode
FROM @tCurrent tCurr
WHERE
NOT EXISTS(
SELECT 1
FROM @tPrev tPrev
WHERE
tPrev.TypeCode = tCurr.TypeCode AND
tPrev.TextValue = tCurr.TextValue
)
Go to Top of Page

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2013-08-22 : 18:46:15
NOT EXISTS is more flexible and generally preferable to NOT IN, however it is very interesting to me how these three methods stack up. Try this on your machine:

IF OBJECT_ID('tempdb..#tPrev') IS NOT NULL DROP TABLE #tPrev;
IF OBJECT_ID('tempdb..#tCurrent') IS NOT NULL DROP TABLE #tCurrent;

CREATE TABLE #tPrev (ID int,TextValue varchar(256),TypeCode char(1));
GO
insert #tPrev
select 1, '20130530-0000319', 'A' union all
select 2, '20130530-0000320', 'A' union all
select 3, 'FileA.pdf', 'B' union all
select 4, 'FileB.pdf', 'B' union all
select 5, 'VR-3-EDT-100-VZ|01|PRS|ATT000001', 'C' union all
select 6, 'VR-2-EHT-100-LT|01|NTU|ATT000001', 'C' union all
select 7, 'unk-rep|134223|R|3510|PRT9-PRT8|comm-trace-124.log', 'D'
GO 10000

CREATE TABLE #tCurrent (ID int,TextValue varchar(256),TypeCode char(1));
GO
insert #tCurrent
select 1, '20130530-0000319', 'A' union all
select 2, '20130530-0000320', 'A' union all
select 3, '20130531-0000047', 'A' union all
select 4, 'FileA.pdf', 'B' union all
select 5, 'FileB.pdf', 'B' union all
select 6, 'FileC.pdf', 'B' union all
select 7, 'VR-3-EDT-100-VZ|01|PRS|ATT000001', 'C' union all
select 8, 'VR-2-EHT-100-LT|01|NTU|ATT000001', 'C' union all
select 9, 'GO-3-LLL-100-SD|01|PRE|ATT000001', 'C' union all
select 10, 'unk-rep|134223|R|3510|PRT9-PRT8|comm-trace-124.log', 'D' union all
select 11, 'mg-tor|148207|R|56722|SG5-TRT9|age-trim-hs.log.old', 'D'
GO 10000

select count(*) from #tPrev;
select count(*) from #tCurrent;

-- Query will be repeated for each TypeCode
select TextValue,TypeCode
from #tCurrent c
where TextValue not in (
select p.TextValue from #tPrev p where p.TypeCode=c.TypeCode
)

select TextValue,TypeCode
from #tCurrent
EXCEPT
select TextValue,TypeCode
from #tPrev

SELECT tCurr.TextValue, tCurr.TypeCode
FROM #tCurrent tCurr
WHERE
NOT EXISTS(
SELECT 1
FROM #tPrev tPrev
WHERE
tPrev.TypeCode = tCurr.TypeCode AND
tPrev.TextValue = tCurr.TextValue
)

CREATE NONCLUSTERED INDEX ix_#tCurrent ON #tCurrent(TypeCode,TextValue);
CREATE NONCLUSTERED INDEX ix_#tPrev ON #tPrev(TypeCode,TextValue);


-- Query will be repeated for each TypeCode
select TextValue,TypeCode
from #tCurrent c
where TextValue not in (
select p.TextValue from #tPrev p where p.TypeCode=c.TypeCode
)

select TextValue,TypeCode
from #tCurrent
EXCEPT
select TextValue,TypeCode
from #tPrev

SELECT tCurr.TextValue, tCurr.TypeCode
FROM #tCurrent tCurr
WHERE
NOT EXISTS(
SELECT 1
FROM #tPrev tPrev
WHERE
tPrev.TypeCode = tCurr.TypeCode AND
tPrev.TextValue = tCurr.TextValue
)


It's worth noting that on my test machine, EXCEPT and NOT EXISTS were <=1% of the total cost of the batch both before and after the index creation, with the NOT IN taking 99% of the time. Comparing EXCEPT directly to NOT EXISTS, it comes out about 34% / 66% in favor of EXCEPT, however the resultset is a DISTINCT vs each row without a match as is the case for NOT EXISTS.
Go to Top of Page

ch9862
Yak Posting Veteran

76 Posts

Posted - 2013-08-26 : 08:54:59
Thanks for the suggestions - index on both fields works great. It is interesting to note differences between how these approaches stack up.

Thanks again!
Go to Top of Page

ShivaKrishna
Starting Member

20 Posts

Posted - 2013-08-28 : 05:13:37
use the following query after indexing:

SELECT tCurr.TextValue, tCurr.TypeCode
FROM @tCurrent tCurr
WHERE
NOT EXISTS(
SELECT 1
FROM @tPrev tPrev
WHERE
tPrev.TypeCode = tCurr.TypeCode AND
tPrev.TextValue = tCurr.TextValue
)
Go to Top of Page
   

- Advertisement -