Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2008 Forums
 Transact-SQL (2008)
 Selecting which duplicate to keep fuzzy grouping

Author  Topic 

haler
Starting Member

13 Posts

Posted - 2012-01-19 : 17:51:16
Hi i have recently decided to dedupe my data but i am having a problem after running fuzzy grouping with the query on updating which duplicate to keep

I have a field called score_quality which has an integer in it that decides the quality of the row and whether to keep it or get rid of the duplicates

I also have another field called dedupe_result which will let me know whether i have decided to keep or purge it

So for each _key_out i want to set dedupe_result to 'keep' with the highest score_quality and the others to 'purge' but some may have the same score_quality so it should just set one of them to 'keep' and the rest to 'purge'

ehorn
Master Smack Fu Yak Hacker

1632 Posts

Posted - 2012-01-19 : 18:31:58
Hello haler,

You will likely require an aggregate query which forces one of these rows out {i.e. MAX(col)}. Does the result set have any uniqueness (either a primary or composite key structure)?

HTH.
Go to Top of Page

haler
Starting Member

13 Posts

Posted - 2012-01-19 : 18:49:06
there is a unique key _key_in and then _key_out which groups together the duplicates using _key_in

I have this select statement which I THINK selects everything with the highest score but I am not very sure on how to translate it into an update statement, if i can get this into an update statement I suppose I can just set the remainder of the records to purge

SELECT Distinct(_key_out), quality_score, *
FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] t1
WHERE
quality_score = (SELECT MAX(quality_score) FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] WHERE _key_out = t1._key_out)
Go to Top of Page

ehorn
Master Smack Fu Yak Hacker

1632 Posts

Posted - 2012-01-19 : 19:24:21
quote:
Originally posted by haler

there is a unique key _key_in and then _key_out which groups together the duplicates using _key_in

I have this select statement which I THINK selects everything with the highest score but I am not very sure on how to translate it into an update statement, if i can get this into an update statement I suppose I can just set the remainder of the records to purge

SELECT Distinct(_key_out), quality_score, *
FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] t1
WHERE
quality_score = (SELECT MAX(quality_score) FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] WHERE _key_out = t1._key_out)



Sounds like a good plan. If it helps, have a search on "update using joining tables". This will give you some sample structures for building such an update query.

Best wishes.
Go to Top of Page

haler
Starting Member

13 Posts

Posted - 2012-01-19 : 19:49:10
I am having trouble with the updates statement if anyone could help me with restructuring this it would be much appreciated, this is what i have and i am getting the error: 'Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.'

I sort of understand why i get the error but im not sure how to restructure the query to make it work.

UPDATE [BusinessListings].[dbo].[MongoOrganisationACTM1Destination]
SET dedupe_result = 'keep'
WHERE _key_out = (SELECT Distinct(_key_out)
FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] t1
WHERE
quality_score = (SELECT MAX(quality_score) FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] WHERE _key_out = t1._key_out))
Go to Top of Page

ehorn
Master Smack Fu Yak Hacker

1632 Posts

Posted - 2012-01-19 : 19:52:58
Perhaps something more like the following form:


UPDATE b
SET b.dedupe_result = 'keep'
FROM
[BusinessListings].[dbo].[MongoOrganisationACTM1Destination] b
INNER JOIN
(
SELECT Distinct(_key_out) AS _key_out
FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] t1
WHERE quality_score =
(
SELECT MAX(quality_score)
FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination]
WHERE _key_out = t1._key_out)
)
) d ON d._key_out = b._key_out


HTH.
Go to Top of Page

haler
Starting Member

13 Posts

Posted - 2012-01-19 : 20:10:58
Thought that nearly did it for a second but for some reason it is setting everything to keep
Go to Top of Page

ehorn
Master Smack Fu Yak Hacker

1632 Posts

Posted - 2012-01-19 : 20:36:47
Perhaps you dont have uniquesness with just "_key_out". You mentioned is was a composite key of _key_out and quality_score which created uniqueness.

If that is the case, then you must include both fields in the join condition. Something like:

UPDATE 	b
SET b.dedupe_result = 'keep'
FROM
[BusinessListings].[dbo].[MongoOrganisationACTM1Destination] b
INNER JOIN
(
SELECT
Distinct(_key_out) AS _key_out,
quality_score
FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] t1
WHERE quality_score =
(
SELECT MAX(quality_score)
FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination]
WHERE _key_out = t1._key_out)
)
) d ON d._key_out = b._key_out AND d.quality_score = b.quality_score


This is why we often need to join 2 or more tables to perform such an update. Because we have a composite key which defines the affected results.

HTH.
Go to Top of Page

haler
Starting Member

13 Posts

Posted - 2012-01-19 : 21:47:43
_key_in is unique, _key_out is the duplicates so for example:

_key_in , _key_out , name , score , dedupe
1 , 1 , ron , 10 , purge
2 , 1 , ronn , 15 , keep
3 , 3 , john , 5 , keep
4 , 4 , matt , 15 , keep
5 , 4 , mat , 10 , purge

and i want to keep the _key_out with the higher score by setting the field de_dupe to 'keep' and the remainder to 'purge'

edit: the score can also be the same within a duplicate so in the case it is the same i just need to keep one it doesnt matter which one. The query above nearly works perfect but it marks all duplicates with the same score as keep
Go to Top of Page

haler
Starting Member

13 Posts

Posted - 2012-01-22 : 20:50:00
got it working thanks ehorn

go
merge into [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] mon
using (
select _key_in,
case ROW_NUMBER() OVER (PARTITION BY _key_out ORDER BY quality_score desc)
when 1 then 'keep'
else 'purge'
end new_dedupe
from [BusinessListings].[dbo].[MongoOrganisationACTM1Destination]
) new
ON (mon._key_in = new._key_in )
WHEN MATCHED THEN UPDATE
SET mon.dedupe_result = new.new_dedupe
;
Go to Top of Page

ehorn
Master Smack Fu Yak Hacker

1632 Posts

Posted - 2012-01-22 : 21:18:55
quote:
Originally posted by haler

got it working thanks ehorn

go
merge into [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] mon
using (
select _key_in,
case ROW_NUMBER() OVER (PARTITION BY _key_out ORDER BY quality_score desc)
when 1 then 'keep'
else 'purge'
end new_dedupe
from [BusinessListings].[dbo].[MongoOrganisationACTM1Destination]
) new
ON (mon._key_in = new._key_in )
WHEN MATCHED THEN UPDATE
SET mon.dedupe_result = new.new_dedupe
;




Great to hear. Nicely done.

Have a good evening.
Go to Top of Page
   

- Advertisement -