Author |
Topic |
haler
Starting Member
13 Posts |
Posted - 2012-01-19 : 17:51:16
|
Hi i have recently decided to dedupe my data but i am having a problem after running fuzzy grouping with the query on updating which duplicate to keepI have a field called score_quality which has an integer in it that decides the quality of the row and whether to keep it or get rid of the duplicatesI also have another field called dedupe_result which will let me know whether i have decided to keep or purge itSo for each _key_out i want to set dedupe_result to 'keep' with the highest score_quality and the others to 'purge' but some may have the same score_quality so it should just set one of them to 'keep' and the rest to 'purge' |
|
ehorn
Master Smack Fu Yak Hacker
1632 Posts |
Posted - 2012-01-19 : 18:31:58
|
Hello haler, You will likely require an aggregate query which forces one of these rows out {i.e. MAX(col)}. Does the result set have any uniqueness (either a primary or composite key structure)?HTH. |
 |
|
haler
Starting Member
13 Posts |
Posted - 2012-01-19 : 18:49:06
|
there is a unique key _key_in and then _key_out which groups together the duplicates using _key_inI have this select statement which I THINK selects everything with the highest score but I am not very sure on how to translate it into an update statement, if i can get this into an update statement I suppose I can just set the remainder of the records to purgeSELECT Distinct(_key_out), quality_score, *FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] t1WHEREquality_score = (SELECT MAX(quality_score) FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] WHERE _key_out = t1._key_out) |
 |
|
ehorn
Master Smack Fu Yak Hacker
1632 Posts |
Posted - 2012-01-19 : 19:24:21
|
quote: Originally posted by haler there is a unique key _key_in and then _key_out which groups together the duplicates using _key_inI have this select statement which I THINK selects everything with the highest score but I am not very sure on how to translate it into an update statement, if i can get this into an update statement I suppose I can just set the remainder of the records to purgeSELECT Distinct(_key_out), quality_score, *FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] t1WHEREquality_score = (SELECT MAX(quality_score) FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] WHERE _key_out = t1._key_out)
Sounds like a good plan. If it helps, have a search on "update using joining tables". This will give you some sample structures for building such an update query.Best wishes. |
 |
|
haler
Starting Member
13 Posts |
Posted - 2012-01-19 : 19:49:10
|
I am having trouble with the updates statement if anyone could help me with restructuring this it would be much appreciated, this is what i have and i am getting the error: 'Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.'I sort of understand why i get the error but im not sure how to restructure the query to make it work.UPDATE [BusinessListings].[dbo].[MongoOrganisationACTM1Destination]SET dedupe_result = 'keep'WHERE _key_out = (SELECT Distinct(_key_out)FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] t1WHEREquality_score = (SELECT MAX(quality_score) FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] WHERE _key_out = t1._key_out)) |
 |
|
ehorn
Master Smack Fu Yak Hacker
1632 Posts |
Posted - 2012-01-19 : 19:52:58
|
Perhaps something more like the following form:UPDATE bSET b.dedupe_result = 'keep'FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] bINNER JOIN( SELECT Distinct(_key_out) AS _key_out FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] t1 WHERE quality_score = ( SELECT MAX(quality_score) FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] WHERE _key_out = t1._key_out) )) d ON d._key_out = b._key_out HTH. |
 |
|
haler
Starting Member
13 Posts |
Posted - 2012-01-19 : 20:10:58
|
Thought that nearly did it for a second but for some reason it is setting everything to keep |
 |
|
ehorn
Master Smack Fu Yak Hacker
1632 Posts |
Posted - 2012-01-19 : 20:36:47
|
Perhaps you dont have uniquesness with just "_key_out". You mentioned is was a composite key of _key_out and quality_score which created uniqueness. If that is the case, then you must include both fields in the join condition. Something like:UPDATE bSET b.dedupe_result = 'keep'FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] bINNER JOIN( SELECT Distinct(_key_out) AS _key_out, quality_score FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] t1 WHERE quality_score = ( SELECT MAX(quality_score) FROM [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] WHERE _key_out = t1._key_out) )) d ON d._key_out = b._key_out AND d.quality_score = b.quality_score This is why we often need to join 2 or more tables to perform such an update. Because we have a composite key which defines the affected results.HTH. |
 |
|
haler
Starting Member
13 Posts |
Posted - 2012-01-19 : 21:47:43
|
_key_in is unique, _key_out is the duplicates so for example:_key_in , _key_out , name , score , dedupe1 , 1 , ron , 10 , purge2 , 1 , ronn , 15 , keep3 , 3 , john , 5 , keep4 , 4 , matt , 15 , keep5 , 4 , mat , 10 , purgeand i want to keep the _key_out with the higher score by setting the field de_dupe to 'keep' and the remainder to 'purge' edit: the score can also be the same within a duplicate so in the case it is the same i just need to keep one it doesnt matter which one. The query above nearly works perfect but it marks all duplicates with the same score as keep |
 |
|
haler
Starting Member
13 Posts |
Posted - 2012-01-22 : 20:50:00
|
got it working thanks ehorngomerge into [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] monusing ( select _key_in, case ROW_NUMBER() OVER (PARTITION BY _key_out ORDER BY quality_score desc) when 1 then 'keep' else 'purge' end new_dedupe from [BusinessListings].[dbo].[MongoOrganisationACTM1Destination]) newON (mon._key_in = new._key_in )WHEN MATCHED THEN UPDATE SET mon.dedupe_result = new.new_dedupe; |
 |
|
ehorn
Master Smack Fu Yak Hacker
1632 Posts |
Posted - 2012-01-22 : 21:18:55
|
quote: Originally posted by haler got it working thanks ehorngomerge into [BusinessListings].[dbo].[MongoOrganisationACTM1Destination] monusing ( select _key_in, case ROW_NUMBER() OVER (PARTITION BY _key_out ORDER BY quality_score desc) when 1 then 'keep' else 'purge' end new_dedupe from [BusinessListings].[dbo].[MongoOrganisationACTM1Destination]) newON (mon._key_in = new._key_in )WHEN MATCHED THEN UPDATE SET mon.dedupe_result = new.new_dedupe;
Great to hear. Nicely done.Have a good evening. |
 |
|
|