Author |
Topic |
CleaningWoman
Starting Member
13 Posts |
Posted - 2015-01-08 : 10:53:45
|
Hi SQL-Expert,
i hope you guys can hep me to find the fastest way possible. I have about 800 million rows of data, which are in need of a new flag. About once a week i get 10 million new rows.
I do need to find out which machines were busy and set a flag [is_busy=1]. Defintion for a machine to be busy is the following:
- partitioned (grouped) by group_id, section_id with a group-specific [gid_busy_limit] (which is joined from elsewhere)
- a row is part of a timeinterval of 4 minutes with >= [gid_busy_limit] entries
Currently there are no primary keys, but if very helpful i could add an autoincrement.
I am using sql-server 2014 (64)
For experiments i prepared a sample #demo table:
CREATE TABLE #demo ( [group_id] [smallint] NULL, [gid_busy_limit] [int] NULL, [section_id] [nvarchar](8) NULL, [time_id] [datetime2](6) NULL, [is_busy] [smallint] NULL ) ON [PRIMARY]
GO INSERT #demo ([group_id], [gid_busy_limit], [section_id], [time_id], [is_busy]) VALUES (7, 4, N'00001047', CAST(N'2014-11-26 11:41:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001047', CAST(N'2014-11-26 11:44:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001047', CAST(N'2014-11-26 11:46:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001047', CAST(N'2014-11-26 11:47:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001047', CAST(N'2014-11-26 11:48:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001047', CAST(N'2014-11-26 11:50:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001047', CAST(N'2014-11-26 12:21:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:25:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:47:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:48:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:48:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:50:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:51:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:54:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002004', CAST(N'2014-11-26 15:39:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002004', CAST(N'2014-11-26 15:39:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002004', CAST(N'2014-11-26 15:40:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002004', CAST(N'2014-11-26 15:40:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002004', CAST(N'2014-11-26 15:42:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002004', CAST(N'2014-11-26 15:44:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002010', CAST(N'2014-11-26 09:56:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002010', CAST(N'2014-11-26 09:57:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002010', CAST(N'2014-11-26 09:59:00.0000000' AS DateTime2), NULL) GO
Select * from #demo go
If solved the above table would look like this
(7, 4, N'00001047', CAST(N'2014-11-26 11:41:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001047', CAST(N'2014-11-26 11:44:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001047', CAST(N'2014-11-26 11:46:00.0000000' AS DateTime2), 1) ,(7, 4, N'00001047', CAST(N'2014-11-26 11:47:00.0000000' AS DateTime2), 1) ,(7, 4, N'00001047', CAST(N'2014-11-26 11:48:00.0000000' AS DateTime2), 1) ,(7, 4, N'00001047', CAST(N'2014-11-26 11:50:00.0000000' AS DateTime2), 1) ,(7, 4, N'00001047', CAST(N'2014-11-26 12:21:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:25:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:47:00.0000000' AS DateTime2), 1) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:48:00.0000000' AS DateTime2), 1) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:48:00.0000000' AS DateTime2), 1) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:50:00.0000000' AS DateTime2), 1) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:51:00.0000000' AS DateTime2), 1) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:54:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002004', CAST(N'2014-11-26 15:39:00.0000000' AS DateTime2), 1) ,(9, 3, N'00002004', CAST(N'2014-11-26 15:39:00.0000000' AS DateTime2), 1) ,(9, 3, N'00002004', CAST(N'2014-11-26 15:40:00.0000000' AS DateTime2), 1) ,(9, 3, N'00002004', CAST(N'2014-11-26 15:40:00.0000000' AS DateTime2), 1) ,(9, 3, N'00002004', CAST(N'2014-11-26 15:42:00.0000000' AS DateTime2), 1) ,(9, 3, N'00002004', CAST(N'2014-11-26 15:44:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002010', CAST(N'2014-11-26 09:55:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002010', CAST(N'2014-11-26 09:57:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002010', CAST(N'2014-11-26 09:59:00.0000000' AS DateTime2), NULL)
I hope i could describe the problem well enough. Any help or suggestions are much appreciated.
Thank you in advance
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2015-01-08 : 12:32:07
|
You can use the LAG function to solve this really quickly.
Microsoft SQL Server MVP, MCT, MCSE, MCSA, MCP, MCITP, MCTS, MCDBA |
 |
|
CleaningWoman
Starting Member
13 Posts |
Posted - 2015-01-09 : 05:40:54
|
quote: Originally posted by SwePeso
You can use the LAG function to solve this really quickly.
Hi SwePeso,
many thanks for your suggestion. I recognized lag/lead from MDX and was surprised, I did not realise, that it is now supported by SQL 2014. Fine.
I also think that this function looks promising. But only for the first round.
I do have the following problems: 1.) I can calculate from one row the lag([gid_busy_limit]) of some (int)time_id i create, and so i will find that this row is part of a bunch of rows that are busy. But how do i calculate the rows in between the start and end row of this bunch of rows?
2.) the gid_busy_limit currently goes as high as 10 so i do not think it is an option to put every possible combination of lag(1-9) into extra columns and calculate every option. 2.a) Adding to this i would have to probably use lag(1-9) and lead(1-9) in combination to find the inbetween rows.
Do you have more insights to this, as to how to implement this ?
Thank you, again.
|
 |
|
gbritton
Master Smack Fu Yak Hacker
2780 Posts |
Posted - 2015-01-09 : 06:53:40
|
if the group of rose forms a partition, then you can use the first value and last value functions to access the first and last rows |
 |
|
CleaningWoman
Starting Member
13 Posts |
Posted - 2015-01-12 : 05:37:24
|
quote: Originally posted by SwePeso
You can use the LAG function to solve this really quickly.
I did try to solve this with lag() and saved one join, but i am not very satisfied with the code. I would be pleased if you have even more ideas to improve the code?
The time '2010-01-01 00:00:01' is arbitrary. All timecodes are currently later than 2011-01-01
Particularly pittyful is that i have overlapping timespans which are updated multiple times.
Here is what i did: (it already shows a better result than i did writing the example down ;-)
select [group_id] ,[section_id] ,dateadd(ss,lagx,'2010-01-01 00:00:01') as starttime_id ,[time_id] into #hitlist from (
SELECT [group_id] ,[gid_busy_limit] ,[section_id] ,[time_id] , datediff(ss, '2010-01-01 00:00:01', time_id) as sec2now , LAG(datediff(ss, '2010-01-01 00:00:01', time_id),gid_busy_limit-1) over (partition by group_id, section_id order by time_id) as lagx
FROM #demo ) as a where sec2now - lagx < = 240 go
select group_id, section_id, starttime_id ,time_id from #hitlist go
update a set a.is_busy = 1 from #demo as a left join #hitlist as b on a.group_id=b.group_id and a.section_id = b.section_id where a.time_id>=starttime_id and a.time_id <=b.time_id go
select * from #demo go
Sorry, if it becomes obvious that i am not the most experienced sql-developer. |
 |
|
CleaningWoman
Starting Member
13 Posts |
Posted - 2015-01-12 : 05:41:22
|
quote: Originally posted by gbritton
if the group of rose forms a partition, then you can use the first value and last value functions to access the first and last rows
Hi gbritton, thank you for your input. I do not understand how to use first/last value to my advantage. Can you give me more insight to the way you think this might help? |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2015-01-12 : 07:03:09
|
[code]CREATE TABLE #demo ( [group_id] [smallint] NULL, [gid_busy_limit] [int] NULL, [section_id] [nvarchar](8) NULL, [time_id] [datetime2](6) NULL, [is_busy] [smallint] NULL ) GO
INSERT #demo ([group_id], [gid_busy_limit], [section_id], [time_id], [is_busy]) VALUES (7, 4, N'00001047', CAST(N'2014-11-26 11:41:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001047', CAST(N'2014-11-26 11:44:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001047', CAST(N'2014-11-26 11:46:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001047', CAST(N'2014-11-26 11:47:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001047', CAST(N'2014-11-26 11:48:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001047', CAST(N'2014-11-26 11:50:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001047', CAST(N'2014-11-26 12:21:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:25:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:47:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:48:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:48:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:50:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:51:00.0000000' AS DateTime2), NULL) ,(7, 4, N'00001242', CAST(N'2014-11-26 10:54:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002004', CAST(N'2014-11-26 15:39:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002004', CAST(N'2014-11-26 15:39:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002004', CAST(N'2014-11-26 15:40:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002004', CAST(N'2014-11-26 15:40:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002004', CAST(N'2014-11-26 15:42:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002004', CAST(N'2014-11-26 15:44:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002010', CAST(N'2014-11-26 09:56:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002010', CAST(N'2014-11-26 09:57:00.0000000' AS DateTime2), NULL) ,(9, 3, N'00002010', CAST(N'2014-11-26 09:59:00.0000000' AS DateTime2), NULL) GO
-- SwePeso WITH cteSource AS ( SELECT Group_ID, Section_ID, DATEADD(SECOND, GID_Busy_Limit, Time_ID) AS ReservedTime, Time_ID, Is_Busy FROM #Demo ), cteTarget AS ( SELECT Is_Busy, CASE WHEN LAG(ReservedTime, 1, '99991231') OVER (PARTITION BY Group_ID, Section_ID ORDER BY Time_ID) < Time_ID THEN 1 ELSE 0 END AS Flag FROM cteSource ) UPDATE cteTarget SET Is_Busy = Flag;
UPDATE d SET d.Is_Busy = w.Flag FROM #Demo AS d INNER JOIN ( SELECT Group_ID, Section_ID, Time_ID, MAX(Is_Busy) AS Flag FROM #demo GROUP BY Group_ID, Section_ID, Time_ID HAVING COUNT(*) >= 2 ) AS w ON w.group_id = d.group_id AND w.section_id = d.section_id AND w.time_id = d.time_id;
SELECT * FROM #Demo;
DROP TABLE #demo[/code]
Microsoft SQL Server MVP, MCT, MCSE, MCSA, MCP, MCITP, MCTS, MCDBA |
 |
|
CleaningWoman
Starting Member
13 Posts |
Posted - 2015-01-13 : 05:17:11
|
quote: Originally posted by SwePeso WITH cteSource
Thank you. I dont get it to work. This query has wrong results. Some wright ones are missing and some wrong ones are flagged. The DATEADD(SECOND, GID_Busy_Limit, Time_ID) AS ReservedTime, would (in this example) add 3 or 4 seconds to the time_id but GID_Busy_Limit stands for the number of machines in the timespan of 240 seconds. I could not find any reference to the 240 seconds nor to the number of rows contained in the timespan in the query.
I am also unsure about the use of CTE with the update. Would i produce update is_busy=0 and extra performance loss or wont this happen?
Did i misunderstand this or is something missing? |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2015-01-13 : 17:18:24
|
Change SECOND to MINUTE in the DATEADD function. And then make the calculation work according to your expectations.
Microsoft SQL Server MVP, MCT, MCSE, MCSA, MCP, MCITP, MCTS, MCDBA |
 |
|
ScottPletcher
Aged Yak Warrior
550 Posts |
Posted - 2015-01-14 : 13:13:55
|
The clustered index for the table should be on: ( Group_ID, Section_ID, Time_ID ) NOT on identity. |
 |
|
CleaningWoman
Starting Member
13 Posts |
Posted - 2015-01-19 : 14:07:07
|
quote: Originally posted by SwePeso
Change SECOND to MINUTE in the DATEADD function. And then make the calculation work according to your expectations.
Yes, i tried several alternatives of your code, but the main problem always stays. The rows in between the first and the last of a sequence wont be found with those routines.
So far i have only a brute-force routine which takes ages to update. If you find a way to enhance your routine to find the in-between rows - i am still open for ideas.
Anyway, thank you for trying. |
 |
|
CleaningWoman
Starting Member
13 Posts |
Posted - 2015-01-19 : 19:45:16
|
What do you think of the following working soulution:
* I do first set is_busy wherever the timespan is valid * is_busy startingamount is gid_busy_limit * Then i loop through the table counting is_busy down to 1 * doubles are found
The first tests were quite fast and the results correct !
;with tbl as (SELECT [section_id] ,[is_busy] ,[group_id] ,[time_id] ,gid_busy_limit ,DATEADD(SECOND,240,[time_id]) as endtime ,lead(time_id,gid_busy_limit-1) OVER (PARTITION BY group_id,section_id ORDER BY time_id) as endentry FROM #demo ) update tbl set is_busy=gid_busy_limit where endentry <= endtime go
select 42 WHILE (@@ROWCOUNT > 0) BEGIN with tbl as ( SELECT [is_busy] ,lag(is_busy,1) OVER (PARTITION BY group_id, section_id ORDER BY time_id, is_busy desc) as prevBusy FROM #demo ) Update tbl set is_busy = prevBusy -1 where prevBusy -1 > isnull(is_busy,0) and prevBusy > 1 and isnull(is_busy,0)=0 END
SELECT * from #demo go
|
 |
|
|
|
|