-
-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Database Deadlocks with MSSQL 2019 #348
Comments
I do not know what is causing this. I'm surprised a deadlock is happening this frequent. Would you list your db-scheduler config? (the spring boot properties) |
Similar to #327 |
This might be related to issue discussed here: https://stackoverflow.com/a/59196002 |
This is our config. We have been using it for more than two years now without any problem. Only when the amount of tasks increased we had the deadlock problem. Roughly around 300 to 400 tasks per minute are the threshold when the problem starts to appear. db-scheduler:
delay-startup-until-context-ready: true
immediate-execution-enabled: true
polling-interval: 10s
polling-limit: 5
threads: 2 I have also attached the MSSQL deadlock report: <event name="xml_deadlock_report" package="sqlserver" timestamp="2022-12-07T09:01:29.190Z">
<data name="xml_report">
<value>
<deadlock>
<victim-list>
<victimProcess id="process26c6d7ab468" />
</victim-list>
<process-list>
<process id="process26c6d7ab468" taskpriority="0" logused="1004"
waitresource="KEY: 9:72057594058113024 (4c22163c40c7)" waittime="4657" ownerId="5421774626"
transactionname="UPDATE" lasttranstarted="2022-12-07T10:01:24.530" XDES="0x261abad0428"
lockMode="U" schedulerid="7" kpid="17324" status="suspended" spid="61" sbid="0" ecid="0"
priority="0" trancount="2" lastbatchstarted="2022-12-07T10:01:24.530"
lastbatchcompleted="2022-12-07T10:01:24.523" lastattention="1900-01-01T00:00:00.523"
clientapp="Microsoft JDBC Driver for SQL Server" hostname="workflow-engine-28-cfvv7" hostpid="0"
loginname="db-user" isolationlevel="read committed (2)" xactid="5421774626" currentdb="9"
currentdbname="DB" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack>
<frame procname="adhoc" line="1" stmtstart="198" stmtend="560"
sqlhandle="0x0200000056b25017aaa8a864d1f0a432b6a123b48afea3c30000000000000000000000000000000000000000">
unknown</frame>
<frame procname="unknown" line="1"
sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000">
unknown</frame>
</executionStack>
<inputbuf>
(@P0 bit,@P1 nvarchar(4000),@P2 datetime2,@P3 bit,@P4 nvarchar(4000),@P5 nvarchar(4000),@P6
bigint)
update scheduled_tasks
set picked = @P0, picked_by = @P1, last_heartbeat = @P2, version = version + 1
where picked = @P3 and task_name = @P4 and task_instance = @P5 and version = @P6
</inputbuf>
</process>
<process id="process241cd639468" taskpriority="0" logused="1352"
waitresource="KEY: 9:72057594046775296 (f692f3e2bbb3)" waittime="4657" ownerId="5421774627"
transactionname="UPDATE" lasttranstarted="2022-12-07T10:01:24.530" XDES="0x26433a14428"
lockMode="U" schedulerid="14" kpid="13960" status="suspended" spid="56" sbid="0" ecid="0"
priority="0" trancount="2" lastbatchstarted="2022-12-07T10:01:24.530"
lastbatchcompleted="2022-12-07T10:01:24.527" lastattention="1900-01-01T00:00:00.527"
clientapp="Microsoft JDBC Driver for SQL Server" hostname="workflow-engine-28-cfvv7" hostpid="0"
loginname="db-user" isolationlevel="read committed (2)" xactid="5421774627" currentdb="9"
currentdbname="DB" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
<executionStack>
<frame procname="adhoc" line="1" stmtstart="324" stmtend="868"
sqlhandle="0x02000000008cbf296485126f49222b7d62069155af5cd8830000000000000000000000000000000000000000">
unknown </frame>
<frame procname="unknown" line="1"
sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000">
unknown </frame>
</executionStack>
<inputbuf>
(@P0 bit,@P1 nvarchar(4000),@P2 datetime2,@P3 datetime2,@P4 datetime2,@P5 int,@P6
datetime2,@P7 varbinary(8000),@P8 nvarchar(4000),@P9 nvarchar(4000),@P10 bigint)
update scheduled_tasks
set picked = @P0, picked_by = @P1, last_heartbeat = @P2, last_success = @P3, last_failure =
@P4, consecutive_failures = @P5, execution_time = @P6, task_data = @P7, version = version +
1
where task_name = @P8 and task_instance = @P9 and version = @P10
</inputbuf>
</process>
</process-list>
<resource-list>
<keylock hobtid="72057594058113024" dbid="9" objectname="DB.workflow.scheduled_tasks"
indexname="execution_time_idx" id="lock2646d22ce00" mode="X"
associatedObjectId="72057594058113024">
<owner-list>
<owner id="process241cd639468" mode="X" />
</owner-list>
<waiter-list>
<waiter id="process26c6d7ab468" mode="U" requestType="wait" />
</waiter-list>
</keylock>
<keylock hobtid="72057594046775296" dbid="9" objectname="DB.workflow.scheduled_tasks"
indexname="PK__schedule__0BE25508E6B4E257" id="lock23e68e7aa80" mode="X"
associatedObjectId="72057594046775296">
<owner-list>
<owner id="process26c6d7ab468" mode="X" />
</owner-list>
<waiter-list>
<waiter id="process241cd639468" mode="U" requestType="wait" />
</waiter-list>
</keylock>
</resource-list>
</deadlock>
</value>
</data>
</event> |
Would you also paste your table and index definitions for the |
I have no experience with MSSQL, but I am curious if it makes any difference if we make the primary key |
The initial table definition was without the indexes: create table workflow.scheduled_tasks (
task_name varchar(250) not null,
task_instance varchar(250) not null,
task_data nvarchar(max),
execution_time datetimeoffset not null,
picked bit,
picked_by text,
last_success datetimeoffset ,
last_failure datetimeoffset ,
consecutive_failures INT,
last_heartbeat datetimeoffset ,
[version] BIGINT not null,
PRIMARY KEY (task_name, task_instance)
) We later added both indexes when we updated the library version and saw the change in the release notes: CREATE INDEX execution_time_idx ON workflow.scheduled_tasks (execution_time);
CREATE INDEX last_heartbeat_idx ON workflow.scheduled_tasks (last_heartbeat); |
Sounds good in theory. We will try to test this on a UAT instance and see if we can reproduce the problem. |
A quick update from my experiments. The non-clustered primary key surprisingly has worsened the situation and leads to a lot more deadlocks on my local MSSQL instance compared to the clustered primary. |
Not sure but lock-and-fetch for MSSQL possibly needs |
I was able to reproduce locally when using AzureSqlEdge, but the problem went away when I added Not getting the problem for the new lock-and-fetch for mssql. See #371 |
…unner transaction support. (#371) * Adding support for `lock-and-fetch` aka. `SELECT FOR UPDATE .. SKIP LOCKED` for MSSQL/SqlServer. Though testing has shown that lock-and-fetch are prone to deadlocks, so it is not recommended until that is understood/resolved. However, a query-hint was also added to `fetch-and-lock-on-execute` for MSSQL, and it appears this resolves issues with deadlocks for that strategy. * Updates jdbc-handling dependency `micro-jdbc` * Re-written transaction-handling (in `micro-jdbc`) * Shading `micro-jdbc` dependency * Explicit limit also for SqlServer, MySQL, Oracle * Adds ClusterTest for Sql Server (test concurrency) ## Fixes * #264 * #337 * #348 ? * #273 ? ## Further work / fix later * Mysql v8 skip locked syntax * MariaDB skip locked syntax * Oracle skip locked syntax ## Reminders - [x] Added/ran automated tests - [x] Update README and/or examples --- cc @kagkarlsson
Closing as I am hoping this was fixed with the new query-hints for SQL Server |
Prerequisites
For bug reports
When we execute around 300 - 400 tasks per minute, we experience database deadlocks on the table
scheduled_tasks
.The deadlock report from the MSSQL database shows a deadlock between either one of the indexes
execution_time_idx
orlast_heartbeat_idx
and the clustered primary key of the table.The deadlock happens when a Custom-Task tries to reschedule itself for another execution (
PollingTaskConfiguration:42
in the stacktrace below) withexecutionOperations.reschedule
.Workaround
For now we have deleted both indexes to stop the deadlocks from happening (so far successfully), but probably at the cost of reduced throughput.
Steps to Reproduce the bug
Currently unclear
Context
Logs
The text was updated successfully, but these errors were encountered: