Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database Deadlocks with MSSQL 2019 #348

Closed
2 of 3 tasks
rafaelhofmann opened this issue Dec 20, 2022 · 12 comments
Closed
2 of 3 tasks

Database Deadlocks with MSSQL 2019 #348

rafaelhofmann opened this issue Dec 20, 2022 · 12 comments

Comments

@rafaelhofmann
Copy link

rafaelhofmann commented Dec 20, 2022

Prerequisites

  • I am running the latest version (We are running 11.5)
  • I checked the documentation and found no answer
  • I checked to make sure that this issue has not already been filed

For bug reports

When we execute around 300 - 400 tasks per minute, we experience database deadlocks on the table scheduled_tasks.
The deadlock report from the MSSQL database shows a deadlock between either one of the indexes execution_time_idx or last_heartbeat_idx and the clustered primary key of the table.

The deadlock happens when a Custom-Task tries to reschedule itself for another execution (PollingTaskConfiguration:42 in the stacktrace below) with executionOperations.reschedule.

Workaround

For now we have deleted both indexes to stop the deadlocks from happening (so far successfully), but probably at the cost of reduced throughput.

Steps to Reproduce the bug

Currently unclear

Context

  • DB-Scheduler Version : 11.5
  • Java Version : 1.8
  • Spring Boot (check for Yes) : [x]
  • Database and Version : Microsoft SQL Server 2019
  • Database Transaction Isolation Level: Read Committed Snapshot

Logs


Failed while completing execution Execution: task=polling-task, id=polling-task-155175, executionTime=2022-12-17T23:28:53.936Z, picked=true, pickedBy=workflow-engine-36-v6vm9, lastHeartbeat=2022-12-17T23:29:03.339Z, version=642. Execution will likely remain scheduled and locked/picked. The execution should be detected as dead after a while, and handled according to the tasks DeadExecutionHandler.
                com.github.kagkarlsson.jdbc.SQLRuntimeException: com.microsoft.sqlserver.jdbc.SQLServerException: Transaction (Process ID 71) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
                at com.github.kagkarlsson.jdbc.JdbcRunner.translateException(JdbcRunner.java:126)
                at com.github.kagkarlsson.jdbc.JdbcRunner.lambda$execute$2(JdbcRunner.java:93)
                at com.github.kagkarlsson.jdbc.JdbcRunner.withConnection(JdbcRunner.java:140)
                at com.github.kagkarlsson.jdbc.JdbcRunner.execute(JdbcRunner.java:66)
                at com.github.kagkarlsson.jdbc.JdbcRunner.execute(JdbcRunner.java:54)
                at com.github.kagkarlsson.scheduler.jdbc.JdbcTaskRepository.rescheduleInternal(JdbcTaskRepository.java:251)
                at com.github.kagkarlsson.scheduler.jdbc.JdbcTaskRepository.reschedule(JdbcTaskRepository.java:247)
                at com.github.kagkarlsson.scheduler.task.ExecutionOperations.reschedule(ExecutionOperations.java:62)
                at XXX.PollingTaskConfiguration.lambda$null$1(PollingTaskConfiguration.java:42)
                at com.github.kagkarlsson.scheduler.ExecutePicked.complete(ExecutePicked.java:104)
                at com.github.kagkarlsson.scheduler.ExecutePicked.executePickedExecution(ExecutePicked.java:88)
                at com.github.kagkarlsson.scheduler.ExecutePicked.run(ExecutePicked.java:68)
                at com.github.kagkarlsson.scheduler.FetchCandidates.lambda$run$1(FetchCandidates.java:89)
                at java.util.Optional.ifPresent(Optional.java:159)
                at com.github.kagkarlsson.scheduler.FetchCandidates.lambda$run$2(FetchCandidates.java:87)
                at com.github.kagkarlsson.scheduler.Executor.lambda$addToQueue$0(Executor.java:52)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                at java.lang.Thread.run(Thread.java:750)
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: Transaction (Process ID 71) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
                at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:265)
                at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1676)
                at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:615)
                at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:537)
                at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7730)
                at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3786)
                at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:268)
                at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:242)
                at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.execute(SQLServerPreparedStatement.java:515)
                at com.zaxxer.hikari.pool.ProxyPreparedStatement.execute(ProxyPreparedStatement.java:44)
                at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.execute(HikariProxyPreparedStatement.java)
                at com.github.kagkarlsson.jdbc.JdbcRunner.lambda$execute$2(JdbcRunner.java:85)
                ... 17 common frames omitted
@kagkarlsson
Copy link
Owner

I do not know what is causing this. I'm surprised a deadlock is happening this frequent. Would you list your db-scheduler config? (the spring boot properties)

@kagkarlsson
Copy link
Owner

Similar to #327

@kagkarlsson
Copy link
Owner

This might be related to issue discussed here: https://stackoverflow.com/a/59196002
MSSQL leaving locks on rows it traverses to find the row to update

@rafaelhofmann
Copy link
Author

This is our config. We have been using it for more than two years now without any problem. Only when the amount of tasks increased we had the deadlock problem. Roughly around 300 to 400 tasks per minute are the threshold when the problem starts to appear.

db-scheduler:
  delay-startup-until-context-ready: true
  immediate-execution-enabled: true
  polling-interval: 10s
  polling-limit: 5
  threads: 2

I have also attached the MSSQL deadlock report:

<event name="xml_deadlock_report" package="sqlserver" timestamp="2022-12-07T09:01:29.190Z">
    <data name="xml_report">
        <value>
            <deadlock>
                <victim-list>
                    <victimProcess id="process26c6d7ab468" />
                </victim-list>
                <process-list>
                    <process id="process26c6d7ab468" taskpriority="0" logused="1004"
                        waitresource="KEY: 9:72057594058113024 (4c22163c40c7)" waittime="4657" ownerId="5421774626"
                        transactionname="UPDATE" lasttranstarted="2022-12-07T10:01:24.530" XDES="0x261abad0428"
                        lockMode="U" schedulerid="7" kpid="17324" status="suspended" spid="61" sbid="0" ecid="0"
                        priority="0" trancount="2" lastbatchstarted="2022-12-07T10:01:24.530"
                        lastbatchcompleted="2022-12-07T10:01:24.523" lastattention="1900-01-01T00:00:00.523"
                        clientapp="Microsoft JDBC Driver for SQL Server" hostname="workflow-engine-28-cfvv7" hostpid="0"
                        loginname="db-user" isolationlevel="read committed (2)" xactid="5421774626" currentdb="9"
                        currentdbname="DB" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
                        <executionStack>
                            <frame procname="adhoc" line="1" stmtstart="198" stmtend="560"
                                sqlhandle="0x0200000056b25017aaa8a864d1f0a432b6a123b48afea3c30000000000000000000000000000000000000000">
                                unknown</frame>
                            <frame procname="unknown" line="1"
                                sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000">
                                unknown</frame>
                        </executionStack>
                        <inputbuf>
                            (@P0 bit,@P1 nvarchar(4000),@P2 datetime2,@P3 bit,@P4 nvarchar(4000),@P5 nvarchar(4000),@P6
                            bigint)
                            update scheduled_tasks
                            set picked = @P0, picked_by = @P1, last_heartbeat = @P2, version = version + 1
                            where picked = @P3 and task_name = @P4 and task_instance = @P5 and version = @P6
                        </inputbuf>
                    </process>
                    <process id="process241cd639468" taskpriority="0" logused="1352"
                        waitresource="KEY: 9:72057594046775296 (f692f3e2bbb3)" waittime="4657" ownerId="5421774627"
                        transactionname="UPDATE" lasttranstarted="2022-12-07T10:01:24.530" XDES="0x26433a14428"
                        lockMode="U" schedulerid="14" kpid="13960" status="suspended" spid="56" sbid="0" ecid="0"
                        priority="0" trancount="2" lastbatchstarted="2022-12-07T10:01:24.530"
                        lastbatchcompleted="2022-12-07T10:01:24.527" lastattention="1900-01-01T00:00:00.527"
                        clientapp="Microsoft JDBC Driver for SQL Server" hostname="workflow-engine-28-cfvv7" hostpid="0"
                        loginname="db-user" isolationlevel="read committed (2)" xactid="5421774627" currentdb="9"
                        currentdbname="DB" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
                        <executionStack>
                            <frame procname="adhoc" line="1" stmtstart="324" stmtend="868"
                                sqlhandle="0x02000000008cbf296485126f49222b7d62069155af5cd8830000000000000000000000000000000000000000">
                                unknown </frame>
                            <frame procname="unknown" line="1"
                                sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000">
                                unknown </frame>
                        </executionStack>
                        <inputbuf>
                            (@P0 bit,@P1 nvarchar(4000),@P2 datetime2,@P3 datetime2,@P4 datetime2,@P5 int,@P6
                            datetime2,@P7 varbinary(8000),@P8 nvarchar(4000),@P9 nvarchar(4000),@P10 bigint)

                            update scheduled_tasks

                            set picked = @P0, picked_by = @P1, last_heartbeat = @P2, last_success = @P3, last_failure =
                            @P4, consecutive_failures = @P5, execution_time = @P6, task_data = @P7, version = version +
                            1

                            where task_name = @P8 and task_instance = @P9 and version = @P10
                        </inputbuf>
                    </process>
                </process-list>
                <resource-list>
                    <keylock hobtid="72057594058113024" dbid="9" objectname="DB.workflow.scheduled_tasks"
                        indexname="execution_time_idx" id="lock2646d22ce00" mode="X"
                        associatedObjectId="72057594058113024">
                        <owner-list>
                            <owner id="process241cd639468" mode="X" />
                        </owner-list>
                        <waiter-list>
                            <waiter id="process26c6d7ab468" mode="U" requestType="wait" />
                        </waiter-list>
                    </keylock>
                    <keylock hobtid="72057594046775296" dbid="9" objectname="DB.workflow.scheduled_tasks"
                        indexname="PK__schedule__0BE25508E6B4E257" id="lock23e68e7aa80" mode="X"
                        associatedObjectId="72057594046775296">
                        <owner-list>
                            <owner id="process26c6d7ab468" mode="X" />
                        </owner-list>
                        <waiter-list>
                            <waiter id="process241cd639468" mode="U" requestType="wait" />
                        </waiter-list>
                    </keylock>
                </resource-list>
            </deadlock>
        </value>
    </data>
</event>

@kagkarlsson
Copy link
Owner

Would you also paste your table and index definitions for the scheduled_tasks table?

@kagkarlsson
Copy link
Owner

I have no experience with MSSQL, but I am curious if it makes any difference if we make the primary key NONCLUSTERED?

@rafaelhofmann
Copy link
Author

Would you also paste your table and index definitions for the scheduled_tasks table?

The initial table definition was without the indexes:

create table workflow.scheduled_tasks (
  task_name varchar(250) not null,
  task_instance varchar(250) not null,
  task_data  nvarchar(max),
  execution_time datetimeoffset  not null,
  picked bit,
  picked_by text,
  last_success datetimeoffset ,
  last_failure datetimeoffset ,
  consecutive_failures INT,
  last_heartbeat datetimeoffset ,
  [version] BIGINT not null,
  PRIMARY KEY (task_name, task_instance)
)

We later added both indexes when we updated the library version and saw the change in the release notes:

CREATE INDEX execution_time_idx ON workflow.scheduled_tasks (execution_time);
CREATE INDEX last_heartbeat_idx ON workflow.scheduled_tasks (last_heartbeat);

@rafaelhofmann
Copy link
Author

I have no experience with MSSQL, but I am curious if it makes any difference if we make the primary key NONCLUSTERED?

Sounds good in theory. We will try to test this on a UAT instance and see if we can reproduce the problem.

@rafaelhofmann
Copy link
Author

A quick update from my experiments. The non-clustered primary key surprisingly has worsened the situation and leads to a lot more deadlocks on my local MSSQL instance compared to the clustered primary.
Currently I am experimenting with lock hints (more precise: READPAST, ROWLOCK, and UPDLOCK) for the SELECT and UPDATE queries and implementing the lock-and-fetch strategy for MssqlJdbcCustomization.

@kagkarlsson
Copy link
Owner

Not sure but lock-and-fetch for MSSQL possibly needs JdbcRunner to support transaction spanning multiple statement

@kagkarlsson
Copy link
Owner

I was able to reproduce locally when using AzureSqlEdge, but the problem went away when I added with (readpast) to the select due query.

Not getting the problem for the new lock-and-fetch for mssql.

See #371

kagkarlsson added a commit that referenced this issue Oct 25, 2023
…unner transaction support. (#371)

* Adding support for `lock-and-fetch` aka. `SELECT FOR UPDATE .. SKIP
LOCKED` for MSSQL/SqlServer. Though testing has shown that
lock-and-fetch are prone to deadlocks, so it is not recommended until
that is understood/resolved. However, a query-hint was also added to
`fetch-and-lock-on-execute` for MSSQL, and it appears this resolves
issues with deadlocks for that strategy.
* Updates jdbc-handling dependency `micro-jdbc`
* Re-written transaction-handling (in `micro-jdbc`)
* Shading `micro-jdbc` dependency
* Explicit limit also for SqlServer, MySQL, Oracle
* Adds ClusterTest for Sql Server (test concurrency)

## Fixes

* #264
* #337
* #348 ?
* #273 ?

## Further work / fix later

* Mysql v8 skip locked syntax
* MariaDB skip locked syntax
* Oracle skip locked syntax

## Reminders
- [x] Added/ran automated tests
- [x] Update README and/or examples

---
cc @kagkarlsson
@kagkarlsson
Copy link
Owner

Closing as I am hoping this was fixed with the new query-hints for SQL Server

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants