Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#{stepExecutionContext[''] value same among partitioned threads on batch restart/resubmit. #4621

Open
revewo opened this issue Jun 18, 2024 · 0 comments
Labels
status: waiting-for-triage Issues that we did not analyse yet type: bug

Comments

@revewo
Copy link

revewo commented Jun 18, 2024

Bug description
When the batch has 3 partitions, we expect the 3 partitions to have unique execution context either in the first run, or in the restart run. We see that on a run due to a restart (same job parameters), the execution context is same in 2 different partitions.

Environment
spring-boot-starter-parent - 3.2.0
Java 17

Steps to reproduce
Created a small POC which can reproduce the issue. Steps added below.

Expected behavior
A batch with 3 partitions (all 3 having unique stepExecutionContext[''] values) where 2 partitions fail due to runtime exceptions on first run must execute the only those 2 failed partitions with stepExecutionContext[''] value being unique among those 2 partitions in the restart run

Minimal Complete Reproducible example

POC available at https://github.com/revewo/spring-batch-partition-issue-poc.

Configuration required in POC to replicate the issue:

  1. In src/main/resources/application.properties, you can change db.url based on the type of operating system you are running this code on. The database file will be created on the first batch run. We need persistence to simulate the restart issue and hence we can't use an in-memory database.
  2. Comment or uncomment lines 27 to 29 in com.example.batchprocessing.ThirdTasklet to replicate the issue.

Details / Steps to reproduce the issue

  1. On a successful execution, everything runs fine (keep the commented out 'RuntimeException' as is inside com.example.batchprocessing.ThirdTasklet#execute and run the batch). You will notice 'Inside ThirdTasklet. unique-to-partition is {}' is printed 3 times in the logs, once from each partition and each having unique value for unique-to-partition.

Now, to simulate the issue:

  1. Let us say our batch execution faces an issue at runtime such that business logic in 2 of the 3 partitions fails due to a database issue (this database issue we are mentioning is unrelated to spring batch tables and due to business tables) (uncomment the code in com.example.batchprocessing.ThirdTasklet#execute to throw that exception to simulate the issue) (also please provide a new runIdentifier in com.example.batchprocessing.BatchProcessingApplication#run when running the batch to simulate the issue).

We can see in this failed run that the log line 'Inside ThirdTasklet. unique-to-partition is {}' is printed in console 3 times and each of those 3 ends with unique values 1,2,3, just like in the successful run in step 1 above.

  1. Assume our database expert has fixed the issue on database. Now, on resubmit, we need only those two failed partitions to resume and the value of the key 'unique-to-partition' to be unique among the 2 different partitions on resubmit (since our business logic demands that the partition provide this unique information to database).
    So we resubmit the batch with the same runIdentifier we had provided in the failed run in step 2.
    Now the batch completes, but in the logs, we see that the log line "Inside ThirdTasklet. unique-to-partition is {}" is printed twice, each of those two ending with same value - either 1 or 3 but not unique.

These 2 partitions must have unique values for unique-to-partition on resubmit, just as they had those unique values on a run without any issue/exception. The fact that these values are not unique on resubmit changes the business log for us during a run.

I have also attached sample logs in case you don't want to try out the example code.

On a run where exception is thrown

2024-06-17T11:34:38.104+02:00  INFO 11452 --- [   TPTETHREAD-3] c.example.batchprocessing.ThirdTasklet   : Inside ThirdTasklet. unique-to-partition is value2
2024-06-17T11:34:39.399+02:00  INFO 11452 --- [   TPTETHREAD-2] c.example.batchprocessing.ThirdTasklet   : Inside ThirdTasklet. unique-to-partition is value3
2024-06-17T11:34:39.823+02:00  INFO 11452 --- [   TPTETHREAD-1] c.example.batchprocessing.ThirdTasklet   : Inside ThirdTasklet. unique-to-partition is value1
On the resubmit run (after fixing that exception)
2024-06-17T11:35:13.079+02:00  INFO 16936 --- [   TPTETHREAD-2] c.example.batchprocessing.ThirdTasklet   : Inside ThirdTasklet. unique-to-partition is value3
2024-06-17T11:35:13.487+02:00  INFO 16936 --- [   TPTETHREAD-1] c.example.batchprocessing.ThirdTasklet   : Inside ThirdTasklet. unique-to-partition is value3
@revewo revewo added status: waiting-for-triage Issues that we did not analyse yet type: bug labels Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: waiting-for-triage Issues that we did not analyse yet type: bug
Projects
None yet
Development

No branches or pull requests

1 participant