[SPARK-49411][SS] Communicate CheckpointID between driver and stateful operators #47895

siying · 2024-08-27T18:32:56Z

What changes were proposed in this pull request?

This is an incremental step to implement RocksDB state store checkpoint format V2.

Once conf STATE_STORE_CHECKPOINT_FORMAT_VERSION is set to be higher than version 2, the executor returns checkpointID to the driver (only done for RocksDB). The driver stores is locally. For the next batch, the checkpointID is sent to the executor to be used to load the state store. If the local version of the executor doesn't match the uniqueID, it will reload from the checkpoint.

There is no behavior change if the default checkpoint format is used.

Why are the changes needed?

This is an incremental step of the project of a new RocksDB State Store checkpoint format. The new format is to simplify checkpoint mechanism to make it less bug prone, and fix some unexpected query results in rare queries.

Does this PR introduce any user-facing change?

No

How was this patch tested?

A new unit test is added to cover format version. And another unit test is added to validate the uniqueID is passed back and force as expected.

Was this patch authored or co-authored using generative AI tooling?

No

WweiL · 2024-08-29T18:42:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala

-    val isFirstBatch: Boolean)
+    val isFirstBatch: Boolean,
+    val currentCheckpointUniqueId:
+      MutableMap[Long, Array[String]] = MutableMap[Long, Array[String]]())


Can we add comments on what are these unique Ids map to? I believe key is operator Id?

also better name it currentStateUniqueId as it is only related to state store not general checkpoint

I'm also confused by this. When I sketched an implementation of your proposal in my head, my assumption would be that IncrementalExecution would get just an ID, perhaps a single Long, that would correspond to the ID that it would bake into the physical plan sent to executors. So why is a map needed?

I'll add a comment, but it is basically operatorID->partitionID->checkpointID

WweiL · 2024-08-29T18:48:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+  private def updateCheckpointId(
+      execCtx: MicroBatchExecutionContext,
+      latestExecPlan: SparkPlan): Unit = {
+    // This function cannot handle MBP now.


unnecessary comment

WweiL · 2024-08-29T22:25:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

-      if (loadedVersion != version) {
+      if (loadedVersion != version ||
+        (checkpointFormatVersion >= 2 && checkpointUniqueId.isDefined &&
+        (!loadedCheckpointId.isDefined || checkpointUniqueId.get != loadedCheckpointId.get))) {


nit: loadedCheckpointId.isEmpty

WweiL · 2024-08-30T18:01:57Z

...scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreIntegrationSuite.scala

+          .agg(count("*"))
+          .as[(Int, Long)]
+
+      // Run the stream with changelog checkpointing disabled.


WweiL · 2024-09-01T18:32:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+  // Store checkpointIDs for state store checkpoints to be committed or have been committed to
+  // the commit log.
+  // operatorID -> (partitionID -> uniqueID)
+  private val currentCheckpointUniqueId = MutableMap[Long, Array[String]]()


Maybe this is better to be put into the stream execution context

operatorID -> (partitionID -> uniqueID), is this supposed to mean a map of maps? If so, then why is the type of currentCheckpointUniqueId just a single map?

I also don't fully understand why we would need a unique map for every operator X partition. Why is it not sufficient to have the following protocol, where we have one unique ID for every batch:

For the first batch, an ID is created and sent to all executors. When all tasks finish, that ID is persisted into the commit log. It is also kept in memory for the subsequent batch.

For any other batch, if there does not exist an ID in memory from the previous batch, then it must be read from the commit log and brought into memory. (This is the restart case.)

Then, using the ID in memory from the previous batch (call that prevId), this is sent to all executors in the physical plan, as well as a new ID for the current batch (call this currId). Before any processing start, executors must load and use the state for prevId to process the current batch. Then, they can start processing, and they upload their state as <state file name>_currId.<changelog|snapshot>.

What's wrong with that?

Right now, the uniqueID is generated in executor. As a potential optimization, the driver can send a uniqueID to all executors, but executors still need to modify it to make it unique among all attempts of the same task. After doing that, the IDs won't be unique anymore, so we need different IDs per partition.

WweiL · 2024-09-03T21:13:51Z

...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala

    try {
      if (version < 0) {
        throw QueryExecutionErrors.unexpectedStateStoreVersion(version)
      }
-      rocksDB.load(version, true)
+      rocksDB.load(version, uniqueId, true)


rocksDB.load(
version,
if (storeConf.stateStoreCheckpointFormatVersion >= 2) uniqueId else None)

WweiL · 2024-09-08T19:23:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+  @volatile private var LastCommitBasedCheckpointId: Option[String] = None
+  @volatile private var lastCommittedCheckpointId: Option[String] = None
+  @volatile private var loadedCheckpointId: Option[String] = None
+  @volatile private var sessionCheckpointId: Option[String] = None


Should reset these to None in rollback()

neilramaswamy · 2024-09-10T01:53:36Z

fix some unexpected query results in rare queries

@siying can you provide some content about which situations there are specifically?

(Edit, seems to be here in the design doc.)

neilramaswamy

Going to stop reviewing since I have a few fundamental questions regarding the protocol.

neilramaswamy · 2024-09-11T18:40:55Z

...g/apache/spark/sql/execution/datasources/v2/state/StreamStreamJoinStatePartitionReader.scala

@@ -105,7 +105,7 @@ class StreamStreamJoinStatePartitionReader(
      val stateInfo = StatefulOperatorStateInfo(
        partition.sourceOptions.stateCheckpointLocation.toString,
        partition.queryId, partition.sourceOptions.operatorId,
-        partition.sourceOptions.batchId + 1, -1)
+        partition.sourceOptions.batchId + 1, -1, None)


Why is this None? I would image that users of the state data source reader now have to specify the id that they would like to read, given that state stores are now not uniquely identified by operator/partition/name, but by id/operator/partition/name?

neilramaswamy · 2024-09-11T18:43:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala

-    val isFirstBatch: Boolean)
+    val isFirstBatch: Boolean,
+    val currentCheckpointUniqueId:
+      MutableMap[Long, Array[String]] = MutableMap[Long, Array[String]]())


I'm also confused by this. When I sketched an implementation of your proposal in my head, my assumption would be that IncrementalExecution would get just an ID, perhaps a single Long, that would correspond to the ID that it would bake into the physical plan sent to executors. So why is a map needed?

neilramaswamy · 2024-09-11T18:49:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+  // Store checkpointIDs for state store checkpoints to be committed or have been committed to
+  // the commit log.
+  // operatorID -> (partitionID -> uniqueID)
+  private val currentCheckpointUniqueId = MutableMap[Long, Array[String]]()


operatorID -> (partitionID -> uniqueID), is this supposed to mean a map of maps? If so, then why is the type of currentCheckpointUniqueId just a single map?

I also don't fully understand why we would need a unique map for every operator X partition. Why is it not sufficient to have the following protocol, where we have one unique ID for every batch:

For the first batch, an ID is created and sent to all executors. When all tasks finish, that ID is persisted into the commit log. It is also kept in memory for the subsequent batch.

For any other batch, if there does not exist an ID in memory from the previous batch, then it must be read from the commit log and brought into memory. (This is the restart case.)

Then, using the ID in memory from the previous batch (call that prevId), this is sent to all executors in the physical plan, as well as a new ID for the current batch (call this currId). Before any processing start, executors must load and use the state for prevId to process the current batch. Then, they can start processing, and they upload their state as <state file name>_currId.<changelog|snapshot>.

What's wrong with that?

neilramaswamy · 2024-09-16T22:26:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala

+    val ret = StatefulOperatorStateInfo(
      checkpointLocation,
      runId,
-      statefulOperatorId.getAndIncrement(),
+      operatorId,
      currentBatchId,
-      numStateStores)
+      numStateStores,
+      currentCheckpointUniqueId.get(operatorId))
+    ret


ret is not needed

neilramaswamy · 2024-09-16T22:42:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+      case e: StreamingDeduplicateWithinWatermarkExec =>
+        assert(e.stateInfo.isDefined)
+        updateCheckpointIdForOperator(execCtx, e.stateInfo.get.operatorId, e.getCheckpointInfo())
+      // TODO Need to deal with FlatMapGroupsWithStateExec, TransformWithStateExec,


Why not?

And I also don't see why we need to enumerate all of these here. Can we leverage the StatefulOperator trait and use that to get the state info? It should clean this up quite a bit.

You will, though, probably have to do some work to make sure that getCheckpointInfo can be called for any stateful operator.

neilramaswamy · 2024-09-16T22:44:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

-    watermarkTracker.updateWatermark(execCtx.executionPlan.executedPlan)
+    val latestExecPlan = execCtx.executionPlan.executedPlan
+    watermarkTracker.updateWatermark(latestExecPlan)
+    if (sparkSession.sessionState.conf.stateStoreCheckpointFormatVersion >= 2) {


I don't really like the >= 2 sprinkled everywhere. Can you define a constant somewhere, and then have a utility method that you call

neilramaswamy · 2024-09-16T22:48:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala

-    val isFirstBatch: Boolean)
+    val isFirstBatch: Boolean,
+    val currentCheckpointUniqueId:
+      MutableMap[Long, Array[String]] = MutableMap[Long, Array[String]]())


Is it always true that partition IDs are always [0, numPartitions)?

neilramaswamy · 2024-09-16T23:19:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

+    })
+  }
+
+  private def updateCheckpointId(


Let me make sure I understand the flow here:

Micro-batch ends, we call updateCheckpointId

This goes through every stateful operator and calls updateCheckpointIdForOperator

For each operator, we call into its getCheckpointInfo method

That method will access the checkpointInfoAccumulator

The checkpointInfoAccumulator is appended to using the unique ID from the state store after processing all data on the task

In the future, we'll write this to the commit log.

Is this right?

neilramaswamy · 2024-09-16T23:25:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

@@ -803,6 +843,14 @@ class RocksDB(
  /** Get the write buffer manager and cache */
  def getWriteBufferManagerAndCache(): (WriteBufferManager, Cache) = (writeBufferManager, lruCache)

+  def getLatestCheckpointInfo(partitionId: Int): StateStoreCheckpointInfo = {


Will this ever be called if lastCommittedCheckpointId is None or LastCommitBasedCheckpointId is None?

neilramaswamy · 2024-09-16T23:28:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+  // variables to manage checkpoint ID. Once a checkpoingting finishes, it nees to return
+  // the `lastCommittedCheckpointId` as the committed checkpointID, as well as
+  // `LastCommitBasedCheckpointId` as the checkpontID of the previous version that is based on.
+  // `loadedCheckpointId` is the checkpointID for the current live DB. After the batch finishes
+  // and checkpoint finishes, it will turn into `LastCommitBasedCheckpointId`.
+  // `sessionCheckpointId` store an ID to be used for future checkpoints. It is kept being used
+  // until we have to use a new one. We don't need to reuse any uniqueID, but reusing when possible
+  // can help debug problems.
+  @volatile private var LastCommitBasedCheckpointId: Option[String] = None
+  @volatile private var lastCommittedCheckpointId: Option[String] = None
+  @volatile private var loadedCheckpointId: Option[String] = None
+  @volatile private var sessionCheckpointId: Option[String] = None


We never read sessionCheckpointId and the comment doesn't really help me. What is it being used for?

Is there a reason LastCommitBasedCheckpointId is capitalized? And LastCommitBasedCheckpointId isn't even used in this PR since there is another TODO that says // TODO validate baseCheckpointId? Is that right?

neilramaswamy · 2024-09-16T23:37:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+  @volatile private var LastCommitBasedCheckpointId: Option[String] = None
+  @volatile private var lastCommittedCheckpointId: Option[String] = None
+  @volatile private var loadedCheckpointId: Option[String] = None
+  @volatile private var sessionCheckpointId: Option[String] = None


Can you comment specifically why these are marked as volatile? From what I can tell, these are only read/written to by the query execution thread.

neilramaswamy · 2024-09-16T23:37:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala

+    partitionId: Int,
+    batchVersion: Long,
+    checkpointId: Option[String],
+    baseCheckpointId: Option[String])


We call this checkpointId in some places and baseCheckpointId in others? Can you clarify which is which, and what specifically it should be here?

neilramaswamy · 2024-09-16T23:50:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala

+      .map {
+        case (key, values) => key -> values.head
+      }


This list would be non-zero only if there was a task retry/speculative execution, right?

And as discussed earlier today offline, this has the issue of not working if the same partition has multiple state stores, e.g. in a stream-stream join, which is actually a very serious issue.

github-actions bot added SQL STRUCTURED STREAMING labels Aug 27, 2024

siying marked this pull request as draft August 27, 2024 18:33

siying changed the title ~~[WIP] Communicate CheckpointID between driver and stateful operators~~ [SPARK-49411][SS] Communicate CheckpointID between driver and stateful operators Aug 27, 2024

siying marked this pull request as ready for review August 27, 2024 22:22

WweiL reviewed Aug 29, 2024

View reviewed changes

WweiL reviewed Aug 30, 2024

View reviewed changes

WweiL reviewed Sep 1, 2024

View reviewed changes

WweiL reviewed Sep 3, 2024

View reviewed changes

WweiL reviewed Sep 8, 2024

View reviewed changes

neilramaswamy reviewed Sep 11, 2024

View reviewed changes

siying force-pushed the unique_id2 branch from cb9a5ab to 3f5509e Compare September 16, 2024 23:11

neilramaswamy reviewed Sep 16, 2024

View reviewed changes

siying force-pushed the unique_id2 branch from 3f5509e to 262745c Compare September 19, 2024 20:10

siying added 11 commits September 23, 2024 16:13

checkpoint

1795490

comments

010bac2

address comments

c2b995d

Support stream-stream join

44a13b1

minor change

5a99595

consolidate >=2

17f6efa

add comment

22f2ffc

comments

51dfbe7

add comments

6d675df

Rename checkpointUniqueID to checkpointID

22ab329

more unit test

efe3ab5

siying force-pushed the unique_id2 branch from 8dab4a8 to efe3ab5 Compare September 23, 2024 23:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-49411][SS] Communicate CheckpointID between driver and stateful operators #47895

[SPARK-49411][SS] Communicate CheckpointID between driver and stateful operators #47895

siying commented Aug 27, 2024 •

edited

Loading

WweiL Aug 29, 2024 •

edited

Loading

WweiL Sep 1, 2024

neilramaswamy Sep 11, 2024

siying Sep 11, 2024

WweiL Aug 29, 2024

WweiL Aug 29, 2024

WweiL Aug 30, 2024

WweiL Sep 1, 2024

neilramaswamy Sep 11, 2024

siying Sep 11, 2024

WweiL Sep 3, 2024

WweiL Sep 8, 2024

neilramaswamy commented Sep 10, 2024 •

edited

Loading

neilramaswamy left a comment

neilramaswamy Sep 11, 2024

neilramaswamy Sep 11, 2024

neilramaswamy Sep 11, 2024

neilramaswamy Sep 16, 2024

neilramaswamy Sep 16, 2024

neilramaswamy Sep 16, 2024

neilramaswamy Sep 16, 2024

neilramaswamy Sep 16, 2024

neilramaswamy Sep 16, 2024

neilramaswamy Sep 16, 2024

neilramaswamy Sep 16, 2024

neilramaswamy Sep 16, 2024

neilramaswamy Sep 16, 2024

neilramaswamy Sep 16, 2024

neilramaswamy Sep 17, 2024 •

edited

Loading

[SPARK-49411][SS] Communicate CheckpointID between driver and stateful operators #47895

Are you sure you want to change the base?

[SPARK-49411][SS] Communicate CheckpointID between driver and stateful operators #47895

Conversation

siying commented Aug 27, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

WweiL Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neilramaswamy commented Sep 10, 2024 • edited Loading

neilramaswamy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neilramaswamy Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

siying commented Aug 27, 2024 •

edited

Loading

WweiL Aug 29, 2024 •

edited

Loading

neilramaswamy commented Sep 10, 2024 •

edited

Loading

neilramaswamy Sep 17, 2024 •

edited

Loading