Enable Incremental in Bronze Snapshot #805

souravbaner-da · 2023-03-08T08:32:59Z

Enable Incremental in Bronze Snapshot

* initial commit * Refractor InitializerFunctions.scala * Refractor InitializerFunctions.scala * Change Scala Sources Name * Refractor InitializerFunctions.scala * Refractor InitializerFunctions.scala * Added Initializerv2.scala * Added Initializerv2.scala * Changed as per Sriram comment * Changed as per Sriram comment * dropped Initializer Deprecated --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com>

GeekSheikh · 2023-03-15T18:22:36Z

src/main/scala/com/databricks/labs/overwatch/pipeline/Bronze.scala

@@ -95,6 +95,37 @@ class Bronze(_workspace: Workspace, _database: Database, _config: Config)
    Helpers.parClone(cloneSpecs)

  }
+  def snapshotStream(


I think this needs to be moved to workspace and called incrementalSnap

currently users backup using workspace.snap after this is merged customers should be able to call workspace.incrementalSnap

GeekSheikh · 2023-03-15T18:24:29Z

src/main/scala/com/databricks/labs/overwatch/pipeline/Bronze.scala

@@ -95,6 +95,37 @@ class Bronze(_workspace: Workspace, _database: Database, _config: Config)
    Helpers.parClone(cloneSpecs)

  }
+  def snapshotStream(


Add code docs on what is expected to be passed in for these variables and an English explanation of what this function does.

GeekSheikh · 2023-03-15T18:28:09Z

src/main/scala/com/databricks/labs/overwatch/pipeline/Bronze.scala

@@ -95,6 +95,37 @@ class Bronze(_workspace: Workspace, _database: Database, _config: Config)
    Helpers.parClone(cloneSpecs)

  }
+  def snapshotStream(


be sure to confirm that the snapshot backs up ALL data from all organization_ids. This means you need to test on a workspace that has multiple workspaces in the same bronze tables.

GeekSheikh · 2023-03-15T18:30:11Z

src/main/scala/com/databricks/labs/overwatch/pipeline/Bronze.scala

@@ -95,6 +95,37 @@ class Bronze(_workspace: Workspace, _database: Database, _config: Config)
    Helpers.parClone(cloneSpecs)

  }
+  def snapshotStream(
+                      targetPrefix: String,


I think we should change the name of this in this func and the existing .snap func. It's too close to target prefix meaning PipelineTable ETLPrefix. I think this should be renamed to something like snapshotRootPath or something to make it more clear.

GeekSheikh · 2023-03-15T18:32:30Z

src/main/scala/com/databricks/labs/overwatch/pipeline/Bronze.scala

+      .filter(_.exists()) // source path must exist
+      .filterNot(t => cleanExcludes.contains(t.name.toLowerCase))
+
+    val finalTargetPathPrefix = s"${targetPrefix}/bckup"


why are we nesting this into a subdir here?

I think this makes sense if it were called something like s"${targetPrefix}/data" which would allow for another folder called s"${targetPrefix}/report" where we could persist the snapshot reports.

GeekSheikh · 2023-03-15T18:37:53Z

src/main/scala/com/databricks/labs/overwatch/utils/Tools.scala

+   * @param cloneDetails details required to execute the parallelized clone
+   * @return
+   */
+  def tableStream(cloneDetails: Seq[CloneDetail]): Seq[CloneReport] = {


this should be private[overwatch] I think since I don't think we want to allow use outside.

let's rename the func to make it more clear that it's only a snapStream.

I'm on the fence as to whether we should have a class/object called snapshot and refactor the core logic out of this Helpers section. We could move snap and this new snapStream over to the other location. We could have a Helper function that then instantiates/calls the core logic from Snapshot.scala or something. What do you think?

GeekSheikh · 2023-03-15T18:40:38Z

src/main/scala/com/databricks/labs/overwatch/utils/Tools.scala

+   * @param cloneDetails details required to execute the parallelized clone
+   * @return
+   */
+  def tableStream(cloneDetails: Seq[CloneDetail]): Seq[CloneReport] = {


I think we need to persist the clone report. This brings more credence to the idea of refactoring the snapshots out of these helper sections to allow for a more complex / controlled workflow to happen. I even think there could be a main class for snapshotting since everyone will be doing this from a scheduled job anyway.

GeekSheikh · 2023-03-15T18:42:59Z

src/main/scala/com/databricks/labs/overwatch/pipeline/Bronze.scala

+    val cloneSpecs = targetsToSnap.map(t => {
+      val targetPath = s"${finalTargetPathPrefix}/${t.name.toLowerCase}"
+      CloneDetail(t.tableLocation, targetPath)
+    })


this looks like reusable code -- move to a function?

sriram251-code · 2023-03-30T08:15:15Z

src/main/scala/com/databricks/labs/overwatch/pipeline/Snapshot.scala

+    cloneSpecs
+  }
+
+  def main(args: Array[String]): Unit = {


add documentation of the function

sriram251-code · 2023-03-30T08:15:21Z

src/main/scala/com/databricks/labs/overwatch/pipeline/Snapshot.scala

+    Helpers.snapStream(cloneSpecs, snapshotRootPath)
+  }
+
+  def buildCloneSpecs(


make it private

sonarcloud · 2023-04-05T21:10:08Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

No Coverage information
0.0% Duplication

GeekSheikh

Is there a proof for this? Don't just test that it runs -- please test that the data is incrementally appearing in the output, that it looks correct, and that the restore functionality works as expected.

GeekSheikh · 2023-04-10T20:46:58Z

src/main/scala/com/databricks/labs/overwatch/pipeline/Snapshot.scala

+   * @return
+   */
+  private[overwatch] def snap(
+              bronze: Bronze,


if you change this to pipelines and let the user pass in comma delimited list it will be easier for customers to use and it would also allow us to snap any and / or all pipelines.

Suggest you accept similar to MultiWorkspace configs (i.e. "bronze,silver,gold") and then split them, get the pipelines and the necessary tables.

GeekSheikh · 2023-04-10T20:48:20Z

src/main/scala/com/databricks/labs/overwatch/pipeline/Snapshot.scala

+   *
+   * @param arg(0)        Source Database Name or Source Remote_OverwatchGlobalPath
+   * @param arg(1)        Target snapshotRootPath
+   * @param arg(2)        Flag to Determine whether the snap is normal batch process or Incremental one.(if "true" then incremental else normal snap)


Suggest changing this to accept, "incremental" or "full" to make it easier to use. Since batch doesn't

GeekSheikh · 2023-04-10T20:49:59Z

src/main/scala/com/databricks/labs/overwatch/pipeline/Snapshot.scala

+   */
+
+
+  def main(args: Array[String]): Unit = {


please create variable names for each of the args at the very beginning and then use those variables throughout the code to make it easier to support / debug.

i.e.
val source = "args(0)"
val targetPath = "args(1)"
...

GeekSheikh · 2023-04-10T20:51:21Z

src/main/scala/com/databricks/labs/overwatch/pipeline/Snapshot.scala

+  /**
+   * Create a backup of the Overwatch datasets
+   *
+   * @param arg(0)        Source Database Name or Source Remote_OverwatchGlobalPath


This comment is very confusing. You're accepting either a source database or source path prefix. I'm not sure what to do with Source Remote_OverwatchGlobalPath

GeekSheikh · 2023-04-10T20:52:45Z

src/main/scala/com/databricks/labs/overwatch/pipeline/Snapshot.scala

+
+    val workspace = if (args(0).contains("/")){
+      val remote_workspace_id = args(3)
+      val pathToOverwatchGlobalShareData = args(0)


should the path be the etl_prefix or the etl_prefix/global_share? perhaps you called this out in the docs if so, great otherwise, please make sure it's very clear and also be sure to fail if not correct. Don't hardcode global_share.

GeekSheikh · 2023-04-10T21:20:32Z

src/main/scala/com/databricks/labs/overwatch/utils/Tools.scala

@@ -472,7 +523,9 @@ object Helpers extends SparkSessionWrapper {
          spark.sql(baseCloneStatement)
          CloneReport(cloneSpec, baseCloneStatement, msg)
        }
-        case e: Throwable => CloneReport(cloneSpec, stmt, e.getMessage)
+        case e: Throwable => {
+          CloneReport(cloneSpec, stmt, e.getMessage)


create an error message using appendStackStrace similar to below

val failMsg = PipelineFunctions.appendStackStrace(e) val msg = s"FAILED: $moduleId-$moduleName Module: API CALL Failed\n$failMsg"

GeekSheikh · 2023-04-10T21:21:42Z

src/test/scala/com/databricks/labs/overwatch/pipeline/InitializeTest.scala

@@ -40,6 +41,7 @@ class InitializeTest extends AnyFunSpec with DataFrameComparer with SparkSession

    it ("initializeDatabase function should create both elt and consumer database") {
      import spark.implicits._
+      val sc: SparkContext = spark.sparkContext


please remove this and the import as they are not used and not needed.

GeekSheikh · 2023-04-10T21:23:44Z

src/main/scala/com/databricks/labs/overwatch/utils/Tools.scala

+        CloneReport(cloneSpec, s"Streaming For: ${cloneSpec.source} --> ${cloneSpec.target}", "SUCCESS")
+      } catch {
+        case e: Throwable if (e.getMessage.contains("is after the latest commit timestamp of")) => {
+          val msg = s"SUCCESS WITH WARNINGS: The timestamp provided, ${cloneSpec.asOfTS.get} " +


Use the appendStackStrace method shown in other comment to build this message with the full stack trace as well

GeekSheikh · 2023-04-10T21:24:10Z

src/main/scala/com/databricks/labs/overwatch/utils/Tools.scala

+        val sourceName = s"${cloneSpec.source}".split("/").takeRight(1).head
+        val checkPointLocation = if (snapshotRootPath.takeRight(1) == "/") s"${snapshotRootPath}checkpoint/${sourceName}" else s"${snapshotRootPath}/checkpoint/${sourceName}"
+        val cloneReportPath = if (snapshotRootPath.takeRight(1) == "/") s"${snapshotRootPath}report/" else s"${snapshotRootPath}/clone_report/"


use the new path slash helpers introduced in #825 -- these are merged to 0720 now so rebase and you should have access to them.

GeekSheikh · 2023-04-10T21:27:59Z

src/main/scala/com/databricks/labs/overwatch/utils/Tools.scala

+   * @param cloneDetails details required to execute the parallelized clone
+   * @return
+   */
+  private[overwatch] def snapStream(cloneDetails: Seq[CloneDetail],snapshotRootPath: String): Unit = {


move the getQueryListener code to Helpers and update Database.scala -- then reuse this to get a query listener for the stream and provide updates to the logs on query progressions similar to how it's done today in Database.scala. This will provide the user and support with status updates which are great for longer running Trigger.Once streams

overwatch/src/main/scala/com/databricks/labs/overwatch/env/Database.scala

Lines 112 to 133 in 5337cad

private def getQueryListener(query: StreamingQuery, minEventsPerTrigger: Long): StreamingQueryListener = {

val streamManager = new StreamingQueryListener() {

override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {

println("Query started: " + queryStarted.id)

}

override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = {

println("Query terminated: " + queryTerminated.id)

}

override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {

println("Query made progress: " + queryProgress.progress)

if (config.debugFlag) {

println(query.status.prettyJson)

}

if (queryProgress.progress.numInputRows <= minEventsPerTrigger) {

query.stop()

}

}

}

streamManager

}

souravbaner-da · 2023-05-10T06:19:50Z

Duplicate PR already raised from Release-721
#909

GeekSheikh and others added 3 commits March 6, 2023 12:19

initial commit

8c42b03

Enable Incremental in Bronze Snapshot

e0930f0

souravbaner-da added the enhancement New feature or request label Mar 8, 2023

souravbaner-da added this to the 0.7.2.0 milestone Mar 8, 2023

souravbaner-da requested review from GeekSheikh and brij-raghuwanshi-db March 8, 2023 08:33

souravbaner-da self-assigned this Mar 8, 2023

souravbaner-da linked an issue Mar 8, 2023 that may be closed by this pull request

[FEAT] - Bronze Snapshots - Enable Incremental #525

Open

GeekSheikh suggested changes Mar 15, 2023

View reviewed changes

Sourav692 added 8 commits March 16, 2023 05:50

Enable Incremental in Bronze Snapshot

6726a46

Enable Incremental in Bronze Snapshot

a3acd28

Enable Incremental in Bronze Snapshot

8c76915

Include Main Class to run Snapshot as separate job

5d54273

Include Main Class to run Snapshot as separate job

3e538c2

Include Main Class to run Snapshot as separate job

16d1ec0

Include Main Class to run Snapshot as separate job

5abdb4d

Include Main Class to run Snapshot as separate job

d74a496

sriram251-code requested changes Mar 30, 2023

View reviewed changes

Sourav692 added 5 commits March 30, 2023 15:30

Added changes as per comments

2e496c8

Added Change

01bc334

Added Change

c1ebaa3

Added Change

4988f29

Added Change

0d04282

souravbaner-da requested review from GeekSheikh and sriram251-code April 10, 2023 14:01

GeekSheikh suggested changes Apr 10, 2023

View reviewed changes

GeekSheikh modified the milestones: 0.7.2.0, 0.7.3.0 Apr 11, 2023

GeekSheikh force-pushed the 0720_release branch from 8c9721c to f744073 Compare April 18, 2023 21:34

gueniai modified the milestones: 0.7.3.0, 0.7.2.1 Apr 25, 2023

souravbaner-da closed this May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Incremental in Bronze Snapshot #805

Enable Incremental in Bronze Snapshot #805

souravbaner-da commented Mar 8, 2023

GeekSheikh Mar 15, 2023

GeekSheikh Mar 15, 2023

GeekSheikh Mar 15, 2023

GeekSheikh Mar 15, 2023

GeekSheikh Mar 15, 2023

GeekSheikh Mar 15, 2023

GeekSheikh Mar 15, 2023

GeekSheikh Mar 15, 2023

GeekSheikh Mar 15, 2023

sriram251-code Mar 30, 2023

sriram251-code Mar 30, 2023

sonarcloud bot commented Apr 5, 2023

GeekSheikh left a comment

GeekSheikh Apr 10, 2023

GeekSheikh Apr 10, 2023

GeekSheikh Apr 10, 2023

GeekSheikh Apr 10, 2023

GeekSheikh Apr 10, 2023

GeekSheikh Apr 10, 2023

GeekSheikh Apr 10, 2023

GeekSheikh Apr 10, 2023

GeekSheikh Apr 10, 2023

GeekSheikh Apr 10, 2023

souravbaner-da commented May 10, 2023

	private def getQueryListener(query: StreamingQuery, minEventsPerTrigger: Long): StreamingQueryListener = {
	val streamManager = new StreamingQueryListener() {
	override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {
	println("Query started: " + queryStarted.id)
	}

	override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = {
	println("Query terminated: " + queryTerminated.id)
	}

	override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
	println("Query made progress: " + queryProgress.progress)
	if (config.debugFlag) {
	println(query.status.prettyJson)
	}
	if (queryProgress.progress.numInputRows <= minEventsPerTrigger) {
	query.stop()
	}
	}
	}
	streamManager
	}

Enable Incremental in Bronze Snapshot #805

Enable Incremental in Bronze Snapshot #805

Conversation

souravbaner-da commented Mar 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarcloud bot commented Apr 5, 2023

GeekSheikh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

souravbaner-da commented May 10, 2023