reading mount source from csv implemented #695

sriram251-code · 2023-01-13T12:36:39Z

closes #694
closes #701

sonarcloud · 2023-01-20T07:24:17Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

No Coverage information
0.0% Duplication

GeekSheikh · 2023-02-20T16:53:37Z

src/main/scala/com/databricks/labs/overwatch/pipeline/BronzeTransforms.scala

    val mountMap = getMountPointMapping(apiEnv) //Getting the mount info from api and cleaning the data
+      .withColumn("mount_point", when('mount_point.endsWith("/"), 'mount_point.substr(lit(0), length('mount_point) - 1)).otherwise('mount_point))


wasn't this implemented in #699 ? Curious as to why it's in this PR too. No problem, just want to confirm the logic is exactly the same.

Added #768 for this

its a different issue when get the source to mount mapping using dbutils i see the data which is present in csv contains the "/" at the end hence some of the joins were failing. #699 contains the "/" issue only in apiURL.

ahh i see, thanks for clarifying

GeekSheikh · 2023-02-20T16:54:13Z

src/main/scala/com/databricks/labs/overwatch/pipeline/BronzeTransforms.scala

@@ -925,11 +926,34 @@ trait BronzeTransforms extends SparkSessionWrapper {

    val result = pathsDF.select('wildPrefix, 'cluster_id)
    result
+    }catch {
+      case e:Exception=>
+          throw e


Let's log the exception here.

GeekSheikh · 2023-02-20T16:57:24Z

src/main/scala/com/databricks/labs/overwatch/MultiWorkspaceDeployment.scala

@@ -166,7 +166,8 @@ class MultiWorkspaceDeployment extends SparkSessionWrapper {
      config.enable_unsafe_SSL.getOrElse(false),
      config.thread_pool_size.getOrElse(4),
      config.api_waiting_time.getOrElse(300000),
-      Some(apiProxyConfig))
+      Some(apiProxyConfig),
+      Some(config.mount_mapping_path))


Did you add some validation for this? When this parameter is passed, the >50 mount validation should be disabled and the mapping validation should be enabled. Is this done?

no, i thought of it but some part of the code is in
#713
and some part of the code is in
#695

so was waiting for these 2 PR to get merged so that i can enhance it to add the validation part

Ok, will merge this now and #713 as soon as your review is complete, and you can add the validator logic comments in a new PR after all merged. Thanks.

GeekSheikh · 2023-02-20T16:58:21Z

src/main/scala/com/databricks/labs/overwatch/pipeline/BronzeTransforms.scala

@@ -985,7 +1010,7 @@ trait BronzeTransforms extends SparkSessionWrapper {
    // Build root level eventLog path prefix from clusterID and log conf
    // /some/log/prefix/cluster_id/eventlog
    val allEventLogPrefixes =
-    if(isMultiWorkSpaceDeployment) {
+    if(isMultiWorkSpaceDeployment && organisationId != spark.conf.get("spark.databricks.clusterUsageTags.clusterOwnerOrgId")) {


instead of spark.conf.get("spark.databricks.clusterUsageTags.clusterOwnerOrgId") can we just use the standard getOrgId func so as not to duplicate this critical snippet?

getOrgId function uses dbutils.notebook.getContext.tags("orgId") which sometime gives null pointer exception. I think spark.conf.get("spark.databricks.clusterUsageTags.clusterOwnerOrgId") is more robust then dbutils.notebook.getContext.tags("orgId")

Initializer.getOrgId -- this should never give null pointer, right?

PR #708 updated the logic

implemented

GeekSheikh · 2023-02-20T17:00:04Z

src/main/scala/com/databricks/labs/overwatch/utils/Structures.scala

@@ -101,6 +103,7 @@ case class MultiWorkspaceConfig(workspace_name: String,
                                enable_unsafe_SSL: Option[Boolean]= None,
                                thread_pool_size:  Option[Int] = None,
                                api_waiting_time:  Option[Long] = None,
+                                mount_mapping_path: String,


This is option string everywhere else, do you have a default string when instantiating ApiProxyConfig if it's not present through getOrElse or something? I didn't see that.

while building the config i am taking the mountMappingPath as Some(config.mount_mapping_path)) and checking is as nonEmpty in later part

let me know if i have to change it and make default value??i thought mount_mapping_path should not have any default value thats why i have implemented in this way

nah this is fine. thanks for explaining.

sonarcloud · 2023-02-20T17:55:25Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

No Coverage information
0.0% Duplication

* initial commit * Changing Serverless to High-Concurrency (#706) * Changing Serverless to High-Concurrency * minor changes --------- Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com> * Spark events bronze par session (#678) * initial commit * Test Spark per session * notes from meeting * persession implemented * persession implemented * persession implemented * persession implemented * persession implemented * persession implemented * pr review implemented * 686 - SparkEvents Executor ID Schema Handler * global Session - initialize configs * Concurrent Writes - Table Lock for Parallelized Loads (#691) * initi commit -- working * code modified preWriteActions added * re-instanted perform retry for legacy deployments and added comments * added logging details * clear sessionsMap on batch runner --------- Co-authored-by: sriram251-code <sriram.mohanty@databricks.com> * minor fixes --------- Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com> Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * adding new snapshots to bronze layer (#684) * adding new snapshots to bronze layer * changed all the single asset name to plural * adding transform function for new bronze snaps * changes applied to improve schema quality --------- Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com> * apiURL "/" removed and dbsql added (#699) * bug fix * bug fix * bug fix * bug fix * Improve acquisition of Cloud Provider and OrgID (#708) * Improve acquisition of Cloud Provider and OrgID * Improve acquisition of Cloud Provider and OrgID * Modularize getOrgID function * removed old commented version of code --------- Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com> * Update SilverTransforms.scala (#703) * Overwatch on photon broadcast exchange perf issue (#705) * Change the Join to Shuffle_hash Join for collectEventLogPaths * Change the Join to Shuffle_hash Join for collectEventLogPaths * adding pagination logic for job-runs api (#723) * 729 - enable clusterEvents merge Insert (#730) * enable clusterEvents merge Insert * added comments --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * auditRaw - mergeInsertOnly (#738) * enable clusterEvents merge Insert * added comments * 737 - dateGlobFix and auditLogRaw mergeInserts --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * implemented (#752) * asStrings implemented for apicallv2 (#707) * 749 fill meta improved (#753) * 749 fill meta improved * put tsPartVal in clsf back to 16 --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * cleaning jobs_snap_bronze new_cluster field (#732) Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * dbu and cost calculations fixed (#760) * dbu calcs corrected * readd aliases * add runtime_engine to fillforward * added a few comments to funcs * corrected workerDBU Cost Value * enabled remote getWorkspaceByDatabase (#754) Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * improved first run Impute clusterSpec (#759) * excluded scope enhanced (#740) * excluded scope enhanced * review comment implemented * modified lowerCase logic --------- Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com> * adding temp location, start and end time in jobs runs list api (#755) * adding temp location, start and end time in jobs runs list api * change in jobRunsList api call * removed default apiVersion from new apply method * adding fix for jobs runs list api * adding code to cleanse duplicate cols in JobRunsList transform, and added new bronze snapshots in target * reading mount source from csv implemented (#695) * reading mount source from csv implemented * driver workspace should not call search/mount to get source * review comment implemented * review comment implemented * Reading config from delta implemented. (#713) * reading config from delta implemented. skip mount point check for AWS added. * review comment implemented * review comment implemented * review comment implemented * review comment implemented * review comment implemented * merge conflict removed * shuffle partition changed to String (#717) * shuffle partition changed to String * comments implemented * comments implemented * comments implemented * comments implemented * comments implemented * comments implemented * comments implemented * test cases added * test cases added * adding generic api calls function (#756) * adding generic api calls function * adding an empty map as return in APIMeta Trait for def getAPIJsonQuery * adding function getParallelAPIParams * implemented code review comments * removed commented lines * one workspace instance per workspace deployment (#774) * adding cluster type in jrcp view (#778) * improved spark conf handler and optimized confs (#773) * mount mapping validation added (#777) * mount mapping validation added * review comments implemented * review comments implemented * review comments implemented * review comments implemented * Integration Testing - Bug Fixes (#782) * added persistAndLoad to all writes with tableLocking * dont perform data validation if path validation fails -- protrects first run failures especially * fix excludedScopes * null config handlers plus proxy scope,key error handler * added persistAndLoad to all writes with tableLocking * dont perform data validation if path validation fails -- protrects first run failures especially * fix excludedScopes * null config handlers plus proxy scope,key error handler * cleanup * debugging * fixed futures executions * additional fixes * dbu cost fix * getOrgID bug fix * target exists enhancement for delta target path validation * getWorkspaceByDatabase -- cross-cloud remote workspace enabled * added experimental flag to jrsnapshot and enabled manual module disabling * rollback and module mapping --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> Co-authored-by: Sourav Banerjee <109206082+souravbaner-da@users.noreply.github.com> Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> Co-authored-by: Sriram Mohanty <69749553+sriram251-code@users.noreply.github.com> Co-authored-by: Aman <91308367+aman-db@users.noreply.github.com>

reading mount source from csv implemented

9a1e353

sriram251-code requested a review from GeekSheikh January 13, 2023 12:38

driver workspace should not call search/mount to get source

a198dbd

sriram251-code self-assigned this Jan 20, 2023

sriram251-code added the enhancement New feature or request label Jan 20, 2023

sriram251-code added this to the 0.7.1.2 milestone Jan 20, 2023

This was linked to issues Jan 20, 2023

[BUG] : MWS driver workspace performing dbfs/search-mounts api calls #701

Closed

Support cluster logging with 50+ mount points for MWS deployment #694

Closed

GeekSheikh modified the milestones: 0.7.1.2, 0.7.1.1 Feb 17, 2023

GeekSheikh suggested changes Feb 20, 2023

View reviewed changes

review comment implemented

116380e

sriram251-code requested a review from GeekSheikh February 20, 2023 17:31

review comment implemented

f9a68a5

GeekSheikh approved these changes Feb 20, 2023

View reviewed changes

GeekSheikh merged commit 0055bc3 into 0711_release Feb 20, 2023

GeekSheikh deleted the mount_path branch February 20, 2023 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reading mount source from csv implemented #695

reading mount source from csv implemented #695

sriram251-code commented Jan 13, 2023 •

edited

Loading

sonarcloud bot commented Jan 20, 2023

GeekSheikh Feb 20, 2023

sriram251-code Feb 20, 2023

GeekSheikh Feb 20, 2023

GeekSheikh Feb 20, 2023

sriram251-code Feb 20, 2023

GeekSheikh Feb 20, 2023

sriram251-code Feb 20, 2023

GeekSheikh Feb 20, 2023

GeekSheikh Feb 20, 2023

sriram251-code Feb 20, 2023

GeekSheikh Feb 20, 2023

GeekSheikh Feb 20, 2023

sriram251-code Feb 20, 2023

GeekSheikh Feb 20, 2023

sriram251-code Feb 20, 2023

GeekSheikh Feb 20, 2023

sonarcloud bot commented Feb 20, 2023

		val mountMap = getMountPointMapping(apiEnv) //Getting the mount info from api and cleaning the data
		.withColumn("mount_point", when('mount_point.endsWith("/"), 'mount_point.substr(lit(0), length('mount_point) - 1)).otherwise('mount_point))

reading mount source from csv implemented #695

reading mount source from csv implemented #695

Conversation

sriram251-code commented Jan 13, 2023 • edited Loading

sonarcloud bot commented Jan 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarcloud bot commented Feb 20, 2023

sriram251-code commented Jan 13, 2023 •

edited

Loading