-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reading mount source from csv implemented #695
Conversation
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
val mountMap = getMountPointMapping(apiEnv) //Getting the mount info from api and cleaning the data | ||
.withColumn("mount_point", when('mount_point.endsWith("/"), 'mount_point.substr(lit(0), length('mount_point) - 1)).otherwise('mount_point)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its a different issue when get the source to mount mapping using dbutils i see the data which is present in csv contains the "/" at the end hence some of the joins were failing. #699 contains the "/" issue only in apiURL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh i see, thanks for clarifying
@@ -925,11 +926,34 @@ trait BronzeTransforms extends SparkSessionWrapper { | |||
|
|||
val result = pathsDF.select('wildPrefix, 'cluster_id) | |||
result | |||
}catch { | |||
case e:Exception=> | |||
throw e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's log the exception here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -166,7 +166,8 @@ class MultiWorkspaceDeployment extends SparkSessionWrapper { | |||
config.enable_unsafe_SSL.getOrElse(false), | |||
config.thread_pool_size.getOrElse(4), | |||
config.api_waiting_time.getOrElse(300000), | |||
Some(apiProxyConfig)) | |||
Some(apiProxyConfig), | |||
Some(config.mount_mapping_path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you add some validation for this? When this parameter is passed, the >50 mount validation should be disabled and the mapping validation should be enabled. Is this done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will merge this now and #713 as soon as your review is complete, and you can add the validator logic comments in a new PR after all merged. Thanks.
@@ -985,7 +1010,7 @@ trait BronzeTransforms extends SparkSessionWrapper { | |||
// Build root level eventLog path prefix from clusterID and log conf | |||
// /some/log/prefix/cluster_id/eventlog | |||
val allEventLogPrefixes = | |||
if(isMultiWorkSpaceDeployment) { | |||
if(isMultiWorkSpaceDeployment && organisationId != spark.conf.get("spark.databricks.clusterUsageTags.clusterOwnerOrgId")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of spark.conf.get("spark.databricks.clusterUsageTags.clusterOwnerOrgId")
can we just use the standard getOrgId func so as not to duplicate this critical snippet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getOrgId function uses dbutils.notebook.getContext.tags("orgId") which sometime gives null pointer exception. I think spark.conf.get("spark.databricks.clusterUsageTags.clusterOwnerOrgId") is more robust then dbutils.notebook.getContext.tags("orgId")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initializer.getOrgId -- this should never give null pointer, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR #708 updated the logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implemented
@@ -101,6 +103,7 @@ case class MultiWorkspaceConfig(workspace_name: String, | |||
enable_unsafe_SSL: Option[Boolean]= None, | |||
thread_pool_size: Option[Int] = None, | |||
api_waiting_time: Option[Long] = None, | |||
mount_mapping_path: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is option string everywhere else, do you have a default string when instantiating ApiProxyConfig if it's not present through getOrElse or something? I didn't see that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah this is fine. thanks for explaining.
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
* initial commit * Changing Serverless to High-Concurrency (#706) * Changing Serverless to High-Concurrency * minor changes --------- Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com> * Spark events bronze par session (#678) * initial commit * Test Spark per session * notes from meeting * persession implemented * persession implemented * persession implemented * persession implemented * persession implemented * persession implemented * pr review implemented * 686 - SparkEvents Executor ID Schema Handler * global Session - initialize configs * Concurrent Writes - Table Lock for Parallelized Loads (#691) * initi commit -- working * code modified preWriteActions added * re-instanted perform retry for legacy deployments and added comments * added logging details * clear sessionsMap on batch runner --------- Co-authored-by: sriram251-code <sriram.mohanty@databricks.com> * minor fixes --------- Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com> Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * adding new snapshots to bronze layer (#684) * adding new snapshots to bronze layer * changed all the single asset name to plural * adding transform function for new bronze snaps * changes applied to improve schema quality --------- Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com> * apiURL "/" removed and dbsql added (#699) * bug fix * bug fix * bug fix * bug fix * Improve acquisition of Cloud Provider and OrgID (#708) * Improve acquisition of Cloud Provider and OrgID * Improve acquisition of Cloud Provider and OrgID * Modularize getOrgID function * removed old commented version of code --------- Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com> * Update SilverTransforms.scala (#703) * Overwatch on photon broadcast exchange perf issue (#705) * Change the Join to Shuffle_hash Join for collectEventLogPaths * Change the Join to Shuffle_hash Join for collectEventLogPaths * adding pagination logic for job-runs api (#723) * 729 - enable clusterEvents merge Insert (#730) * enable clusterEvents merge Insert * added comments --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * auditRaw - mergeInsertOnly (#738) * enable clusterEvents merge Insert * added comments * 737 - dateGlobFix and auditLogRaw mergeInserts --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * implemented (#752) * asStrings implemented for apicallv2 (#707) * 749 fill meta improved (#753) * 749 fill meta improved * put tsPartVal in clsf back to 16 --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * cleaning jobs_snap_bronze new_cluster field (#732) Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * dbu and cost calculations fixed (#760) * dbu calcs corrected * readd aliases * add runtime_engine to fillforward * added a few comments to funcs * corrected workerDBU Cost Value * enabled remote getWorkspaceByDatabase (#754) Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * improved first run Impute clusterSpec (#759) * excluded scope enhanced (#740) * excluded scope enhanced * review comment implemented * modified lowerCase logic --------- Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com> * adding temp location, start and end time in jobs runs list api (#755) * adding temp location, start and end time in jobs runs list api * change in jobRunsList api call * removed default apiVersion from new apply method * adding fix for jobs runs list api * adding code to cleanse duplicate cols in JobRunsList transform, and added new bronze snapshots in target * reading mount source from csv implemented (#695) * reading mount source from csv implemented * driver workspace should not call search/mount to get source * review comment implemented * review comment implemented * Reading config from delta implemented. (#713) * reading config from delta implemented. skip mount point check for AWS added. * review comment implemented * review comment implemented * review comment implemented * review comment implemented * review comment implemented * merge conflict removed * shuffle partition changed to String (#717) * shuffle partition changed to String * comments implemented * comments implemented * comments implemented * comments implemented * comments implemented * comments implemented * comments implemented * test cases added * test cases added * adding generic api calls function (#756) * adding generic api calls function * adding an empty map as return in APIMeta Trait for def getAPIJsonQuery * adding function getParallelAPIParams * implemented code review comments * removed commented lines * one workspace instance per workspace deployment (#774) * adding cluster type in jrcp view (#778) * improved spark conf handler and optimized confs (#773) * mount mapping validation added (#777) * mount mapping validation added * review comments implemented * review comments implemented * review comments implemented * review comments implemented * Integration Testing - Bug Fixes (#782) * added persistAndLoad to all writes with tableLocking * dont perform data validation if path validation fails -- protrects first run failures especially * fix excludedScopes * null config handlers plus proxy scope,key error handler * added persistAndLoad to all writes with tableLocking * dont perform data validation if path validation fails -- protrects first run failures especially * fix excludedScopes * null config handlers plus proxy scope,key error handler * cleanup * debugging * fixed futures executions * additional fixes * dbu cost fix * getOrgID bug fix * target exists enhancement for delta target path validation * getWorkspaceByDatabase -- cross-cloud remote workspace enabled * added experimental flag to jrsnapshot and enabled manual module disabling * rollback and module mapping --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> Co-authored-by: Sourav Banerjee <109206082+souravbaner-da@users.noreply.github.com> Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> Co-authored-by: Sriram Mohanty <69749553+sriram251-code@users.noreply.github.com> Co-authored-by: Aman <91308367+aman-db@users.noreply.github.com>
* initial commit * Changing Serverless to High-Concurrency (#706) * Changing Serverless to High-Concurrency * minor changes --------- Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com> * Spark events bronze par session (#678) * initial commit * Test Spark per session * notes from meeting * persession implemented * persession implemented * persession implemented * persession implemented * persession implemented * persession implemented * pr review implemented * 686 - SparkEvents Executor ID Schema Handler * global Session - initialize configs * Concurrent Writes - Table Lock for Parallelized Loads (#691) * initi commit -- working * code modified preWriteActions added * re-instanted perform retry for legacy deployments and added comments * added logging details * clear sessionsMap on batch runner --------- Co-authored-by: sriram251-code <sriram.mohanty@databricks.com> * minor fixes --------- Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com> Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * adding new snapshots to bronze layer (#684) * adding new snapshots to bronze layer * changed all the single asset name to plural * adding transform function for new bronze snaps * changes applied to improve schema quality --------- Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com> * apiURL "/" removed and dbsql added (#699) * bug fix * bug fix * bug fix * bug fix * Improve acquisition of Cloud Provider and OrgID (#708) * Improve acquisition of Cloud Provider and OrgID * Improve acquisition of Cloud Provider and OrgID * Modularize getOrgID function * removed old commented version of code --------- Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com> * Update SilverTransforms.scala (#703) * Overwatch on photon broadcast exchange perf issue (#705) * Change the Join to Shuffle_hash Join for collectEventLogPaths * Change the Join to Shuffle_hash Join for collectEventLogPaths * adding pagination logic for job-runs api (#723) * 729 - enable clusterEvents merge Insert (#730) * enable clusterEvents merge Insert * added comments --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * auditRaw - mergeInsertOnly (#738) * enable clusterEvents merge Insert * added comments * 737 - dateGlobFix and auditLogRaw mergeInserts --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * implemented (#752) * asStrings implemented for apicallv2 (#707) * 749 fill meta improved (#753) * 749 fill meta improved * put tsPartVal in clsf back to 16 --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * cleaning jobs_snap_bronze new_cluster field (#732) Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * dbu and cost calculations fixed (#760) * dbu calcs corrected * readd aliases * add runtime_engine to fillforward * added a few comments to funcs * corrected workerDBU Cost Value * enabled remote getWorkspaceByDatabase (#754) Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * improved first run Impute clusterSpec (#759) * excluded scope enhanced (#740) * excluded scope enhanced * review comment implemented * modified lowerCase logic --------- Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com> * adding temp location, start and end time in jobs runs list api (#755) * adding temp location, start and end time in jobs runs list api * change in jobRunsList api call * removed default apiVersion from new apply method * adding fix for jobs runs list api * adding code to cleanse duplicate cols in JobRunsList transform, and added new bronze snapshots in target * reading mount source from csv implemented (#695) * reading mount source from csv implemented * driver workspace should not call search/mount to get source * review comment implemented * review comment implemented * Reading config from delta implemented. (#713) * reading config from delta implemented. skip mount point check for AWS added. * review comment implemented * review comment implemented * review comment implemented * review comment implemented * review comment implemented * merge conflict removed * shuffle partition changed to String (#717) * shuffle partition changed to String * comments implemented * comments implemented * comments implemented * comments implemented * comments implemented * comments implemented * comments implemented * test cases added * test cases added * adding generic api calls function (#756) * adding generic api calls function * adding an empty map as return in APIMeta Trait for def getAPIJsonQuery * adding function getParallelAPIParams * implemented code review comments * removed commented lines * one workspace instance per workspace deployment (#774) * adding cluster type in jrcp view (#778) * improved spark conf handler and optimized confs (#773) * mount mapping validation added (#777) * mount mapping validation added * review comments implemented * review comments implemented * review comments implemented * review comments implemented * Integration Testing - Bug Fixes (#782) * added persistAndLoad to all writes with tableLocking * dont perform data validation if path validation fails -- protrects first run failures especially * fix excludedScopes * null config handlers plus proxy scope,key error handler * added persistAndLoad to all writes with tableLocking * dont perform data validation if path validation fails -- protrects first run failures especially * fix excludedScopes * null config handlers plus proxy scope,key error handler * cleanup * debugging * fixed futures executions * additional fixes * dbu cost fix * getOrgID bug fix * target exists enhancement for delta target path validation * getWorkspaceByDatabase -- cross-cloud remote workspace enabled * added experimental flag to jrsnapshot and enabled manual module disabling * rollback and module mapping --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> Co-authored-by: Sourav Banerjee <109206082+souravbaner-da@users.noreply.github.com> Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> Co-authored-by: Sriram Mohanty <69749553+sriram251-code@users.noreply.github.com> Co-authored-by: Aman <91308367+aman-db@users.noreply.github.com>
* initial commit * Changing Serverless to High-Concurrency (#706) * Changing Serverless to High-Concurrency * minor changes --------- Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com> * Spark events bronze par session (#678) * initial commit * Test Spark per session * notes from meeting * persession implemented * persession implemented * persession implemented * persession implemented * persession implemented * persession implemented * pr review implemented * 686 - SparkEvents Executor ID Schema Handler * global Session - initialize configs * Concurrent Writes - Table Lock for Parallelized Loads (#691) * initi commit -- working * code modified preWriteActions added * re-instanted perform retry for legacy deployments and added comments * added logging details * clear sessionsMap on batch runner --------- Co-authored-by: sriram251-code <sriram.mohanty@databricks.com> * minor fixes --------- Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com> Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * adding new snapshots to bronze layer (#684) * adding new snapshots to bronze layer * changed all the single asset name to plural * adding transform function for new bronze snaps * changes applied to improve schema quality --------- Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com> * apiURL "/" removed and dbsql added (#699) * bug fix * bug fix * bug fix * bug fix * Improve acquisition of Cloud Provider and OrgID (#708) * Improve acquisition of Cloud Provider and OrgID * Improve acquisition of Cloud Provider and OrgID * Modularize getOrgID function * removed old commented version of code --------- Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com> * Update SilverTransforms.scala (#703) * Overwatch on photon broadcast exchange perf issue (#705) * Change the Join to Shuffle_hash Join for collectEventLogPaths * Change the Join to Shuffle_hash Join for collectEventLogPaths * adding pagination logic for job-runs api (#723) * 729 - enable clusterEvents merge Insert (#730) * enable clusterEvents merge Insert * added comments --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * auditRaw - mergeInsertOnly (#738) * enable clusterEvents merge Insert * added comments * 737 - dateGlobFix and auditLogRaw mergeInserts --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * implemented (#752) * asStrings implemented for apicallv2 (#707) * 749 fill meta improved (#753) * 749 fill meta improved * put tsPartVal in clsf back to 16 --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * cleaning jobs_snap_bronze new_cluster field (#732) Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * dbu and cost calculations fixed (#760) * dbu calcs corrected * readd aliases * add runtime_engine to fillforward * added a few comments to funcs * corrected workerDBU Cost Value * enabled remote getWorkspaceByDatabase (#754) Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> * improved first run Impute clusterSpec (#759) * excluded scope enhanced (#740) * excluded scope enhanced * review comment implemented * modified lowerCase logic --------- Co-authored-by: Daniel Tomes <10840635+GeekSheikh@users.noreply.github.com> * adding temp location, start and end time in jobs runs list api (#755) * adding temp location, start and end time in jobs runs list api * change in jobRunsList api call * removed default apiVersion from new apply method * adding fix for jobs runs list api * adding code to cleanse duplicate cols in JobRunsList transform, and added new bronze snapshots in target * reading mount source from csv implemented (#695) * reading mount source from csv implemented * driver workspace should not call search/mount to get source * review comment implemented * review comment implemented * Reading config from delta implemented. (#713) * reading config from delta implemented. skip mount point check for AWS added. * review comment implemented * review comment implemented * review comment implemented * review comment implemented * review comment implemented * merge conflict removed * shuffle partition changed to String (#717) * shuffle partition changed to String * comments implemented * comments implemented * comments implemented * comments implemented * comments implemented * comments implemented * comments implemented * test cases added * test cases added * adding generic api calls function (#756) * adding generic api calls function * adding an empty map as return in APIMeta Trait for def getAPIJsonQuery * adding function getParallelAPIParams * implemented code review comments * removed commented lines * one workspace instance per workspace deployment (#774) * adding cluster type in jrcp view (#778) * improved spark conf handler and optimized confs (#773) * mount mapping validation added (#777) * mount mapping validation added * review comments implemented * review comments implemented * review comments implemented * review comments implemented * Integration Testing - Bug Fixes (#782) * added persistAndLoad to all writes with tableLocking * dont perform data validation if path validation fails -- protrects first run failures especially * fix excludedScopes * null config handlers plus proxy scope,key error handler * added persistAndLoad to all writes with tableLocking * dont perform data validation if path validation fails -- protrects first run failures especially * fix excludedScopes * null config handlers plus proxy scope,key error handler * cleanup * debugging * fixed futures executions * additional fixes * dbu cost fix * getOrgID bug fix * target exists enhancement for delta target path validation * getWorkspaceByDatabase -- cross-cloud remote workspace enabled * added experimental flag to jrsnapshot and enabled manual module disabling * rollback and module mapping --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> --------- Co-authored-by: geeksheikh <geeksheikh@users.noreply.github.com> Co-authored-by: Sourav Banerjee <109206082+souravbaner-da@users.noreply.github.com> Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> Co-authored-by: Sriram Mohanty <69749553+sriram251-code@users.noreply.github.com> Co-authored-by: Aman <91308367+aman-db@users.noreply.github.com>
closes #694
closes #701