Refactor lookups in Silver Job Runs #1253

neilbest-db · 2024-06-26T16:58:40Z

[ Original description moved here in #1256. ]

The net effect of this PR is to refactor a number of transformations that are part of Silver Job Runs (module 2011) so that the Spark jobs and stages are labelled in a useful way. Only the last NamedTransformation before a Spark action in a chain has the desired effect, so ending a NamedTransformation with an action or performing an action immediately after applying one is the sensible application of this feature.

Next steps

Further gains in resource utilization and time efficiency may be possible in the subsequent phases of the JR module (2011):

…t_bronze to mapType

commit a6a13fe Author: Neil Best <neil.best@databricks.com> Date: Thu May 23 16:39:58 2024 -0500 improve TransformationDescriberTest commit 1f145aa Author: Neil Best <neil.best@databricks.com> Date: Thu May 23 15:25:29 2024 -0500 Add descriptive job group IDs and named transformations This makes the Spark UI more developer-friendly when analyzing Overwatch runs. Job group IDs have the form <workspace name>:<OW module name> Any use of `.transform( df => df)` may be replaced with `.transformWithDescription( nt)` after instantiating a `val nt = NamedTransformation( df => df)` as its argument. This commit contains one such application of the new extension method. (See `val jobRunsAppendClusterName` in `WorkflowsTransforms.scala`.) Some logic in `GoldTransforms` falls through to elements of the special job-run-action form of Job Group IDs emitted by the platform but the impact is minimal relative to the benefit to Overwatch development and troubleshooting. Even so this form of Job Group ID is still present in initial Spark events before OW ETL modules begin to execute. commit da0c55a Author: Guenia <guenia.izquierdo@databricks.com> Date: Wed May 8 19:43:29 2024 -0400 Initial commit

Removed a level of indirection and unnecessary conditional branching in definition of chained `lookupWhen` transformations. Moved defintions to have references to `PipelineTable` objects in scope rather than passing them by argument. (cherry picked from commit efdd63f)

- enable auto-optimized shuffle for module 2011 - move caching action to previous `NamedTransformation` for more meaningful Spark UI labels

for greater visibility in Spark UI. `NamedTransformation` type name now appears in labels' second position.

prevent certain regressions when the Job Group labels set by the platform are no longer available for parsing. Labels set by the platform contain tokens that are necessary to preserve referential integrity under certain conditions. (Which conditions?)

neilbest-db · 2024-06-29T20:02:40Z

@sriram251-code, the code that changes the values of the Spark UI Job Group IDs has been commented out in this branch per your recommendation (see 9fe9f8c). I would like to understand the scenarios when the Job Group IDs are the only place to extract certain tokens/IDs. Is it possible to enumerate these scenarios and map the flow of those tokens through the ETL to the target table(s)?

neilbest-db · 2024-07-15T20:12:46Z

to follow #1228

sonarcloud · 2024-07-15T20:13:08Z

Quality Gate passed

Issues
10 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

enable auto-optimized shuffle for module 2011 originally implemented in commit [d751d5f](d751d5f)

) * Initial commit * append null columns from cluster snapshot for cluster_spec_silver * append null columns from cluster snapshot for cluster_spec_silver --------- Co-authored-by: Guenia <guenia.izquierdo@databricks.com>

* Initial commit * cluster event bronze will take all the data from API for first run * Update BronzeTransforms.scala adjust whitespace around `landClusterEvents()` --------- Co-authored-by: Guenia <guenia.izquierdo@databricks.com> Co-authored-by: Neil Best <60439766+neilbest-db@users.noreply.github.com>

) * Initial commit * Change publish() function to incorporate views from ETL Tables iso paths * Handle view creation in case of table does not exists --------- Co-authored-by: Guenia <guenia.izquierdo@databricks.com>

* Initial commit * 19-Oct-23 : Added Validation Framework * 19-Oct-23: Customize the message for customer * 19-Oct-23: Customize the message for customer * 26-Oct-23: Added OverwatchID filter in the table * 26-Oct-23: Change for Coding Best Practices * Added Function Description for validateColumnBetweenMultipleTable * Added Pattern Matching in Validation * Convert if-else in validateRuleAndUpdateStatus to case statement as per comment * Initial commit * traceability implemented (#1102) * traceability implemented * code review implemented * missed code implemented (#1105) * Initial commit * traceability implemented (#1102) * traceability implemented * code review implemented * missed code implemented * missed code implemented --------- Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com> * Added proper exception for Spark Stream Gold if progress c… (#1085) * Initial commit * 09-Nov-23: Added proper exception for Spark Stream Gold if progress column contains only null in SparkEvents_Bronze --------- Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com> Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> * Gracefully Handle Exception for NotebookCommands_Gold (#1095) * Initial commit * Gracefully Handle Exception for NotebookCommands_Gold * Convert the check in buildNotebookCommandsFact to single or clause --------- Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com> Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> * code missed in merge (#1120) * Fix Helper Method to Instantiate Remote Workspaces (#1110) * Initial commit * Change getRemoteWorkspaceByPath and getWorkspaceByDatabase to take it RemoteWorkspace * Remove Unnecessary println Statements --------- Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com> * Ensure we test the write into a partitioned storage_prefix (#1088) * Initial commit * Ensure we test the write into a partitioned storage_prefix * silver warehouse spec fix (#1121) * added missed copy-pasta (#1129) * Exclude cluster logs in S3 root bucket (#1118) * Exclude cluster logs in S3 root bucket * Omit cluster log paths pointing to s3a as well * implemented recon (#1116) * implemented recon * docs added * file path change * review comments implemented * Added ShuffleFactor to NotebookCommands (#1124) Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> * disabled traceability (#1130) * Added JobRun_Silver in buildClusterStateFact for Cluster E… (#1083) * Initial commit * 08-Nov-23: Added JobRun_Silver in buildClusterStateFact for Cluster End Time Imputation * Impute Terminating Events in CLSF from JR_Silver * Impute Terminating Events in CLSD * Impute Terminating Events in CLSD * Change CLSF to original 0730 version * Change CLSF to original 0730 version * Added cluster_spec in CLSD to get job Cluster only * Make the variables name in buildClusterStateDetail into more descriptive way * Make the variables name in buildClusterStateDetail into more descriptive way --------- Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com> Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> * Sys table audit log integration (#1122) * system table integration with audit log * adding code to resolve issues with response col * fixed timestamp issue * adding print statement for from and until time * adding fix for azure * removed comments * removed comments and print statements * removed comments * implemented code review comments * implemented code review comments * adding review comment * Sys table integration multi acount (#1131) * added code changes for multi account deployment * code for multi account system table integration * Sys table integration multi acount (#1132) * added code changes for multi account deployment * code for multi account system table integration * adding code for system table migration check * changing exception for empty audit log from system table * adding code to handle sql_endpoint in configs and fix in migration validation (#1133) * corner case commit (#1134) * Handle CLSD Cluster Impute when jrcp and clusterSpec is Empty (#1135) * Handle CLSD Cluster Impute when jrcp and clusterSpec is Empty * Exclude last_state from clsd as it is not needed in the logic. --------- Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> * Exclude 2011 and 2014 as dependency module for 2019 (#1136) * Exclude 2011 and 2014 as dependency module for 2019 * Added comment in CLSD for understandability --------- Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> * corner case commit (#1137) * Update version * adding fix for empty EH config for system tables (#1140) * corner case commit (#1142) * adding fix for empty audit log for warehouse_spec_silver (#1141) * recon columns removed (#1143) * recon columns removed * recon columns removed * Initial Commit * Added Changes in Validation Framework as per comments added during sprint meeting * added hotfix for warehouse_spec_silver (#1154) * Added Multiple RunID check in Validation Frameowkr * Added Other tables in Validation Framework * Added Multiple WS ID option in Cros Table Validation * Added change for Pipeline_report * Change for Pipeline Report * Added msg for single table validation * Added negative msg in HealthCheck Report * Added Negative Msg for Cross Table Validation * Added extra filter for total cost validation for CLSF * Changed as per Comments * Changed as per the comments * Added some filter condition for cost validation in clsf * Added Config for all pipeline run * 19-Oct-23 : Added Validation Framework * 19-Oct-23: Customize the message for customer * 19-Oct-23: Customize the message for customer * 26-Oct-23: Added OverwatchID filter in the table * 26-Oct-23: Change for Coding Best Practices * Added Function Description for validateColumnBetweenMultipleTable * Added Pattern Matching in Validation * Convert if-else in validateRuleAndUpdateStatus to case statement as per comment * traceability implemented (#1102) * traceability implemented * code review implemented * Added JobRun_Silver in buildClusterStateFact for Cluster E… (#1083) * Initial commit * 08-Nov-23: Added JobRun_Silver in buildClusterStateFact for Cluster End Time Imputation * Impute Terminating Events in CLSF from JR_Silver * Impute Terminating Events in CLSD * Impute Terminating Events in CLSD * Change CLSF to original 0730 version * Change CLSF to original 0730 version * Added cluster_spec in CLSD to get job Cluster only * Make the variables name in buildClusterStateDetail into more descriptive way * Make the variables name in buildClusterStateDetail into more descriptive way --------- Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com> Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> * corner case commit (#1134) * Exclude 2011 and 2014 as dependency module for 2019 (#1136) * Exclude 2011 and 2014 as dependency module for 2019 * Added comment in CLSD for understandability --------- Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> * Added Changes in Validation Framework as per comments added during sprint meeting * Added Multiple RunID check in Validation Frameowkr * Added Other tables in Validation Framework * Added Multiple WS ID option in Cros Table Validation * Added change for Pipeline_report * Change for Pipeline Report * Added msg for single table validation * Added negative msg in HealthCheck Report * Added Negative Msg for Cross Table Validation * Added extra filter for total cost validation for CLSF * Changed as per Comments * Changed as per the comments * Added some filter condition for cost validation in clsf * Added Config for all pipeline run --------- Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com> Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> Co-authored-by: Sriram Mohanty <69749553+sriram251-code@users.noreply.github.com> Co-authored-by: Aman <91308367+aman-db@users.noreply.github.com>

* test * code for warehouse_state_detail_silver * removed comments * adding warehouseEvents scope * added exception for table not found * added exception to check if system tables are getting used or not * enhance function getWarehousesEventDF * added code to fix max number of clusters * change in column names * refactored code

* Initial commit * Add descriptive job group IDs and named transformations This makes the Spark UI more developer-friendly when analyzing Overwatch runs. Job group IDs have the form <workspace name>:<OW module name> Any use of `.transform( df => df)` may be replaced with `.transformWithDescription( nt)` after instantiating a `val nt = NamedTransformation( df => df)` as its argument. This commit contains one such application of the new extension method. (See `val jobRunsAppendClusterName` in `WorkflowsTransforms.scala`.) Some logic in `GoldTransforms` falls through to elements of the special job-run-action form of Job Group IDs emitted by the platform but the impact is minimal relative to the benefit to Overwatch development and troubleshooting. Even so this form of Job Group ID is still present in initial Spark events before OW ETL modules begin to execute. * improve TransformationDescriberTest * flip transformation names to beginning of label for greater visibility in Spark UI. `NamedTransformation` type name now appears in labels' second position. (cherry picked from commit 2ead752) * revert modified Spark UI Job Group labels TODO: enumerate the regressions this would introduce when the labels set by then platform are replaced this way. --------- Co-authored-by: Guenia <guenia.izquierdo@databricks.com>

* adding code for warehouseStateFact gold * removed hard coded data and fix logic * removed commented code

* Initial commit * Add extension method to show `DataFrame` records in the log * catch up with 0820_release Squashed commit of the following: commit bbdb61f Author: Neil Best <60439766+neilbest-db@users.noreply.github.com> Date: Tue Aug 20 10:11:03 2024 -0500 Add descriptive `NamedTransformation`s to Spark UI (#1223) * Initial commit * Add descriptive job group IDs and named transformations This makes the Spark UI more developer-friendly when analyzing Overwatch runs. Job group IDs have the form <workspace name>:<OW module name> Any use of `.transform( df => df)` may be replaced with `.transformWithDescription( nt)` after instantiating a `val nt = NamedTransformation( df => df)` as its argument. This commit contains one such application of the new extension method. (See `val jobRunsAppendClusterName` in `WorkflowsTransforms.scala`.) Some logic in `GoldTransforms` falls through to elements of the special job-run-action form of Job Group IDs emitted by the platform but the impact is minimal relative to the benefit to Overwatch development and troubleshooting. Even so this form of Job Group ID is still present in initial Spark events before OW ETL modules begin to execute. * improve TransformationDescriberTest * flip transformation names to beginning of label for greater visibility in Spark UI. `NamedTransformation` type name now appears in labels' second position. (cherry picked from commit 2ead752) * revert modified Spark UI Job Group labels TODO: enumerate the regressions this would introduce when the labels set by then platform are replaced this way. --------- Co-authored-by: Guenia <guenia.izquierdo@databricks.com> commit 3055a22 Author: Aman <91308367+aman-db@users.noreply.github.com> Date: Mon Aug 12 22:59:13 2024 +0530 1218 warehouse state details (#1254) * test * code for warehouse_state_detail_silver * removed comments * adding warehouseEvents scope * added exception for table not found * added exception to check if system tables are getting used or not * enhance function getWarehousesEventDF * added code to fix max number of clusters * change in column names * refactored code commit 59daae5 Author: Aman <91308367+aman-db@users.noreply.github.com> Date: Thu Aug 8 20:20:17 2024 +0530 adding fix for duplicate accountId in module 2010 and 3019 (#1270) commit d6fa441 Author: Sourav Banerjee <109206082+souravbaner-da@users.noreply.github.com> Date: Wed Aug 7 23:24:00 2024 +0530 1030 pipeline validation framework (#1071) * Initial commit * 19-Oct-23 : Added Validation Framework * 19-Oct-23: Customize the message for customer * 19-Oct-23: Customize the message for customer * 26-Oct-23: Added OverwatchID filter in the table * 26-Oct-23: Change for Coding Best Practices * Added Function Description for validateColumnBetweenMultipleTable * Added Pattern Matching in Validation * Convert if-else in validateRuleAndUpdateStatus to case statement as per comment * Initial commit * traceability implemented (#1102) * traceability implemented * code review implemented * missed code implemented (#1105) * Initial commit * traceability implemented (#1102) * traceability implemented * code review implemented * missed code implemented * missed code implemented --------- Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com> * Added proper exception for Spark Stream Gold if progress c… (#1085) * Initial commit * 09-Nov-23: Added proper exception for Spark Stream Gold if progress column contains only null in SparkEvents_Bronze --------- Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com> Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> * Gracefully Handle Exception for NotebookCommands_Gold (#1095) * Initial commit * Gracefully Handle Exception for NotebookCommands_Gold * Convert the check in buildNotebookCommandsFact to single or clause --------- Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com> Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> * code missed in merge (#1120) * Fix Helper Method to Instantiate Remote Workspaces (#1110) * Initial commit * Change getRemoteWorkspaceByPath and getWorkspaceByDatabase to take it RemoteWorkspace * Remove Unnecessary println Statements --------- Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com> * Ensure we test the write into a partitioned storage_prefix (#1088) * Initial commit * Ensure we test the write into a partitioned storage_prefix * silver warehouse spec fix (#1121) * added missed copy-pasta (#1129) * Exclude cluster logs in S3 root bucket (#1118) * Exclude cluster logs in S3 root bucket * Omit cluster log paths pointing to s3a as well * implemented recon (#1116) * implemented recon * docs added * file path change * review comments implemented * Added ShuffleFactor to NotebookCommands (#1124) Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> * disabled traceability (#1130) * Added JobRun_Silver in buildClusterStateFact for Cluster E… (#1083) * Initial commit * 08-Nov-23: Added JobRun_Silver in buildClusterStateFact for Cluster End Time Imputation * Impute Terminating Events in CLSF from JR_Silver * Impute Terminating Events in CLSD * Impute Terminating Events in CLSD * Change CLSF to original 0730 version * Change CLSF to original 0730 version * Added cluster_spec in CLSD to get job Cluster only * Make the variables name in buildClusterStateDetail into more descriptive way * Make the variables name in buildClusterStateDetail into more descriptive way --------- Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com> Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> * Sys table audit log integration (#1122) * system table integration with audit log * adding code to resolve issues with response col * fixed timestamp issue * adding print statement for from and until time * adding fix for azure * removed comments * removed comments and print statements * removed comments * implemented code review comments * implemented code review comments * adding review comment * Sys table integration multi acount (#1131) * added code changes for multi account deployment * code for multi account system table integration * Sys table integration multi acount (#1132) * added code changes for multi account deployment * code for multi account system table integration * adding code for system table migration check * changing exception for empty audit log from system table * adding code to handle sql_endpoint in configs and fix in migration validation (#1133) * corner case commit (#1134) * Handle CLSD Cluster Impute when jrcp and clusterSpec is Empty (#1135) * Handle CLSD Cluster Impute when jrcp and clusterSpec is Empty * Exclude last_state from clsd as it is not needed in the logic. --------- Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> * Exclude 2011 and 2014 as dependency module for 2019 (#1136) * Exclude 2011 and 2014 as dependency module for 2019 * Added comment in CLSD for understandability --------- Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> * corner case commit (#1137) * Update version * adding fix for empty EH config for system tables (#1140) * corner case commit (#1142) * adding fix for empty audit log for warehouse_spec_silver (#1141) * recon columns removed (#1143) * recon columns removed * recon columns removed * Initial Commit * Added Changes in Validation Framework as per comments added during sprint meeting * added hotfix for warehouse_spec_silver (#1154) * Added Multiple RunID check in Validation Frameowkr * Added Other tables in Validation Framework * Added Multiple WS ID option in Cros Table Validation * Added change for Pipeline_report * Change for Pipeline Report * Added msg for single table validation * Added negative msg in HealthCheck Report * Added Negative Msg for Cross Table Validation * Added extra filter for total cost validation for CLSF * Changed as per Comments * Changed as per the comments * Added some filter condition for cost validation in clsf * Added Config for all pipeline run * 19-Oct-23 : Added Validation Framework * 19-Oct-23: Customize the message for customer * 19-Oct-23: Customize the message for customer * 26-Oct-23: Added OverwatchID filter in the table * 26-Oct-23: Change for Coding Best Practices * Added Function Description for validateColumnBetweenMultipleTable * Added Pattern Matching in Validation * Convert if-else in validateRuleAndUpdateStatus to case statement as per comment * traceability implemented (#1102) * traceability implemented * code review implemented * Added JobRun_Silver in buildClusterStateFact for Cluster E… (#1083) * Initial commit * 08-Nov-23: Added JobRun_Silver in buildClusterStateFact for Cluster End Time Imputation * Impute Terminating Events in CLSF from JR_Silver * Impute Terminating Events in CLSD * Impute Terminating Events in CLSD * Change CLSF to original 0730 version * Change CLSF to original 0730 version * Added cluster_spec in CLSD to get job Cluster only * Make the variables name in buildClusterStateDetail into more descriptive way * Make the variables name in buildClusterStateDetail into more descriptive way --------- Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com> Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> * corner case commit (#1134) * Exclude 2011 and 2014 as dependency module for 2019 (#1136) * Exclude 2011 and 2014 as dependency module for 2019 * Added comment in CLSD for understandability --------- Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> * Added Changes in Validation Framework as per comments added during sprint meeting * Added Multiple RunID check in Validation Frameowkr * Added Other tables in Validation Framework * Added Multiple WS ID option in Cros Table Validation * Added change for Pipeline_report * Change for Pipeline Report * Added msg for single table validation * Added negative msg in HealthCheck Report * Added Negative Msg for Cross Table Validation * Added extra filter for total cost validation for CLSF * Changed as per Comments * Changed as per the comments * Added some filter condition for cost validation in clsf * Added Config for all pipeline run --------- Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com> Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com> Co-authored-by: Sriram Mohanty <69749553+sriram251-code@users.noreply.github.com> Co-authored-by: Aman <91308367+aman-db@users.noreply.github.com> commit 3c16b5f Author: Sourav Banerjee <109206082+souravbaner-da@users.noreply.github.com> Date: Wed Aug 7 23:23:17 2024 +0530 Redefine views so that they are created from tables not locations (#1241) * Initial commit * Change publish() function to incorporate views from ETL Tables iso paths * Handle view creation in case of table does not exists --------- Co-authored-by: Guenia <guenia.izquierdo@databricks.com> commit f3ffd7c Author: Sourav Banerjee <109206082+souravbaner-da@users.noreply.github.com> Date: Wed Aug 7 23:21:37 2024 +0530 1201 collect all event logs on first run (#1255) * Initial commit * cluster event bronze will take all the data from API for first run * Update BronzeTransforms.scala adjust whitespace around `landClusterEvents()` --------- Co-authored-by: Guenia <guenia.izquierdo@databricks.com> Co-authored-by: Neil Best <60439766+neilbest-db@users.noreply.github.com> commit caa3282 Author: Sriram Mohanty <69749553+sriram251-code@users.noreply.github.com> Date: Wed Aug 7 23:20:25 2024 +0530 append null columns from cluster snapshot for cluster_spec_silver (#1239) * Initial commit * append null columns from cluster snapshot for cluster_spec_silver * append null columns from cluster snapshot for cluster_spec_silver --------- Co-authored-by: Guenia <guenia.izquierdo@databricks.com> commit f7460bd Author: Neil Best <60439766+neilbest-db@users.noreply.github.com> Date: Tue Jul 30 14:52:38 2024 -0500 adjust Silver Job Runs module configuration (#1256) enable auto-optimized shuffle for module 2011 originally implemented in commit [d751d5f](d751d5f) commit 25671b7 Author: Sriram Mohanty <69749553+sriram251-code@users.noreply.github.com> Date: Tue Jul 9 02:02:04 2024 +0530 recon enhancement done to deal with different columns in source and target (#1216) * Initial commit * recon enhancement done to deal with different columns in source and target --------- Co-authored-by: Guenia <guenia.izquierdo@databricks.com> commit 97236ae Author: Guenia <guenia.izquierdo@databricks.com> Date: Wed May 8 19:43:29 2024 -0400 Initial commit commit f9c8dd0 Author: Guenia Izquierdo Delgado <guenia.izquierdo@databricks.com> Date: Mon Jun 24 11:28:15 2024 -0400 0812 release (#1249) * Initial commit * adding fix for schemaScrubber and StructToMap (#1232) * fix for null driver_type_id and node_type_id in jrcp (#1236) * Modify Cluster_snapshot_bronze column (#1234) * Comvert all the struct field inside 'spec' column for cluster_snapshot_bronze to mapType * Dropped Spec column from snapshot * Removed Reductant VerifyMinSchema * Update_AWS_instance_types (#1248) * Update_gcp_instance_types (#1244) Update_gcp_instance_types * Update_AWS_instance_types Update_AWS_instance_types --------- Co-authored-by: Aman <91308367+aman-db@users.noreply.github.com> Co-authored-by: Sourav Banerjee <109206082+souravbaner-da@users.noreply.github.com> Co-authored-by: Mohan Baabu <87074323+mohanbaabu1996@users.noreply.github.com> commit 7390d4a Author: Mohan Baabu <87074323+mohanbaabu1996@users.noreply.github.com> Date: Fri Jun 21 20:01:46 2024 +0530 Update_Azure_Instance_details (#1246) * Update_Azure_Instance_details Update_Azure_Instance_details * Update Azure_Instance_Details.csv Updated Standard_NV72ads_A10_v5 types, missed a comma commit 6cbb9d7 Author: Mohan Baabu <87074323+mohanbaabu1996@users.noreply.github.com> Date: Fri Jun 21 19:37:57 2024 +0530 Update_gcp_instance_types (#1244) Update_gcp_instance_types * add Spark conf option for `DataFrame` logging extension methods This feature respects the logging level set for the logger in scope. ```scala spark.conf.set( "overwatch.dataframelogger.level", "DEBUG") logger.setLevel( "WARN") df.log() // no data shown in logs logger.setLevel( "DEBUG") df.log() // :) ``` also: - implement `DataFrameSyntaxTest` suite to test `Dataset`/`DataFrame` extension methods `.showLines()` and `.log()` as implemented within the `DataFrameSyntax` trait. - move `SparkSessionTestWrapper` into `src/main` and made it extend `SparkSessionWrapper` in order to make `DataFrameSyntax` testable through the use of type parameter `SPARK` and self-typing. --------- Co-authored-by: Guenia <guenia.izquierdo@databricks.com>

sonarcloud · 2024-08-26T23:56:11Z

Quality Gate failed

Failed conditions
8.8% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

Everything was resolved.

gueniai and others added 11 commits May 31, 2024 15:03

Initial commit

e134bd4

adding fix for schemaScrubber and StructToMap (#1232)

da14f88

Comvert all the struct field inside 'spec' column for cluster_snapsho…

a5c8b54

…t_bronze to mapType

Dropped Spec column from snapshot

aeef7ff

Squash merge of show_dataframe_in_logs

4f64d83

use NamedTransformations in Silver Job Runs

98cfed6

adjust Silver Job Runs module configuration and caching

d751d5f

- enable auto-optimized shuffle for module 2011 - move caching action to previous `NamedTransformation` for more meaningful Spark UI labels

add more Spark UI labels to Silver Job Runs transformations

0c8c9c2

flip transformation names to beginning of label

2ead752

for greater visibility in Spark UI. `NamedTransformation` type name now appears in labels' second position.

neilbest-db requested review from gueniai and sriram251-code June 26, 2024 23:23

neilbest-db linked an issue Jun 26, 2024 that may be closed by this pull request

Analyze and improve Silver Job Runs performance #1228

Open

neilbest-db self-assigned this Jun 26, 2024

neilbest-db added enhancement New feature or request optimization Technical Spark Optimization labels Jun 26, 2024

neilbest-db added this to the 0.8.2.0 milestone Jun 26, 2024

This comment was marked as resolved.

Sign in to view

Initial commit

97236ae

gueniai force-pushed the 0820_release branch from da0c55a to 97236ae Compare June 27, 2024 13:22

revert Spark UI Job Group labels

9fe9f8c

prevent certain regressions when the Job Group labels set by the platform are no longer available for parsing. Labels set by the platform contain tokens that are necessary to preserve referential integrity under certain conditions. (Which conditions?)

neilbest-db marked this pull request as ready for review June 29, 2024 20:37

Merge branch '0820_release' into 1228-silver-job-runs-spark312-r0812

995e0da

neilbest-db requested review from aman-db and souravbaner-da June 29, 2024 20:40

This was linked to issues Jun 29, 2024

Dataset.transformWithDescription extension method to add source code metadata to SparkUI #1226

Open

Capture Dataset.show output in log4j logs #1227

Open

This was unlinked from issues Jul 15, 2024

Capture Dataset.show output in log4j logs #1227

Open

Dataset.transformWithDescription extension method to add source code metadata to SparkUI #1226

Open

neilbest-db removed request for gueniai, aman-db and souravbaner-da July 15, 2024 20:02

neilbest-db changed the title ~~Analyze and improve Silver Job Runs performance (Spark 3.1.2)~~ Refactor lookups in Silver Job Runs Jul 15, 2024

neilbest-db reopened this Jul 15, 2024

neilbest-db added this to the 0.8.2.0 milestone Jul 15, 2024

neilbest-db and others added 11 commits July 30, 2024 15:52

adjust Silver Job Runs module configuration (#1256)

f7460bd

enable auto-optimized shuffle for module 2011 originally implemented in commit [d751d5f](d751d5f)

adding fix for duplicate accountId in module 2010 and 3019 (#1270)

59daae5

adding code for warehouseStateFact gold (#1265)

12fd6ac

* adding code for warehouseStateFact gold * removed hard coded data and fix logic * removed commented code

Merge branch '0820_release' into 1228-silver-job-runs-spark312-r0812

75c8760

neilbest-db marked this pull request as ready for review August 27, 2024 00:01

neilbest-db mentioned this pull request Aug 27, 2024

correctly integrate cancelAllRuns audit events with job runs #1281

Open

neilbest-db requested review from aman-db and souravbaner-da August 27, 2024 15:25

neilbest-db modified the milestones: 0.8.2.0, 0.9.0.0 Aug 27, 2024

gueniai force-pushed the 0820_release branch from fba983c to 88376a3 Compare September 9, 2024 13:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor lookups in Silver Job Runs #1253

Refactor lookups in Silver Job Runs #1253

neilbest-db commented Jun 26, 2024 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

neilbest-db commented Jun 29, 2024

neilbest-db commented Jul 15, 2024

sonarcloud bot commented Jul 15, 2024

sonarcloud bot commented Aug 26, 2024

Refactor lookups in Silver Job Runs #1253

Are you sure you want to change the base?

Refactor lookups in Silver Job Runs #1253

Conversation

neilbest-db commented Jun 26, 2024 • edited Loading

Next steps

This comment was marked as resolved.

This comment was marked as resolved.

neilbest-db commented Jun 29, 2024

neilbest-db commented Jul 15, 2024

sonarcloud bot commented Jul 15, 2024

Quality Gate passed

sonarcloud bot commented Aug 26, 2024

Quality Gate failed

neilbest-db commented Jun 26, 2024 •

edited

Loading