Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add descriptive NamedTransformations to Spark UI #1223

Merged
merged 5 commits into from
Aug 20, 2024

Conversation

neilbest-db
Copy link
Contributor

@neilbest-db neilbest-db commented May 23, 2024

This makes the Spark UI more developer-friendly when analyzing
Overwatch runs.

Job group IDs have the form :

Any use of .transform( df => df) may be replaced with
.transformWithDescription( nt) after instantiating a val nt = NamedTransformation( df => df) as its argument.

This commit contains one such application of the new extension method.
(See val jobRunsAppendClusterName in WorkflowsTransforms.scala.)

Some logic in GoldTransforms falls through to elements of the
special job-run-action form of Job Group IDs emitted by the platform
but the impact is minimal relative to the benefit to Overwatch
development and troubleshooting. Even so this form of Job Group ID is
still present in initial Spark events before OW ETL modules begin to
execute.

gueniai and others added 3 commits May 22, 2024 16:22
This makes the Spark UI more developer-friendly when analyzing
Overwatch runs.

Job group IDs have the form <workspace name>:<OW module name>

Any use of `.transform( df => df)` may be replaced with
`.transformWithDescription( nt)` after instantiating a `val nt =
NamedTransformation( df => df)` as its argument.

This commit contains one such application of the new extension method.
(See `val jobRunsAppendClusterName` in `WorkflowsTransforms.scala`.)

Some logic in `GoldTransforms` falls through to elements of the
special job-run-action form of Job Group IDs emitted by the platform
but the impact is minimal relative to the benefit to Overwatch
development and troubleshooting.  Even so this form of Job Group ID is
still present in initial Spark events before OW ETL modules begin to
execute.
Copy link

sonarcloud bot commented May 23, 2024

Quality Gate Passed Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@neilbest-db
Copy link
Contributor Author

@souravbaner-da, @gueniai, is there anything I can do to help get this merged in?

@neilbest-db neilbest-db changed the base branch from main to 0820_release July 1, 2024 21:27
for greater visibility in Spark UI. `NamedTransformation` type name
now appears in labels' second position.

(cherry picked from commit 2ead752)
@neilbest-db
Copy link
Contributor Author

closes #1226

@neilbest-db

This comment was marked as outdated.

TODO: enumerate the regressions this would introduce when the labels set by then platform are replaced this way.
Copy link

sonarcloud bot commented Jul 15, 2024

@neilbest-db neilbest-db changed the title Add descriptive Job Group ID and NamedTransformations to Spark UI Add descriptive NamedTransformations to Spark UI Jul 15, 2024
@neilbest-db neilbest-db requested review from souravbaner-da and removed request for gueniai and souravbaner-da July 15, 2024 17:45
@neilbest-db neilbest-db added enhancement New feature or request Testing Tests Needed Or Part of Testing Suite labels Jul 15, 2024
@@ -18,6 +18,7 @@ libraryDependencies += "com.databricks" % "dbutils-api_2.12" % "0.0.5" % Provide
libraryDependencies += "com.amazonaws" % "aws-java-sdk-s3" % "1.11.595" % Provided
libraryDependencies += "io.delta" % "delta-core_2.12" % "1.0.0" % Provided
libraryDependencies += "org.scalaj" %% "scalaj-http" % "2.4.2"
libraryDependencies += "com.lihaoyi" %% "sourcecode" % "0.4.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neilbest-db This is not a licensed module. So we can't use it as per databrickslab standard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@souravbaner-da, I ust noticed your comment from 2024-06-05. I discussed this with @sriram251-code here in Slack back on 2024-06-04. I believe we have discussed it on >= 1 team calls since then also. If it's still a concern I would like to learn more about the standard you mention. Is that in a document somewhere?

@gueniai
Copy link
Contributor

gueniai commented Aug 12, 2024

@neilbest-db if we're no longer overriding the JobGroupIDs, what will be the benefit of this change?

@neilbest-db
Copy link
Contributor Author

@gueniai, if your previous comment refers to the following code in this diff in src/main/scala/com/databricks/labs/overwatch/pipeline/ETLDefinition.scala:

val transformedDF = transforms.foldLeft(verifiedSourceDF) {
case (df, transform) =>
/*
* reverting Spark UI Job Group labels for now
*
* TODO: enumerate the regressions this would introduce
* when the labels set by then platform are replaced
* this way.
* df.sparkSession.sparkContext.setJobGroup(
* s"${module.pipeline.config.workspaceName}:${module.moduleName}",
* transform.toString)
*/
df.transform( transform)
}

. . . then you already know that the manipulation of the JobGroupID is commented out in this PR. I left that for future reference; see details below. The rest of this PR implements and demonstrates how to manipulate the job description and call site fields in the Spark UI in a convenient way. The core of that abstraction is here:

sc.setJobDescription( namedTransformation.toString)
sc.setCallSite( callSite)

I am currently investigating whether there might be an alternative to the current dependence on the Job Group ID, which is not always helpful in linking Spark events to Databricks Workflow job/task runs, especially in cases of:

  • Legacy workflows spawned by dbutils.notebook.run()
  • Workflow tasks run on all-purpose clusters
  • (possibly other cases)

In those known scenarios, the Job Group ID either has a value that refers to the job_id and run_id of the spawned ephemeral run or some other value set by the platform that is NOT LIKE '%job-%-run-%'.

In any case, because SparkContext.setJobGroup() is a member of Spark's public API (Scala|Python), any workflow could override the value that Overwatch relies on to make certain associations. I intend to show that there is a more robust solution if possible. This is the main reason that I am taking such care in my current ongoing investigation that was initiated by a user report of null Spark Task durations in jobruncostpotentialfact_gold ("JRCP").

@gueniai gueniai merged commit bbdb61f into 0820_release Aug 20, 2024
1 check passed
@gueniai gueniai deleted the named_transformation branch August 20, 2024 15:11
gueniai added a commit that referenced this pull request Aug 23, 2024
* Initial commit

* Add extension method to show `DataFrame` records in the log

* catch up with 0820_release

Squashed commit of the following:

commit bbdb61f
Author: Neil Best <60439766+neilbest-db@users.noreply.github.com>
Date:   Tue Aug 20 10:11:03 2024 -0500

    Add descriptive `NamedTransformation`s to Spark UI (#1223)

    * Initial commit

    * Add descriptive job group IDs and named transformations

    This makes the Spark UI more developer-friendly when analyzing
    Overwatch runs.

    Job group IDs have the form <workspace name>:<OW module name>

    Any use of `.transform( df => df)` may be replaced with
    `.transformWithDescription( nt)` after instantiating a `val nt =
    NamedTransformation( df => df)` as its argument.

    This commit contains one such application of the new extension method.
    (See `val jobRunsAppendClusterName` in `WorkflowsTransforms.scala`.)

    Some logic in `GoldTransforms` falls through to elements of the
    special job-run-action form of Job Group IDs emitted by the platform
    but the impact is minimal relative to the benefit to Overwatch
    development and troubleshooting.  Even so this form of Job Group ID is
    still present in initial Spark events before OW ETL modules begin to
    execute.

    * improve TransformationDescriberTest

    * flip transformation names to beginning of label

    for greater visibility in Spark UI. `NamedTransformation` type name
    now appears in labels' second position.

    (cherry picked from commit 2ead752)

    * revert modified Spark UI Job Group labels

    TODO: enumerate the regressions this would introduce when the labels set by then platform are replaced this way.

    ---------

    Co-authored-by: Guenia <guenia.izquierdo@databricks.com>

commit 3055a22
Author: Aman <91308367+aman-db@users.noreply.github.com>
Date:   Mon Aug 12 22:59:13 2024 +0530

    1218 warehouse state details (#1254)

    * test

    * code for warehouse_state_detail_silver

    * removed comments

    * adding warehouseEvents scope

    * added exception for table not found

    * added exception to check if system tables are getting used or not

    * enhance function getWarehousesEventDF

    * added code to fix max number of clusters

    * change in column names

    * refactored code

commit 59daae5
Author: Aman <91308367+aman-db@users.noreply.github.com>
Date:   Thu Aug 8 20:20:17 2024 +0530

    adding fix for duplicate accountId in module 2010 and 3019 (#1270)

commit d6fa441
Author: Sourav Banerjee <109206082+souravbaner-da@users.noreply.github.com>
Date:   Wed Aug 7 23:24:00 2024 +0530

    1030 pipeline validation framework (#1071)

    * Initial commit

    * 19-Oct-23 : Added Validation Framework

    * 19-Oct-23: Customize the message for customer

    * 19-Oct-23: Customize the message for customer

    * 26-Oct-23: Added OverwatchID filter in the table

    * 26-Oct-23: Change for Coding Best Practices

    * Added Function Description for validateColumnBetweenMultipleTable

    * Added Pattern Matching in Validation

    * Convert if-else in validateRuleAndUpdateStatus to case statement as per comment

    * Initial commit

    * traceability implemented (#1102)

    * traceability implemented

    * code review implemented

    * missed code implemented (#1105)

    * Initial commit

    * traceability implemented (#1102)

    * traceability implemented

    * code review implemented

    * missed code implemented

    * missed code implemented

    ---------

    Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com>

    * Added proper exception for Spark Stream Gold if progress c… (#1085)

    * Initial commit

    * 09-Nov-23: Added proper exception for Spark Stream Gold if progress column contains only null in SparkEvents_Bronze

    ---------

    Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com>
    Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com>

    * Gracefully Handle Exception for NotebookCommands_Gold (#1095)

    * Initial commit

    * Gracefully Handle Exception for NotebookCommands_Gold

    * Convert the check in buildNotebookCommandsFact to single or clause

    ---------

    Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com>
    Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com>

    * code missed in merge (#1120)

    * Fix Helper Method to Instantiate Remote Workspaces (#1110)

    * Initial commit

    * Change getRemoteWorkspaceByPath and getWorkspaceByDatabase to take it RemoteWorkspace

    * Remove Unnecessary println Statements

    ---------

    Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com>

    * Ensure we test the write into a partitioned storage_prefix (#1088)

    * Initial commit

    * Ensure we test the write into a partitioned storage_prefix

    * silver warehouse spec fix (#1121)

    * added missed copy-pasta (#1129)

    * Exclude cluster logs in S3 root bucket (#1118)

    * Exclude cluster logs in S3 root bucket

    * Omit cluster log paths pointing to s3a as well

    * implemented recon (#1116)

    * implemented recon

    * docs added

    * file path change

    * review comments implemented

    * Added ShuffleFactor to NotebookCommands (#1124)

    Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com>

    * disabled traceability (#1130)

    * Added JobRun_Silver in buildClusterStateFact for Cluster E… (#1083)

    * Initial commit

    * 08-Nov-23: Added JobRun_Silver in buildClusterStateFact for Cluster End Time Imputation

    * Impute Terminating Events in CLSF from JR_Silver

    * Impute Terminating Events in CLSD

    * Impute Terminating Events in CLSD

    * Change CLSF to original 0730 version

    * Change CLSF to original 0730 version

    * Added cluster_spec in CLSD to get job Cluster only

    * Make the variables name in buildClusterStateDetail into more descriptive way

    * Make the variables name in buildClusterStateDetail into more descriptive way

    ---------

    Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com>
    Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com>

    * Sys table audit log integration (#1122)

    * system table integration with audit log

    * adding code to resolve issues with response col

    * fixed timestamp issue

    * adding print statement for from and until time

    * adding fix for azure

    * removed comments

    * removed comments and print statements

    * removed comments

    * implemented code review comments

    * implemented code review comments

    * adding review comment

    * Sys table integration multi acount (#1131)

    * added code changes for multi account deployment

    * code for multi account system table integration

    * Sys table integration multi acount (#1132)

    * added code changes for multi account deployment

    * code for multi account system table integration

    * adding code for system table migration check

    * changing exception for empty audit log from system table

    * adding code to handle sql_endpoint in configs and fix in migration validation (#1133)

    * corner case commit (#1134)

    * Handle CLSD Cluster Impute when jrcp and clusterSpec is Empty (#1135)

    * Handle CLSD Cluster Impute when jrcp and clusterSpec is Empty

    * Exclude last_state from clsd as it is not needed in the logic.

    ---------

    Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com>

    * Exclude 2011 and 2014 as dependency module for 2019 (#1136)

    * Exclude 2011 and 2014 as dependency module for 2019

    * Added comment in CLSD for understandability

    ---------

    Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com>

    * corner case commit (#1137)

    * Update version

    * adding fix for empty EH config for system tables (#1140)

    * corner case commit (#1142)

    * adding fix for empty audit log for warehouse_spec_silver (#1141)

    * recon columns removed (#1143)

    * recon columns removed

    * recon columns removed

    * Initial Commit

    * Added Changes in Validation Framework as per comments added during sprint meeting

    * added hotfix for warehouse_spec_silver (#1154)

    * Added Multiple RunID check in Validation Frameowkr

    * Added Other tables in Validation Framework

    * Added Multiple WS ID option in Cros Table Validation

    * Added change for Pipeline_report

    * Change for Pipeline Report

    * Added msg for single table validation

    * Added negative msg in HealthCheck Report

    * Added Negative Msg for Cross Table Validation

    * Added extra filter for total cost validation for CLSF

    * Changed as per Comments

    * Changed as per the comments

    * Added some filter condition for cost validation in clsf

    * Added Config for all pipeline run

    * 19-Oct-23 : Added Validation Framework

    * 19-Oct-23: Customize the message for customer

    * 19-Oct-23: Customize the message for customer

    * 26-Oct-23: Added OverwatchID filter in the table

    * 26-Oct-23: Change for Coding Best Practices

    * Added Function Description for validateColumnBetweenMultipleTable

    * Added Pattern Matching in Validation

    * Convert if-else in validateRuleAndUpdateStatus to case statement as per comment

    * traceability implemented (#1102)

    * traceability implemented

    * code review implemented

    * Added JobRun_Silver in buildClusterStateFact for Cluster E… (#1083)

    * Initial commit

    * 08-Nov-23: Added JobRun_Silver in buildClusterStateFact for Cluster End Time Imputation

    * Impute Terminating Events in CLSF from JR_Silver

    * Impute Terminating Events in CLSD

    * Impute Terminating Events in CLSD

    * Change CLSF to original 0730 version

    * Change CLSF to original 0730 version

    * Added cluster_spec in CLSD to get job Cluster only

    * Make the variables name in buildClusterStateDetail into more descriptive way

    * Make the variables name in buildClusterStateDetail into more descriptive way

    ---------

    Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com>
    Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com>

    * corner case commit (#1134)

    * Exclude 2011 and 2014 as dependency module for 2019 (#1136)

    * Exclude 2011 and 2014 as dependency module for 2019

    * Added comment in CLSD for understandability

    ---------

    Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com>

    * Added Changes in Validation Framework as per comments added during sprint meeting

    * Added Multiple RunID check in Validation Frameowkr

    * Added Other tables in Validation Framework

    * Added Multiple WS ID option in Cros Table Validation

    * Added change for Pipeline_report

    * Change for Pipeline Report

    * Added msg for single table validation

    * Added negative msg in HealthCheck Report

    * Added Negative Msg for Cross Table Validation

    * Added extra filter for total cost validation for CLSF

    * Changed as per Comments

    * Changed as per the comments

    * Added some filter condition for cost validation in clsf

    * Added Config for all pipeline run

    ---------

    Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com>
    Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com>
    Co-authored-by: Sriram Mohanty <69749553+sriram251-code@users.noreply.github.com>
    Co-authored-by: Aman <91308367+aman-db@users.noreply.github.com>

commit 3c16b5f
Author: Sourav Banerjee <109206082+souravbaner-da@users.noreply.github.com>
Date:   Wed Aug 7 23:23:17 2024 +0530

    Redefine views so that they are created from tables not locations (#1241)

    * Initial commit

    * Change publish() function to incorporate views from ETL Tables iso paths

    * Handle view creation in case of table does not exists

    ---------

    Co-authored-by: Guenia <guenia.izquierdo@databricks.com>

commit f3ffd7c
Author: Sourav Banerjee <109206082+souravbaner-da@users.noreply.github.com>
Date:   Wed Aug 7 23:21:37 2024 +0530

    1201 collect all event logs on first run (#1255)

    * Initial commit

    * cluster event bronze will take all the data from API for first run

    * Update BronzeTransforms.scala

    adjust whitespace around `landClusterEvents()`

    ---------

    Co-authored-by: Guenia <guenia.izquierdo@databricks.com>
    Co-authored-by: Neil Best <60439766+neilbest-db@users.noreply.github.com>

commit caa3282
Author: Sriram Mohanty <69749553+sriram251-code@users.noreply.github.com>
Date:   Wed Aug 7 23:20:25 2024 +0530

    append null columns from cluster snapshot for cluster_spec_silver (#1239)

    * Initial commit

    * append null columns from cluster snapshot for cluster_spec_silver

    * append null columns from cluster snapshot for cluster_spec_silver

    ---------

    Co-authored-by: Guenia <guenia.izquierdo@databricks.com>

commit f7460bd
Author: Neil Best <60439766+neilbest-db@users.noreply.github.com>
Date:   Tue Jul 30 14:52:38 2024 -0500

    adjust Silver Job Runs module configuration (#1256)

    enable auto-optimized shuffle for module 2011

    originally implemented in commit [d751d5f](d751d5f)

commit 25671b7
Author: Sriram Mohanty <69749553+sriram251-code@users.noreply.github.com>
Date:   Tue Jul 9 02:02:04 2024 +0530

    recon enhancement done to deal with different columns in source and target (#1216)

    * Initial commit

    * recon enhancement done to deal with different columns in source and target

    ---------

    Co-authored-by: Guenia <guenia.izquierdo@databricks.com>

commit 97236ae
Author: Guenia <guenia.izquierdo@databricks.com>
Date:   Wed May 8 19:43:29 2024 -0400

    Initial commit

commit f9c8dd0
Author: Guenia Izquierdo Delgado <guenia.izquierdo@databricks.com>
Date:   Mon Jun 24 11:28:15 2024 -0400

    0812 release (#1249)

    * Initial commit

    * adding fix for schemaScrubber and StructToMap (#1232)

    * fix for null driver_type_id and node_type_id in jrcp (#1236)

    * Modify Cluster_snapshot_bronze column (#1234)

    * Comvert all the struct field inside 'spec' column for cluster_snapshot_bronze to mapType

    * Dropped Spec column from snapshot

    * Removed Reductant VerifyMinSchema

    * Update_AWS_instance_types (#1248)

    * Update_gcp_instance_types (#1244)

    Update_gcp_instance_types

    * Update_AWS_instance_types

    Update_AWS_instance_types

    ---------

    Co-authored-by: Aman <91308367+aman-db@users.noreply.github.com>
    Co-authored-by: Sourav Banerjee <109206082+souravbaner-da@users.noreply.github.com>
    Co-authored-by: Mohan Baabu <87074323+mohanbaabu1996@users.noreply.github.com>

commit 7390d4a
Author: Mohan Baabu <87074323+mohanbaabu1996@users.noreply.github.com>
Date:   Fri Jun 21 20:01:46 2024 +0530

    Update_Azure_Instance_details (#1246)

    * Update_Azure_Instance_details

    Update_Azure_Instance_details

    * Update Azure_Instance_Details.csv

    Updated Standard_NV72ads_A10_v5 types, missed a comma

commit 6cbb9d7
Author: Mohan Baabu <87074323+mohanbaabu1996@users.noreply.github.com>
Date:   Fri Jun 21 19:37:57 2024 +0530

    Update_gcp_instance_types (#1244)

    Update_gcp_instance_types

* add Spark conf option for `DataFrame` logging extension methods

This feature respects the logging level set for the logger in scope.

```scala
spark.conf.set( "overwatch.dataframelogger.level", "DEBUG")

logger.setLevel( "WARN")

df.log()

// no data shown in logs

logger.setLevel( "DEBUG")

df.log()

// :)
```

also:

- implement `DataFrameSyntaxTest` suite to test `Dataset`/`DataFrame`
  extension methods `.showLines()` and `.log()` as implemented within
  the `DataFrameSyntax` trait.

- move `SparkSessionTestWrapper` into `src/main` and made it extend
  `SparkSessionWrapper` in order to make `DataFrameSyntax` testable
  through the use of type parameter `SPARK` and self-typing.

---------

Co-authored-by: Guenia <guenia.izquierdo@databricks.com>
gueniai added a commit that referenced this pull request Sep 9, 2024
* Initial commit

* Add descriptive job group IDs and named transformations

This makes the Spark UI more developer-friendly when analyzing
Overwatch runs.

Job group IDs have the form <workspace name>:<OW module name>

Any use of `.transform( df => df)` may be replaced with
`.transformWithDescription( nt)` after instantiating a `val nt =
NamedTransformation( df => df)` as its argument.

This commit contains one such application of the new extension method.
(See `val jobRunsAppendClusterName` in `WorkflowsTransforms.scala`.)

Some logic in `GoldTransforms` falls through to elements of the
special job-run-action form of Job Group IDs emitted by the platform
but the impact is minimal relative to the benefit to Overwatch
development and troubleshooting.  Even so this form of Job Group ID is
still present in initial Spark events before OW ETL modules begin to
execute.

* improve TransformationDescriberTest

* flip transformation names to beginning of label

for greater visibility in Spark UI. `NamedTransformation` type name
now appears in labels' second position.

(cherry picked from commit 2ead752)

* revert modified Spark UI Job Group labels

TODO: enumerate the regressions this would introduce when the labels set by then platform are replaced this way.

---------

Co-authored-by: Guenia <guenia.izquierdo@databricks.com>
gueniai added a commit that referenced this pull request Sep 9, 2024
* Initial commit

* Add extension method to show `DataFrame` records in the log

* catch up with 0820_release

Squashed commit of the following:

commit bbdb61f
Author: Neil Best <60439766+neilbest-db@users.noreply.github.com>
Date:   Tue Aug 20 10:11:03 2024 -0500

    Add descriptive `NamedTransformation`s to Spark UI (#1223)

    * Initial commit

    * Add descriptive job group IDs and named transformations

    This makes the Spark UI more developer-friendly when analyzing
    Overwatch runs.

    Job group IDs have the form <workspace name>:<OW module name>

    Any use of `.transform( df => df)` may be replaced with
    `.transformWithDescription( nt)` after instantiating a `val nt =
    NamedTransformation( df => df)` as its argument.

    This commit contains one such application of the new extension method.
    (See `val jobRunsAppendClusterName` in `WorkflowsTransforms.scala`.)

    Some logic in `GoldTransforms` falls through to elements of the
    special job-run-action form of Job Group IDs emitted by the platform
    but the impact is minimal relative to the benefit to Overwatch
    development and troubleshooting.  Even so this form of Job Group ID is
    still present in initial Spark events before OW ETL modules begin to
    execute.

    * improve TransformationDescriberTest

    * flip transformation names to beginning of label

    for greater visibility in Spark UI. `NamedTransformation` type name
    now appears in labels' second position.

    (cherry picked from commit 2ead752)

    * revert modified Spark UI Job Group labels

    TODO: enumerate the regressions this would introduce when the labels set by then platform are replaced this way.

    ---------

    Co-authored-by: Guenia <guenia.izquierdo@databricks.com>

commit 3055a22
Author: Aman <91308367+aman-db@users.noreply.github.com>
Date:   Mon Aug 12 22:59:13 2024 +0530

    1218 warehouse state details (#1254)

    * test

    * code for warehouse_state_detail_silver

    * removed comments

    * adding warehouseEvents scope

    * added exception for table not found

    * added exception to check if system tables are getting used or not

    * enhance function getWarehousesEventDF

    * added code to fix max number of clusters

    * change in column names

    * refactored code

commit 59daae5
Author: Aman <91308367+aman-db@users.noreply.github.com>
Date:   Thu Aug 8 20:20:17 2024 +0530

    adding fix for duplicate accountId in module 2010 and 3019 (#1270)

commit d6fa441
Author: Sourav Banerjee <109206082+souravbaner-da@users.noreply.github.com>
Date:   Wed Aug 7 23:24:00 2024 +0530

    1030 pipeline validation framework (#1071)

    * Initial commit

    * 19-Oct-23 : Added Validation Framework

    * 19-Oct-23: Customize the message for customer

    * 19-Oct-23: Customize the message for customer

    * 26-Oct-23: Added OverwatchID filter in the table

    * 26-Oct-23: Change for Coding Best Practices

    * Added Function Description for validateColumnBetweenMultipleTable

    * Added Pattern Matching in Validation

    * Convert if-else in validateRuleAndUpdateStatus to case statement as per comment

    * Initial commit

    * traceability implemented (#1102)

    * traceability implemented

    * code review implemented

    * missed code implemented (#1105)

    * Initial commit

    * traceability implemented (#1102)

    * traceability implemented

    * code review implemented

    * missed code implemented

    * missed code implemented

    ---------

    Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com>

    * Added proper exception for Spark Stream Gold if progress c… (#1085)

    * Initial commit

    * 09-Nov-23: Added proper exception for Spark Stream Gold if progress column contains only null in SparkEvents_Bronze

    ---------

    Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com>
    Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com>

    * Gracefully Handle Exception for NotebookCommands_Gold (#1095)

    * Initial commit

    * Gracefully Handle Exception for NotebookCommands_Gold

    * Convert the check in buildNotebookCommandsFact to single or clause

    ---------

    Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com>
    Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com>

    * code missed in merge (#1120)

    * Fix Helper Method to Instantiate Remote Workspaces (#1110)

    * Initial commit

    * Change getRemoteWorkspaceByPath and getWorkspaceByDatabase to take it RemoteWorkspace

    * Remove Unnecessary println Statements

    ---------

    Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com>

    * Ensure we test the write into a partitioned storage_prefix (#1088)

    * Initial commit

    * Ensure we test the write into a partitioned storage_prefix

    * silver warehouse spec fix (#1121)

    * added missed copy-pasta (#1129)

    * Exclude cluster logs in S3 root bucket (#1118)

    * Exclude cluster logs in S3 root bucket

    * Omit cluster log paths pointing to s3a as well

    * implemented recon (#1116)

    * implemented recon

    * docs added

    * file path change

    * review comments implemented

    * Added ShuffleFactor to NotebookCommands (#1124)

    Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com>

    * disabled traceability (#1130)

    * Added JobRun_Silver in buildClusterStateFact for Cluster E… (#1083)

    * Initial commit

    * 08-Nov-23: Added JobRun_Silver in buildClusterStateFact for Cluster End Time Imputation

    * Impute Terminating Events in CLSF from JR_Silver

    * Impute Terminating Events in CLSD

    * Impute Terminating Events in CLSD

    * Change CLSF to original 0730 version

    * Change CLSF to original 0730 version

    * Added cluster_spec in CLSD to get job Cluster only

    * Make the variables name in buildClusterStateDetail into more descriptive way

    * Make the variables name in buildClusterStateDetail into more descriptive way

    ---------

    Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com>
    Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com>

    * Sys table audit log integration (#1122)

    * system table integration with audit log

    * adding code to resolve issues with response col

    * fixed timestamp issue

    * adding print statement for from and until time

    * adding fix for azure

    * removed comments

    * removed comments and print statements

    * removed comments

    * implemented code review comments

    * implemented code review comments

    * adding review comment

    * Sys table integration multi acount (#1131)

    * added code changes for multi account deployment

    * code for multi account system table integration

    * Sys table integration multi acount (#1132)

    * added code changes for multi account deployment

    * code for multi account system table integration

    * adding code for system table migration check

    * changing exception for empty audit log from system table

    * adding code to handle sql_endpoint in configs and fix in migration validation (#1133)

    * corner case commit (#1134)

    * Handle CLSD Cluster Impute when jrcp and clusterSpec is Empty (#1135)

    * Handle CLSD Cluster Impute when jrcp and clusterSpec is Empty

    * Exclude last_state from clsd as it is not needed in the logic.

    ---------

    Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com>

    * Exclude 2011 and 2014 as dependency module for 2019 (#1136)

    * Exclude 2011 and 2014 as dependency module for 2019

    * Added comment in CLSD for understandability

    ---------

    Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com>

    * corner case commit (#1137)

    * Update version

    * adding fix for empty EH config for system tables (#1140)

    * corner case commit (#1142)

    * adding fix for empty audit log for warehouse_spec_silver (#1141)

    * recon columns removed (#1143)

    * recon columns removed

    * recon columns removed

    * Initial Commit

    * Added Changes in Validation Framework as per comments added during sprint meeting

    * added hotfix for warehouse_spec_silver (#1154)

    * Added Multiple RunID check in Validation Frameowkr

    * Added Other tables in Validation Framework

    * Added Multiple WS ID option in Cros Table Validation

    * Added change for Pipeline_report

    * Change for Pipeline Report

    * Added msg for single table validation

    * Added negative msg in HealthCheck Report

    * Added Negative Msg for Cross Table Validation

    * Added extra filter for total cost validation for CLSF

    * Changed as per Comments

    * Changed as per the comments

    * Added some filter condition for cost validation in clsf

    * Added Config for all pipeline run

    * 19-Oct-23 : Added Validation Framework

    * 19-Oct-23: Customize the message for customer

    * 19-Oct-23: Customize the message for customer

    * 26-Oct-23: Added OverwatchID filter in the table

    * 26-Oct-23: Change for Coding Best Practices

    * Added Function Description for validateColumnBetweenMultipleTable

    * Added Pattern Matching in Validation

    * Convert if-else in validateRuleAndUpdateStatus to case statement as per comment

    * traceability implemented (#1102)

    * traceability implemented

    * code review implemented

    * Added JobRun_Silver in buildClusterStateFact for Cluster E… (#1083)

    * Initial commit

    * 08-Nov-23: Added JobRun_Silver in buildClusterStateFact for Cluster End Time Imputation

    * Impute Terminating Events in CLSF from JR_Silver

    * Impute Terminating Events in CLSD

    * Impute Terminating Events in CLSD

    * Change CLSF to original 0730 version

    * Change CLSF to original 0730 version

    * Added cluster_spec in CLSD to get job Cluster only

    * Make the variables name in buildClusterStateDetail into more descriptive way

    * Make the variables name in buildClusterStateDetail into more descriptive way

    ---------

    Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com>
    Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com>

    * corner case commit (#1134)

    * Exclude 2011 and 2014 as dependency module for 2019 (#1136)

    * Exclude 2011 and 2014 as dependency module for 2019

    * Added comment in CLSD for understandability

    ---------

    Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com>

    * Added Changes in Validation Framework as per comments added during sprint meeting

    * Added Multiple RunID check in Validation Frameowkr

    * Added Other tables in Validation Framework

    * Added Multiple WS ID option in Cros Table Validation

    * Added change for Pipeline_report

    * Change for Pipeline Report

    * Added msg for single table validation

    * Added negative msg in HealthCheck Report

    * Added Negative Msg for Cross Table Validation

    * Added extra filter for total cost validation for CLSF

    * Changed as per Comments

    * Changed as per the comments

    * Added some filter condition for cost validation in clsf

    * Added Config for all pipeline run

    ---------

    Co-authored-by: Guenia Izquierdo <guenia.izquierdo@databricks.com>
    Co-authored-by: Sourav Banerjee <30810740+Sourav692@users.noreply.github.com>
    Co-authored-by: Sriram Mohanty <69749553+sriram251-code@users.noreply.github.com>
    Co-authored-by: Aman <91308367+aman-db@users.noreply.github.com>

commit 3c16b5f
Author: Sourav Banerjee <109206082+souravbaner-da@users.noreply.github.com>
Date:   Wed Aug 7 23:23:17 2024 +0530

    Redefine views so that they are created from tables not locations (#1241)

    * Initial commit

    * Change publish() function to incorporate views from ETL Tables iso paths

    * Handle view creation in case of table does not exists

    ---------

    Co-authored-by: Guenia <guenia.izquierdo@databricks.com>

commit f3ffd7c
Author: Sourav Banerjee <109206082+souravbaner-da@users.noreply.github.com>
Date:   Wed Aug 7 23:21:37 2024 +0530

    1201 collect all event logs on first run (#1255)

    * Initial commit

    * cluster event bronze will take all the data from API for first run

    * Update BronzeTransforms.scala

    adjust whitespace around `landClusterEvents()`

    ---------

    Co-authored-by: Guenia <guenia.izquierdo@databricks.com>
    Co-authored-by: Neil Best <60439766+neilbest-db@users.noreply.github.com>

commit caa3282
Author: Sriram Mohanty <69749553+sriram251-code@users.noreply.github.com>
Date:   Wed Aug 7 23:20:25 2024 +0530

    append null columns from cluster snapshot for cluster_spec_silver (#1239)

    * Initial commit

    * append null columns from cluster snapshot for cluster_spec_silver

    * append null columns from cluster snapshot for cluster_spec_silver

    ---------

    Co-authored-by: Guenia <guenia.izquierdo@databricks.com>

commit f7460bd
Author: Neil Best <60439766+neilbest-db@users.noreply.github.com>
Date:   Tue Jul 30 14:52:38 2024 -0500

    adjust Silver Job Runs module configuration (#1256)

    enable auto-optimized shuffle for module 2011

    originally implemented in commit [d751d5f](d751d5f)

commit 25671b7
Author: Sriram Mohanty <69749553+sriram251-code@users.noreply.github.com>
Date:   Tue Jul 9 02:02:04 2024 +0530

    recon enhancement done to deal with different columns in source and target (#1216)

    * Initial commit

    * recon enhancement done to deal with different columns in source and target

    ---------

    Co-authored-by: Guenia <guenia.izquierdo@databricks.com>

commit 97236ae
Author: Guenia <guenia.izquierdo@databricks.com>
Date:   Wed May 8 19:43:29 2024 -0400

    Initial commit

commit f9c8dd0
Author: Guenia Izquierdo Delgado <guenia.izquierdo@databricks.com>
Date:   Mon Jun 24 11:28:15 2024 -0400

    0812 release (#1249)

    * Initial commit

    * adding fix for schemaScrubber and StructToMap (#1232)

    * fix for null driver_type_id and node_type_id in jrcp (#1236)

    * Modify Cluster_snapshot_bronze column (#1234)

    * Comvert all the struct field inside 'spec' column for cluster_snapshot_bronze to mapType

    * Dropped Spec column from snapshot

    * Removed Reductant VerifyMinSchema

    * Update_AWS_instance_types (#1248)

    * Update_gcp_instance_types (#1244)

    Update_gcp_instance_types

    * Update_AWS_instance_types

    Update_AWS_instance_types

    ---------

    Co-authored-by: Aman <91308367+aman-db@users.noreply.github.com>
    Co-authored-by: Sourav Banerjee <109206082+souravbaner-da@users.noreply.github.com>
    Co-authored-by: Mohan Baabu <87074323+mohanbaabu1996@users.noreply.github.com>

commit 7390d4a
Author: Mohan Baabu <87074323+mohanbaabu1996@users.noreply.github.com>
Date:   Fri Jun 21 20:01:46 2024 +0530

    Update_Azure_Instance_details (#1246)

    * Update_Azure_Instance_details

    Update_Azure_Instance_details

    * Update Azure_Instance_Details.csv

    Updated Standard_NV72ads_A10_v5 types, missed a comma

commit 6cbb9d7
Author: Mohan Baabu <87074323+mohanbaabu1996@users.noreply.github.com>
Date:   Fri Jun 21 19:37:57 2024 +0530

    Update_gcp_instance_types (#1244)

    Update_gcp_instance_types

* add Spark conf option for `DataFrame` logging extension methods

This feature respects the logging level set for the logger in scope.

```scala
spark.conf.set( "overwatch.dataframelogger.level", "DEBUG")

logger.setLevel( "WARN")

df.log()

// no data shown in logs

logger.setLevel( "DEBUG")

df.log()

// :)
```

also:

- implement `DataFrameSyntaxTest` suite to test `Dataset`/`DataFrame`
  extension methods `.showLines()` and `.log()` as implemented within
  the `DataFrameSyntax` trait.

- move `SparkSessionTestWrapper` into `src/main` and made it extend
  `SparkSessionWrapper` in order to make `DataFrameSyntax` testable
  through the use of type parameter `SPARK` and self-typing.

---------

Co-authored-by: Guenia <guenia.izquierdo@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Testing Tests Needed Or Part of Testing Suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataset.transformWithDescription extension method to add source code metadata to SparkUI
3 participants