add spark_conf_string to thrift connection type #591

vinhnemo · 2023-01-13T06:45:15Z

resolves #590

Description

Checklist

I have read the contributing guide and understand what's expected of me
I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have opened an issue to add/update docs, or docs changes are not required/relevant for this PR
I have run changie new to create a changelog entry

cla-bot · 2023-01-13T06:45:19Z

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @vinhnemo

…tps://github.com/vinhnemo/dbt-spark into features/vinh-nguyen-add-extras-spark-config-sts

cla-bot · 2023-01-13T06:47:24Z

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @vinhnemo, @justinvin

…-sts

cla-bot · 2023-02-01T03:24:34Z

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @justinvin, @vinhnemo

Fleid · 2023-02-15T22:09:44Z

Hi @vinhnemo, thanks for taking the time to contribute!
There are duplicate issue and PR for that exact scenario (#387 and #577).

Your approach is to create a new configuration string spark_conf_string that then needs to be parsed. their approach is to re-purpose server_side_parameters from the ODBC method.

I have a preference for the server_side_parameters approach as it makes the profiles.yml schema consistent across methods. Any thoughts about that?

…-sts

cla-bot · 2023-02-15T22:09:53Z

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @justinvin, @vinhnemo

VShkaberda · 2023-02-16T00:01:20Z

It seems that this solution has disadvantage, because it disallows to use = and ;. At least, it might be needed in:

spark.driver.extraJavaOptions, e.g. spark.driver.extraJavaOptions=-Dconfig.resource=dev
spark.redaction.string.regex, e.g. (?=...) as the part of the patttern

vinhnemo · 2023-02-16T06:48:45Z

Hi @Fleid ,

Just to be clear, my purpose is to be able to change the Spark configuration with profile.yml when connecting via Thrift connection. These configs will be adjusted according to the DBT models.

To avoid change detection on profile.yml and do a full re-parse(docs) . The simplest way is to pass these configs from environment variables through the jinja format(ref).
spark_conf_string: "{{ env_var('SPARK_CONFIG_STRING') }}"

With the server_side_parameters approach, its data type is dict, it is difficult to bypass the re-parse limitation if we want to use the environment variable mentioned above.

With these specific requirements, it led me to create another parameter with the String data type.

colin-rogers-dbt · 2023-02-19T20:18:06Z

dbt/adapters/spark/connections.py

@@ -74,6 +74,7 @@ class SparkCredentials(Credentials):
    connect_timeout: int = 10
    use_ssl: bool = False
    server_side_parameters: Dict[str, Any] = field(default_factory=dict)
+    spark_conf_string: Optional[str] = None


why does this need to be a string? Can this be a dict? Doing so would allow us to escape special characters in the configuration

Just to be clear, my purpose is to be able to change the Spark configuration with profile.yml when connecting via Thrift connection. These configs will be adjusted according to the DBT models.

To avoid change detection on profile.yml and do a full re-parse(docs) . The simplest way is to pass these configs from environment variables through the jinja format(ref). spark_conf_string: "{{ env_var('SPARK_CONFIG_STRING') }}"

With the server_side_parameters approach, its data type is dict, it is difficult to bypass the re-parse limitation if we want to use the environment variable mentioned above.

With these specific requirements, it led me to create another parameter with the String data type.

Hi @colin-rogers-dbt ,
As I mentioned in the previous comment, the main reason is to prevent full-reparse when we need to change the spark configuration via dbt profile sequentially

my bad, should have made the connection.

colin-rogers-dbt · 2023-02-21T04:47:20Z

These configs will be adjusted according to the DBT models.

If the goal is to allow for per-model configuration I wonder if this would be better put in the model or schema configuration

Fleid · 2023-02-27T22:30:40Z

Ok so we want the parameter to be a string, so it can be injected via an environment variable.
So any way we split it, we need the parameter to have a different name from the existing one because of the mismatch in type.

Now to follow-up on Colin's ask, when you say:

These configs will be adjusted according to the DBT models.

Do you mean that you're targeting subsets of the DAG while running dbt, changing the value of that string in between calls?
I'm not sure that's the case though, because those settings we're handling, they're connection level are they not?

VShkaberda · 2023-03-02T09:43:33Z

Ok so we want the parameter to be a string, so it can be injected via an environment variable. So any way we split it, we need the parameter to have a different name from the existing one because of the mismatch in type.

Now to follow-up on Colin's ask, when you say:

These configs will be adjusted according to the DBT models.

Do you mean that you're targeting subsets of the DAG while running dbt, changing the value of that string in between calls? I'm not sure that's the case though, because those settings we're handling, they're connection level are they not?

The useсase our team ends up with:

we have a dbt project with multiple models
the project is scheduled to run in kubernetes with the certain profile for all the runs
the models are scheduled to run separately
the model X needs to be run with the default parameters, the model Y needs to be run with specific spark.executor.memory, the model Z needs to be run with specific spark.executor.memory and spark.kubernetes.memoryOverheadFactor

vinhnemo · 2023-03-02T15:25:25Z

Do you mean that you're targeting subsets of the DAG while running dbt, changing the value of that string in between calls?
I'm not sure that's the case though, because those settings we're handling, they're connection level are they not

Thanks @colin-rogers-dbt and @Fleid for replies

The configs for Spark that I want to pass in are at the connection level.

Each DBT model runs on an independent Spark engine, and the profile.yaml will specify resources for these engines via spark_conf_string: "{{ env_var('SPARK_CONFIG_STRING') }}".

We will override this variable to allocate resources and initialize the spark engine for each DBT-Spark model each time it runs like that spark_conf_string: 'spark.executor.memory=1g;spark.executor.cores=1'

Each spark engine will be spawned to serve a DBT-Spark model and shut down when that model is done.

Fleid · 2023-03-03T21:56:21Z

Ok that's what I was wondering about. This feels like the virtual warehouse overriding for Snowflake. You need a couple models processed with additional resources, and really the expression of the capacity you need should be a model level configuration.

When a model has a different snowflake_warehouse setting that the default (profile level), it gets updated/reverted via model hooks:

    def pre_model_hook(self, config: Mapping[str, Any]) -> Optional[str]:
        default_warehouse = self.config.credentials.warehouse
        warehouse = config.get("snowflake_warehouse", default_warehouse)
        if warehouse == default_warehouse or warehouse is None:
            return None
        previous = self._get_warehouse()
        self._use_warehouse(warehouse)
        return previous


    def post_model_hook(self, config: Mapping[str, Any], context: Optional[str]) -> None:
        if context is not None:
            self._use_warehouse(context)

I'm wondering if we couldn't do something similar here?

What I'm not sure about, is the life cycle of sessions/connections in dbt-spark (and dbt at all to be honest).
Because if we're re-using a single connection across models, we need to check that Spark allows for these settings to be changed during it. If not, if we're closing/re-opening connections between models, we should be good. @dbeatty10 could you land me your wisdom on that? ;)

…-sts

cla-bot · 2023-03-16T21:30:47Z

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @justinvin, @vinhnemo

…-sts

cla-bot · 2023-06-26T18:36:32Z

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @justinvin, @vinhnemo

JCZuurmond · 2023-07-07T09:14:33Z

@Fleid : @Fokko and I prefer to close this one over the #577 one, as we prefer the parameters in a dictionary over the string.

colin-rogers-dbt · 2023-10-11T22:39:45Z

closing per @JCZuurmond

add spark_conf_string to thrift connection type

3c70ac7

vinhnemo and others added 2 commits January 13, 2023 13:46

add spark_conf_string to thrift connection type

67c782d

Merge branch 'features/vinh-nguyen-add-extras-spark-config-sts' of ht…

ec55269

…tps://github.com/vinhnemo/dbt-spark into features/vinh-nguyen-add-extras-spark-config-sts

vinhnemo mentioned this pull request Jan 13, 2023

[CT-1812] [Feature] Adding extra Spark config for Thrift connection type #590

Closed

3 tasks

Merge branch 'main' into features/vinh-nguyen-add-extras-spark-config…

09a9f92

…-sts

Fleid mentioned this pull request Feb 15, 2023

Added server side parameters for thrift connection type #577

Merged

6 tasks

Merge branch 'main' into features/vinh-nguyen-add-extras-spark-config…

74861cd

…-sts

colin-rogers-dbt reviewed Feb 19, 2023

View reviewed changes

Fleid mentioned this pull request Feb 27, 2023

[CT-1561] [Feature] Support defining server_side_parameters inside of dbt_project.yml #529

Closed

3 tasks

nathaniel-may requested a review from a team as a code owner March 16, 2023 21:20

nathaniel-may requested a review from McKnight-42 March 16, 2023 21:20

Merge branch 'main' into features/vinh-nguyen-add-extras-spark-config…

e6891c1

…-sts

Merge branch 'main' into features/vinh-nguyen-add-extras-spark-config…

7f252c0

…-sts

colin-rogers-dbt closed this Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add spark_conf_string to thrift connection type #591

add spark_conf_string to thrift connection type #591

vinhnemo commented Jan 13, 2023 •

edited by dbeatty10

Loading

cla-bot bot commented Jan 13, 2023

cla-bot bot commented Jan 13, 2023

cla-bot bot commented Feb 1, 2023

Fleid commented Feb 15, 2023

cla-bot bot commented Feb 15, 2023

VShkaberda commented Feb 16, 2023

vinhnemo commented Feb 16, 2023

colin-rogers-dbt Feb 19, 2023

vinhnemo Feb 20, 2023

colin-rogers-dbt Feb 21, 2023

colin-rogers-dbt commented Feb 21, 2023 •

edited

Loading

Fleid commented Feb 27, 2023

VShkaberda commented Mar 2, 2023 •

edited

Loading

vinhnemo commented Mar 2, 2023

Fleid commented Mar 3, 2023 •

edited

Loading

cla-bot bot commented Mar 16, 2023

cla-bot bot commented Jun 26, 2023

JCZuurmond commented Jul 7, 2023

colin-rogers-dbt commented Oct 11, 2023 •

edited

Loading

add spark_conf_string to thrift connection type #591

add spark_conf_string to thrift connection type #591

Conversation

vinhnemo commented Jan 13, 2023 • edited by dbeatty10 Loading

Description

Checklist

cla-bot bot commented Jan 13, 2023

cla-bot bot commented Jan 13, 2023

cla-bot bot commented Feb 1, 2023

Fleid commented Feb 15, 2023

cla-bot bot commented Feb 15, 2023

VShkaberda commented Feb 16, 2023

vinhnemo commented Feb 16, 2023

colin-rogers-dbt Feb 19, 2023

Choose a reason for hiding this comment

vinhnemo Feb 20, 2023

Choose a reason for hiding this comment

colin-rogers-dbt Feb 21, 2023

Choose a reason for hiding this comment

colin-rogers-dbt commented Feb 21, 2023 • edited Loading

Fleid commented Feb 27, 2023

VShkaberda commented Mar 2, 2023 • edited Loading

vinhnemo commented Mar 2, 2023

Fleid commented Mar 3, 2023 • edited Loading

cla-bot bot commented Mar 16, 2023

cla-bot bot commented Jun 26, 2023

JCZuurmond commented Jul 7, 2023

colin-rogers-dbt commented Oct 11, 2023 • edited Loading

vinhnemo commented Jan 13, 2023 •

edited by dbeatty10

Loading

colin-rogers-dbt commented Feb 21, 2023 •

edited

Loading

VShkaberda commented Mar 2, 2023 •

edited

Loading

Fleid commented Mar 3, 2023 •

edited

Loading

colin-rogers-dbt commented Oct 11, 2023 •

edited

Loading