Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add spark_conf_string to thrift connection type #591

Closed
wants to merge 7 commits into from
Closed

add spark_conf_string to thrift connection type #591

wants to merge 7 commits into from

Conversation

vinhnemo
Copy link

@vinhnemo vinhnemo commented Jan 13, 2023

resolves #590

Description

Checklist

@cla-bot
Copy link

cla-bot bot commented Jan 13, 2023

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @vinhnemo

@cla-bot
Copy link

cla-bot bot commented Jan 13, 2023

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @vinhnemo, @justinvin

@cla-bot
Copy link

cla-bot bot commented Feb 1, 2023

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @justinvin, @vinhnemo

@Fleid
Copy link
Contributor

Fleid commented Feb 15, 2023

Hi @vinhnemo, thanks for taking the time to contribute!
There are duplicate issue and PR for that exact scenario (#387 and #577).

Your approach is to create a new configuration string spark_conf_string that then needs to be parsed. their approach is to re-purpose server_side_parameters from the ODBC method.

I have a preference for the server_side_parameters approach as it makes the profiles.yml schema consistent across methods. Any thoughts about that?

@cla-bot
Copy link

cla-bot bot commented Feb 15, 2023

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @justinvin, @vinhnemo

@VShkaberda
Copy link
Contributor

It seems that this solution has disadvantage, because it disallows to use = and ;. At least, it might be needed in:

  1. spark.driver.extraJavaOptions, e.g. spark.driver.extraJavaOptions=-Dconfig.resource=dev
  2. spark.redaction.string.regex, e.g. (?=...) as the part of the patttern

@vinhnemo
Copy link
Author

Hi @Fleid ,

Just to be clear, my purpose is to be able to change the Spark configuration with profile.yml when connecting via Thrift connection. These configs will be adjusted according to the DBT models.

To avoid change detection on profile.yml and do a full re-parse(docs) . The simplest way is to pass these configs from environment variables through the jinja format(ref).
spark_conf_string: "{{ env_var('SPARK_CONFIG_STRING') }}"

With the server_side_parameters approach, its data type is dict, it is difficult to bypass the re-parse limitation if we want to use the environment variable mentioned above.

With these specific requirements, it led me to create another parameter with the String data type.

@@ -74,6 +74,7 @@ class SparkCredentials(Credentials):
connect_timeout: int = 10
use_ssl: bool = False
server_side_parameters: Dict[str, Any] = field(default_factory=dict)
spark_conf_string: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this need to be a string? Can this be a dict? Doing so would allow us to escape special characters in the configuration

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, my purpose is to be able to change the Spark configuration with profile.yml when connecting via Thrift connection. These configs will be adjusted according to the DBT models.

To avoid change detection on profile.yml and do a full re-parse(docs) . The simplest way is to pass these configs from environment variables through the jinja format(ref). spark_conf_string: "{{ env_var('SPARK_CONFIG_STRING') }}"

With the server_side_parameters approach, its data type is dict, it is difficult to bypass the re-parse limitation if we want to use the environment variable mentioned above.

With these specific requirements, it led me to create another parameter with the String data type.

Hi @colin-rogers-dbt ,
As I mentioned in the previous comment, the main reason is to prevent full-reparse when we need to change the spark configuration via dbt profile sequentially

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad, should have made the connection.

@colin-rogers-dbt
Copy link
Contributor

colin-rogers-dbt commented Feb 21, 2023

These configs will be adjusted according to the DBT models.

If the goal is to allow for per-model configuration I wonder if this would be better put in the model or schema configuration

@Fleid
Copy link
Contributor

Fleid commented Feb 27, 2023

Ok so we want the parameter to be a string, so it can be injected via an environment variable.
So any way we split it, we need the parameter to have a different name from the existing one because of the mismatch in type.

Now to follow-up on Colin's ask, when you say:

These configs will be adjusted according to the DBT models.

Do you mean that you're targeting subsets of the DAG while running dbt, changing the value of that string in between calls?
I'm not sure that's the case though, because those settings we're handling, they're connection level are they not?

@VShkaberda
Copy link
Contributor

VShkaberda commented Mar 2, 2023

Ok so we want the parameter to be a string, so it can be injected via an environment variable. So any way we split it, we need the parameter to have a different name from the existing one because of the mismatch in type.

Now to follow-up on Colin's ask, when you say:

These configs will be adjusted according to the DBT models.

Do you mean that you're targeting subsets of the DAG while running dbt, changing the value of that string in between calls? I'm not sure that's the case though, because those settings we're handling, they're connection level are they not?

The useсase our team ends up with:

  • we have a dbt project with multiple models
  • the project is scheduled to run in kubernetes with the certain profile for all the runs
  • the models are scheduled to run separately
  • the model X needs to be run with the default parameters, the model Y needs to be run with specific spark.executor.memory, the model Z needs to be run with specific spark.executor.memory and spark.kubernetes.memoryOverheadFactor

@vinhnemo
Copy link
Author

vinhnemo commented Mar 2, 2023

Do you mean that you're targeting subsets of the DAG while running dbt, changing the value of that string in between calls?
I'm not sure that's the case though, because those settings we're handling, they're connection level are they not

Thanks @colin-rogers-dbt and @Fleid for replies

The configs for Spark that I want to pass in are at the connection level.

Each DBT model runs on an independent Spark engine, and the profile.yaml will specify resources for these engines via spark_conf_string: "{{ env_var('SPARK_CONFIG_STRING') }}".

We will override this variable to allocate resources and initialize the spark engine for each DBT-Spark model each time it runs like that spark_conf_string: 'spark.executor.memory=1g;spark.executor.cores=1'

Each spark engine will be spawned to serve a DBT-Spark model and shut down when that model is done.

@Fleid
Copy link
Contributor

Fleid commented Mar 3, 2023

Ok that's what I was wondering about. This feels like the virtual warehouse overriding for Snowflake. You need a couple models processed with additional resources, and really the expression of the capacity you need should be a model level configuration.

When a model has a different snowflake_warehouse setting that the default (profile level), it gets updated/reverted via model hooks:

    def pre_model_hook(self, config: Mapping[str, Any]) -> Optional[str]:
        default_warehouse = self.config.credentials.warehouse
        warehouse = config.get("snowflake_warehouse", default_warehouse)
        if warehouse == default_warehouse or warehouse is None:
            return None
        previous = self._get_warehouse()
        self._use_warehouse(warehouse)
        return previous


    def post_model_hook(self, config: Mapping[str, Any], context: Optional[str]) -> None:
        if context is not None:
            self._use_warehouse(context)

I'm wondering if we couldn't do something similar here?

What I'm not sure about, is the life cycle of sessions/connections in dbt-spark (and dbt at all to be honest).
Because if we're re-using a single connection across models, we need to check that Spark allows for these settings to be changed during it. If not, if we're closing/re-opening connections between models, we should be good. @dbeatty10 could you land me your wisdom on that? ;)

@nathaniel-may nathaniel-may requested a review from a team as a code owner March 16, 2023 21:20
@cla-bot
Copy link

cla-bot bot commented Mar 16, 2023

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @justinvin, @vinhnemo

@cla-bot
Copy link

cla-bot bot commented Jun 26, 2023

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @justinvin, @vinhnemo

@JCZuurmond
Copy link
Collaborator

@Fleid : @Fokko and I prefer to close this one over the #577 one, as we prefer the parameters in a dictionary over the string.

@colin-rogers-dbt
Copy link
Contributor

colin-rogers-dbt commented Oct 11, 2023

closing per @JCZuurmond

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-1812] [Feature] Adding extra Spark config for Thrift connection type
7 participants