Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Es.write.operation documentation is deceptive on default values when used via spark #2206

Open
1 of 2 tasks
robwithhair opened this issue Mar 19, 2024 · 1 comment
Open
1 of 2 tasks
Labels

Comments

@robwithhair
Copy link

robwithhair commented Mar 19, 2024

What kind an issue is this?

  • Bug report. If you’ve found a bug, please provide a code snippet or test to reproduce it below.
    The easier it is to track down the bug, the faster it is solved.
  • Feature Request. Start by telling us what problem you’re trying to solve.
    Often a solution already exists! Don’t send pull requests to implement new features without
    first getting our support. Sometimes we leave features out on purpose to keep the project small.

Issue description

Documentation suggests default es.write.operation is index but when used via spark output mode "update" the default mode is actually upsert. This information is only available by reading code.

Documentation is deceptive because it suggests that in spark update mode the default value of index will be used when actually the default is overridden to be "upsert" it appears in testing and by visually reviewing code.

Steps to reproduce

Code:

N/A as is documentation fix

Strack trace:

N/A

@jbaiera
Copy link
Member

jbaiera commented Mar 19, 2024

This could be better detailed in the docs for sure.

When using update mode in Spark SQL, the connector changes the operation to be "upsert" since 1) it needs to use that request mode to satisfy the invariants defined by Spark and 2) it's anticipating your need for that setting to be set to use that mode and so it just sets it for you so you don't have to say you want to update data in multiple places.

Fun fact: There are actually quite a lot of things in Spark that we plug into in order to modify the connector's behavior based on your API usage, like pushing down queries to ES (by default we don't filter results from the server, but we generate queries based on the query plan if we're able to) or limiting returned fields from the server (we'll intercept the field projection from Spark if it's available so we don't pull a bunch of fields from each document that aren't needed for the operation). It's tough to list these all out because in some cases we are merging existing configurations together, in other cases we override them, and sometimes we're just offloading some of the concern on to the library code so users don't have to worry about configurations.

@jbaiera jbaiera added the >docs label Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants