Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark connector's implentation of "explode" does not work on nested fields #2051

Open
1 of 2 tasks
ThibSCH opened this issue Dec 14, 2022 · 0 comments
Open
1 of 2 tasks

Comments

@ThibSCH
Copy link

ThibSCH commented Dec 14, 2022

Hi everyone,

What kind an issue is this?

  • Bug report.
  • Feature Request

Issue description

We use Spark to manipulated an array of distinct objects in an ElasticSearch Index.
The ElasticSearch index's field is mapped as :

"array_field": {
        "type": "nested",
        "properties": {
          "property1": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword"
              }
            }
          },
          "property2": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword"
              }
            }
          },
          "property3": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword"
              }
            }
          },
          "property4": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword"
              }
            }
          },
          "property5": {
            "type": "date"
          }
        }
      }

When we use the explode Spark function on a dataset created from reading from ElasticSearch the connector generates the following query :

"query": {
    "bool": {
      "must": [
        {
          "match_all": {}
        }
      ],
      "filter": [
        {
          "exists": {
            "field": "array_field"
          }
        }
      ]
    }
  } 

The "exists" part in the query is generated to differentiate calls of explode and explode_outer because explode drops nulls elements whereas explode_outer keeps them.
But since the field is a nested, the query never gets any match because it is not a nested query therefore the dataset is always empty.

Steps to reproduce

  1. Create an index with a nested mapped field
  2. Put a document with a valued nested field
  3. Read the index from Spark into a dataset
  4. Call Spark explode(field) on the nested field on the dataset
  5. The dataset is empty because the generated query does not match any document

Version Info

OS: : Linux
JVM : 1.8
Hadoop/Spark: Spark 3.3.0
ES-Hadoop : elasticsearch-spark-30_2.12:8.2.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants