IS NOT IN operator is translated to a wrong query #2118

j-adamczyk · 2023-08-08T19:34:03Z

What kind an issue is this?

Bug report. If you’ve found a bug, please provide a code snippet or test to reproduce it below.
The easier it is to track down the bug, the faster it is solved.

Issue description

When I negate .isin() function in PySpark, the generated query is malformed and results in an error.

Steps to reproduce

Code:

es_connector = (
    spark.read.format("es")
    .option("es.read.metadata", "false")
    .option("es.nodes.wan.only", "true")
    .option("es.net.ssl", "true")
    .option("es.net.ssl.cert.allow.self.signed", "true")
    .option("es.nodes", es_host)
    .option("es.port", es_port)
    .option("es.net.http.auth.user", es_username)
    .option("es.net.http.auth.pass", es_password)
)

events = es_connector.load(
    "events-*", 
)

excluded_group_ids = [0, 1, 2, 3]

events = (
    events
    .select("id", "group_id", "status")
    .filter(~events.group_id.isin(excluded_group_ids))
    .filter(events.status.isin(["verified", "sent"]))
)

Strack trace:

23/08/08 21:28:09 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID 19)
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: org.elasticsearch.hadoop.rest.EsHadoopRemoteException: query_shard_exception: failed to create query: For input string: "0 1 2 3"
{"query":{"bool":{"must":[{"match_all":{}}],"filter":[{"bool":{"must_not":{"bool":{"should":[{"match":{"group_id":"0 1 2 3"}}]}}}},{"bool":{"should":[{"match":{"status":"verified sent"}}]}}]}},"_source":["id","status","group_id"]}
	at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:487)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:444)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:438)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:418)
	at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:318)
	at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:94)
	at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:66)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Version Info

OS: : Ubuntu 22.04
JVM :
Hadoop/Spark: PySpark 3.3.1
ES-Hadoop : elasticsearch-spark-30_2.12-8.9.0.jar
ES : 8.9.0

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IS NOT IN operator is translated to a wrong query #2118

IS NOT IN operator is translated to a wrong query #2118

j-adamczyk commented Aug 8, 2023

IS NOT IN operator is translated to a wrong query #2118

IS NOT IN operator is translated to a wrong query #2118

Comments

j-adamczyk commented Aug 8, 2023

What kind an issue is this?

Issue description

Steps to reproduce

Version Info