Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support spark.sql.mapKeyDedupPolicy=LAST_WIN for TransformKeys #5325

Closed
viadea opened this issue Apr 27, 2022 · 1 comment · Fixed by #5505
Closed

[FEA] Support spark.sql.mapKeyDedupPolicy=LAST_WIN for TransformKeys #5325

viadea opened this issue Apr 27, 2022 · 1 comment · Fixed by #5505
Assignees
Labels
feature request New feature or request

Comments

@viadea
Copy link
Collaborator

viadea commented Apr 27, 2022

I wish we can support spark.sql.mapKeyDedupPolicy=LAST_WIN to do deduplicate of the map keys.

For example:

from pyspark.sql.types import *
from pyspark.sql.functions import *
jsonString="""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR","Zipcode":999}"""
df=spark.createDataFrame([(1, jsonString)],["id","value"])
df.show(truncate=False)
df2=df.withColumn("value",from_json(df.value,MapType(StringType(),StringType())))
df2.collect()

spark.conf.set("spark.sql.mapKeyDedupPolicy","LAST_WIN")
df2.withColumn('newcol', expr("""transform_keys(value, (k, v) -> split(k, "code")[0])""")).collect()

Not-supported-messages:

!Expression <TransformKeys> transform_keys(value#160, lambdafunction(split(lambda k#254, code, 2)[0], lambda k#254, lambda v#255, false)) cannot run on GPU because LAST_WIN is not supported for config setting spark.sql.mapKeyDedupPolicy
@viadea viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels Apr 27, 2022
@andygrove andygrove changed the title [FEA] Support spark.sql.mapKeyDedupPolicy=LAST_WIN [FEA] Support spark.sql.mapKeyDedupPolicy=LAST_WIN for TransformKeys Apr 28, 2022
@andygrove
Copy link
Contributor

I updated the title to make it clear that this is asking for support for LAST_WIN for transform_keys. We actually already support LAST_WIN for create_map so we should have everything we need from cuDF.

For create_map we call dropListDuplicatesWithKeysValues to remove duplicates and this already supports the same semantics as LAST_WIN:

    // Apache Spark desires to keep the last duplicate element.
    auto [out_keys, out_vals] =
        cudf::lists::drop_list_duplicates(keys, vals, cudf::duplicate_keep_option::KEEP_LAST);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants