Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support array_repeat #5226

Closed
viadea opened this issue Apr 12, 2022 · 1 comment · Fixed by #5293
Closed

[FEA] Support array_repeat #5226

viadea opened this issue Apr 12, 2022 · 1 comment · Fixed by #5293
Assignees
Labels
feature request New feature or request

Comments

@viadea
Copy link
Collaborator

viadea commented Apr 12, 2022

I wish we can support array_repeat.

eg:

from pyspark.sql.functions import *
df = spark.createDataFrame([(["a", "b", "a"], ["b", "c"]), (["a","a"], ["b", "c"]), (["aa"], ["b", "c"])    ], ['x', 'y'])
df.write.format("parquet").mode("overwrite").save("/tmp/testparquet")
df = spark.read.parquet("/tmp/testparquet")
df.select(array_repeat(df.x, 3).alias("repeat")).collect()
    ! <ArrayRepeat> array_repeat(x#72, 3) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.ArrayRepeat
@viadea viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels Apr 12, 2022
@revans2
Copy link
Collaborator

revans2 commented Apr 13, 2022

This is something that we probably can do without any help from cudf. The simplest way I can think of is to create the data column using cudf::repeat and the offsets buffer using a scan SUM. Ww will need to do some bounds checking to make sure that we don't overflow (Spark has similar limits, but ours are going to be a lot smaller). We might also need to fix up the nulls as a null input count results in a null array.

@sperlingxx sperlingxx self-assigned this Apr 15, 2022
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Apr 19, 2022
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Apr 26, 2022
Add generateListOffsets API, converting list lengths to list offsets, which is useful in the development of spark-rapids.

For example, the support of [array_repeat](NVIDIA/spark-rapids#5226) and [arrays_zip](NVIDIA/spark-rapids#5229) relies on this API.

Authors:
  - Alfred Xu (https://github.com/sperlingxx)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Liangcai Li (https://github.com/firestarman)

URL: #10683
@sameerz sameerz changed the title [FEA]Support array_repeat [FEA] Support array_repeat Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants