Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA][JSON reader] to support parsing with single quotes #10004

Closed
Tracked by #9458
wbo4958 opened this issue Jan 10, 2022 · 5 comments
Closed
Tracked by #9458

[FEA][JSON reader] to support parsing with single quotes #10004

wbo4958 opened this issue Jan 10, 2022 · 5 comments
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@wbo4958
Copy link
Contributor

wbo4958 commented Jan 10, 2022

This is part of FEA of NVIDIA/spark-rapids#9
We have a JSON file

{'name': 'Reynold Xin'}

CUDF can't parse this file because of

ai.rapids.cudf.CudfException: cuDF failure at: /home/bobwang/work.d/nvspark/cudf/cpp/src/io/json/reader_impl.cu:609: Error determining column names.

Spark can parse it by default.

We expect there is a configure allowSingleQuotes to control this behavior.

@wbo4958 wbo4958 added feature request New feature or request Needs Triage Need team to review and classify labels Jan 10, 2022
@github-actions
Copy link

github-actions bot commented Feb 9, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@sameerz sameerz added the Spark Functionality that helps Spark RAPIDS label Mar 23, 2022
@revans2
Copy link
Contributor

revans2 commented May 16, 2022

By default Spark allows single quotes by default so this is a blocker for us to enable JSON parsing in Spark by default.

@vuule vuule added the cuIO cuIO issue label Jun 8, 2022
@GregoryKimball GregoryKimball removed the Needs Triage Need team to review and classify label Jun 24, 2022
@GregoryKimball GregoryKimball added this to the Nested JSON reader milestone Nov 19, 2022
@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment and removed inactive-30d labels Dec 1, 2022
@GregoryKimball
Copy link
Contributor

GregoryKimball commented Dec 1, 2022

After some discussion it seems we could support this by adding new states for single-quoted field names and single-quoted string values. We would have to increase the size of the state machine which would bring a performance penalty. I'll leave this in backlog for now.

Update: our approach here will be to introduce a quote-normalizing preprocessing step based on a new finite state transducer. Also see #13525

@GregoryKimball GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023
@shrshi shrshi mentioned this issue Dec 2, 2023
3 tasks
rapids-bot bot pushed a commit that referenced this issue Jan 3, 2024
The goal of this PR is to address [PR 10004](#10004) by supporting parsing of JSON files containing single quotes for field/value strings.

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Elias Stehle (https://github.com/elstehle)

URL: #14545
rapids-bot bot pushed a commit that referenced this issue Jan 24, 2024
The goal of this PR is to address [10004](#10004) by supporting parsing of JSON files containing single quotes for field/value strings. This is a follow-up work to the POC [PR 14545](#14545)

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)

Approvers:
  - Andy Grove (https://github.com/andygrove)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Elias Stehle (https://github.com/elstehle)
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #14729
PointKernel pushed a commit to PointKernel/cudf that referenced this issue Jan 25, 2024
The goal of this PR is to address [10004](rapidsai#10004) by supporting parsing of JSON files containing single quotes for field/value strings. This is a follow-up work to the POC [PR 14545](rapidsai#14545)

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)

Approvers:
  - Andy Grove (https://github.com/andygrove)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Elias Stehle (https://github.com/elstehle)
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: rapidsai#14729
@revans2
Copy link
Contributor

revans2 commented Feb 26, 2024

I think this is done now. @GregoryKimball @andygrove do you both agree?

@GregoryKimball
Copy link
Contributor

Closed by #14729. Thank you @shrshi!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

5 participants