Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance optimization of JSON validation #16996

Open
wants to merge 2 commits into
base: branch-24.12
Choose a base branch
from

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Oct 3, 2024

Description

As part of JSON validation, field, value and string tokens are validated. Right now the code has single transform_inclusive_scan. Since this transform functor is a heavy operation, it slows down the entire scan drastically.
This PR splits transform and scan in validation. The runtime of validation went from 200ms to 20ms.

Also, a few hardcoded string comparisons are moved to trie.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@karthikeyann karthikeyann added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Performance Performance related issue Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 3, 2024
@karthikeyann karthikeyann requested a review from a team as a code owner October 3, 2024 22:55
@davidwendt
Copy link
Contributor

This kind of change usually improves compile time as well.
I looked up process_tokens.cu.o here and the compile time was about 3 minutes and is now 39 seconds in this PR.
Nice work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

2 participants