Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce benchmark suite for JSON reader options #15124

Merged
merged 22 commits into from
Apr 9, 2024

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Feb 23, 2024

Description

The goal of this piece of work is to analyze the performance of the reader for JSON lines. This PR establishes a baseline for the performance of single quote normalization, white space normalization, mixed type as string parsing and recovery mode options when the input JSON is valid, and does not have any single quotes.
Modifying the data generation to produce inputs with single quotes/mixed types/invalid lines will be the focus of follow-on PRs.
Addresses #15041

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Feb 23, 2024
@shrshi shrshi added non-breaking Non-breaking change 2 - In Progress Currently a work in progress feature request New feature or request and removed libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Feb 23, 2024
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Feb 23, 2024
auto const view = tbl->view();
cudf::io::json_writer_options const write_opts =
cudf::io::json_writer_options::builder(source_sink.make_sink_info(), view)
.lines(true)
Copy link
Contributor

@GregoryKimball GregoryKimball Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are testing the options, I would recommend making lines into an nvbench enum axis as well. Although since some of the options require lines to be true, maybe benchmarking lines true/false should be a separate benchmark.

nvbench::enum_type_list<row_selection::ALL, row_selection::BYTE_RANGE>, lines=True
nvbench::enum_type_list<normalize_single_quotes::NO, normalize_single_quotes::YES> line=True/False
nvbench::enum_type_list<mixed_types_as_string::NO, mixed_types_as_string::YES> line=True/False
nvbench::enum_type_list<recovery_mode::RECOVER_WITH_NULL, recovery_mode::FAIL>)) lines=True

After thinking through the options, I don't think we need to test normalize_single_quotes and mixed_types_as_string with lines=false. It still might be useful to add a lines true/false benchmark without any additional options. If others agree then that could be a follow-on PR.

@GregoryKimball
Copy link
Contributor

Would you please post some early benchmark results? (similar to this example)

@shrshi
Copy link
Contributor Author

shrshi commented Feb 29, 2024

Benchmarks were run on A100 80GB GPU, with all combinations of normalize_single_quotes, row_selection, recovery_mode, mixed_type_as_string options being enabled/disabled.
no_mixed_type
The figure above shows a slight drop in performance while byte range reading is enabled. Note that all the runs above do not enable mixed_type_as_string option.
mixed_type
Figure showing significant performance degradation when mixed_type_as_string is enabled.
Performance tracking issue

@GregoryKimball GregoryKimball changed the title [WIP] JSON lines benchmark - studying reader options [WIP] Introduce benchmark suite for JSON reader options Mar 7, 2024
@shrshi shrshi marked this pull request as ready for review March 8, 2024 01:42
@shrshi shrshi changed the title [WIP] Introduce benchmark suite for JSON reader options Introduce benchmark suite for JSON reader options Mar 8, 2024
rapids-bot bot pushed a commit that referenced this pull request Mar 8, 2024
…n is enabled (#15236)

Addresses #15196 by applying a patch from @karthikeyann to skip the `infer_column_type_kernel` by forcing the mixed types column to be a string. 
With this optimization, we see a significant improvement in performance. Please refer to the [comment](#15236 (comment)) for a visualization of the results before and after applying this optimization as obtained from the [JSON lines benchmarking exercise](#15124).

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #15236
Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving CMake changes

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff!
Some minor comments/questions.

cpp/benchmarks/io/json/json_reader_option.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/io/json/json_reader_option.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/io/json/json_reader_option.cpp Show resolved Hide resolved
cpp/benchmarks/io/json/json_reader_option.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/io/json/json_reader_option.cpp Show resolved Hide resolved
@shrshi
Copy link
Contributor Author

shrshi commented Mar 11, 2024

After updates to the mixed types as string handling and the byte range reader, here are the most recent benchmarks. Note that in view of the data generation bug, the results below may vary after that large last row issue is fixed.
rsALL_rmFAIL
row_selection=ALL, recovery_mode=FAIL

rsALL_rmNULL
row_selection=ALL, recovery_mode=RECOVER_WITH_NULL

rsBYTE_rmFAIL
row_selection=BYTE_RANGE, recovery_mode=FAIL

rsBYTE_rmNULL
row_selection=BYTE_RANGE, recovery_mode=RECOVER_WITH_NULL

@GregoryKimball
Copy link
Contributor

These plots are great! Please check out #15185 for an example of where the byte range support is not working optimally.

@shrshi shrshi requested a review from vuule March 19, 2024 18:26
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One non-blocking nit.

Very clean code, nice work!

cpp/benchmarks/io/json/json_reader_option.cpp Outdated Show resolved Hide resolved
Co-authored-by: Yunsong Wang <yunsongw@nvidia.com>
@shrshi
Copy link
Contributor Author

shrshi commented Mar 19, 2024

/ok to test

Comment on lines +81 to +85
template <row_selection RowSelection,
normalize_single_quotes NormalizeSingleQuotes,
normalize_whitespace NormalizeWhitespace,
mixed_types_as_string MixedTypesAsString,
recovery_mode RecoveryMode>
Copy link
Contributor

@ttnghia ttnghia Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no this is too many template params. Why not using run time parameter instead? That would reduce compile time a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the design for benchmarking reader options in orc and parquet. Would we have to modify those benchmarks as well to maintain similar design?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no this is too many template params. Why not using run time parameter instead?

This makes results more readable and the build time for benchmarks doesn't really matter IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's fine to me 👍

@shrshi shrshi changed the base branch from branch-24.04 to branch-24.06 April 8, 2024 19:34
@shrshi
Copy link
Contributor Author

shrshi commented Apr 9, 2024

/merge

@rapids-bot rapids-bot bot merged commit 1862cdc into rapidsai:branch-24.06 Apr 9, 2024
70 checks passed
jjacobelli pushed a commit to jjacobelli/cudf that referenced this pull request Apr 9, 2024
The goal of this piece of work is to analyze the performance of the reader for JSON lines. This PR establishes a baseline for the performance of single quote normalization, white space normalization, mixed type as string parsing and recovery mode options when the input JSON is valid, and does not have any single quotes. 
Modifying the data generation to produce inputs with single quotes/mixed types/invalid lines will be the focus of follow-on PRs.
Addresses rapidsai#15041

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Yunsong Wang (https://github.com/PointKernel)
  - Nghia Truong (https://github.com/ttnghia)

URL: rapidsai#15124
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants