Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JSON option to prune columns #14996

Merged

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Feb 7, 2024

Description

Resolves #14951
This adds an option prune_columns to json_reader_options (default False)
When set to True, the dtypes option is used as filter instead of type inference suggestion. If dtypes (vector of dtypes, map of dtypes or nested schema), is not specified, output is empty dataframe.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@karthikeyann karthikeyann added feature request New feature or request 2 - In Progress Currently a work in progress cuIO cuIO issue Java Affects Java cuDF API. 4 - Needs cuDF (Java) Reviewer non-breaking Non-breaking change labels Feb 7, 2024
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. and removed Java Affects Java cuDF API. labels Feb 7, 2024
@karthikeyann
Copy link
Contributor Author

karthikeyann commented Feb 7, 2024

Profiled on GV100 machine.
Reading JSON with 512 columns, 10k rows without filter
image

Reading 1 columns out of JSON with 512 columns, 10k rows. (with filter 1 row) (with filter 1 column)
image

unnecesary parse_data() calls are eliminated.
It's possible to eliminate the initialize_json_columns() calls as well (but runtime impact is less, memory usage will reduce, and depends on map type PR #14936)

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I tried it out again and the time for pulling one item out of 512 went from 17 seconds to 9 seconds. I have not done traces on it yet to see what the next steps would be, but it is a lot better.

@GregoryKimball
Copy link
Contributor

Thank you @karthikeyann, this is a great demonstration! When you mention:

Reading 1 columns out of JSON with 512 columns, 10k rows. (with filter 1 row)

What do you mean by "filter 1 row"?

@karthikeyann
Copy link
Contributor Author

karthikeyann commented Feb 8, 2024

What do you mean by "filter 1 row"?

Sorry. I meant to type "filter 1 column".

keys.json content in each line:
{"key_109": "value0", "key_200": "value0", "key_342": "value0", ... } (500 keys out of 512 columns in each row)

import cudf
import nvtx
# read all 512 columns
with nvtx.annotate("read_json", color="purple"):
    df = cudf.read_json(open("keys.json"), engine="cudf", lines=True)
# read only 1 column
with nvtx.annotate("read_json", color="purple"):
    df = cudf.read_json(open("keys.json"), engine="cudf", lines=True, dtype={"key_10": str}, use_dtypes_as_filter=True)

@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuDF (Java) Reviewer labels Feb 23, 2024
@karthikeyann karthikeyann changed the base branch from branch-24.04 to branch-24.06 April 8, 2024 20:31
Copy link

copy-pr-bot bot commented Apr 8, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@karthikeyann
Copy link
Contributor Author

/ok to test

@karthikeyann
Copy link
Contributor Author

/ok to test

@karthikeyann
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nitpicks, but this LGTM.

It's a little late now to suggest this, but one wonders if "column pruning" might have been an acceptable replacement to "column filter", to avoid potential confusion.

I've learnt a couple of things from reviewing this PR, as per usual with @karthikeyann's PRs.

cpp/src/io/json/parser_features.cpp Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
cpp/tests/io/json_test.cpp Outdated Show resolved Hide resolved
Comment on lines 2427 to 2435
{std::map<std::string, cudf::io::schema_element> dtype_schema{
{"a", {dtype<int32_t>()}},
};
in_options.set_dtypes(dtype_schema);
cudf::io::table_with_metadata result = cudf::io::read_json(in_options);
// Make sure we have column "a"
ASSERT_EQ(result.tbl->num_columns(), 1);
ASSERT_EQ(result.metadata.schema_info.size(), 1);
EXPECT_EQ(result.metadata.schema_info[0].name, "a");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the formatting here a little off?

Copy link
Contributor Author

@karthikeyann karthikeyann Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// include only one column
{// schema
 {std::map<std::string, cudf::io::schema_element> dtype_schema{

This part of code makes the formatting off.
Consecutive { (even with comment), makes clang-format think, it's consecutive uniform initialization braces {{.
I removed the extra { }.

@karthikeyann
Copy link
Contributor Author

/ok to test

@karthikeyann
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter to prune changes. This looks good to me.

cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_column.cu Outdated Show resolved Hide resolved
cpp/tests/io/json_test.cpp Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/cpp/io/json.pxd Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/json.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/json.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/io/json.py Outdated Show resolved Hide resolved
python/cudf/cudf/io/json.py Outdated Show resolved Hide resolved
karthikeyann and others added 2 commits April 30, 2024 12:42
Co-authored-by: Mike Wilson <hyperbolic2346@users.noreply.github.com>
@karthikeyann
Copy link
Contributor Author

/ok to test

@karthikeyann karthikeyann changed the title Add JSON option to use dtypes as Filter Add JSON option to prune columns Apr 30, 2024
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for entertaining my nits. I always love a good stoptimization and this is a perfect example!

Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!
One question - do we need to include this option in the java code as well?

@karthikeyann
Copy link
Contributor Author

do we need to include this option in the java code as well?

Yes. @revans2 Should I include the java code changes as well in this PR?

@karthikeyann karthikeyann added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels May 1, 2024
@karthikeyann
Copy link
Contributor Author

/merge

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please have a docstring addition for the python read_json? Additionally, perhaps I am dumb, I couldn't understand the C++ docstring for the prune_columns option.

python/cudf/cudf/io/json.py Show resolved Hide resolved
cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
@karthikeyann karthikeyann requested a review from wence- May 2, 2024 17:57
@karthikeyann
Copy link
Contributor Author

/ok to test

@rapids-bot rapids-bot bot merged commit 2fccbc0 into rapidsai:branch-24.06 May 2, 2024
69 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[FEA] have an option for the schema to filter the columns read from JSON
8 participants