Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly handle the mapped and registered regions in memory_mapped_source #16865

Merged
merged 27 commits into from
Oct 4, 2024

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Sep 20, 2024

Description

Depends on #16826

Set of fixes that improve robustness on the non-GDS file input:

  1. Avoid registering beyond the byte range - addresses problems when reading adjacent byte ranges from multiple threads (GH only).
  2. Allow reading data outside of the memory mapped region. This prevents issues with very long rows in CSV or JSON input.
  3. Copy host data when the range being read is only partially registered. This avoids errors when trying to copy the host data range to the device (GH only).

Modifies the datasource class hierarchy to avoid reuse of direct file host_reads

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 20, 2024
@vuule vuule added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Sep 20, 2024
@vuule vuule self-assigned this Sep 20, 2024
@vuule vuule changed the title Bug-fix-mmaped-source Properly handle the mapped and registered regions in memory_mapped_source Sep 20, 2024
Copy link

codecov bot commented Sep 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (branch-24.12@289e466). Learn more about missing BASE report.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-24.12   #16865   +/-   ##
===============================================
  Coverage                ?   82.02%           
===============================================
  Files                   ?      183           
  Lines                   ?    29193           
  Branches                ?        0           
===============================================
  Hits                    ?    23945           
  Misses                  ?     5248           
  Partials                ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +99 to +100
size_t max_size_estimate = 0,
size_t min_size_estimate = 0);
Copy link
Contributor Author

@vuule vuule Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the max includes the padding for the last row when reading a byte range. We're now passing the byte range size without padding as well, for the operations that should not include the padding, such as the buffer registration (because registering overlapping ranges fails, and this can happen when we read a CSV or a JSON file in chunks).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make that use case more obvious in the docstrings? I wouldn't know the difference between padded/non-padded and registration restrictions from this docstring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to throw if max < min? If so that @throws would go into the docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docs here and in memory_mapped_source constructor. I think they add up to a full explanation, let me know what you think :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thanks.

@@ -54,6 +55,30 @@ class file_source : public datasource {
}
}

std::unique_ptr<buffer> host_read(size_t offset, size_t size) override
Copy link
Contributor Author

@vuule vuule Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved the implementation from the derived class because the memory mapped source now performs direct reads when the location is outside of the mapped range.

@vuule vuule marked this pull request as ready for review September 30, 2024 16:53
@vuule vuule requested a review from a team as a code owner September 30, 2024 16:53
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implementation is fine, but the public API / docstrings could be better.

Comment on lines +99 to +100
size_t max_size_estimate = 0,
size_t min_size_estimate = 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make that use case more obvious in the docstrings? I wouldn't know the difference between padded/non-padded and registration restrictions from this docstring.

Comment on lines +99 to +100
size_t max_size_estimate = 0,
size_t min_size_estimate = 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to throw if max < min? If so that @throws would go into the docstring.

@vuule vuule requested a review from bdice October 2, 2024 20:15
Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vuule

@vuule
Copy link
Contributor Author

vuule commented Oct 4, 2024

/merge

@rapids-bot rapids-bot bot merged commit 39342b8 into branch-24.12 Oct 4, 2024
100 checks passed
@vuule vuule deleted the bug-fix-mmaped-source branch October 4, 2024 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants