-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a resource adapter to align on a specified size #768
add a resource adapter to align on a specified size #768
Conversation
* Since a larger alignment size has some additional overhead, the user can specify a threshold | ||
* size. If an allocation's size falls below the threshold, it is aligned to the default size. Only | ||
* allocations with a size above the threshold are aligned to the custom alignment size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand this part of the change. Why would you not want to align smaller allocations? If GDS requires it, it requires it. If you don't want it, use a different adaptor for those allocations. I get that a small allocation is wasteful if the alignment is high, but I don't see why we would want to return an allocation that wasn't aligned properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roughly speaking there are two types of allocations. The first is the small allocations needed for, say, a boolean mask. These have a pretty short lifetime, and would never be written out to disk. The second type is the large buffers that hold columnar data. They are longer lived, can be used for shuffle, and may be spilled to host memory/disk or GDS when GPU memory is full. Only these large buffers that are read/written by GDS need to be 4k aligned.
If we count the frequencies of allocations by size, they tend to follow a Zipf/Pareto distribution with smaller allocations appearing more frequently. If we 4k align all of them, not only we'd be wasting GPU memory, it also adds overhead keeping track of the mapping. In a long running job that's already under memory pressure, it could add noticeable overhead.
To be clear, the smaller allocations are still aligned by upstream memory resources (both pool and arena mr align them to 256), they are just not aligned to the bigger size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the allocations that don't care about alignment use a different adaptor then? One adaptor with no alignment requirements and one with alignment? It seems very odd to me to have an aligned allocator not align certain allocations. If the allocator aligns data to a specific alignment, I would expect that all allocations are aligned if they go through that adaptor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory yes, but then you'd have the same branching logic in the client code. Right now for the cuDF JNI layer, we only have one device memory resource.
You can always set the threshold to 0 and then all allocations are aligned the same way. Having this here is really just a convenience for our use case.
BTW GDS does handle unaligned buffers, it'll just copy them to internal bounce buffers, so slightly less efficient.
/** | ||
* @brief Construct an aligned resource adaptor using `upstream` to satisfy allocation requests. | ||
* | ||
* @throws `rmm::logic_error` if `upstream == nullptr` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @throws `rmm::logic_error` if `upstream == nullptr` | |
* @throws `rmm::logic_error` if `upstream == nullptr` | |
* @throws `rmm::logic_error` if `allocation_alignment is not a multiple of 256` | |
* @throws `rmm::logic_error` if `alignment_threshold is not a multiple of 256` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
alignment_threshold_{alignment_threshold} | ||
{ | ||
RMM_EXPECTS(nullptr != upstream, "Unexpected null upstream resource pointer."); | ||
RMM_EXPECTS(allocation_alignment % 256 == 0, "Allocation alignment is not a multiple of 256."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a requirement? The alignment supported could be arbitrary, but the use of detail::align_up
forces this to be a power of 2. This code would pass for an alignment of 768 bytes, but fail in the call to detail::align_up
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
*/ | ||
void* do_allocate(std::size_t bytes, cuda_stream_view stream) override | ||
{ | ||
if (allocation_alignment_ == default_allocation_alignment || bytes < alignment_threshold_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems dangerous. What happens if CUDA changes the default alignment? This could very easily be forgotten and result in a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just an adapter, the upstream memory resource should properly align the allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it will align it how it wants, which may not match the if statement here. In the admittedly unlikely event that CUDA started to return 16 byte aligned memory, this code would pass through to it if the user requested 256-byte alignment unless this was remembered and adjusted. At that point this would be incorrect.
This is certainly a pedantic complaint at best, but I have seen bugs like this in the past and I would like to avoid such things as they are difficult to track. If we can't find the default alignment from CUDA, I am ok with letting this go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. We have 256 hard coded all over the place. I added a constant in aligned.hpp
and replaced all the hard-coded values.
auto const address = reinterpret_cast<std::size_t>(pointer); | ||
auto const aligned_address = rmm::detail::align_up(address, allocation_alignment_); | ||
void* aligned_pointer = reinterpret_cast<void*>(aligned_address); | ||
if (pointer != aligned_pointer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is a nice touch.
if (this == &other) | ||
return true; | ||
else { | ||
auto aligned_other = dynamic_cast<aligned_resource_adaptor<Upstream> const*>(&other); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will do what you want. If the adaptors are not the exact same adaptor, you will check the upstreams for equality, but are not verifying this level at all. I would expect checks for the alignment values and thresholds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
*/ | ||
[[nodiscard]] std::pair<size_t, size_t> do_get_mem_info(cuda_stream_view stream) const override | ||
{ | ||
return upstream_->get_mem_info(stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't 100% accurate and I think we should indicate that somewhere. If I have an alignment of 4 megs and attempt to allocate all memory that this returns as available the allocation will probably fail due to alignment requirements. I think adding a comment in the docs above the function is sufficient though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
*/ | ||
std::size_t upstream_allocation_size(std::size_t bytes) const | ||
{ | ||
auto const aligned_size = detail::align_up(bytes, allocation_alignment_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we padding out the allocation to alignment size? Is the expectation that this adaptor is the only one active on the GPU and we want the next allocation will come back aligned? If so, we're even more wasteful though because of the alignment - default alignment padding being added on the next line. I'm not understanding the value of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For GDS to consider a buffer to be 4k aligned, both the base pointer and size need to be multiples of 4k.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect that the incoming request for GDS is aligned to 4k then, not that the rmm adaptor makes this assumption for all incoming requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The allocation requests could come from any operator, which may not be aware of GDS at all.
{ | ||
mock_resource mock; | ||
auto construct_alignment = [](auto* r, std::size_t a) { aligned_adaptor mr{r, a}; }; | ||
EXPECT_THROW(construct_alignment(&mock, 255), rmm::logic_error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we are mocking these things here. Why aren't we testing the actual resource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are, the mock is the upstream resource, which we don't really care about.
cuda_stream_view stream; | ||
void* pointer = reinterpret_cast<void*>(123); | ||
// device_memory_resource aligns to 8. | ||
EXPECT_CALL(mock, do_allocate(8, stream)).WillOnce(Return(pointer)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to circumvent the entire class we are testing. Am I reading this incorrectly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mock is the upstream resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I requested before that the tests should check that the resulting alignment of the allocations is correct. But now they just seem to be checking if mocked calls return mocked pointers.
} | ||
|
||
/** | ||
* @brief Compare the upstream resource to another. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't just compare the upstream resource. It compares this resource to another.
* @brief Compare the upstream resource to another. | |
* @brief Compare this resource to another. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harrism added a test to verify the pointer returned from a real memory resource.
* Since a larger alignment size has some additional overhead, the user can specify a threshold | ||
* size. If an allocation's size falls below the threshold, it is aligned to the default size. Only | ||
* allocations with a size above the threshold are aligned to the custom alignment size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory yes, but then you'd have the same branching logic in the client code. Right now for the cuDF JNI layer, we only have one device memory resource.
You can always set the threshold to 0 and then all allocations are aligned the same way. Having this here is really just a convenience for our use case.
BTW GDS does handle unaligned buffers, it'll just copy them to internal bounce buffers, so slightly less efficient.
*/ | ||
void* do_allocate(std::size_t bytes, cuda_stream_view stream) override | ||
{ | ||
if (allocation_alignment_ == default_allocation_alignment || bytes < alignment_threshold_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. We have 256 hard coded all over the place. I added a constant in aligned.hpp
and replaced all the hard-coded values.
} | ||
|
||
/** | ||
* @brief Compare the upstream resource to another. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
*/ | ||
std::size_t upstream_allocation_size(std::size_t bytes) const | ||
{ | ||
auto const aligned_size = detail::align_up(bytes, allocation_alignment_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The allocation requests could come from any operator, which may not be aware of GDS at all.
@gpucibot merge |
Depends on rapidsai/rmm#768. Authors: - Rong Ou (https://github.com/rongou) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #8266
Fixes #762