Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a resource adapter to align on a specified size #768

Merged
merged 13 commits into from
May 19, 2021

Conversation

rongou
Copy link
Contributor

@rongou rongou commented May 4, 2021

Fixes #762

@rongou rongou added feature request New feature or request 3 - Ready for review Ready for review by team non-breaking Non-breaking change cpp Pertains to C++ code labels May 4, 2021
@rongou rongou requested review from a team as code owners May 4, 2021 21:20
@rongou rongou self-assigned this May 4, 2021
@rongou rongou requested a review from harrism May 4, 2021 21:20
@github-actions github-actions bot added the CMake label May 4, 2021
@rongou rongou changed the title add a resource adapter to align on block size add a resource adapter to align on a specified size May 5, 2021
include/rmm/mr/device/aligned_resource_adaptor.hpp Outdated Show resolved Hide resolved
include/rmm/mr/device/aligned_resource_adaptor.hpp Outdated Show resolved Hide resolved
include/rmm/mr/device/aligned_resource_adaptor.hpp Outdated Show resolved Hide resolved
include/rmm/mr/device/aligned_resource_adaptor.hpp Outdated Show resolved Hide resolved
tests/mr/device/aligned_mr_tests.cpp Outdated Show resolved Hide resolved
@rongou rongou requested a review from jrhemstad May 14, 2021 21:14
Comment on lines +41 to +43
* Since a larger alignment size has some additional overhead, the user can specify a threshold
* size. If an allocation's size falls below the threshold, it is aligned to the default size. Only
* allocations with a size above the threshold are aligned to the custom alignment size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand this part of the change. Why would you not want to align smaller allocations? If GDS requires it, it requires it. If you don't want it, use a different adaptor for those allocations. I get that a small allocation is wasteful if the alignment is high, but I don't see why we would want to return an allocation that wasn't aligned properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roughly speaking there are two types of allocations. The first is the small allocations needed for, say, a boolean mask. These have a pretty short lifetime, and would never be written out to disk. The second type is the large buffers that hold columnar data. They are longer lived, can be used for shuffle, and may be spilled to host memory/disk or GDS when GPU memory is full. Only these large buffers that are read/written by GDS need to be 4k aligned.

If we count the frequencies of allocations by size, they tend to follow a Zipf/Pareto distribution with smaller allocations appearing more frequently. If we 4k align all of them, not only we'd be wasting GPU memory, it also adds overhead keeping track of the mapping. In a long running job that's already under memory pressure, it could add noticeable overhead.

To be clear, the smaller allocations are still aligned by upstream memory resources (both pool and arena mr align them to 256), they are just not aligned to the bigger size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the allocations that don't care about alignment use a different adaptor then? One adaptor with no alignment requirements and one with alignment? It seems very odd to me to have an aligned allocator not align certain allocations. If the allocator aligns data to a specific alignment, I would expect that all allocations are aligned if they go through that adaptor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory yes, but then you'd have the same branching logic in the client code. Right now for the cuDF JNI layer, we only have one device memory resource.

You can always set the threshold to 0 and then all allocations are aligned the same way. Having this here is really just a convenience for our use case.

BTW GDS does handle unaligned buffers, it'll just copy them to internal bounce buffers, so slightly less efficient.

/**
* @brief Construct an aligned resource adaptor using `upstream` to satisfy allocation requests.
*
* @throws `rmm::logic_error` if `upstream == nullptr`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @throws `rmm::logic_error` if `upstream == nullptr`
* @throws `rmm::logic_error` if `upstream == nullptr`
* @throws `rmm::logic_error` if `allocation_alignment is not a multiple of 256`
* @throws `rmm::logic_error` if `alignment_threshold is not a multiple of 256`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

alignment_threshold_{alignment_threshold}
{
RMM_EXPECTS(nullptr != upstream, "Unexpected null upstream resource pointer.");
RMM_EXPECTS(allocation_alignment % 256 == 0, "Allocation alignment is not a multiple of 256.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a requirement? The alignment supported could be arbitrary, but the use of detail::align_up forces this to be a power of 2. This code would pass for an alignment of 768 bytes, but fail in the call to detail::align_up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

*/
void* do_allocate(std::size_t bytes, cuda_stream_view stream) override
{
if (allocation_alignment_ == default_allocation_alignment || bytes < alignment_threshold_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems dangerous. What happens if CUDA changes the default alignment? This could very easily be forgotten and result in a bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an adapter, the upstream memory resource should properly align the allocation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it will align it how it wants, which may not match the if statement here. In the admittedly unlikely event that CUDA started to return 16 byte aligned memory, this code would pass through to it if the user requested 256-byte alignment unless this was remembered and adjusted. At that point this would be incorrect.

This is certainly a pedantic complaint at best, but I have seen bugs like this in the past and I would like to avoid such things as they are difficult to track. If we can't find the default alignment from CUDA, I am ok with letting this go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We have 256 hard coded all over the place. I added a constant in aligned.hpp and replaced all the hard-coded values.

auto const address = reinterpret_cast<std::size_t>(pointer);
auto const aligned_address = rmm::detail::align_up(address, allocation_alignment_);
void* aligned_pointer = reinterpret_cast<void*>(aligned_address);
if (pointer != aligned_pointer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is a nice touch.

if (this == &other)
return true;
else {
auto aligned_other = dynamic_cast<aligned_resource_adaptor<Upstream> const*>(&other);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will do what you want. If the adaptors are not the exact same adaptor, you will check the upstreams for equality, but are not verifying this level at all. I would expect checks for the alignment values and thresholds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

*/
[[nodiscard]] std::pair<size_t, size_t> do_get_mem_info(cuda_stream_view stream) const override
{
return upstream_->get_mem_info(stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't 100% accurate and I think we should indicate that somewhere. If I have an alignment of 4 megs and attempt to allocate all memory that this returns as available the allocation will probably fail due to alignment requirements. I think adding a comment in the docs above the function is sufficient though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*/
std::size_t upstream_allocation_size(std::size_t bytes) const
{
auto const aligned_size = detail::align_up(bytes, allocation_alignment_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we padding out the allocation to alignment size? Is the expectation that this adaptor is the only one active on the GPU and we want the next allocation will come back aligned? If so, we're even more wasteful though because of the alignment - default alignment padding being added on the next line. I'm not understanding the value of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For GDS to consider a buffer to be 4k aligned, both the base pointer and size need to be multiples of 4k.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that the incoming request for GDS is aligned to 4k then, not that the rmm adaptor makes this assumption for all incoming requests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allocation requests could come from any operator, which may not be aware of GDS at all.

{
mock_resource mock;
auto construct_alignment = [](auto* r, std::size_t a) { aligned_adaptor mr{r, a}; };
EXPECT_THROW(construct_alignment(&mock, 255), rmm::logic_error);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we are mocking these things here. Why aren't we testing the actual resource?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are, the mock is the upstream resource, which we don't really care about.

cuda_stream_view stream;
void* pointer = reinterpret_cast<void*>(123);
// device_memory_resource aligns to 8.
EXPECT_CALL(mock, do_allocate(8, stream)).WillOnce(Return(pointer));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to circumvent the entire class we are testing. Am I reading this incorrectly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mock is the upstream resource.

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I requested before that the tests should check that the resulting alignment of the allocations is correct. But now they just seem to be checking if mocked calls return mocked pointers.

}

/**
* @brief Compare the upstream resource to another.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't just compare the upstream resource. It compares this resource to another.

Suggested change
* @brief Compare the upstream resource to another.
* @brief Compare this resource to another.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

@rongou rongou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harrism added a test to verify the pointer returned from a real memory resource.

Comment on lines +41 to +43
* Since a larger alignment size has some additional overhead, the user can specify a threshold
* size. If an allocation's size falls below the threshold, it is aligned to the default size. Only
* allocations with a size above the threshold are aligned to the custom alignment size.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory yes, but then you'd have the same branching logic in the client code. Right now for the cuDF JNI layer, we only have one device memory resource.

You can always set the threshold to 0 and then all allocations are aligned the same way. Having this here is really just a convenience for our use case.

BTW GDS does handle unaligned buffers, it'll just copy them to internal bounce buffers, so slightly less efficient.

*/
void* do_allocate(std::size_t bytes, cuda_stream_view stream) override
{
if (allocation_alignment_ == default_allocation_alignment || bytes < alignment_threshold_) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We have 256 hard coded all over the place. I added a constant in aligned.hpp and replaced all the hard-coded values.

}

/**
* @brief Compare the upstream resource to another.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*/
std::size_t upstream_allocation_size(std::size_t bytes) const
{
auto const aligned_size = detail::align_up(bytes, allocation_alignment_);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allocation requests could come from any operator, which may not be aware of GDS at all.

@rongou
Copy link
Contributor Author

rongou commented May 19, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 2a5aa46 into rapidsai:branch-21.06 May 19, 2021
@rongou rongou deleted the block-align-adapter branch May 20, 2021 16:00
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request May 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for review Ready for review by team CMake cpp Pertains to C++ code feature request New feature or request non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Option to 4KiB align large buffers
4 participants