Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a resource adapter to align on a specified size #768

Merged
merged 13 commits into from
May 19, 2021
Merged
219 changes: 219 additions & 0 deletions include/rmm/mr/device/aligned_resource_adaptor.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
/*
* Copyright (c) 2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include <mutex>
#include <optional>
#include <unordered_map>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/detail/aligned.hpp>
#include <rmm/detail/error.hpp>
#include <rmm/mr/device/device_memory_resource.hpp>

namespace rmm::mr {
/**
* @brief Resource that adapts `Upstream` memory resource to allocate memory in a specified
* alignment size.
*
* An instance of this resource can be constructed with an existing, upstream resource in order
* to satisfy allocation requests. This adaptor wraps allocations and deallocations from Upstream
* using the given alignment size.
*
* By default, any address returned by one of the memory allocation routines from the CUDA driver or
* runtime API is always aligned to at least 256 bytes. For some use cases, such as GPUDirect
* Storage (GDS), allocations need to be aligned to a larger size (4 KiB for GDS) in order to avoid
* additional copies to bounce buffers.
*
* Since a larger alignment size has some additional overhead, the user can specify a threshold
* size. If an allocation's size falls below the threshold, it is aligned to the default size. Only
* allocations with a size above the threshold are aligned to the custom alignment size.
Comment on lines +41 to +43
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand this part of the change. Why would you not want to align smaller allocations? If GDS requires it, it requires it. If you don't want it, use a different adaptor for those allocations. I get that a small allocation is wasteful if the alignment is high, but I don't see why we would want to return an allocation that wasn't aligned properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roughly speaking there are two types of allocations. The first is the small allocations needed for, say, a boolean mask. These have a pretty short lifetime, and would never be written out to disk. The second type is the large buffers that hold columnar data. They are longer lived, can be used for shuffle, and may be spilled to host memory/disk or GDS when GPU memory is full. Only these large buffers that are read/written by GDS need to be 4k aligned.

If we count the frequencies of allocations by size, they tend to follow a Zipf/Pareto distribution with smaller allocations appearing more frequently. If we 4k align all of them, not only we'd be wasting GPU memory, it also adds overhead keeping track of the mapping. In a long running job that's already under memory pressure, it could add noticeable overhead.

To be clear, the smaller allocations are still aligned by upstream memory resources (both pool and arena mr align them to 256), they are just not aligned to the bigger size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the allocations that don't care about alignment use a different adaptor then? One adaptor with no alignment requirements and one with alignment? It seems very odd to me to have an aligned allocator not align certain allocations. If the allocator aligns data to a specific alignment, I would expect that all allocations are aligned if they go through that adaptor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory yes, but then you'd have the same branching logic in the client code. Right now for the cuDF JNI layer, we only have one device memory resource.

You can always set the threshold to 0 and then all allocations are aligned the same way. Having this here is really just a convenience for our use case.

BTW GDS does handle unaligned buffers, it'll just copy them to internal bounce buffers, so slightly less efficient.

*
* @tparam Upstream Type of the upstream resource used for allocation/deallocation.
*/
template <typename Upstream>
class aligned_resource_adaptor final : public device_memory_resource {
public:
/**
* @brief Construct an aligned resource adaptor using `upstream` to satisfy allocation requests.
*
* @throws `rmm::logic_error` if `upstream == nullptr`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @throws `rmm::logic_error` if `upstream == nullptr`
* @throws `rmm::logic_error` if `upstream == nullptr`
* @throws `rmm::logic_error` if `allocation_alignment is not a multiple of 256`
* @throws `rmm::logic_error` if `alignment_threshold is not a multiple of 256`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*
* @param upstream The resource used for allocating/deallocating device memory.
* @param allocation_alignment The size used for allocation alignment.
* @param alignment_threshold Only allocations with a size larger than or equal to this threshold
* are aligned.
*/
explicit aligned_resource_adaptor(Upstream* upstream,
std::size_t allocation_alignment = default_allocation_alignment,
std::size_t alignment_threshold = default_alignment_threshold)
: upstream_{upstream},
allocation_alignment_{allocation_alignment},
alignment_threshold_{alignment_threshold}
{
RMM_EXPECTS(nullptr != upstream, "Unexpected null upstream resource pointer.");
RMM_EXPECTS(allocation_alignment % 256 == 0, "Allocation alignment is not a multiple of 256.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a requirement? The alignment supported could be arbitrary, but the use of detail::align_up forces this to be a power of 2. This code would pass for an alignment of 768 bytes, but fail in the call to detail::align_up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

RMM_EXPECTS(alignment_threshold % 256 == 0, "Alignment threshold is not a multiple of 256.");
}

aligned_resource_adaptor() = delete;
~aligned_resource_adaptor() override = default;
aligned_resource_adaptor(aligned_resource_adaptor const&) = delete;
aligned_resource_adaptor(aligned_resource_adaptor&&) = delete;
aligned_resource_adaptor& operator=(aligned_resource_adaptor const&) = delete;
aligned_resource_adaptor& operator=(aligned_resource_adaptor&&) = delete;

/**
* @brief Get the upstream memory resource.
*
* @return Upstream* pointer to a memory resource object.
*/
Upstream* get_upstream() const noexcept { return upstream_; }

/**
* @copydoc rmm::mr::device_memory_resource::supports_streams()
*/
[[nodiscard]] bool supports_streams() const noexcept override
{
return upstream_->supports_streams();
}

/**
* @brief Query whether the resource supports the get_mem_info API.
*
* @return bool true if the upstream resource supports get_mem_info, false otherwise.
*/
[[nodiscard]] bool supports_get_mem_info() const noexcept override
{
return upstream_->supports_get_mem_info();
}

private:
static constexpr std::size_t default_allocation_alignment = 256;
static constexpr std::size_t default_alignment_threshold = 0;
using lock_guard = std::lock_guard<std::mutex>;

/**
* @brief Allocates memory of size at least `bytes` using the upstream resource with the specified
* alignment.
*
* @throws `rmm::bad_alloc` if the requested allocation could not be fulfilled
* by the upstream resource.
*
* @param bytes The size, in bytes, of the allocation
* @param stream Stream on which to perform the allocation
* @return void* Pointer to the newly allocated memory
*/
void* do_allocate(std::size_t bytes, cuda_stream_view stream) override
{
if (allocation_alignment_ == default_allocation_alignment || bytes < alignment_threshold_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems dangerous. What happens if CUDA changes the default alignment? This could very easily be forgotten and result in a bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an adapter, the upstream memory resource should properly align the allocation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it will align it how it wants, which may not match the if statement here. In the admittedly unlikely event that CUDA started to return 16 byte aligned memory, this code would pass through to it if the user requested 256-byte alignment unless this was remembered and adjusted. At that point this would be incorrect.

This is certainly a pedantic complaint at best, but I have seen bugs like this in the past and I would like to avoid such things as they are difficult to track. If we can't find the default alignment from CUDA, I am ok with letting this go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We have 256 hard coded all over the place. I added a constant in aligned.hpp and replaced all the hard-coded values.

return upstream_->allocate(bytes, stream);
} else {
auto const size = upstream_allocation_size(bytes);
void* pointer = upstream_->allocate(size, stream);
auto const address = reinterpret_cast<std::size_t>(pointer);
auto const aligned_address = rmm::detail::align_up(address, allocation_alignment_);
void* aligned_pointer = reinterpret_cast<void*>(aligned_address);
if (pointer != aligned_pointer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is a nice touch.

lock_guard lock(mtx_);
pointers_.emplace(aligned_pointer, pointer);
}
return aligned_pointer;
}
}

/**
* @brief Free allocation of size `bytes` pointed to to by `p` and log the deallocation.
*
* @throws Nothing.
*
* @param p Pointer to be deallocated
* @param bytes Size of the allocation
* @param stream Stream on which to perform the deallocation
*/
void do_deallocate(void* p, std::size_t bytes, cuda_stream_view stream) override
{
if (allocation_alignment_ == default_allocation_alignment || bytes < alignment_threshold_) {
upstream_->deallocate(p, bytes, stream);
} else {
{
lock_guard lock(mtx_);
auto const i = pointers_.find(p);
if (i != pointers_.end()) {
p = i->second;
pointers_.erase(i);
}
}
upstream_->deallocate(p, upstream_allocation_size(bytes), stream);
}
}

/**
* @brief Compare the upstream resource to another.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't just compare the upstream resource. It compares this resource to another.

Suggested change
* @brief Compare the upstream resource to another.
* @brief Compare this resource to another.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*
* @throws Nothing.
*
* @param other The other resource to compare to
* @return true If the two resources are equivalent
* @return false If the two resources are not equivalent
*/
[[nodiscard]] bool do_is_equal(device_memory_resource const& other) const noexcept override
{
if (this == &other)
return true;
else {
auto aligned_other = dynamic_cast<aligned_resource_adaptor<Upstream> const*>(&other);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will do what you want. If the adaptors are not the exact same adaptor, you will check the upstreams for equality, but are not verifying this level at all. I would expect checks for the alignment values and thresholds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

if (aligned_other != nullptr)
return upstream_->is_equal(*aligned_other->get_upstream());
else
return upstream_->is_equal(other);
}
}

/**
* @brief Get free and available memory from upstream resource.
*
* @throws `rmm::cuda_error` if unable to retrieve memory info.
*
* @param stream Stream on which to get the mem info.
* @return std::pair containing free_size and total_size of memory
*/
[[nodiscard]] std::pair<size_t, size_t> do_get_mem_info(cuda_stream_view stream) const override
{
return upstream_->get_mem_info(stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't 100% accurate and I think we should indicate that somewhere. If I have an alignment of 4 megs and attempt to allocate all memory that this returns as available the allocation will probably fail due to alignment requirements. I think adding a comment in the docs above the function is sufficient though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

/**
* @brief Calculate the allocation size needed from upstream to account for alignments of both the
* size and the base pointer.
*
* @param bytes The requested allocation size.
* @return Allocation size needed from upstream to align both the size and the base pointer.
*/
std::size_t upstream_allocation_size(std::size_t bytes) const
{
auto const aligned_size = detail::align_up(bytes, allocation_alignment_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we padding out the allocation to alignment size? Is the expectation that this adaptor is the only one active on the GPU and we want the next allocation will come back aligned? If so, we're even more wasteful though because of the alignment - default alignment padding being added on the next line. I'm not understanding the value of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For GDS to consider a buffer to be 4k aligned, both the base pointer and size need to be multiples of 4k.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that the incoming request for GDS is aligned to 4k then, not that the rmm adaptor makes this assumption for all incoming requests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allocation requests could come from any operator, which may not be aware of GDS at all.

return aligned_size + allocation_alignment_ - default_allocation_alignment;
}

Upstream* upstream_; ///< The upstream resource used for satisfying allocation requests
std::unordered_map<void*, void*> pointers_; ///< Map of aligned pointers to upstream pointers.
std::size_t allocation_alignment_; ///< The size used for allocation alignment
std::size_t alignment_threshold_; ///< The size above which allocations should be aligned
mutable std::mutex mtx_; ///< Mutex for exclusive lock.
};

} // namespace rmm::mr
6 changes: 6 additions & 0 deletions tests/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,12 @@ set(TRACKING_TEST_SRC "${CMAKE_CURRENT_SOURCE_DIR}/mr/device/tracking_mr_tests.c

ConfigureTest(TRACKING_TEST "${TRACKING_TEST_SRC}")

# aligned adaptor tests

set(ALIGNED_TEST_SRC "${CMAKE_CURRENT_SOURCE_DIR}/mr/device/aligned_mr_tests.cpp")

ConfigureTest(ALIGNED_TEST "${ALIGNED_TEST_SRC}")

# limiting adaptor tests

set(LIMITING_TEST_SRC "${CMAKE_CURRENT_SOURCE_DIR}/mr/device/limiting_mr_tests.cpp")
Expand Down
Loading