[New artifacts] Pre-built (alpha version) pip installable wheels for …

…linux will be made available as part of this release. The pip wheels are compatible from python 3.8 through 3.12. The source builds will continue to work as expected. [Documentation] We are updating our contribution policy and will be accepting small PRs targetting improving the cudnn-frontend. For full contribution guide refer to our contribution policy. [API updates] [Python] The graph.execute function in python now takes an optional handle. This is to help user provide a custom handle to the execute function(and achieve parity with the C++ API). [API updates] Pointwise ops can now take scalars directly as an argument. This simplifies the graph creation process in general. For eg. ``` auto C = graph.pointwise(A, graph.tensor(5.0f), fe::graph::Pointwise_attributes() .set_mode(fe::PointwiseMode_t::ADD) .set_compute_data_type(fe::DataType_t::FLOAT)); ``` [Installation] Addresses RFE #64 to provide installation as `cmake install` [Installation] Addresses RFE #63 to provide custom installation of catch2. If catch2 is not found, cudnn frontend fetches it automatically from the upstream github repository. [Logging] Improved logging to print legible tensor names. We will be working on further improvements in future releases to make the logging more streamlined. [Samples] Add a sample for showcasing auto-tuning to select the best plan among the ones returned from heuristics. [Samples] As part of v1.2 release, we have created new Jupyter notebooks, showcasing the python API usage. At this point, these will work on A100 and H100 cards only as mentioned in the notebooks. With future releases, we plan to simplify the installation process and elaborate the API usage. Please refer to `samples/python` directory. [Bug fixes] Fixed issues related to auto-tuning when the always plan 0 was executed, even though a different plan was chosen as the best candidate. [Unit Tests] We are adding some unit tests which will provide a way for developers to test parts of the their code before submitting the pull requests. It is highly encouraged to add unit-tests and samples before submitting a pull request.
NVIDIA · Mar 11, 2024 · bedd113 · bedd113
1 parent 150798f
commit bedd113
Show file tree

Hide file tree

Showing 100 changed files with 4,935 additions and 2,876 deletions.
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -1,20 +1,26 @@
 cmake_minimum_required(VERSION 3.17)
 
-project(cudnn_frontend VERSION 1.1.2)
+project(cudnn_frontend VERSION 1.2.0)
 
+option(CUDNN_FRONTEND_SKIP_NLOHMANN_JSON "Defines whether FE should not include nlohmann/json.hpp." OFF)
 option(CUDNN_FRONTEND_BUILD_SAMPLES "Defines if samples are built or not." ON)
-option(CUDNN_FRONTEND_BUILD_UNIT_TESTS "Defines if unittests are built or not." OFF)
+option(CUDNN_FRONTEND_BUILD_UNIT_TESTS "Defines if unittests are built or not." ON)
 
 if(MSVC OR MSYS OR MINGW)
     option(CUDNN_FRONTEND_BUILD_PYTHON_BINDINGS "Defines if python bindings are built or not." OFF)
     add_compile_options(/W4 /WX)
 else()
-    option(CUDNN_FRONTEND_BUILD_PYTHON_BINDINGS "Defines if python bindings are built or not." ON)
+    option(CUDNN_FRONTEND_BUILD_PYTHON_BINDINGS "Defines if python bindings are built or not." OFF)
     add_compile_options(-Wall -Wextra -Wpedantic -Werror -Wno-error=attributes -Wno-attributes -Wno-error=unused-function -Wno-unused-function)
 endif()
 
 add_library(cudnn_frontend INTERFACE)
 
+target_compile_definitions(
+    cudnn_frontend INTERFACE
+    $<$<BOOL:${CUDNN_FRONTEND_SKIP_NLOHMANN_JSON}>:CUDNN_FRONTEND_SKIP_NLOHMANN_JSON>
+)
+
 target_include_directories(
     cudnn_frontend INTERFACE
     $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>
@@ -50,5 +56,16 @@ if (CUDNN_FRONTEND_BUILD_UNIT_TESTS)
 endif()
 
 if (CUDNN_FRONTEND_BUILD_PYTHON_BINDINGS)
-    add_subdirectory(python_bindings)
+    add_subdirectory(python)
 endif()
+
+# Introduce variables:
+# * CMAKE_INSTALL_LIBDIR
+# * CMAKE_INSTALL_BINDIR
+# * CMAKE_INSTALL_INCLUDEDIR
+include(GNUInstallDirs)
+
+install(
+    DIRECTORY ${PROJECT_SOURCE_DIR}/include/
+    DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}
+)
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,55 @@
+# Contributing to cudnn-frontend
+
+If you are interested in contributing to cudnn-frontend, your contributions will fall
+into three categories:
+1. You want to report a bug, feature request, or documentation issue
+    - File an [issue](https://github.com/NVIDIA/cudnn-frontend/issues)
+    describing what you encountered or what you want to see changed.
+    - The cudnn team will evaluate the issues and triage them, scheduling
+    them for a release. If you believe the issue needs priority attention
+    comment on the issue to notify the team.
+2. You want to propose a new Feature and implement it
+    - Post about your intended feature, and we shall discuss the design and
+    implementation.
+    - Once we agree that the plan looks good, go ahead and implement it, using
+    the [code contributions](#code-contributions) guide below.
+3. You want to implement a feature or bug-fix for an outstanding issue
+    - Follow the [code contributions](#code-contributions) guide below.
+    - If you need more context on a particular issue, please ask and we shall
+    provide.
+
+## Code contributions
+
+### Your first issue
+
+1. Read the project's [README.md](https://github.com/NVIDIA/cudnn-frontend/blob/main/README.md)
+   to learn how to setup the development environment.
+2. Comment on the issue saying you are going to work on it and what changes you are going to make.
+3. Code! Make sure to update unit tests!
+4. When done, [create your pull request](https://github.com/NVIDIA/cudnn-frontend/compare).
+5. Wait for other developers to review your code and update code as needed.
+6. Once reviewed and approved, a cudnn-frontend developer will merge your pull request.
+7. At this time, we are accepting only small fixes, changes. Once merged to main this will be an untagged version. A release tag will be assigned along with future frontend release by cudnn team.
+
+Remember, if you are unsure about anything, don't hesitate to comment on issues and ask for clarifications!
+
+## Code Formatting
+
+Consistent code formatting is important in the cudnn-frontend project to ensure
+readability, maintainability, and thus simplifies collaboration.
+
+### Branches and Versions
+
+The cudnn-frontend repository has one main branch. Please submit a PR to this branch. We will update the doc as the policy changes.
+
+### Branch naming
+
+Branches used to create PRs should have a name of the form `<name>-issue-<issue_number>`
+which conforms to the following conventions:
+
+- Name:
+    - A name to convey what is being worked on
+    - Please use dashes or underscores between words as opposed to spaces.
+
+## Attribution
+Portions of contribution guide adopted from [https://github.com/rapidsai/cuml/blob/branch-24.04/CONTRIBUTING.md](https://github.com/rapidsai/cuml/blob/branch-24.04/CONTRIBUTING.md)
diff --git a/README.FE.1.0.md b/README.FE.1.0.md
@@ -152,17 +152,8 @@ cudnn_frontend::graph::Graph& cudnn_frontend::graph::Plans::deselect_workspace_g
 ### Autotune
 
 Autotuning provides a way to execute different execution plans for a given graph and measure their relative performance under run time conditions.
-This generally helps validate and improve upon the results provided by the heuristics.
+This generally helps validate and improve upon the results provided by the heuristics. Please refer to [samples](samples/cpp/autotuning.cpp)
 
-The current API to perform the autotuning on the filtered plans:
-```
-cudnn_frontend::error_t
-cudnn_frontend::graph::Graph::autotune(cudnnHandle_t handle,
-            std::unordered_map<std::shared_ptr<Tensor_attributes>, void *> variants,
-            void *workspace,
-            void *user_impl = nullptr);
-
-```
 ### Execute
 Executing graph requires device pointers to all input output tensors and a user alloaction device workspace pointer.
 
@@ -220,4 +211,4 @@ Python samples are jupyter notebooks with step by step guide on using FE v1 API.
 
 ## Operations
 
-Please look at docs/operations for APIs of different operation types.
+Please look at [docs/operations](docs/operations/) for APIs of different operation types.
diff --git a/README.md b/README.md
@@ -38,6 +38,8 @@ To run the python samples, additionally, you will need the following python pack
 
 
 ### Python API
+
+#### Source installation:
 Install FE python API by running:
 ```
 pip install git+https://github.com/NVIDIA/cudnn-frontend.git
@@ -48,7 +50,15 @@ Above command picks cuda and cudnn from default system paths.
 To provide a custom CUDA installation path, use environment variable: `CUDAToolkit_ROOT`.  
 To provide a custom CUDNN installation path, use environment variable: `CUDNN_PATH`.
 
+#### pip wheel installation
+
+Download the pip wheel corresponding to your python installation.
+
+```
+pip install nvidia_cudnn_frontend-1.2.0-*.whl
+```
 
+#### Checking the installation
 To test whether installation is successful, run:
 ```
 pytest tests/python_fe
@@ -107,11 +117,11 @@ Calling `cudnn_frontend::getStream() = stream_name` can be used to assign the ou
 For further debugging, please turn on the cudnn backend logs described here https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#api-logging
 
 ## Documentation
-- See README.FE.1.0.md for v1.0 API documentation.
-- See README.FE.0.x.md for v0.x API documentation.
+- See [README.FE.1.0.md](README.FE.1.0.md) for v1.0 API documentation.
+- See [README.FE.0.x.md](README.FE.0.x.md) for v0.x API documentation.
 
 ## Contributing:
-No external contribution to this repository is accepted. Please create an issue in github for bugs or feature requests.
+Please refer to our [contribution guide](CONTRIBUTING.md)
 
 ## Feedback
 Support, resources, and information about cuDNN can be found online at https://developer.nvidia.com/cudnn. 

diff --git a/docs/operations/Attention.md b/docs/operations/Attention.md
@@ -1,7 +1,11 @@
 ## Table of Contents
 1. [Scaled Dot Product Attention](#scaled-dot-product-attention)
 2. [Scaled Dot Product Attention Backward](#scaled-dot-product-attention-backward)
-3. [Miscellaneous](#miscellaneous)
+3. Appendices
+    - [Tensor Layouts](#appendix-a)
+    - [Workspace limits and Performance](#appendix-b)
+    - [RNG dump](#appendix-c)
+4. [Miscellaneous](#miscellaneous)
 
 ### Scaled Dot Product Attention
 
@@ -250,13 +254,65 @@ Returns:
     dV (cudnn_tensor): The value gradient tensor of scaled dot-product attention.
 ```
 
-### Miscellaneous
-- FE provides shadow enums which help avoid users to workaround having different enums for different cudnn versions.
-- The cudnn backend enums are changed as follows:
-    - `cudnnBackend<enum_name>` -> `cudnn_frontend::<enum_name>`
-    - `cudnn<enum_name>` -> `cudnn_frontend::<enum_name>`
-- To dump the dropout mask generated by the Philox RNG dropout implementation for debugging purposes, users can use the `rng_dump` option. This option requires users to pass a tensor of dimensions $(B, H_{q}, S_{q}, S_{kv})$ 
-- Scaled Dot Product Attention Backward improves performance by using an optional dP workspace tensor. This tensor's memory consumption increases quadratically with the sequence length. The following describes the behavior of the `CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT` environment variable, which allows the user to change the GPU memory limit for this workspace tensor:
+
+#### Appendix A 
+Tensor Layouts:
+Q, K, V, O and corresponding gradients layout support. cuDNN API expresses the layout of tensors based on strides.
+
+For example, let Q have dimensions = [5, 7, 4, 3], and strides = [84, 12, 3, 1]
+An element at index [i, j, k, l] can be accessed at the position of Q_ptr + i * 84 + j * 12 + k * 3 + l * 1
+
+Notice how the strides are multiplied to the indices to get the position of all elements.
+Below we will go through the standard usage of the attention tensors and how they can be expressed in cuDNN.
+
+  1. Q, K, V are different matrices with strided layout
+  This is the basic case where the user can specify dims and strides for each of Q, K and V and it works as the example given above.
+  The only limitation is that stride corresponding to the hidden dimension per head (d, last dim in Q) needs to be 1.
+
+  2. Q, K, V are interleaved 
+  This is a special case of (1) and can be described in a strided layout as well. 
+  For example, Q, K and V can be a single matrix of dims (batch (b), number_of_heads (h), sequence_length (s), 3, hidden_dim_per_head(d))
+  Strides of Q can be defined as [h * s * 3 * d, s * 3 * d, 3 * d, 1]
+  Notice how the 3 is multiplied to the strides corresponding to b, h and s because of the interleaving.
+
+  3. There are some special cases when all tokens are not valid and Q, K, V can be in special layouts
+    Let Q tensor have two sequences (i.e batch = 2, number_of_heads = 1) with max_seq_len = 8 and actual_seq_len = [2, 3]
+    Conider two tokens "aa" & "bbb".
+      - Fully padded layout
+
+        aa000000
+        bbb00000
+        Dims = [b=2, h=1, s=8, d=64]
+        Strides = [512, 512, 64, 1]
+
+        CUDNN gets indication of the actual sequence lengths using the seq_len_q and the seq_len_kv and cuts the computation at these values. Please enable use_padding_mask also for this case. CUDNN reads the data based on the strides.
+
+      - Fully packed layout
+        aabbb000
+        00000000
+        Dims = [b=2, h=1, s=8, d=64]
+        Strides = [512, 512, 64, 1]
+
+        The strides remain the same but they are incorrect as the second batch begins at 64*2. Therefore, we have an API called "ragged_offset" which is a b+1 size tensor telling where each batch begins. The b+1 element is where the last batch ends.
+        Users can set <tensor>.set_ragged_offset(<ragged_offset_tensor>)
+        For this example ragged_offset = [0, 128, 320]
+        Actual sequence length still have to be provided with padding mask.
+
+      - Valid tokens in a batch are packed together
+        aa00bbb0
+        00000000
+
+        User just needs to update the ragged offset to = [0, 256, 448]
+
+      - Valid tokens are not packed together
+        a0abbb00
+        bb000000
+
+        Ragged offset is insufficient to represent this. This case is NOT supported.
+
+#### Appendix B
+Workspace limit:
+Scaled Dot Product Attention Backward improves performance by using an optional dP workspace tensor. This tensor's memory consumption increases quadratically with the sequence length. The following describes the behavior of the `CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT` environment variable, which allows the user to change the GPU memory limit for this workspace tensor:
   - `CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT = unset`  
     The optimization will utilize workspace memory until reaching the default limit of 256MB.
   - `CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT = -1`  
@@ -265,3 +321,12 @@ Returns:
     Workspace optimization is always disabled, avoiding the additional memory usage.
   - `CUDNN_FRONTEND_ATTN_DP_WORKSPACE_LIMIT = n`  
     Allows workspace optimization up to a user-defined limit of n bytes, accommodating systems with varying GPU memory capacities.
+
+#### Appendix C
+To dump the dropout mask generated by the Philox RNG dropout implementation for debugging purposes, users can use the `rng_dump` option. This option requires users to pass a tensor of dimensions $(B, H_{q}, S_{q}, S_{kv})$ 
+
+### Miscellaneous
+- FE provides shadow enums which help avoid users to workaround having different enums for different cudnn versions.
+- The cudnn backend enums are changed as follows:
+    - `cudnnBackend<enum_name>` -> `cudnn_frontend::<enum_name>`
+    - `cudnn<enum_name>` -> `cudnn_frontend::<enum_name>`
diff --git a/include/cudnn_backend_base.h b/include/cudnn_backend_base.h
@@ -45,12 +45,14 @@ class OpaqueBackendPointer {
      * OpaqueBackendPointer constructor.
      * Calls the cudnnBackendCreateDescriptor. Allocates memory according to the type.
      */
-    OpaqueBackendPointer(cudnnBackendDescriptorType_t type) { status = cudnnBackendCreateDescriptor(type, &m_desc); }
+    OpaqueBackendPointer(cudnnBackendDescriptorType_t type) {
+        status = cudnn_frontend::create_descriptor(type, &m_desc);
+    }
     /**
      * OpaqueBackendPointer destructor.
      * Calls the cudnnBackendDestroyDescriptor. Frees memory allocated in the constructor.
      */
-    ~OpaqueBackendPointer() { cudnnBackendDestroyDescriptor(m_desc); };
+    ~OpaqueBackendPointer() { cudnn_frontend::destroy_descriptor(m_desc); };
     /**
      * Accessor.
      * Returns the const reference to raw underlying descriptor.
@@ -160,4 +162,5 @@ class BackendDescriptor {
     mutable cudnnStatus_t status = CUDNN_STATUS_SUCCESS;  //!< Error code if any being set
     mutable std::string err_msg;                          //!< Error message if any being set
 };
+
 }  // namespace cudnn_frontend
diff --git a/include/cudnn_frontend.h b/include/cudnn_frontend.h
@@ -124,8 +124,8 @@
 #include "cudnn_frontend/utils/serialize.h"
 
 #define CUDNN_FRONTEND_MAJOR_VERSION 1
-#define CUDNN_FRONTEND_MINOR_VERSION 1
-#define CUDNN_FRONTEND_PATCH_VERSION 2
+#define CUDNN_FRONTEND_MINOR_VERSION 2
+#define CUDNN_FRONTEND_PATCH_VERSION 0
 #define CUDNN_FRONTEND_VERSION \
     ((CUDNN_FRONTEND_MAJOR_VERSION * 10000) + (CUDNN_FRONTEND_MINOR_VERSION * 100) + CUDNN_FRONTEND_PATCH_VERSION)
 

diff --git a/include/cudnn_frontend/cudnn_interface.h b/include/cudnn_frontend/cudnn_interface.h
@@ -42,38 +42,33 @@ class ICudnn {
     // TODO: Always returns OK. Can the status and error message be accessed from tensor descriptor?
     error_t
     create_cudnn_tensor(std::shared_ptr<graph::Tensor_attributes> const& props,
-                        uid_t& uid,
-                        std::unordered_map<uid_t, std::shared_ptr<cudnn_frontend::Tensor>>& tensors,
-                        std::unordered_set<uid_t> const& invalid_uids) const {
-        // Check whether tensor already created
-        // Make sure no other tensor somehow already has claimed uid.
+                        std::unordered_map<uid_t, std::shared_ptr<cudnn_frontend::Tensor>>& tensors) const {
+        // TODO: uid check has to be moved to validate stage.
+        RETURN_CUDNN_FRONTEND_ERROR_IF(props->has_uid() == false,
+                                       error_code_t::ATTRIBUTE_NOT_SET,
+                                       "Tensor named '" + props->get_name() + "' has no uid assigned.");
 
-        auto tensor_uid = props->has_uid() ? props->get_uid() : uid;
+        // Check whether tensor already created
+        auto tensor_uid = props->get_uid();
         if (tensors.find(tensor_uid) != tensors.end()) {
-            getLogger() << "[cudnn_frontend] INFO: Shared Tensor" << uid << " already created." << std::endl;
+            getLogger() << "[cudnn_frontend] INFO: Backend Tensor named '" << props->get_name() << "' with UID "
+                        << tensor_uid << " already created." << std::endl;
             return {error_code_t::OK, ""};
         }
 
-        if (props->has_uid() == false) {
-            props->set_uid(uid);
-            do {
-                uid++;
-            } while (invalid_uids.find(uid) != invalid_uids.end());
-        }
-
         auto&& tensor_builder = cudnn_frontend::TensorBuilder();
 
         tensor_builder.setDim(props->get_dim().size(), props->get_dim().data())
             .setStrides(props->get_stride().size(), props->get_stride().data())
-            .setId(props->get_uid())
+            .setId(tensor_uid)
             .setAlignment(16)
             .setDataType(props->get_data_type())
             .setVirtual(props->get_is_virtual())
             .setByValue(props->get_is_pass_by_value())
             .setReorderType(props->get_reordering_type());
 
         if (auto ragged_offset_props = props->get_ragged_offset()) {
-            CHECK_CUDNN_FRONTEND_ERROR(create_cudnn_tensor(ragged_offset_props, uid, tensors, invalid_uids));
+            CHECK_CUDNN_FRONTEND_ERROR(create_cudnn_tensor(ragged_offset_props, tensors));
             tensor_builder.setRaggedOffset(tensors.at(ragged_offset_props->get_uid()));
         }
 
@@ -83,13 +78,13 @@ class ICudnn {
         auto tensor = tensor_builder.build();
         RETURN_CUDNN_FRONTEND_ERROR_IF(
             tensor.get_status() != CUDNN_STATUS_SUCCESS, error_code_t::CUDNN_BACKEND_API_FAILED, tensor.get_error());
-        tensors.emplace(props->get_uid(), std::make_shared<Tensor>(std::move(tensor)));
+        tensors.emplace(tensor_uid, std::make_shared<Tensor>(std::move(tensor)));
 #else
         // build() can throw
         // wrap in try catch
         try {
             auto tensor = tensor_builder.build();
-            tensors.emplace(props->get_uid(), std::make_shared<Tensor>(std::move(tensor)));
+            tensors.emplace(tensor_uid, std::make_shared<Tensor>(std::move(tensor)));
         } catch (cudnn_frontend::cudnnException& e) {
             RETURN_CUDNN_FRONTEND_ERROR_IF(
                 e.getCudnnStatus() != CUDNN_STATUS_SUCCESS, error_code_t::CUDNN_BACKEND_API_FAILED, e.what());