NVIDIA · andygrove · Jun 3, 2021 · May 20, 2021 · Jun 2, 2021 · Jun 2, 2021
diff --git a/docs/configs.md b/docs/configs.md
@@ -31,11 +31,12 @@ Name | Description | Default Value
 -----|-------------|--------------
 <a name="alluxio.pathsToReplace"></a>spark.rapids.alluxio.pathsToReplace|List of paths to be replaced with corresponding alluxio scheme. Eg, when configureis set to "s3:/foo->alluxio://0.1.2.3:19998/foo,gcs:/bar->alluxio://0.1.2.3:19998/bar", which means:       s3:/foo/a.csv will be replaced to alluxio://0.1.2.3:19998/foo/a.csv and      gcs:/bar/b.csv will be replaced to alluxio://0.1.2.3:19998/bar/b.csv|None
 <a name="cloudSchemes"></a>spark.rapids.cloudSchemes|Comma separated list of additional URI schemes that are to be considered cloud based filesystems. Schemes already included: dbfs, s3, s3a, s3n, wasbs, gs. Cloud based stores generally would be total separate from the executors and likely have a higher I/O read cost. Many times the cloud filesystems also get better throughput when you have multiple readers in parallel. This is used with spark.rapids.sql.format.parquet.reader.type|None
-<a name="memory.gpu.allocFraction"></a>spark.rapids.memory.gpu.allocFraction|The fraction of total GPU memory that should be initially allocated for pooled memory. Extra memory will be allocated as needed, but it may result in more fragmentation. This must be less than or equal to the maximum limit configured via spark.rapids.memory.gpu.maxAllocFraction.|0.9
+<a name="memory.gpu.allocFraction"></a>spark.rapids.memory.gpu.allocFraction|The fraction of available GPU memory that should be initially allocated for pooled memory. Extra memory will be allocated as needed, but it may result in more fragmentation. This must be less than or equal to the maximum limit configured via spark.rapids.memory.gpu.maxAllocFraction.|0.9
 <a name="memory.gpu.debug"></a>spark.rapids.memory.gpu.debug|Provides a log of GPU memory allocations and frees. If set to STDOUT or STDERR the logging will go there. Setting it to NONE disables logging. All other values are reserved for possible future expansion and in the mean time will disable logging.|NONE
 <a name="memory.gpu.direct.storage.spill.batchWriteBuffer.size"></a>spark.rapids.memory.gpu.direct.storage.spill.batchWriteBuffer.size|The size of the GPU memory buffer used to batch small buffers when spilling to GDS. Note that this buffer is mapped to the PCI Base Address Register (BAR) space, which may be very limited on some GPUs (e.g. the NVIDIA T4 only has 256 MiB), and it is also used by UCX bounce buffers.|8388608
 <a name="memory.gpu.direct.storage.spill.enabled"></a>spark.rapids.memory.gpu.direct.storage.spill.enabled|Should GPUDirect Storage (GDS) be used to spill GPU memory buffers directly to disk. GDS must be enabled and the directory `spark.local.dir` must support GDS. This is an experimental feature. For more information on GDS, see https://docs.nvidia.com/gpudirect-storage/.|false
 <a name="memory.gpu.maxAllocFraction"></a>spark.rapids.memory.gpu.maxAllocFraction|The fraction of total GPU memory that limits the maximum size of the RMM pool. The value must be greater than or equal to the setting for spark.rapids.memory.gpu.allocFraction. Note that this limit will be reduced by the reserve memory configured in spark.rapids.memory.gpu.reserve.|1.0
+<a name="memory.gpu.minAllocFraction"></a>spark.rapids.memory.gpu.minAllocFraction|The fraction of total GPU memory that limits the minimum size of the RMM pool. The value must be less than or equal to the setting for spark.rapids.memory.gpu.allocFraction.|0.25
 <a name="memory.gpu.oomDumpDir"></a>spark.rapids.memory.gpu.oomDumpDir|The path to a local directory where a heap dump will be created if the GPU encounters an unrecoverable out-of-memory (OOM) error. The filename will be of the form: "gpu-oom-<pid>.hprof" where <pid> is the process ID.|None
 <a name="memory.gpu.pool"></a>spark.rapids.memory.gpu.pool|Select the RMM pooling allocator to use. Valid values are "DEFAULT", "ARENA", and "NONE". With "DEFAULT", `rmm::mr::pool_memory_resource` is used; with "ARENA", `rmm::mr::arena_memory_resource` is used. If set to "NONE", pooling is disabled and RMM just passes through to CUDA memory allocation directly. Note: "ARENA" is the recommended pool allocator if CUDF is built with Per-Thread Default Stream (PTDS), as "DEFAULT" is known to be unstable (https://github.com/NVIDIA/spark-rapids/issues/1141)|ARENA
 <a name="memory.gpu.pooling.enabled"></a>spark.rapids.memory.gpu.pooling.enabled|Should RMM act as a pooling allocator for GPU memory, or should it just pass through to CUDA memory allocation directly. DEPRECATED: please use spark.rapids.memory.gpu.pool instead.|true

diff --git a/docs/tuning-guide.md b/docs/tuning-guide.md
@@ -56,8 +56,9 @@ Default value: `0.9`
 
 Allocating memory on a GPU can be an expensive operation. RAPIDS uses a pooling allocator
 called [RMM](https://github.com/rapidsai/rmm) to mitigate this overhead. By default, on startup
-the plugin will allocate `90%` (`0.9`) of the memory on the GPU and keep it as a pool that can
-be allocated from. If the pool is exhausted more memory will be allocated and added to the pool.
+the plugin will allocate `90%` (`0.9`) of the _available_ memory on the GPU and keep it as a pool
+that can be allocated from. If the pool is exhausted more memory will be allocated and added to
+the pool.
 Most of the time this is a huge win, but if you need to share the GPU with other 
 [libraries](additional-functionality/ml-integration.md) that are not aware of RMM this can lead
 to memory issues, and you may need to disable pooling.

diff --git a/integration_tests/run_pyspark_from_build.sh b/integration_tests/run_pyspark_from_build.sh
@@ -133,6 +133,7 @@ else
     then
         export PYSP_TEST_spark_rapids_memory_gpu_allocFraction=$MEMORY_FRACTION
         export PYSP_TEST_spark_rapids_memory_gpu_maxAllocFraction=$MEMORY_FRACTION
+        export PYSP_TEST_spark_rapids_memory_gpu_minAllocFraction=0
         python "${RUN_TESTS_COMMAND[@]}" "${TEST_PARALLEL_OPTS[@]}" "${TEST_COMMON_OPTS[@]}"
     else
         "$SPARK_HOME"/bin/spark-submit --jars "${ALL_JARS// /,}" \

diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala
@@ -168,16 +168,24 @@ object GpuDeviceManager extends Logging {
     // Align workaround for https://github.com/rapidsai/rmm/issues/527
     def truncateToAlignment(x: Long): Long = x & ~511L
 
-    var initialAllocation = truncateToAlignment((conf.rmmAllocFraction * info.total).toLong)
-    if (initialAllocation > info.free) {
-      logWarning(s"Initial RMM allocation (${toMB(initialAllocation)} MB) is " +
-          s"larger than free memory (${toMB(info.free)} MB)")
+    var initialAllocation = truncateToAlignment((conf.rmmAllocFraction * info.free).toLong)
+    val minAllocation = truncateToAlignment((conf.rmmAllocMinFraction * info.total).toLong)
+    if (initialAllocation < minAllocation) {
+      throw new IllegalArgumentException(s"The initial allocation of " +
+        s"${toMB(initialAllocation)} MB (calculated from ${RapidsConf.RMM_ALLOC_FRACTION} " +
+        s"(=${conf.rmmAllocFraction}) and ${toMB(info.free)} MB free memory) was less than " +
+        s"the minimum allocation of ${toMB(minAllocation)} (calculated from " +
+        s"${RapidsConf.RMM_ALLOC_MIN_FRACTION} (=${conf.rmmAllocMinFraction}) " +
+        s"and ${toMB(info.total)} MB total memory)")
     }
     val maxAllocation = truncateToAlignment((conf.rmmAllocMaxFraction * info.total).toLong)
     if (maxAllocation < initialAllocation) {
-      throw new IllegalArgumentException(s"${RapidsConf.RMM_ALLOC_MAX_FRACTION} " +
-          s"configured as ${conf.rmmAllocMaxFraction} which is less than the " +
-          s"${RapidsConf.RMM_ALLOC_FRACTION} setting of ${conf.rmmAllocFraction}")
+      throw new IllegalArgumentException(s"The initial allocation of " +
+        s"${toMB(initialAllocation)} MB (calculated from ${RapidsConf.RMM_ALLOC_FRACTION} " +
+        s"(=${conf.rmmAllocFraction}) and ${toMB(info.free)} MB free memory) was more than " +
+        s"the maximum allocation of ${toMB(maxAllocation)} (calculated from " +
+        s"${RapidsConf.RMM_ALLOC_MAX_FRACTION} (=${conf.rmmAllocMaxFraction}) " +
+        s"and ${toMB(info.total)} MB total memory)")
     }
     val reserveAmount = conf.rmmAllocReserve
     if (reserveAmount >= maxAllocation) {

diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
@@ -320,10 +320,11 @@ object RapidsConf {
     .createOptional
 
   private val RMM_ALLOC_MAX_FRACTION_KEY = "spark.rapids.memory.gpu.maxAllocFraction"
+  private val RMM_ALLOC_MIN_FRACTION_KEY = "spark.rapids.memory.gpu.minAllocFraction"
   private val RMM_ALLOC_RESERVE_KEY = "spark.rapids.memory.gpu.reserve"
 
   val RMM_ALLOC_FRACTION = conf("spark.rapids.memory.gpu.allocFraction")
-    .doc("The fraction of total GPU memory that should be initially allocated " +
+    .doc("The fraction of available GPU memory that should be initially allocated " +
       "for pooled memory. Extra memory will be allocated as needed, but it may " +
       "result in more fragmentation. This must be less than or equal to the maximum limit " +
       s"configured via $RMM_ALLOC_MAX_FRACTION_KEY.")
@@ -340,6 +341,13 @@ object RapidsConf {
     .checkValue(v => v >= 0 && v <= 1, "The fraction value must be in [0, 1].")
     .createWithDefault(1)
 
+  val RMM_ALLOC_MIN_FRACTION = conf(RMM_ALLOC_MIN_FRACTION_KEY)
+    .doc("The fraction of total GPU memory that limits the minimum size of the RMM pool. " +
+      s"The value must be less than or equal to the setting for $RMM_ALLOC_FRACTION.")
+    .doubleConf
+    .checkValue(v => v >= 0 && v <= 1, "The fraction value must be in [0, 1].")
+    .createWithDefault(0.25)
+
   val RMM_ALLOC_RESERVE = conf(RMM_ALLOC_RESERVE_KEY)
       .doc("The amount of GPU memory that should remain unallocated by RMM and left for " +
           "system use such as memory needed for kernels, kernel launches or JIT compilation.")
@@ -1354,6 +1362,8 @@ class RapidsConf(conf: Map[String, String]) extends Logging {
 
   lazy val rmmAllocMaxFraction: Double = get(RMM_ALLOC_MAX_FRACTION)
 
+  lazy val rmmAllocMinFraction: Double = get(RMM_ALLOC_MIN_FRACTION)
+
   lazy val rmmAllocReserve: Long = get(RMM_ALLOC_RESERVE)
 
   lazy val hostSpillStorageSize: Long = get(HOST_SPILL_STORAGE_SIZE)

diff --git a/tests/src/test/scala/com/nvidia/spark/rapids/GpuDeviceManagerSuite.scala b/tests/src/test/scala/com/nvidia/spark/rapids/GpuDeviceManagerSuite.scala
@@ -28,11 +28,13 @@ class GpuDeviceManagerSuite extends FunSuite with Arm {
     TrampolineUtil.cleanupAnyExistingSession()
     val totalGpuSize = Cuda.memGetInfo().total
     val initPoolFraction = 0.1
+    val minPoolFraction = 0.01
     val maxPoolFraction = 0.2
     val conf = new SparkConf()
         .set(RapidsConf.POOLED_MEM.key, "true")
         .set(RapidsConf.RMM_POOL.key, "ARENA")
         .set(RapidsConf.RMM_ALLOC_FRACTION.key, initPoolFraction.toString)
+        .set(RapidsConf.RMM_ALLOC_MIN_FRACTION.key, minPoolFraction.toString)
         .set(RapidsConf.RMM_ALLOC_MAX_FRACTION.key, maxPoolFraction.toString)
         .set(RapidsConf.RMM_ALLOC_RESERVE.key, "0")
     try {