Expanded performance guide (#868)

holoviz · Jan 27, 2020 · 569520d · 569520d
1 parent 138b86c
commit 569520d
Show file tree

Hide file tree

Showing 2 changed files with 34 additions and 18 deletions.
diff --git a/examples/FAQ.ipynb b/examples/FAQ.ipynb
@@ -89,11 +89,7 @@
    "source": [
     "### What data libraries can I use with Datashader?\n",
     "\n",
-    "Datashader accepts various types of DataFrame: \n",
-    "- [Pandas](https://pandas.pydata.org): for every glyph type (points, lines, areas, trimesh, raster)\n",
-    "- [Dask](https://dask.org): for most glyph types (points, lines, areas), using distributed, multi-core, and/or out of core computation\n",
-    "- [cuDF](https://github.com/rapidsai/cudf): for points, lines, or areas, with computation on single NVIDIA GPUs\n",
-    "- [Dask-cuDF](https://rapidsai.github.io/projects/cudf/en/0.10.0/10min.html): for points, lines, or areas, with computation on multiple NVIDIA GPUs"
+    "See the [Performance user guide](user_guide/Performance.ipynb#Data-objects) for the available options for working with columnar/multidimensional/ragged data on single-core/multi-core/distributed/CPU/GPU resources in or out of core for each available glyph."
    ]
   }
  ],

diff --git a/examples/user_guide/10_Performance.ipynb b/examples/user_guide/10_Performance.ipynb
@@ -51,19 +51,39 @@
    "source": [
     "## Data objects\n",
     "\n",
-    "Datashader performance will vary significantly depending on the library and specific data object type used to represent the data in Python, because different libraries and data objects have very different abilities to use the available processing power and memory. Moreover, different libraries and objects are appropriate for different types of data, due to how they organize and store the data internally as well as the operations they provide for working with the data. The data objects currently supported by Datashader are:\n",
+    "Datashader performance will vary significantly depending on the library and specific data object type used to represent the data in Python, because different libraries and data objects have very different abilities to use the available processing power and memory. Moreover, different libraries and objects are appropriate for different types of data, due to how they organize and store the data internally as well as the operations they provide for working with the data. The various objects available from the supported libraries all fall into one of the following three types of data structures:\n",
+    "- **[Columnar (tabular) data](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html)**: Relational, table-like data consisting of arbitrarily many rows, each with data for a fixed number of columns. For example, if you track the location of a particular cell phone over time, each time sampled would be a row, and for each time there could be columns for the latitude and longitude for the location at that time.\n",
+    "- **[Multidimensional (nd-array) data](http://xarray.pydata.org/en/stable/why-xarray.html)**: Data laid out in _n_ dimensions, where _n_ is typically >1.  For example, you might have the precipitation measured on a latitude and longitude grid covering the whole world, for every time at which precipitation was measured. Such data could be stored columnarly, but it would be very inefficient; instead it is stored as a three dimensional array of precipitation values, indexed with time, latitude, and longitude.\n",
+    "- **[Ragged arrays](https://en.wikipedia.org/wiki/Jagged_array)**: Relational/columnar data where the value of at least one column is a list of values that could vary in length for each row. For example, you may have a table with one row per US state and columns for population, land area, and the geographic shape of that state.  Here the shape would be stored as a polygon consisting of an arbitrarily long list of latitude and longitude coordinates, which does not fit efficiently into a standard columnar data structure due to its ragged (variable length) nature.\n",
     "\n",
-    "- [Pandas DataFrame](https://pandas.pydata.org): Basic columnar data support, typically lower performance than the other options where those are supported. Does not typically support multi-threaded operation that can make full use of your CPU's cores, and is generally limited to datasets that fit in your CPU's accessible memory. Does not support ragged arrays like polygons efficiently on its own.\n",
-    "- [Dask DataFrame (using Pandas DataFrame internally)](https://dask.org): Multi-threaded support built on Pandas that can make full use of your CPU's cores, distributed support for using multiple CPUs in a cluster, HPC system or the cloud, and out-of-core support for datasets larger than memory. \n",
-    "- [cuDF](https://github.com/rapidsai/cudf): DataFrame stored on and processed using an NVIDIA GPU (general-purpose graphics processing unit).\n",
-    "- [Dask DataFrame (using cuDF DataFrame internally)](https://rapidsai.github.io/projects/cudf/en/0.10.0/10min.html): DataFrame following a Dask API but implemented on NVIDIA GPUs internally. Provides multi-GPU support for computation distributed across multiple NVIDIA GPUs on the same or different machines.\n",
-    "- [SpatialPandas DataFrame](https://github.com/holoviz/spatialpandas): DataFrame based on Pandas extended to support efficient storage and computation on ragged arrays for polygons and variable-length lines and for spatially indexed points, typically using one core of one CPU.\n",
-    "- [Dask (using SpatialPandas DataFrame internally)](https://github.com/holoviz/spatialpandas): DataFrame using Dask's API built on SpatialPandas DataFrames that are combined with Dask to support multi-core, distributed, and out-of-core processing.\n",
-    "- [Xarray+NumPy](http://xarray.pydata.org): Multidimensional (not columnar or ragged) array operation built on NumPy arrays.\n",
-    "- [Xarray+DaskArray](https://dask.org/): Dask-based multidimensional array processing built on Dask arrays, with support for distributed (multi-CPU) operation.\n",
-    "- [Xarray+CuPy](https://cupy.chainer.org): Dask-based multidimensional array processing built on CuPy arrays, with storage and processing on an NVIDIA GPU.\n",
+    "As you can see, all three examples include latitude and longitude values, but they are very different data structures that need to be stored differently for them to be processed efficiently. \n",
     "\n",
-    "Datashader's current release supports these libraries for nearly all of the Canvas glyph types (points, lines, etc.) where they would apply. Supported combinations of glyph and data library are listed in this table, where the entries mean:\n",
+    "Apart from the data structure involved, the data libraries and objects differ by how they handle computation:\n",
+    "\n",
+    "- **Single-core CPU**: All processing is done serially on a single core of a single CPU on one machine. This is the most common and straightforward implementation, but the slowest, as there are generally other processing resources available that could be used.\n",
+    "- **Multi-core CPU**: All processing is done on a single CPU, but using multiple threads running on multiple separate cores. This approach is able to make use of all of the CPU resources available on a given machine, but cannot use multiple machines.\n",
+    "- **Distributed CPU**: Processing is distributed across multiple cores that may be on multiple CPUs in a cluster or a set of cloud-based machines. This approach can be much faster than single-CPU approaches when many CPUs are available.\n",
+    "- **GPU**: Processing is done not on the CPU but on a separate general-purpose graphics-processing unit (GP-GPU). The GPU typically has far more (though individually less powerful) cores available than a CPU does, and for highly parallelizable computations like those in Datashader a GPU can typically achieve much faster performance at a given price point than a CPU or distributed set of CPUs can. However, not all machines have a supported GPU, memory available on the GPU is often limited, and it takes special effort to support a GPU (e.g. to avoid expensive copying of data between CPU and GPU memory), and so not all CPU code has been rewritten appropriately. \n",
+    "- **Distributed GPUs**: When there are multiple GPUs available, processing can be distributed across all of the GPUs to further speed it up or to fit large problems into the larger total amount of GPU memory available across all of them.\n",
+    "\n",
+    "Finally, libraries differ by whether they can handle datasets larger than memory:\n",
+    "\n",
+    "- **In-core processing**: All data must fit into the directly addressable memory space available (e.g. RAM for a CPU); larger datasets have to be explicitly split up and processed separately.\n",
+    "- **Out-of-core processing**: The data library can process data in chunks that it reads in as needed, allowing it to work with data much larger than the available memory (at the cost of having to read in those chunks each time they are needed, instead of simply referring to the data in memory).\n",
+    "\n",
+    "Given all of these options, the data objects currently supported by Datashader are:\n",
+    "\n",
+    "- **[Pandas DataFrame](https://pandas.pydata.org)**: Basic single-core, single CPU, CPU-only, in-core, non-ragged columnar data support, typically lower performance than the other options where those are supported.\n",
+    "- **[Dask DataFrame (using Pandas DataFrame internally)](https://dask.org)**: Multi-core, multi-CPU, in-core or out-of-core, distributed CPU computation. \n",
+    "- **[cuDF](https://github.com/rapidsai/cudf)**: Single NVIDIA-hardware GPU DataFrame.\n",
+    "- **[Dask DataFrame (using cuDF DataFrame internally)](https://rapidsai.github.io/projects/cudf/en/0.10.0/10min.html)**: Distributed GPUs with a Dask API on multiple NVIDIA GPUs on the same or different machines.\n",
+    "- **[SpatialPandas DataFrame](https://github.com/holoviz/spatialpandas)**: Pandas DataFrame with support for ragged arrays and spatial indexing (efficient access of spatially distributed values), typically using one core of one CPU.\n",
+    "- **[Dask (using SpatialPandas DataFrame internally)](https://github.com/holoviz/spatialpandas)**: Distributed CPU processing, in-core or out-of-core, using Dask's DataFrame API built on SpatialPandas.\n",
+    "- **[Xarray+NumPy](http://xarray.pydata.org)**: Multidimensional data operation built on NumPy arrays.\n",
+    "- **[Xarray+DaskArray](https://dask.org/)**: Dask-based multidimensional array processing built on Dask arrays, with support for distributed (multi-CPU) operation.\n",
+    "- **[Xarray+CuPy](https://cupy.chainer.org)**: Multidimensional array processing built on CuPy arrays, with storage and processing on a single NVIDIA GPU.\n",
+    "\n",
+    "Datashader's current release supports these libraries for nearly all of the Canvas glyph types (points, lines, etc.) where they would naturally apply. Supported combinations of glyph and data library are listed in this table, where the entries mean:\n",
     "\n",
     "- **Yes**: Supported\n",
     "- **No**: Not (yet) supported, but could be with sufficient development effort (feel free to contribute effort or funding!)\n",
@@ -186,7 +206,7 @@
     "\n",
     "</tbody></table></div>\n",
     "\n",
-    "In general, all it takes to use the indicated data library for a particular glyph type is to instantiate a DataFrame (Pandas, Dask, cuPy, SpatialPandas) or DataArray/DataSet (Xarray), and then pass it to the appropriate `ds.Canvas` method call, as illustrated in the various examples in the user guide and topics."
+    "In general, all it takes to use the indicated data library for a particular glyph type is to instantiate a DataFrame (Pandas, Dask, cuDF, SpatialPandas) or DataArray/DataSet (Xarray), and then pass it to the appropriate `ds.Canvas` method call, as illustrated in the various examples in the user guide and topics."
    ]
   },
   {
@@ -206,7 +226,7 @@
     "use one thread per core for parallelizing computations.\n",
     "\n",
     "When the entire dataset fits into memory at once, you can (and should) persist the\n",
-    "data as a Dask dataframe prior to passing it into datashader, to ensure\n",
+    "data as a Dask dataframe prior to passing it into Datashader, to ensure\n",
     "that data only needs to be loaded once:\n",
     "\n",
     "```\n",