Skip to content

All About Python Benchmarking

Eric Snow edited this page May 31, 2022 · 2 revisions

This page is dedicated to the numerous facets of benchmarking Python.


Overview

A benchmark is a script or application designed for producing concrete measurements of another application or runtime (e.g. CPython). It is often characterized as a "micro" benchmark or "macro" benchmark, depending on its complexity and what functionality it represents. Often a benchmark will aim to represent some specific execution profile based on an existing use case or capability, rather than measuring real-world usage. This is because benchmarks need to give consistent results, as well as emphasize the performance of the target use case or capability.

Why is benchmarking important to us, as we work to make Python faster? It boils down to this: decisions are especially effective when based on reliable data, and even more so when it comes to technology. This is no different for Python implementors. We want to be sure Python meets the needs of its community.

The Needs of the Community

The Python community has a wide variety of needs, of which Python itself satisfies some. It is critical that Python implementors have the following:

  • an understanding of the community's Python needs
  • a consistent terminology to communicate about those needs
  • a uniform way to measure how well they are met

On this page we focus on two critical aspects of how Python is used, workloads and features. Workloads are the categories of use-cases for applications and libraries. Features are the capabilities provided by the Python language and stdlib that those applications rely on.

Python Performance

The central discussion here is around making Python faster. Benchmarks are essential to making that happen. For Python implementors, benchmarks provide the reliable data we need for making good decisions about Python performance. It is useful to users too, as they make their own technology choices.

Users:

  • care about how fast (or how efficiently) their applications run
  • when deciding between comparable features, which is fastest?
  • factor in performance when considering different Python versions or other Python implementations

Python implementors:

  • use benchmark results to communicate about those things with users
  • care about which features are slow (or fast)
  • need to know how much proposed changes improve or hurt performance
  • want to quickly pinpoint the source of performance regressions

Benchmarks are meaningful only if they take care of all that. The tricky part is how to make benchmarks that do it effectively, especially when they can take a long time to run and time for analyzing results is limited.

This is where workloads and features come back into play. Benchmarks that focus either on workloads or on specific features are, together, very effective at ticking all those earlier boxes.

Workloads

A Python "workload" is what we identify as a high-level use case for a Python runtime in the community. At its essence, a workload is a discrete category in which to group Python applications (and libraries), describing that specific case. Some workloads are complex, with applications utilizing many Python features, while others are simpler. Some workloads are long-running, while others are short-lived. The resources on which workloads depend also varies greatly.

XXX TODO:

  • top-level workloads vs. sub-workloads

When a benchmark represents the behavior of a specific workload, we call it a "workload" benchmark. Another name for a workload benchmark is a "macro" benchmark.

Related:

  • per-workload tables
  • per-benchmark tables

Features

Python (and its different implementations) can be partially described as sets of features. They can be categorized as granular (i.e. "atomic") vs. composite, language vs. stdlib, etc. Features are distinct from workloads in several important ways. A feature is provided by the language/runtime and is a low(er)-level building block with a focus on specific foundational capability. In contrast, a workload is focused on high-level user applications.

A "feature benchmark" is one that focuses strictly on exercising a specific Python feature in a specific way. Feature benchmarks are all "micro" benchmarks (but not all micro benchmarks are feature benchmarks).

Related:

  • per-feature tables
  • per-benchmark tables

Running Benchmarks

XXX operations:

  • how to run benchmarks
  • getting consistent results

Benchmark Results

XXX operations:

  • how to compare
  • sharing results
  • fair comparisons between runs, incl. between implementations

Which Benchmarks to Use

We can measure many kinds of performance, but CPU (time) performance is typically the primary subject (with memory use as a secondary one).

XXX core benchmarks vs. community benchmarks

XXX practical concerns:

  • throughput: meaningful results vs. getting results quickly
  • throughput (workloads): getting benchmark results quickly vs. accurate representation of low-running workloads
  • benchmarks for features/workloads that use system resources (network, FS, etc.)
  • benchmarks for features/workloads that are fundamentally non-deterministic

Related:

Adding New Benchmarks

...