-
-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add internal API function to create tuple without items array initialization #80211
Comments
PyTuple_New() fills items array with NULLs to make usage of Py_DECREF() safe even when array is not fully filled with real items. Before: After: |
Sergey Fedoseev wrote similar change for list: bpo-34151. |
Zeroing memory is usually not expensive relative to the cost of filling it in. Also, the timing loop is unrealistic. We mostly care about small tuples. For long term maintenance of the project, I recommend filling the code with these unsafe variants which will segfault whenever someone follows the normal practice of decreffing when an error is encountered. This would be yet another special case a person would have to learn to do maintenance or code review for CPython. In general, there are a lot of small optimizations to be had if we were to forgo principles of loose coupling, defensive programming, and ease of maintenance. Once in a while, we find one that actually is worth it, but usually it is a best practice sprinkle these special cases all over the code (see MS Code Complete for more discussion on the long term costs of this kind of coding). |
Raymond:
Sergey: can you please run benchmarks on small tuples? Example, 1, 5, 10, 20 and 100 items. Maybe not only tuple(list), but try some other operations? |
I prefer PyTuple_FromArray() API idea. It's safer by design than you "list to tuple" function stealing references to list items if refcnt(list)==1. |
Here's benchmarks for PyTuple_FromArray() PR: $ python -m perf timeit -s "l = [None]*0; tuple_ = tuple" "tuple_(l)" --duplicate=100 --compare-to=../cpython-master/venv/bin/python
/home/sergey/tmp/cpython-master/venv/bin/python: ..................... 50.5 ns +- 1.0 ns
/home/sergey/tmp/cpython-dev/venv/bin/python: ..................... 45.6 ns +- 1.2 ns
Mean +- std dev: [/home/sergey/tmp/cpython-master/venv/bin/python] 50.5 ns +- 1.0 ns -> [/home/sergey/tmp/cpython-dev/venv/bin/python] 45.6 ns +- 1.2 ns: 1.11x faster (-10%)
$ python -m perf timeit -s "l = [None]*1; tuple_ = tuple" "tuple_(l)" --duplicate=100 --compare-to=../cpython-master/venv/bin/python
/home/sergey/tmp/cpython-master/venv/bin/python: ..................... 61.8 ns +- 1.1 ns
/home/sergey/tmp/cpython-dev/venv/bin/python: ..................... 54.8 ns +- 0.8 ns
Mean +- std dev: [/home/sergey/tmp/cpython-master/venv/bin/python] 61.8 ns +- 1.1 ns -> [/home/sergey/tmp/cpython-dev/venv/bin/python] 54.8 ns +- 0.8 ns: 1.13x faster (-11%)
$ python -m perf timeit -s "l = [None]*5; tuple_ = tuple" "tuple_(l)" --duplicate=100 --compare-to=../cpython-master/venv/bin/python
/home/sergey/tmp/cpython-master/venv/bin/python: ..................... 68.2 ns +- 1.3 ns
/home/sergey/tmp/cpython-dev/venv/bin/python: ..................... 61.8 ns +- 1.5 ns
Mean +- std dev: [/home/sergey/tmp/cpython-master/venv/bin/python] 68.2 ns +- 1.3 ns -> [/home/sergey/tmp/cpython-dev/venv/bin/python] 61.8 ns +- 1.5 ns: 1.10x faster (-9%)
$ python -m perf timeit -s "l = [None]*10; tuple_ = tuple" "tuple_(l)" --duplicate=100 --compare-to=../cpython-master/venv/bin/python
/home/sergey/tmp/cpython-master/venv/bin/python: ..................... 88.1 ns +- 2.3 ns
/home/sergey/tmp/cpython-dev/venv/bin/python: ..................... 78.9 ns +- 3.1 ns
Mean +- std dev: [/home/sergey/tmp/cpython-master/venv/bin/python] 88.1 ns +- 2.3 ns -> [/home/sergey/tmp/cpython-dev/venv/bin/python] 78.9 ns +- 3.1 ns: 1.12x faster (-10%)
$ python -m perf timeit -s "l = [None]*100; tuple_ = tuple" "tuple_(l)" --duplicate=100 --compare-to=../cpython-master/venv/bin/python
/home/sergey/tmp/cpython-master/venv/bin/python: ..................... 477 ns +- 7 ns
/home/sergey/tmp/cpython-dev/venv/bin/python: ..................... 452 ns +- 6 ns
Mean +- std dev: [/home/sergey/tmp/cpython-master/venv/bin/python] 477 ns +- 7 ns -> [/home/sergey/tmp/cpython-dev/venv/bin/python] 452 ns +- 6 ns: 1.05x faster (-5%) |
Please make PyTuple_FromArray private: rename to _PyTuple_FromArray. I would prefer to only add it to Include/internals/pycore_tupleobject.c. |
Once I had wrote a similar patch that adds PyTuple_FromArray, but never published it because I did not found enough use cases for this function. Although I considered using it only for removing some code duplication, but Sergey shown that it can be used for small performance boost in some special cases. I am still not sure, but this argument makes this change a tiny bit more attractive. Leave this on Raymond. |
The micro-benchmark results are not really impressive. I still like PR 11954 because it removes code (simply loops). _PyTuple_FromArray() has a well defined API and is safe (I'm saying that because PR 11927 adds an "unsafe" function). As soon as it's private, I'm fine with it. I'm more attracted by the code simplification than performance boost here. |
I reviewed PR 11954: I asked to rework the PR to only add _PyTuple_FromArray() and the "unrelated" micro-optimization. So it would be easier to see code simplification and the micro-optimization. If the micro-optimization doesn't make the code more complex and doesn't introduce subtle issue with the GC, I'm fine with taking 10 ns optimization ;-) |
This change is virtually renaming _PyStack_AsTuple to _PyTuple_FromArray and using it in appropriate places for code simplification. I am +1 for such change. |
I added WIP PR with discussed micro-optimization, here are benchmark results: $ python -m perf compare_to --min-speed 1 -G master.json tuple-untracked.json
Slower (1):
- sqlite_synth: 5.16 us +- 0.10 us -> 5.22 us +- 0.08 us: 1.01x slower (+1%) Faster (19):
|
This optimization also can be used for BUILD_TUPLE opcode and in pickle module, if it's OK to add _PyTuple_StealFromArray() function :-) |
I would like to see a micro-benchmark showing that it's faster. |
Can you please convert msg336142 into a perf script? See the doc: And then run again these benchmarks on PR 12052. If you have a script, you can do: ./python-ref script.py -o ref.json https://perf.readthedocs.io/en/latest/cli.html#compare-to-cmd |
Victor/vstinner: Isn't PR 12032 reintroducing the issue fixed in bpo-29234? _PyStack_AsTuple was explicitly marked as _Py_NO_INLINE because inlining was creating excessive stack consumption in the callers (which were the bytecode interpreter loop), but the new _PyTuple_FromArray isn't marked as _Py_NO_INLINE, so the swap reintroduces the problem. Seems like either:
or
It's possible this isn't as much of an issue because _PyTuple_FromArray is in tupleobject.c (where it's only called in one place), while _PyStack_AsTuple was in call.c and was called from within call.c in four places, but only if link-time optimization isn't involved (and in practice, most public distributions of CPython are built with link-time optimization now, correct?); if LTO is enabled, the same stack bloat issues are possible. |
No according to manual tests: Usage of the stack memory doesn't change with PR 12032. FYI I also tested manually since I was very surprised, and I confirm that the PR doesn't change the usage of the stack memory :-) |
Please test, I'm not interested to spend too much time on that topic. To be clear, _Py_NO_INLINE was a hack and a micro-optimization. It doesn't solve a real bug. Python has very weak promises on the maximum stack depth. My work on reducing the stack memory usage was mostly motivated by my work on FASTCALL, since some patches reduced the maximum stack depth. They increased the stack memory usage with micro-optimizations like "PyObject *small_stack[_PY_FASTCALL_SMALL_STACK];" which is allocated on ths stack: see Objects/call.c. |
+--------------------+---------+------------------------------+ |
+-----------------+-----------------+-----------------------------+ |
Well done Sergey Fedoseev: I merged your last PR for this issue ;-) I'm not sure if it's worth it to document these tuple micro-optimizations in What's New in Python 3.8 / 3.9. In the meanwhile, I close the issue. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: