Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean the OpenSSL error queue more consistently #65148

Merged
merged 5 commits into from
Feb 15, 2022

Conversation

bartonjs
Copy link
Member

@bartonjs bartonjs commented Feb 10, 2022

This change started with the intent of changing the error code / message from an OpenSSL-based
exception from the most recent error in the queue to the oldest error that was produced after the
operation started. This was motivated mostly by EVP_PKEY2PKCS8(pkey) incorrectly indicating a
malloc failure after producing the original/correct error that pkey did not have a private key
portion.

Having fully developed the experiment, data showed that while for EVP_PKEY2PKCS8 the first (of
two) errors was the better one, for everything else with more than one error reported, the last
error was at least as good as, and often better, than the first error. With that data in hand, this
change now represents more consistently cleaning the error queue, and reducing the overhead
in producing the exception objects.

Closes #55973.

* The exception is based on the oldest error in the queue, instead of the newest
* Operations whose failure can result in an exception should be clearing the queue at the start of that operation, and at any time an error has been inspected and suppressed.

For this change, each shim function was evaluated based on its semantics.
The callsites weren't checked to verify that failure from a non-error-queue routine aren't throwing a queue-based exception.
@ghost
Copy link

ghost commented Feb 10, 2022

Tagging subscribers to this area: @dotnet/area-system-security, @vcsjones
See info in area-owners.md if you want to be subscribed.

Issue Details

This change started with the intent of changing the error code / message from an OpenSSL-based
exception from the most recent error in the queue to the oldest error that was produced after the
operation started. This was motivated mostly by EVP_PKEY2PKCS8(pkey) incorrectly indicating a
malloc failure after producing the original/correct error that pkey did not have a private key
portion.

Having fully developed the experiment, data showed that while for EVP_PKEY2PKCS8 the first (of
two) errors was the better one, for everything else with more than one error reported, the last
error was at least as good as, and often better, than the first error. With that data in hand, this
change now represents more consistently cleaning the error queue, and reducing the overhead
in producing the exception objects.

Author: bartonjs
Assignees: -
Labels:

area-System.Security

Milestone: -

@@ -102,6 +104,8 @@ otherwise.
*/
const ASN1_TIME* CryptoNative_GetX509NotBefore(X509* x509)
{
// No error queue impact.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost preemptively removed all of these comments, since they aren't as relevant with "keep using the last error in the queue". Clearly, I didn't.

  • Why remove: They were added to justify a lack of ERR_clear_error, when ERR_clear_error was necessary producing correct exceptions.
  • Why keep: They represent the output of white-box investigation of the code behind the code.
  • Why remove: They could be invalidated.
  • Why keep: Uh... it made opening the PR easier.

I have to do something to deal with merge conflicts no matter what... so thought I'd open the floor for a keep or cut discussion.

Copy link
Member

@vcsjones vcsjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions. Also, in the original issue:

Each operation probably wants to start off with ERR_clear_error(). It's relatively cheap these days.
(Probably worth a perf measurement to quantify this).

Do we have any sense of the perf impact of this, or confident enough that it's a cheap operation?

@bartonjs
Copy link
Member Author

Do we have any sense of the perf impact of this, or confident enough that it's a cheap operation?

In the shim version (CryptoNative_ErrClearError) I wrapped the call with __rdtsc. It's not "free" but it's "pretty cheap". But "pretty cheap" still felt slightly expensive when put in front of a field read.

ERR_clear_error() in 794 time units.
ERR_clear_error() in 763 time units.
ERR_clear_error() in 737 time units.
ERR_clear_error() in 813 time units.
ERR_clear_error() in 824 time units.
ERR_clear_error() in 685 time units.
ERR_clear_error() in 1343 time units.
ERR_clear_error() in 796 time units.
ERR_clear_error() in 713 time units.
ERR_clear_error() in 850 time units.
ERR_clear_error() in 787 time units.
ERR_clear_error() in 683 time units.
ERR_clear_error() in 1221 time units.
ERR_clear_error() in 1034 time units.
ERR_clear_error() in 898 time units.
ERR_clear_error() in 870 time units.
ERR_clear_error() in 850 time units.
ERR_clear_error() in 789 time units.
ERR_clear_error() in 1351 time units.
ERR_clear_error() in 841 time units.
ERR_clear_error() in 1057 time units.
ERR_clear_error() in 824 time units.

@bartonjs
Copy link
Member Author

Also, for a sense of scale on the __rdtsc values, some RSAOpenSsl.SignHash calls:

  EVP_PKEY_sign completed in 2391753 time units. ret=128
  EVP_PKEY_sign completed in 9962701 time units. ret=256
  EVP_PKEY_sign completed in 9833251 time units. ret=256
  EVP_PKEY_sign completed in 9802511 time units. ret=256
  EVP_PKEY_sign completed in 9860034 time units. ret=256
  EVP_PKEY_sign completed in 9904225 time units. ret=256
  EVP_PKEY_sign completed in 2393494 time units. ret=128
  EVP_PKEY_sign completed in 9772914 time units. ret=256
  EVP_PKEY_sign completed in 758404 time units. ret=48
  EVP_PKEY_sign completed in 9810084 time units. ret=256
  EVP_PKEY_sign completed in 1132 time units. ret=-1
  EVP_PKEY_sign completed in 787 time units. ret=-1
  EVP_PKEY_sign completed in 810 time units. ret=-1
  EVP_PKEY_sign completed in 9817429 time units. ret=256
  EVP_PKEY_sign completed in 9896021 time units. ret=256
  EVP_PKEY_sign completed in 9797808 time units. ret=256
  EVP_PKEY_sign completed in 9742905 time units. ret=256
  EVP_PKEY_sign completed in 9803647 time units. ret=256

@bartonjs
Copy link
Member Author

Since no one asked for the comments to be deleted, I left them all in. But updated the two that had been called out. (I also looked again at DSA_size and ECDSA_size, and agree they are "no impact"... and will get rewritten to be EVP_PKEY_size anyways)

…aphy.Native/Interop.ERR.cs

Co-authored-by: Kevin Jones <vcsjones@github.com>
@bartonjs bartonjs merged commit 45a9548 into dotnet:main Feb 15, 2022
@bartonjs bartonjs deleted the ossl_exception_model branch February 15, 2022 00:50
rzikm added a commit to rzikm/dotnet-runtime that referenced this pull request Mar 10, 2022
Fixes dotnet#44689

As of dotnet#65148, libraries use different approach to handling OpenSSL errors. The original assert which would be hit if a previous operation left errors in the queue is no longer necessary as all OpenSSL calls are prepended by `ERR_clear_error()` to make sure the latest (and most relevant) error is reported.
rzikm added a commit that referenced this pull request Mar 10, 2022
Fixes #44689

As of #65148, libraries use different approach to handling OpenSSL errors. The original assert which would be hit if a previous operation left errors in the queue is no longer necessary as all OpenSSL calls are prepended by `ERR_clear_error()` to make sure the latest (and most relevant) error is reported.
@ghost ghost locked as resolved and limited conversation to collaborators Mar 17, 2022
@jozkee
Copy link
Member

jozkee commented Oct 14, 2022

@bartonjs @vcsjones this showed-up in the 6.0 vs 7.0-rc2 perf report. Seems like this change regressed one benchmark on non-Windows configs. Should this be expected (by design)?

Graph of the historical data:
https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_ubuntu%2018.04%2fSystem.Security.Cryptography.Tests.Perf_Rfc2898DeriveBytes.DeriveBytes.html

Screenshot:
newplot (2)

3a4ade3...45a9548 is the range of commits in which the benchmark results changed, in case you think something else caused it.

Here's the report data:

System.Security.Cryptography.Tests.Perf_Rfc2898DeriveBytes.DeriveBytes

Result Ratio Alloc Delta Operating System Bit Processor Name Modality
Slower 0.73 +17 ubuntu 18.04 Arm64 Unknown processor
Same 1.08 -3 Windows 11 Arm64 Unknown processor
Same 1.02 -1 Windows 11 Arm64 Microsoft SQ1 3.0 GHz
Same 0.99 -2 Windows 11 Arm64 Microsoft SQ1 3.0 GHz
Slower 0.74 -3 macOS Monterey 12.6 Arm64 Apple M1
Slower 0.73 -3 macOS Monterey 12.6 Arm64 Apple M1 Max
Same 0.94 +1 Windows 10 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Same 1.02 -1 Windows 11 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Same 0.97 +0 Windows 11 X64 AMD Ryzen 9 5900X
Same 0.97 +0 Windows 11 X64 AMD Ryzen 9 7950X
Same 0.97 -2 Windows 11 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower 0.80 +3 debian 11 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower 0.74 +1 ubuntu 18.04 X64 AMD Ryzen 9 5900X
Slower 0.81 -2 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Slower 0.69 +1 ubuntu 20.04 X64 AMD Ryzen 9 5900X
Slower 0.84 +7 ubuntu 20.04 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Slower 0.84 +3 ubuntu 20.04 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower 0.69 +23 macOS Big Sur 11.7 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell)
Slower 0.68 +29 macOS Monterey 12.6 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell)

@vcsjones
Copy link
Member

@jozkee Yeah. We discussed this one in #65719 and the regression is expected / accepted.

@dakersnar
Copy link
Contributor

Ahh, thank you. We had a bug in our search process and missed that.

@karelz karelz added this to the 7.0.0 milestone Nov 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revisit the OpenSSL interop exception model
6 participants