Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split up "regular" and AOT DSO caches #9117

Merged
merged 5 commits into from
Aug 27, 2024
Merged

Conversation

grendello
Copy link
Contributor

Fixes: #9081
Context: dotnet/runtime#104397

Splits up "regular" DSO cache and AOT DSO cache into separate collections.
It will not only speed up regular DSO usage (e.g. for p/invoke calls) but
will also disambiguate like-named shared libraries from the two sets.

When dotnet/runtime#104397 is fixed, it will make the AOT side of the split
more efficient as we won't have to permute the shared library name as many
times as now.

@grendello grendello force-pushed the dev/grendel/dso-cache-fix branch 3 times, most recently from 5cc7999 to b11debd Compare July 23, 2024 10:29
@grendello grendello marked this pull request as ready for review July 23, 2024 12:10
Copy link
Member

@jonathanpeppers jonathanpeppers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main thing missing here is some kind of test. Was there a SkiaSharp sample that was crashing?

Is there some permutation of this test that would have made it crash?

public void SkiaSharpCanvasBasedAppRuns ([Values (true, false)] bool isRelease, [Values (true, false)] bool addResource)

@grendello
Copy link
Contributor Author

I think the main thing missing here is some kind of test. Was there a SkiaSharp sample that was crashing?

The problem is that it stopped crashing after @daltzctr reinstalled dotnet... :) A test is definitely something we need, but I'd like to wait for @lambdageek's code to land on the runtime side, as it will simplify what we need to process for AOT libraries.

Is there some permutation of this test that would have made it crash?

public void SkiaSharpCanvasBasedAppRuns ([Values (true, false)] bool isRelease, [Values (true, false)] bool addResource)

I'll take a look at it.

@lambdageek
Copy link
Member

I'd like to wait for @lambdageek's code to land on the runtime side

There were two suggestions in dotnet/runtime#104397

a. dotnet/runtime#104397 (comment) - make mono not crash when it opens a DSO but it doesn't contain a MonoAotFileInfo. just improve the error handling.
b. dotnet/runtime#104397 (comment) extra flags and a change of how we build the runtime for Android (and updates to our AndroidAppBuilder, and probably library mode, etc).

I'm not actively working on either - I was just helping to diagnose the issue for the runtime mobile team. I think (a) can be done in .NET 9, but (b) feels too big.

@grendello
Copy link
Contributor Author

@jonathanpeppers I tried to make the sample crash, but couldn't :( Considering that the #9031 original repro doesn't appear to crash anymore, I don't think we have a reliable way to test for this condition. Regardless, this PR is a good change since it separates two concerns, eliminating the possibility of a hash clash (making stuff faster at the same time, since the p/invoke mechanism has less work to do searching the DSO cache)

@lambdageek understood :) Runtime changes aren't required for this PR to work, so as much as it would be nice to have a. from your comment, we don't absolutely require it now. IOW, no rush :)

@grendello grendello force-pushed the dev/grendel/dso-cache-fix branch 6 times, most recently from 60dcd07 to 7a962fd Compare July 29, 2024 18:36
@grendello
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@matouskozak
Copy link
Member

matouskozak commented Aug 7, 2024

@jonathanpeppers I tried to make the sample crash, but couldn't :( Considering that the #9031 original repro doesn't appear to crash anymore, I don't think we have a reliable way to test for this condition. Regardless, this PR is a good change since it separates two concerns, eliminating the possibility of a hash clash (making stuff faster at the same time, since the p/invoke mechanism has less work to do searching the DSO cache)

@lambdageek understood :) Runtime changes aren't required for this PR to work, so as much as it would be nice to have a. from your comment, we don't absolutely require it now. IOW, no rush :)

We implemented the a. runtime changes in dotnet/runtime#106026. Locally, I was able to verify that it fixed a previously crashing Xamarin Android app due to SkiaSharp load, it should be part of the .NET 9 RC1 runtime release. @grendello

static DSOCacheEntry* find_only_aot_cache_entry (hash_t hash) noexcept
{
constexpr bool IsAotCache = true;
return find_dso_cache_entry_common<IsAotCache> (hash);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fond of a recurring pattern in this PR:

constexpr bool IsAotCache = true;
return find_dso_cache_entry_common<IsAotCache> (hash);

because it's hiding important differences. find_only_aot_cache_entry() and find_only_dso_cache_entry() both contain the expression find_dso_cache_entry_common<IsAotCache>(hash), but they do "opposite" things, and that isn't clear from the callsite, as it requires "additional context" in the form of the preceding line.

I think it would be clearer if we did one of two things instead:

  • "Manually" inline, e.g. find_dso_cache_entry_common<true>(hash) vs. find_dso_cache_entry_common<false>(hash) makes it immediately obvious that they're different, or
  • Use a name that makes it clearer what is going on:
    constexpr bool UseAotCache = true;
    return find_dso_cache_entry_common<UseAotCache> (hash);
    // vs
    constexpr bool DoNotUseAotCache = false;
    return find_dso_cache_entry_common<DoNotUseAotCache> (hash);

Different things should look different at a glance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use this pattern to make it obvious what the <true> in template instantiation means and the name of the constant reflects the role of the template parameter in question. What you suggest:

constexpr bool DoNotUseAotCache = false;

Reads to me that the call is not using the AOT cache (if I just look at the name of the constant) but if I notice its value, then it's confusing - because !DoNotUseAotCache logically means use AOT cache, and that's not what the call would do. IsAotCache = true or IsAotCache = false, OTOH, is less confusing in that the name + the value are "in agreement".

I agree that the whole pattern is not perfect, but it is trying to provide for "self documenting" code or, at least, something that's easier to understand without having to find the definition of find_dso_cache_entry_common

@@ -112,7 +159,7 @@ namespace xamarin::android::internal

hash_t name_hash = xxhash::hash (name, strlen (name));
log_debug (LOG_ASSEMBLY, "monodroid_dlopen: hash for name '%s' is 0x%zx", name, name_hash);
DSOCacheEntry *dso = find_dso_cache_entry (name_hash);
DSOCacheEntry *dso = use_aot_cache ? find_any_dso_cache_entry (name_hash) : find_only_dso_cache_entry (name_hash);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If use_aot_cache is true, shouldn't this invoke find_only_aot_cache_entry()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my later comment, this line is probably fine, it's just that the name use_aot_cache was misleading/confusing me.

Assuming my later comment is correct, that use_aot_cache means "prefer the AOT cache" instead of an either/or interpretation, then I think removing find_any_dso_cache_entry() entirely might be useful, then updating this callsite to:

DSOCacheEntry *dso = nullptr;
if (prefer_aot_cache) {
    dso = find_only_aot_cache_entry (name_hash);
}
if (dso == nullptr) {
    dso = find_only_dso_cache_entry (name_hash);
}

which would help emphasize this is a "prefer AOT/look there first" vs. an "either/or" scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If use_aot_cache is true, shouldn't this invoke find_only_aot_cache_entry()?

No, because the AOT context is less, hmm, defined. The DSO cache entry context is 100% clear - it is always used from within the p/invoke resolution code, so we know the caller (MonoVM) won't be interested in anything else but libraries that contain p/invokable code. When monodroid_dlopen is called, OTOH, the runtime might want either the AOT shared libraries or standard shared libraries and we don't know which.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grendello wrote:

the runtime might want either the AOT shared libraries or standard shared libraries and we don't know which.

which is part of my confusion: yes, MonoVM wants either/or, but from our perspective it means we need to check both, in a particular order.

Maybe it's just me -- it's probably just me -- but "either/or" in combination with "use_aot_cache" implies to me that either the AOT cache should be used, or the DSO cache, but not both.

Meanwhile, we do want "both"! In a particular preferential order! Hence my suggestion to rename "use_aot_cache" to "prefer_aot_cache".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's prefer_aot_cache now.

@@ -19,7 +19,9 @@ namespace xamarin::android {
void *lib_handle = dso_handle == nullptr ? nullptr : *dso_handle;

if (lib_handle == nullptr) {
lib_handle = internal::MonodroidDl::monodroid_dlopen (library_name, MONO_DL_LOCAL, nullptr, nullptr);
// We're being called as part of the p/invoke mechanism, we don't need to look in the AOT cache
constexpr bool USE_AOT_CACHE = false;
Copy link
Member

@jonpryor jonpryor Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto: this should probably be named DO_NOT_USE_AOT_CACHE=false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the pattern by using an enum to parametrize the template instead. Hopefully it's less confusing.

@@ -103,7 +150,7 @@ namespace xamarin::android::internal

public:
[[gnu::flatten]]
static void* monodroid_dlopen (const char *name, int flags, char **err, [[maybe_unused]] void *user_data) noexcept
static void* monodroid_dlopen (const char *name, int flags, char **err, bool use_aot_cache) noexcept
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think part of my confusion here is thinking that "use aot cache" is an either/or interpretation: either (only) AOT cache is used, or AOT cache is not used. (It's a boolean! Is this not what boolean means?)

While the logic appears to instead be: "when true, check AOT first, then fallback to non-AOT cache."

Consequently, a better name than use_aot_cache could be prefer_aot_cache or check_aot_cache_first.

AOT cache not used yet, so applications using AOT will crash at runtime at this point, pending a
Mono runtime fix.
@jonpryor
Copy link
Member

jonpryor commented Aug 27, 2024

Fixes: https://github.com/dotnet/android/issues/9081

Context: https://github.com/dotnet/runtime/issues/104397
Context: https://github.com/dotnet/runtime/pull/106026
Context: https://github.com/dotnet/android/issues/9081#issuecomment-2209064439
Context: c227042b6e58565e21dcd5e4ff5ea20fc4e9367c

("Compatibility is fun")

Consider a P/Invoke method declaration:

	[DllImport("libSkiaSharp")]
	static extern void gr_backendrendertarget_delete(IntPtr rendertarget);

Historically, when attempting to resolve this method, Mono would try
loading the following native libraries:

  * `libSkiaSharp`
  * `libSkiaSharp.so`
  * `liblibSkiaSharp`
  * `liblibSkiaSharp.so`

.NET for Android would further permute these names, *removing* the
`lib` prefix, for attempted compatibility in case there is a P/Invoke
into `"SkiaSharp"`.

The unfortunate occasional result would be an *ambiguity*: when told
to resolve "SkiaSharp", what should we return?  The information for
`libSkiaSharp.so`, or for the *AOT'd image* of the assembly
`SkiaSharp.dll`, by way of `libaot-SkiaSharp.dll.so`?

        %struct.DSOCacheEntry {
                i64 u0x12e73d483788709d, ; from name: SkiaSharp.so
                i64 u0x3cb282562b838c95, ; uint64_t real_name_hash
                i1 false, ; bool ignore
                ptr @.DSOCacheEntry.23_name, ; name: libaot-SkiaSharp.dll.so
                ptr null; void* handle
        }, ; 71
        %struct.DSOCacheEntry {
                i64 u0x12e73d483788709d, ; from name: SkiaSharp.so
                i64 u0x43db119dcc3147fa, ; uint64_t real_name_hash
                i1 false, ; bool ignore
                ptr @.DSOCacheEntry.7_name, ; name: libSkiaSharp.so
                ptr null; void* handle
        }, ; 72

If we return the wrong thing, then the app may crash or otherwise
behave incorrectly.

Fix this by:

  * Splitting up the DSO cache into AOT-related `.so` files and
    everything else.
  * Updating `PinvokeOverride::load_library_symbol()` so that the AOT
    files are *not* consulted when resolving P/Invoke libraries.
  * Updating `MonodroidDl::monodroid_dlopen()` -- which is called by
    MonoVM via `mono_dl_fallback_register()` -- so that the AOT files
    *are* consulted *first* when resolving AOT images.

When dotnet/runtime#104397 is fixed, it will make the AOT side of the
split more efficient as we won't have to permute the shared library
name as many times as now.

@jonpryor jonpryor merged commit 47a3b33 into main Aug 27, 2024
58 checks passed
@jonpryor jonpryor deleted the dev/grendel/dso-cache-fix branch August 27, 2024 20:26
@github-actions github-actions bot locked and limited conversation to collaborators Sep 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DSO handling for like-named libraries is broken
5 participants