Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libsolv %_dbpath fixes rollup #2553

Merged
merged 7 commits into from
Feb 9, 2021

Conversation

jlebon
Copy link
Member

@jlebon jlebon commented Feb 8, 2021

We trigger a librpm macro file load in many of our paths. Since the
default value shipped by rpm's macro file sets _dbpath to
/var/lib/rpm, we have to explicitly set that back to /usr/share/rpm
in those paths.

This became more problematic recently with libsolv v0.7.17 which fully
keys off of _dbpath to find the rpmdb path to load:

openSUSE/libsolv@04d4d03

And it's not technically wrong; we really should make that macro not
lie. This is what this patch does by injecting an RPM macro file in our
composes which sets it to /usr/lib/sysimage/rpm (which currently is a
symlink to /usr/share/rpm, but eventually will become the canonical
location). So then e.g. the rpm CLI doesn't actually need the
/var/lib/rpm backcompat link anymore, though there's no harm in
leaving it.

In the future, we should be able to drop this once we move all of Fedora
to /usr/lib/sysimage/rpm (see
coreos/fedora-coreos-tracker#639).

Closes: #2548

@cgwalters
Copy link
Member

/retest
Weird, not seeing failure logs in Prow artifacts.

@cgwalters
Copy link
Member

Ohh, I think Prow may not do CI for draft PRs by default and so switching to draft cancelled those jobs? Let's see if the explicit test request works.

@cgwalters
Copy link
Member

OK, successfully failed here:

initramfs: error: syscore cleanup: regenerating refs: pkgcache cleanup: Deployment index=1: Failed to find any packages in root

rust/src/composepost.rs Outdated Show resolved Hide resolved
rust/src/composepost.rs Show resolved Hide resolved
@jlebon jlebon force-pushed the pr/rpmdb-fixes branch 2 times, most recently from 24f929d to d2f6e00 Compare February 8, 2021 20:20
@cgwalters
Copy link
Member

Looks like commit message needs adjusting too, otherwise LGTM!

@cgwalters
Copy link
Member

One thing to bear in mind is this will also break the use case of e.g. dnf --installroot directly on an rpm-ostree based host system. Which admittedly is pretty obscure...but until we unify the path it's going to cause obscure and hard to fix fallout.

@cgwalters
Copy link
Member

Anyways, LGTM - let's lift the draft and remove "DNM:" tag?

@jlebon
Copy link
Member Author

jlebon commented Feb 8, 2021

Anyways, LGTM - let's lift the draft and remove "DNM:" tag?

I'm waiting on CI to make sure it's green with the new libsolv before dropping that commit and tweaking the commit message.

Although... probably not a bad idea to just keep that override anyway.

@jlebon jlebon changed the title DNM: ci: Test with libsolv-0.7.17 Add /etc/rpm/macros.rpm-ostree to set %_dbpath to /usr/lib/sysimage/rpm Feb 8, 2021
@jlebon jlebon marked this pull request as ready for review February 8, 2021 21:21
@jlebon jlebon changed the title Add /etc/rpm/macros.rpm-ostree to set %_dbpath to /usr/lib/sysimage/rpm Add /usr/lib/rpm/macros.d/macros.rpm-ostree to set %_dbpath to /usr/lib/sysimage/rpm Feb 8, 2021
@cgwalters
Copy link
Member

Yeah let's keep the override for a little bit at least I think; we have coverage of older libsolv via kola tests.
/lgtm

@cgwalters
Copy link
Member

Huh interesting, upgrade test failed with

Feb  9 00:07:36.124276 systemd-coredump[1164]: Process 1104 (rpm-ostree) of user 978 dumped core.
                                               
                                               Stack trace of thread 1104:
                                               #0  0x00007f29ec6fd9d5 raise (libc.so.6 + 0x3d9d5)
                                               #1  0x00007f29ec6e68a4 abort (libc.so.6 + 0x268a4)
                                               #2  0x00007f29ed159b6c g_assertion_message.cold (libglib-2.0.so.0 + 0x1db6c)
                                               #3  0x00007f29ed1b498a g_assertion_message_cmpnum (libglib-2.0.so.0 + 0x7898a)
                                               #4  0x00007f29ed67a9a5 _rpm_ostree_package_list_for_commit (librpmostree-1.so.1 + 0x69a5)
                                               #5  0x00007f29ed67a240 rpm_ostree_db_diff_ext (librpmostree-1.so.1 + 0x6240)
                                               #6  0x00007f29ed67a393 rpm_ostree_db_diff (librpmostree-1.so.1 + 0x6393)
                                               #7  0x0000562f94ee02aa rpmostree_print_treepkg_diff_from_sysroot_path (rpm-ostree + 0x27c2aa)
                                               #8  0x0000562f94ed20a2 rpmostree_builtin_deploy (rpm-ostree + 0x26e0a2)
                                               #9  0x0000562f94ed0af8 _ZN12rpmostreecxx14rpmostree_mainEN4rust10cxxbridge15SliceIKNS1_3StrEEE (rpm-ostree + 0x26caf8)
                                               #10 0x0000562f94ecefe0 rpmostreecxx$cxxbridge1$rpmostree_main (rpm-ostree + 0x26afe0)
                                               #11 0x0000562f94cc75d0 _ZN10rpm_ostree4main17h038527a03685d1f3E (rpm-ostree + 0x635d0)
                                               #12 0x0000562f94cc6d33 _ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17h729ff9d57aa85eedE (rpm-ostree + 0x62d33)
                                               #13 0x0000562f94cc7b58 main (rpm-ostree + 0x63b58)
                                               #14 0x00007f29ec6e81e2 __libc_start_main (libc.so.6 + 0x281e2)
                                               #15 0x0000562f94cc6c6e _start (rpm-ostree + 0x62c6e)
                                               
                                               Stack trace of thread 1106:
                                               #0  0x00007f29ec7b6a5f __poll (libc.so.6 + 0xf6a5f)
                                               #1  0x00007f29ed1e16f6 g_main_context_iterate.constprop.0 (libglib-2.0.so.0 + 0xa56f6)
                                               #2  0x00007f29ed18f033 g_main_loop_run (libglib-2.0.so.0 + 0x53033)
                                               #3  0x00007f29ed3d9d1a gdbus_shared_thread_func.lto_priv.0 (libgio-2.0.so.0 + 0x10fd1a)
                                               #4  0x00007f29ed1bd2b2 g_thread_proxy (libglib-2.0.so.0 + 0x812b2)
                                               #5  0x00007f29ec8943f9 start_thread (libpthread.so.0 + 0x93f9)
                                               #6  0x00007f29ec7c1b53 __clone (libc.so.6 + 0x101b53)
                                               
                                               Stack trace of thread 1105:
                                               #0  0x00007f29ec7b6a5f __poll (libc.so.6 + 0xf6a5f)
                                               #1  0x00007f29ed1e16f6 g_main_context_iterate.constprop.0 (libglib-2.0.so.0 + 0xa56f6)
                                               #2  0x00007f29ed18cd43 g_main_context_iteration (libglib-2.0.so.0 + 0x50d43)
                                               #3  0x00007f29ed18e961 glib_worker_main (libglib-2.0.so.0 + 0x52961)
                                               #4  0x00007f29ed1bd2b2 g_thread_proxy (libglib-2.0.so.0 + 0x812b2)
                                               #5  0x00007f29ec8943f9 start_thread (libpthread.so.0 + 0x93f9)
                                               #6  0x00007f29ec7c1b53 __clone (libc.so.6 + 0x101b53)
                                               
                                               Stack trace of thread 1162:
                                               #0  0x00007f29ec7bc55d syscall (libc.so.6 + 0xfc55d)
                                               #1  0x00007f29ed18cf72 g_source_attach (libglib-2.0.so.0 + 0x50f72)
                                               #2  0x00007f29ed377558 g_task_return.part.0 (libgio-2.0.so.0 + 0xad558)
                                               #3  0x00007f29ed3777ea g_task_thread_pool_thread (libgio-2.0.so.0 + 0xad7ea)
                                               #4  0x00007f29ed1bfe54 g_thread_pool_thread_proxy.lto_priv.0 (libglib-2.0.so.0 + 0x83e54)
                                               #5  0x00007f29ed1bd2b2 g_thread_proxy (libglib-2.0.so.0 + 0x812b2)
                                               #6  0x00007f29ec8943f9 start_thread (libpthread.so.0 + 0x93f9)
                                               #7  0x00007f29ec7c1b53 __clone (libc.so.6 + 0x101b53)
                                               
Feb  9 00:07:36.138341 zincati[1055]: [ERROR] failed to stage deployment: rpm-ostree deploy failed:
Feb  9 00:07:36.138341 zincati[1055]:     **
Feb  9 00:07:36.138341 zincati[1055]:     ERROR:src/lib/rpmostree-package.c:256:_rpm_ostree_package_list_for_commit: assertion failed (pkglist->len > 0): (0 > 0)
Feb  9 00:07:36.138341 zincati[1055]:     

This is the upgrade-from-current kola test which certainly looks like a legitimate failure.

OK so the main CoreOS CI flow all uses the updated rpm-ostree compose side, where as this test will end up exercising "new rpm-ostree client side, old server side".

@cgwalters
Copy link
Member

OK coreos-assembler already has the new libsolv, so that was used at compose time. It'll be in the target ostree too. The upgrade test grabbed...ah yes, the testing-devel build with libsolv reverted. It was also built with an rpm-ostree before this patch, obviously.

So our starting state is (old rpm-ostree, old libsolv) and then kola pulls an ostree commit with (new rpm-ostree, new libsolv). I...think what's happening is it's actually the old rpm-ostree failing to load pkglist from the new ostree commit.

@cgwalters
Copy link
Member

Aaand...trying to reproduce this locally it passes, of course.

@cgwalters
Copy link
Member

@jlebon OK if we restart CI on this? I want to see if this is somehow a flake (seems unlikely though). Or we could try to push more debugging bits to this PR.

@@ -5,7 +5,8 @@ parallel rpms: {
def n = 5
cosaPod(buildroot: true, runAsUser: 0, memory: "2Gi", cpu: "${n}") {
checkout scm
shwrap("""RPM_BUILD_NCPUS=${n} ./ci/coreosci-rpmbuild.sh
// 2:1 job to CPU at most should keep us from getting kicked out
shwrap("""RPM_BUILD_NCPUS=${n} CARGO_BUILD_JOBS=${n} ./ci/coreosci-rpmbuild.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm shouldn't cargo respect the make jobserver? rust-lang/cargo#4110

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Eh we can look at this later)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, seems like it would.

Ohhh, I think it might be the cmake call for libdnf which is not capped. I wonder if the issue is that the jobserver isn't surviving the make -> cmake -> make chain. Briefly searching around, it's not clear to me whether cmake supports using a parent job server for the children make processes.

Next time we hit the libgomp thing, let's try something like this instead:

diff --git a/.cci.jenkinsfile b/.cci.jenkinsfile
index 5cd9e699..29d11897 100644
--- a/.cci.jenkinsfile
+++ b/.cci.jenkinsfile
@@ -6,7 +6,7 @@ parallel rpms: {
   cosaPod(buildroot: true, runAsUser: 0, memory: "2Gi", cpu: "${n}") {
       checkout scm
       // 2:1 job to CPU at most should keep us from getting kicked out
-      shwrap("""RPM_BUILD_NCPUS=${n} CARGO_BUILD_JOBS=${n} ./ci/coreosci-rpmbuild.sh
+      shwrap("""RPM_BUILD_NCPUS=${n} CMAKE_BUILD_PARALLEL_LEVEL=${n} ./ci/coreosci-rpmbuild.sh
                 mv packaging/*/*.rpm .
              """)
       // make it easy for anyone to download the RPMs

Hmm, or maybe that should be in the spec file instead since RPM_BUILD_NCPUS really should be the only top-level knob.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh, I think it might be the cmake call for libdnf which is not capped. I wonder if the issue is that the jobserver isn't surviving the make -> cmake -> make chain.

I think it should all be inherited by the environment by default? Unless the child make explicitly invokes make -jN which AIUI resets things.

@cgwalters
Copy link
Member

/lgtm

@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, jlebon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cgwalters
Copy link
Member

cgwalters commented Feb 9, 2021

OK next failure is

layering-scripts: error: Ran rpm -q bash in %post

which must be related; I bet we were implicitly relying on /var/lib/rpm not being accessible, but now it is. So...hmm, maybe let's do --bind-ro /usr/share/empty /usr/share/rpm in scripts?

@cgwalters
Copy link
Member

Testing out locally

diff --git a/src/libpriv/rpmostree-scripts.cxx b/src/libpriv/rpmostree-scripts.cxx
index f7adb08f..f871fc0f 100644
--- a/src/libpriv/rpmostree-scripts.cxx
+++ b/src/libpriv/rpmostree-scripts.cxx
@@ -378,6 +378,9 @@ rpmostree_run_script_in_bwrap_container (int rootfs_fd,
   if (glnx_fstatat (rootfs_fd, "usr/lib/opt", &stbuf, AT_SYMLINK_NOFOLLOW, NULL) && S_ISDIR(stbuf.st_mode))
     rpmostree_bwrap_append_bwrap_argv (bwrap, "--symlink", "usr/lib/opt", "/opt", NULL);
 
+  /* Don't let scripts see the base rpm database by default */
+  rpmostree_bwrap_bind_read (bwrap, "usr/share/empty", "usr/share/rpm");
+
   /* Add ostree-booted API; some scriptlets may work differently on OSTree systems; e.g.
    * akmods. Just create it manually; /run is usually tmpfs, but scriptlets shouldn't be
    * adding stuff there anyway. */

@openshift-ci-robot
Copy link
Collaborator

New changes are detected. LGTM label has been removed.

@cgwalters
Copy link
Member

Yep that works, pushed.

@cgwalters cgwalters added the lgtm label Feb 9, 2021
@cgwalters
Copy link
Member

We're definitely reaping the value of all that work on the test suite here!

@cgwalters cgwalters changed the title Add /usr/lib/rpm/macros.d/macros.rpm-ostree to set %_dbpath to /usr/share/rpm libsolv %_dbpath fixes rollup Feb 9, 2021
@jlebon
Copy link
Member Author

jlebon commented Feb 9, 2021

I bet we were implicitly relying on /var/lib/rpm not being accessible, but now it is. So...hmm, maybe let's do --bind-ro /usr/share/empty /usr/share/rpm in scripts?

Yup thanks, makes sense to me! I think a lot of the work we're doing here is good prep for when we actually move the rpmdb to /usr/lib/sysimage/rpm!

jlebon and others added 7 commits February 9, 2021 22:33
We need to make sure that we can work with newer libsolv, which changed
how the rpmdb is found (see coreos#2548).
…hare/rpm

We trigger a librpm macro file load in many of our paths. Since the
default value shipped by rpm's macro file sets `_dbpath` to
`/var/lib/rpm`, we have to explicitly set that back to `/usr/share/rpm`
in those paths.

This became more problematic recently with libsolv v0.7.17 which fully
keys off of `_dbpath` to find the rpmdb path to load:

openSUSE/libsolv@04d4d03

And it's not technically wrong; we really should make that macro not
lie. This is what this patch does by injecting an RPM macro file in our
composes which sets it to /usr/share/rpm. So then e.g. the `rpm` CLI
doesn't actually need the `/var/lib/rpm` backcompat link anymore, though
there's no harm in leaving it.

In the future, we should be able to drop this once we move all of Fedora
to `/usr/lib/sysimage/rpm` (see
coreos/fedora-coreos-tracker#639).

Closes: coreos#2548
We don't technically need this yet, but it mirrors how it's set up in
our composes so that if there's code that wants to use the new location
too, it'll just work.
Follow-up to previous commit: we had another path where we made a
temporary rootfs and symlinked `/var/lib/rpm` to the base rpmdb. That of
course broke now that we inject a macro to point the rpmdb to
`/usr/share/rpm`.

Rework this to use `/usr/share/rpm` since that's our canonical location
for now, but also add the compat symlinks so that this logic should keep
working even on trees without the injected macro yet.
We lost this at some point during the CI re-shuffle. We need to
constrain cargo builds too to respect our CPU allocation.

This doesn't totally keep all jobs under 5 since e.g. we could have 5
make jobs and 5 cargo codegen builds going at once, but I think as long
as it's not something ridiculous like 40, it should be fine. Otherwise
we'll tighten it more.
Now that we inject the `%_dbpath /usr/share/rpm` macro, `rpm -q`
will start using it.  But in RPM script invocation, we don't
want them to see any RPM database at all - trying to query it
should be a clean failure.
@openshift-ci-robot
Copy link
Collaborator

New changes are detected. LGTM label has been removed.

@cgwalters cgwalters added the lgtm label Feb 9, 2021
@cgwalters
Copy link
Member

Fixed the compose test case to look for /usr/share/rpm again.

@openshift-merge-robot openshift-merge-robot merged commit 6886e44 into coreos:master Feb 9, 2021
@rajveermalviya
Copy link

will this be backported in https://src.fedoraproject.org/rpms/rpm-ostree ?

@cgwalters
Copy link
Member

Yes, either that or we do a new release. Probably makes the most sense to backport.

@jlebon jlebon deleted the pr/rpmdb-fixes branch April 23, 2023 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

libsolv-0.7.17-1.fc33 failure tracker
5 participants