Race condition when using --reboot flag #1348

peterbaouoft · 2018-04-25T19:12:21Z

Host system details

Provide the output of rpm-ostree status.

[root@ip-172-31-28-144 ~]# rpm-ostree status
State: idle
Deployments:
* local:xxx/7/x86_64/standard
                   Version: 7.1803 (2018-04-24 17:25:51)
                    Commit: 1a81065065f99d16c3509871783246380d2fb1a0754778f3b25b6fa224aa1576

  centos-atomic-host:centos-atomic-host/7/x86_64/standard
                   Version: 7.1712.1 (2018-01-12 10:25:04)
                    Commit: c6104b6bd8aeabbc19687e91359657e1292c3669e3c686438993e48481af6f2d
              GPGSignature: Valid signature by 64E3E7558572B59A319452AAF17E745691BA8335

Expected vs actual behavior

Actual:

[root@ip-172-31-28-144 ~]# time rpm-ostree rebase -r local:xxx/7/x86_64/standard

1 metadata, 1827 content objects fetched; 139364 KiB transferred in 23 seconds                          
Copying /etc changes: 22 modified, 4 removed, 52 added
Transaction complete; bootconfig swap: yes deployment count change: 1
error: Bus owner changed, aborting.

real	0m50.273s
user	0m0.016s
sys	0m0.004s
[root@ip-172-31-28-144 ~]# Connection to ec2-107-23-145-155.compute-1.amazonaws.com closed by remote host.

After reboot:
[root@ip-172-31-28-144 ~]# rpm-ostree status 
State: idle
Deployments:
* local:xxx/7/x86_64/standard
                   Version: 7.1803 (2018-04-24 17:25:51)
                    Commit: 1a81065065f99d16c3509871783246380d2fb1a0754778f3b25b6fa224aa1576

  centos-atomic-host:centos-atomic-host/7/x86_64/standard
                   Version: 7.1712.1 (2018-01-12 10:25:04)
                    Commit: c6104b6bd8aeabbc19687e91359657e1292c3669e3c686438993e48481af6f2d
              GPGSignature: Valid signature by 64E3E7558572B59A319452AAF17E745691BA8335

Expected:
rebase -r with no errors like error: Bus owner changed, aborting.

Steps to reproduce it

Dunno if it happens on every host, but you can try rebase -r to another system. There might be a chance you also see this behavior

Would you like to work on the issue?

I don't mind :p, but will be better if someone who is more experienced take this :D.

The text was updated successfully, but these errors were encountered:

cgwalters · 2018-04-25T19:19:44Z

I think the flow will need to be something like:

Server tells client we're rebooting
Client prints out "Rebooting." message and flushes stdout
Client tells server to proceed, and disconnects from DBus
Server initiates reboot

vtolstov · 2018-04-25T19:29:40Z

what say journalctl -u rpm-ostreed.service -r ?

peterbaouoft · 2018-04-25T19:53:13Z

@vtolstov, hi, the log is like the following, with logger messages omitted

-- Reboot --
Apr 25 18:32:34 ip-172-31-28-144.ec2.internal systemd[1]: Stopped OSTree+RPM package management daemon.
Apr 25 18:32:33 ip-172-31-28-144.ec2.internal systemd[1]: Stopping OSTree+RPM package management daemon...
Apr 25 18:32:32 ip-172-31-28-144.ec2.internal rpm-ostree[1617]: Transaction complete; bootconfig swap: yes deployment count change: 1
==========================Logger messages Omitted=================================
Apr 25 18:32:18 ip-172-31-28-144.ec2.internal rpm-ostree[1617]: sanitycheck(/usr/bin/true) successful
Apr 25 18:32:18 ip-172-31-28-144.ec2.internal rpm-ostree[1617]: Copying /etc changes: 22 modified, 4 removed, 52 added
Apr 25 18:32:06 ip-172-31-28-144.ec2.internal rpm-ostree[1617]: libostree pull from 'local' for openshift/7/x86_64/standard complete
                                                                security: GPG: disabled http: TLS
                                                                non-delta: meta: 1 content: 1827
                                                                transfer: secs: 23 size: 142.7 MB
Apr 25 18:31:43 ip-172-31-28-144.ec2.internal rpm-ostree[1617]: Initiated txn Rebase for client :1.27 (uid 0): /org/projectatomic/rpmostree1/centos_atomic_host

The behaviour was likely being reboot is too fast for rpm-ostree. /me quoted from @jlebon :p.

gicmo · 2021-05-04T09:44:11Z

We (osbuild/osbuild-composer) are running into this in our integration tests:

[2021-05-04T08:41:14.888Z] Warning: Permanently added '192.168.100.50' (ECDSA) to the list of known hosts.
[2021-05-04T08:41:20.200Z] 98 metadata, 66 content objects fetched; 13389 KiB transferred in 5 seconds; 49.5 MB content written
[2021-05-04T08:41:20.200Z] Writing objects: 1...done
[2021-05-04T08:41:22.748Z] Staging deployment...done
[2021-05-04T08:41:24.124Z] error: Bus owner changed, aborting. This likely means the daemon crashed; check logs with `journalctl -xe`.

Since this is happening inside a test VM we sadly don't have the logs (yet). It also seems that we are running into this more often recently with F34 and RHEL 8.5 tests. Did something change in the relevant code paths or are we just "lucky".

cgwalters · 2021-05-04T19:55:01Z

There should be a core in coredumpctl, can you get the stack from that?
I'd change your test suite harness to gather coredumps.

cgwalters · 2021-05-04T19:55:29Z

Or are you using the --reboot flag?

henrywang · 2021-05-12T15:04:35Z

When I run sudo rpm-ostree upgrade, daemon crashed and got core dump.

[admin@vm-bios ~]$ sudo rpm-ostree upgrade
⠠ Scanning metadata: 2310
Scanning metadata: 2310... done
Staging deployment... done
error: Bus owner changed, aborting. This likely means the daemon crashed; check logs with `journalctl -xe`.

[admin@vm-bios ~]$ journalctl -u rpm-ostreed
-- Logs begin at Wed 2021-05-12 14:38:25 UTC, end at Wed 2021-05-12 15:08:45 UTC. --
May 12 14:54:26 vm-bios systemd[1]: Starting rpm-ostree System Management Daemon...
May 12 14:54:26 vm-bios rpm-ostree[1074]: Reading config file '/etc/rpm-ostreed.conf'
May 12 14:54:26 vm-bios systemd[1]: Started rpm-ostree System Management Daemon.
May 12 14:54:26 vm-bios rpm-ostree[1074]: In idle state; will auto-exit in 60 seconds
May 12 14:54:26 vm-bios rpm-ostree[1074]: client(id:cli dbus:1.23 unit:session-3.scope uid:0) added; new total=1
May 12 14:54:26 vm-bios rpm-ostree[1074]: Locked sysroot
May 12 14:54:26 vm-bios rpm-ostree[1074]: Initiated txn Upgrade for client(id:cli dbus:1.23 unit:session-3.scope uid:0): /org/projectatomic/rpmostree1/edge_commit
May 12 14:54:30 vm-bios rpm-ostree[1074]: libostree pull from 'edge-commit' for rhel/8/x86_64/edge complete
                                          security: GPG: disabled
                                          security: SIGN: disabled http: TLS
                                          non-delta: meta: 98 content: 66
                                          transfer: secs: 4 size: 13.7 MB
May 12 14:54:32 vm-bios rpm-ostree[1074]: Created new deployment /ostree/deploy/edge-commit/deploy/4107a6f83a76d40fb045cfcfbcfae6c4128840cd4d7c5018f8643ced399f9c88.0
May 12 14:54:32 vm-bios rpm-ostree[1074]: sanitycheck(/usr/bin/true) successful
May 12 14:54:33 vm-bios rpm-ostree[1074]: **
May 12 14:54:33 vm-bios rpm-ostree[1074]: ERROR:src/lib/rpmostree-package.c:295:_rpm_ostree_package_list_for_commit: assertion failed (pkglist->len > 0): (0 > 0)
May 12 14:54:33 vm-bios systemd[1]: rpm-ostreed.service: Main process exited, code=killed, status=6/ABRT
May 12 14:54:33 vm-bios systemd[1]: rpm-ostreed.service: Failed with result 'signal'.



        [admin@vm-bios ~]$ coredumpctl info 1074
                   PID: 1074 (rpm-ostree)
                   UID: 0 (root)
                   GID: 0 (root)
                Signal: 6 (ABRT)
             Timestamp: Wed 2021-05-12 14:54:33 UTC (3min 3s ago)
          Command Line: /usr/bin/rpm-ostree start-daemon
            Executable: /usr/bin/rpm-ostree
         Control Group: /
                 Slice: -.slice
               Boot ID: bf95678384534218b03d95e06417d478
            Machine ID: 4e906e2d7c8b4cc78590f0196ce7fd85
              Hostname: vm-bios
               Storage: /var/lib/systemd/coredump/core.rpm-ostree.0.bf95678384534218b03d95e06417d478.1074.1620831273000000.lz4 (inaccessible)
               Message: Process 1074 (rpm-ostree) of user 0 dumped core.
         
                        Stack trace of thread 1077:
                        #0  0x00007f27ce52337f raise (libc.so.6)
                        #1  0x00007f27ce50ddb5 abort (libc.so.6)
                        #2  0x00007f27cf9130c3 g_assertion_message.cold.15 (libglib-2.0.so.0)
                        #3  0x00007f27cf96c4ba g_assertion_message_cmpnum (libglib-2.0.so.0)
                        #4  0x00007f27d1ca56cd _rpm_ostree_package_list_for_commit (librpmostree-1.so.1)
                        #5  0x00007f27d1ca4de2 rpm_ostree_db_diff_ext (librpmostree-1.so.1)
                        #6  0x00005561c976b4a0 rpm_diff_add_db_diff.constprop.20 (rpm-ostree)
                        #7  0x00005561c976d0a4 rpmostreed_update_generate_variant (rpm-ostree)
                        #8  0x00005561c9778c8a generate_update_variant (rpm-ostree)
                        #9  0x00005561c977d1e3 deploy_transaction_execute (rpm-ostree)
                        #10 0x00005561c976eacd transaction_execute_thread (rpm-ostree)
                        #11 0x00007f27cfef7367 g_task_thread_pool_thread (libgio-2.0.so.0)
                        #12 0x00007f27cf96d943 g_thread_pool_thread_proxy (libglib-2.0.so.0)
                        #13 0x00007f27cf96cf3a g_thread_proxy (libglib-2.0.so.0)
                        #14 0x00007f27ce8b914a start_thread (libpthread.so.0)
                        #15 0x00007f27ce5e8dc3 __clone (libc.so.6)
         
                        Stack trace of thread 1074:
                        #0  0x00007f27ce5dda41 __poll (libc.so.6)
                        #1  0x00007f27cf944bf6 g_main_context_iterate.isra.21 (libglib-2.0.so.0)
                        #2  0x00007f27cf944d20 g_main_context_iteration (libglib-2.0.so.0)
                        #3  0x00005561c976745c rpmostreed_daemon_run_until_idle_exit (rpm-ostree)
                        #4  0x00005561c972d8d3 rpmostree_builtin_start_daemon (rpm-ostree)
                        #5  0x00005561c9716e0a main (rpm-ostree)
                        #6  0x00007f27ce50f493 __libc_start_main (libc.so.6)
                        #7  0x00005561c971717e _start (rpm-ostree)
         
                        Stack trace of thread 1075:
                        #0  0x00007f27ce5dda41 __poll (libc.so.6)
                        #1  0x00007f27cf944bf6 g_main_context_iterate.isra.21 (libglib-2.0.so.0)
                        #2  0x00007f27cf944d20 g_main_context_iteration (libglib-2.0.so.0)
                        #3  0x00007f27cf944d71 glib_worker_main (libglib-2.0.so.0)
                        #4  0x00007f27cf96cf3a g_thread_proxy (libglib-2.0.so.0)
                        #5  0x00007f27ce8b914a start_thread (libpthread.so.0)
                        #6  0x00007f27ce5e8dc3 __clone (libc.so.6)
         
                        Stack trace of thread 1076:
                        #0  0x00007f27ce5dda41 __poll (libc.so.6)
                        #1  0x00007f27cf944bf6 g_main_context_iterate.isra.21 (libglib-2.0.so.0)
                        #2  0x00007f27cf944fb2 g_main_loop_run (libglib-2.0.so.0)
                        #3  0x00007f27cff3a5aa gdbus_shared_thread_func (libgio-2.0.so.0)
                        #4  0x00007f27cf96cf3a g_thread_proxy (libglib-2.0.so.0)
                        #5  0x00007f27ce8b914a start_thread (libpthread.so.0)
                        #6  0x00007f27ce5e8dc3 __clone (libc.so.6)
         
                        Stack trace of thread 1088:
                        #0  0x00007f27ce5e352d syscall (libc.so.6)
                        #1  0x00007f27cf98bdae g_cond_wait_until (libglib-2.0.so.0)
                        #2  0x00007f27cf917131 g_async_queue_pop_intern_unlocked (libglib-2.0.so.0)
                        #3  0x00007f27cf917720 g_async_queue_timeout_pop (libglib-2.0.so.0)
                        #4  0x00007f27cf96d9e1 g_thread_pool_thread_proxy (libglib-2.0.so.0)
                        #5  0x00007f27cf96cf3a g_thread_proxy (libglib-2.0.so.0)
                        #6  0x00007f27ce8b914a start_thread (libpthread.so.0)
                        #7  0x00007f27ce5e8dc3 __clone (libc.so.6)

jlebon · 2021-05-12T15:45:12Z

@henrywang That's a different bug. Looks like 8.5 brings in openSUSE/libsolv@04d4d03, which means we need to backport #2553.

ondrejbudai · 2021-05-14T12:06:55Z

@jlebon I see the same issue also on Fedora 34 that should have #2553.

jlebon · 2021-05-14T14:04:59Z

@ondrejbudai Same issue as @henrywang? Let's split that into a separate issue and keep this issue about reboot races.

As mentioned in the error message, the Bus owner changed, aborting. message is a generic error which just means the daemon crashed.

Alternative to coreos#2845 which moved the `reboot` invocation into the client (which I think still makes sense in some cases). Previously we were starting the reboot before we're returned a success reply to the client, so it could see the daemon killed by `SIGTERM` and hence emit a spurious error. (Really in this case any 3 of the calling process, the client binary or the daemon could be killed in any order, so it's messy but this just closes one race) For cleanliness we reject starting any new transactions after the daemon has started a reboot. Closes: coreos#1348

Alternative to #2845 which moved the `reboot` invocation into the client (which I think still makes sense in some cases). Previously we were starting the reboot before we're returned a success reply to the client, so it could see the daemon killed by `SIGTERM` and hence emit a spurious error. (Really in this case any 3 of the calling process, the client binary or the daemon could be killed in any order, so it's messy but this just closes one race) For cleanliness we reject starting any new transactions after the daemon has started a reboot. Closes: #1348

jlebon added the bug label Apr 25, 2018

jlebon changed the title ~~Rebase -r caused a bus owner changed error~~ Race condition when using --reboot flag May 1, 2018

jlebon mentioned this issue May 29, 2019

update_agent: add update finalization coreos/zincati#23

Merged

cgwalters mentioned this issue May 21, 2021

daemon: Make actually initiating reboot asynchronous #2848

Merged

cgwalters closed this as completed in #2848 May 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition when using --reboot flag #1348

Race condition when using --reboot flag #1348

peterbaouoft commented Apr 25, 2018

cgwalters commented Apr 25, 2018

vtolstov commented Apr 25, 2018 •

edited

Loading

peterbaouoft commented Apr 25, 2018 •

edited

Loading

gicmo commented May 4, 2021

cgwalters commented May 4, 2021

cgwalters commented May 4, 2021

henrywang commented May 12, 2021 •

edited

Loading

jlebon commented May 12, 2021

ondrejbudai commented May 14, 2021

jlebon commented May 14, 2021

Race condition when using --reboot flag #1348

Race condition when using --reboot flag #1348

Comments

peterbaouoft commented Apr 25, 2018

cgwalters commented Apr 25, 2018

vtolstov commented Apr 25, 2018 • edited Loading

peterbaouoft commented Apr 25, 2018 • edited Loading

gicmo commented May 4, 2021

cgwalters commented May 4, 2021

cgwalters commented May 4, 2021

henrywang commented May 12, 2021 • edited Loading

jlebon commented May 12, 2021

ondrejbudai commented May 14, 2021

jlebon commented May 14, 2021

vtolstov commented Apr 25, 2018 •

edited

Loading

peterbaouoft commented Apr 25, 2018 •

edited

Loading

henrywang commented May 12, 2021 •

edited

Loading