Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition when using --reboot flag #1348

Closed
peterbaouoft opened this issue Apr 25, 2018 · 10 comments · Fixed by #2848
Closed

Race condition when using --reboot flag #1348

peterbaouoft opened this issue Apr 25, 2018 · 10 comments · Fixed by #2848
Labels

Comments

@peterbaouoft
Copy link
Contributor

Host system details

Provide the output of rpm-ostree status.

[root@ip-172-31-28-144 ~]# rpm-ostree status
State: idle
Deployments:
* local:xxx/7/x86_64/standard
                   Version: 7.1803 (2018-04-24 17:25:51)
                    Commit: 1a81065065f99d16c3509871783246380d2fb1a0754778f3b25b6fa224aa1576

  centos-atomic-host:centos-atomic-host/7/x86_64/standard
                   Version: 7.1712.1 (2018-01-12 10:25:04)
                    Commit: c6104b6bd8aeabbc19687e91359657e1292c3669e3c686438993e48481af6f2d
              GPGSignature: Valid signature by 64E3E7558572B59A319452AAF17E745691BA8335

Expected vs actual behavior

Actual:

[root@ip-172-31-28-144 ~]# time rpm-ostree rebase -r local:xxx/7/x86_64/standard

1 metadata, 1827 content objects fetched; 139364 KiB transferred in 23 seconds                          
Copying /etc changes: 22 modified, 4 removed, 52 added
Transaction complete; bootconfig swap: yes deployment count change: 1
error: Bus owner changed, aborting.

real	0m50.273s
user	0m0.016s
sys	0m0.004s
[root@ip-172-31-28-144 ~]# Connection to ec2-107-23-145-155.compute-1.amazonaws.com closed by remote host.

After reboot:
[root@ip-172-31-28-144 ~]# rpm-ostree status 
State: idle
Deployments:
* local:xxx/7/x86_64/standard
                   Version: 7.1803 (2018-04-24 17:25:51)
                    Commit: 1a81065065f99d16c3509871783246380d2fb1a0754778f3b25b6fa224aa1576

  centos-atomic-host:centos-atomic-host/7/x86_64/standard
                   Version: 7.1712.1 (2018-01-12 10:25:04)
                    Commit: c6104b6bd8aeabbc19687e91359657e1292c3669e3c686438993e48481af6f2d
              GPGSignature: Valid signature by 64E3E7558572B59A319452AAF17E745691BA8335

Expected:
rebase -r with no errors like error: Bus owner changed, aborting.

Steps to reproduce it

Dunno if it happens on every host, but you can try rebase -r to another system. There might be a chance you also see this behavior

Would you like to work on the issue?

I don't mind :p, but will be better if someone who is more experienced take this :D.

@jlebon jlebon added the bug label Apr 25, 2018
@cgwalters
Copy link
Member

I think the flow will need to be something like:

  • Server tells client we're rebooting
  • Client prints out "Rebooting." message and flushes stdout
  • Client tells server to proceed, and disconnects from DBus
  • Server initiates reboot

@vtolstov
Copy link

vtolstov commented Apr 25, 2018

what say journalctl -u rpm-ostreed.service -r ?

@peterbaouoft
Copy link
Contributor Author

peterbaouoft commented Apr 25, 2018

@vtolstov, hi, the log is like the following, with logger messages omitted

-- Reboot --
Apr 25 18:32:34 ip-172-31-28-144.ec2.internal systemd[1]: Stopped OSTree+RPM package management daemon.
Apr 25 18:32:33 ip-172-31-28-144.ec2.internal systemd[1]: Stopping OSTree+RPM package management daemon...
Apr 25 18:32:32 ip-172-31-28-144.ec2.internal rpm-ostree[1617]: Transaction complete; bootconfig swap: yes deployment count change: 1
==========================Logger messages Omitted=================================
Apr 25 18:32:18 ip-172-31-28-144.ec2.internal rpm-ostree[1617]: sanitycheck(/usr/bin/true) successful
Apr 25 18:32:18 ip-172-31-28-144.ec2.internal rpm-ostree[1617]: Copying /etc changes: 22 modified, 4 removed, 52 added
Apr 25 18:32:06 ip-172-31-28-144.ec2.internal rpm-ostree[1617]: libostree pull from 'local' for openshift/7/x86_64/standard complete
                                                                security: GPG: disabled http: TLS
                                                                non-delta: meta: 1 content: 1827
                                                                transfer: secs: 23 size: 142.7 MB
Apr 25 18:31:43 ip-172-31-28-144.ec2.internal rpm-ostree[1617]: Initiated txn Rebase for client :1.27 (uid 0): /org/projectatomic/rpmostree1/centos_atomic_host

The behaviour was likely being reboot is too fast for rpm-ostree. /me quoted from @jlebon :p.

@jlebon jlebon changed the title Rebase -r caused a bus owner changed error Race condition when using --reboot flag May 1, 2018
@gicmo
Copy link

gicmo commented May 4, 2021

We (osbuild/osbuild-composer) are running into this in our integration tests:

[2021-05-04T08:41:14.888Z] Warning: Permanently added '192.168.100.50' (ECDSA) to the list of known hosts.
[2021-05-04T08:41:20.200Z] 98 metadata, 66 content objects fetched; 13389 KiB transferred in 5 seconds; 49.5 MB content written
[2021-05-04T08:41:20.200Z] Writing objects: 1...done
[2021-05-04T08:41:22.748Z] Staging deployment...done
[2021-05-04T08:41:24.124Z] error: Bus owner changed, aborting. This likely means the daemon crashed; check logs with `journalctl -xe`.

Since this is happening inside a test VM we sadly don't have the logs (yet). It also seems that we are running into this more often recently with F34 and RHEL 8.5 tests. Did something change in the relevant code paths or are we just "lucky".

@cgwalters
Copy link
Member

There should be a core in coredumpctl, can you get the stack from that?
I'd change your test suite harness to gather coredumps.

@cgwalters
Copy link
Member

Or are you using the --reboot flag?

@henrywang
Copy link

henrywang commented May 12, 2021

When I run sudo rpm-ostree upgrade, daemon crashed and got core dump.

[admin@vm-bios ~]$ sudo rpm-ostree upgrade
⠠ Scanning metadata: 2310
Scanning metadata: 2310... done
Staging deployment... done
error: Bus owner changed, aborting. This likely means the daemon crashed; check logs with `journalctl -xe`.
[admin@vm-bios ~]$ journalctl -u rpm-ostreed
-- Logs begin at Wed 2021-05-12 14:38:25 UTC, end at Wed 2021-05-12 15:08:45 UTC. --
May 12 14:54:26 vm-bios systemd[1]: Starting rpm-ostree System Management Daemon...
May 12 14:54:26 vm-bios rpm-ostree[1074]: Reading config file '/etc/rpm-ostreed.conf'
May 12 14:54:26 vm-bios systemd[1]: Started rpm-ostree System Management Daemon.
May 12 14:54:26 vm-bios rpm-ostree[1074]: In idle state; will auto-exit in 60 seconds
May 12 14:54:26 vm-bios rpm-ostree[1074]: client(id:cli dbus:1.23 unit:session-3.scope uid:0) added; new total=1
May 12 14:54:26 vm-bios rpm-ostree[1074]: Locked sysroot
May 12 14:54:26 vm-bios rpm-ostree[1074]: Initiated txn Upgrade for client(id:cli dbus:1.23 unit:session-3.scope uid:0): /org/projectatomic/rpmostree1/edge_commit
May 12 14:54:30 vm-bios rpm-ostree[1074]: libostree pull from 'edge-commit' for rhel/8/x86_64/edge complete
                                          security: GPG: disabled
                                          security: SIGN: disabled http: TLS
                                          non-delta: meta: 98 content: 66
                                          transfer: secs: 4 size: 13.7 MB
May 12 14:54:32 vm-bios rpm-ostree[1074]: Created new deployment /ostree/deploy/edge-commit/deploy/4107a6f83a76d40fb045cfcfbcfae6c4128840cd4d7c5018f8643ced399f9c88.0
May 12 14:54:32 vm-bios rpm-ostree[1074]: sanitycheck(/usr/bin/true) successful
May 12 14:54:33 vm-bios rpm-ostree[1074]: **
May 12 14:54:33 vm-bios rpm-ostree[1074]: ERROR:src/lib/rpmostree-package.c:295:_rpm_ostree_package_list_for_commit: assertion failed (pkglist->len > 0): (0 > 0)
May 12 14:54:33 vm-bios systemd[1]: rpm-ostreed.service: Main process exited, code=killed, status=6/ABRT
May 12 14:54:33 vm-bios systemd[1]: rpm-ostreed.service: Failed with result 'signal'.


        [admin@vm-bios ~]$ coredumpctl info 1074
                   PID: 1074 (rpm-ostree)
                   UID: 0 (root)
                   GID: 0 (root)
                Signal: 6 (ABRT)
             Timestamp: Wed 2021-05-12 14:54:33 UTC (3min 3s ago)
          Command Line: /usr/bin/rpm-ostree start-daemon
            Executable: /usr/bin/rpm-ostree
         Control Group: /
                 Slice: -.slice
               Boot ID: bf95678384534218b03d95e06417d478
            Machine ID: 4e906e2d7c8b4cc78590f0196ce7fd85
              Hostname: vm-bios
               Storage: /var/lib/systemd/coredump/core.rpm-ostree.0.bf95678384534218b03d95e06417d478.1074.1620831273000000.lz4 (inaccessible)
               Message: Process 1074 (rpm-ostree) of user 0 dumped core.
         
                        Stack trace of thread 1077:
                        #0  0x00007f27ce52337f raise (libc.so.6)
                        #1  0x00007f27ce50ddb5 abort (libc.so.6)
                        #2  0x00007f27cf9130c3 g_assertion_message.cold.15 (libglib-2.0.so.0)
                        #3  0x00007f27cf96c4ba g_assertion_message_cmpnum (libglib-2.0.so.0)
                        #4  0x00007f27d1ca56cd _rpm_ostree_package_list_for_commit (librpmostree-1.so.1)
                        #5  0x00007f27d1ca4de2 rpm_ostree_db_diff_ext (librpmostree-1.so.1)
                        #6  0x00005561c976b4a0 rpm_diff_add_db_diff.constprop.20 (rpm-ostree)
                        #7  0x00005561c976d0a4 rpmostreed_update_generate_variant (rpm-ostree)
                        #8  0x00005561c9778c8a generate_update_variant (rpm-ostree)
                        #9  0x00005561c977d1e3 deploy_transaction_execute (rpm-ostree)
                        #10 0x00005561c976eacd transaction_execute_thread (rpm-ostree)
                        #11 0x00007f27cfef7367 g_task_thread_pool_thread (libgio-2.0.so.0)
                        #12 0x00007f27cf96d943 g_thread_pool_thread_proxy (libglib-2.0.so.0)
                        #13 0x00007f27cf96cf3a g_thread_proxy (libglib-2.0.so.0)
                        #14 0x00007f27ce8b914a start_thread (libpthread.so.0)
                        #15 0x00007f27ce5e8dc3 __clone (libc.so.6)
         
                        Stack trace of thread 1074:
                        #0  0x00007f27ce5dda41 __poll (libc.so.6)
                        #1  0x00007f27cf944bf6 g_main_context_iterate.isra.21 (libglib-2.0.so.0)
                        #2  0x00007f27cf944d20 g_main_context_iteration (libglib-2.0.so.0)
                        #3  0x00005561c976745c rpmostreed_daemon_run_until_idle_exit (rpm-ostree)
                        #4  0x00005561c972d8d3 rpmostree_builtin_start_daemon (rpm-ostree)
                        #5  0x00005561c9716e0a main (rpm-ostree)
                        #6  0x00007f27ce50f493 __libc_start_main (libc.so.6)
                        #7  0x00005561c971717e _start (rpm-ostree)
         
                        Stack trace of thread 1075:
                        #0  0x00007f27ce5dda41 __poll (libc.so.6)
                        #1  0x00007f27cf944bf6 g_main_context_iterate.isra.21 (libglib-2.0.so.0)
                        #2  0x00007f27cf944d20 g_main_context_iteration (libglib-2.0.so.0)
                        #3  0x00007f27cf944d71 glib_worker_main (libglib-2.0.so.0)
                        #4  0x00007f27cf96cf3a g_thread_proxy (libglib-2.0.so.0)
                        #5  0x00007f27ce8b914a start_thread (libpthread.so.0)
                        #6  0x00007f27ce5e8dc3 __clone (libc.so.6)
         
                        Stack trace of thread 1076:
                        #0  0x00007f27ce5dda41 __poll (libc.so.6)
                        #1  0x00007f27cf944bf6 g_main_context_iterate.isra.21 (libglib-2.0.so.0)
                        #2  0x00007f27cf944fb2 g_main_loop_run (libglib-2.0.so.0)
                        #3  0x00007f27cff3a5aa gdbus_shared_thread_func (libgio-2.0.so.0)
                        #4  0x00007f27cf96cf3a g_thread_proxy (libglib-2.0.so.0)
                        #5  0x00007f27ce8b914a start_thread (libpthread.so.0)
                        #6  0x00007f27ce5e8dc3 __clone (libc.so.6)
         
                        Stack trace of thread 1088:
                        #0  0x00007f27ce5e352d syscall (libc.so.6)
                        #1  0x00007f27cf98bdae g_cond_wait_until (libglib-2.0.so.0)
                        #2  0x00007f27cf917131 g_async_queue_pop_intern_unlocked (libglib-2.0.so.0)
                        #3  0x00007f27cf917720 g_async_queue_timeout_pop (libglib-2.0.so.0)
                        #4  0x00007f27cf96d9e1 g_thread_pool_thread_proxy (libglib-2.0.so.0)
                        #5  0x00007f27cf96cf3a g_thread_proxy (libglib-2.0.so.0)
                        #6  0x00007f27ce8b914a start_thread (libpthread.so.0)
                        #7  0x00007f27ce5e8dc3 __clone (libc.so.6)

@jlebon
Copy link
Member

jlebon commented May 12, 2021

@henrywang That's a different bug. Looks like 8.5 brings in openSUSE/libsolv@04d4d03, which means we need to backport #2553.

@ondrejbudai
Copy link

@jlebon I see the same issue also on Fedora 34 that should have #2553.

@jlebon
Copy link
Member

jlebon commented May 14, 2021

@ondrejbudai Same issue as @henrywang? Let's split that into a separate issue and keep this issue about reboot races.

As mentioned in the error message, the Bus owner changed, aborting. message is a generic error which just means the daemon crashed.

cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue May 21, 2021
Alternative to coreos#2845
which moved the `reboot` invocation into the client (which I think
still makes sense in some cases).

Previously we were starting the reboot before we're returned
a success reply to the client, so it could see the daemon killed
by `SIGTERM` and hence emit a spurious error.

(Really in this case any 3 of the calling process, the client
 binary or the daemon could be killed in any order, so it's messy
 but this just closes one race)

For cleanliness we reject starting any new transactions after the daemon has
started a reboot.

Closes: coreos#1348
cgwalters added a commit that referenced this issue May 25, 2021
Alternative to #2845
which moved the `reboot` invocation into the client (which I think
still makes sense in some cases).

Previously we were starting the reboot before we're returned
a success reply to the client, so it could see the daemon killed
by `SIGTERM` and hence emit a spurious error.

(Really in this case any 3 of the calling process, the client
 binary or the daemon could be killed in any order, so it's messy
 but this just closes one race)

For cleanliness we reject starting any new transactions after the daemon has
started a reboot.

Closes: #1348
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants