[Power9] [Qemu] Migration started after postcopy_ram enabled causes guest reboot in destination and guest remains running state in source as well #34

balamuruhans · 2018-02-01T15:44:04Z

cde:info Mirrored with LTC bug https://bugzilla.linux.ibm.com/show_bug.cgi?id=164182 </cde:info>

migration started after postcopy_ram enabled immediately triggers guest listening in destination to reboot and enters running state, also guest in the source remains to be in running state.

when tried to kill or shutdown any one, observed Backtrace with memory map failure, (attached full log)
source guest when killed:

# sos*** Error in `/usr/bin/qemu-system-ppc64': free(): invalid pointer: 0x000000014624b500 ***
======= Backtrace: =========
/lib64/libc.so.6(cfree+0x4ac)[0x7fff959669fc]
/lib64/libglib-2.0.so.0(g_free+0x24)[0x7fff95e1b794]
/usr/bin/qemu-system-ppc64(+0x827f08)[0x13b107f08]
/usr/bin/qemu-system-ppc64(+0x2c8edc)[0x13aba8edc]
/usr/bin/qemu-system-ppc64(+0x2d15f8)[0x13abb15f8]
/usr/bin/qemu-system-ppc64(main+0x4b74)[0x13ab58214]
/lib64/libc.so.6(+0x24980)[0x7fff958f4980]
/lib64/libc.so.6(__libc_start_main+0xc4)[0x7fff958f4b74]
======= Memory map: ========
13a8e0000-13b550000 r-xp 00000000 fd:00 138849                           /usr/bin/qemu-system-ppc64
13b560000-13b760000 r--p 00c70000 fd:00 138849                           /usr/bin/qemu-system-ppc64
13b760000-13b7f0000 rw-p 00e70000 fd:00 138849                           /usr/bin/qemu-system-ppc64
13b7f0000-13bc40000 rw-p 00000000 00:00 0 
145b90000-1464f0000 rw-p 00000000 00:00 0

destination guest when killed:

qemu-system-ppc64: terminating on signal 2
*** Error in `/usr/bin/qemu-system-ppc64': double free or corruption (fasttop): 0x0000000144fb0660 ***
======= Backtrace: =========
/lib64/libc.so.6(cfree+0x4ac)[0x7fffb88369fc]
/lib64/libglib-2.0.so.0(g_free+0x24)[0x7fffb8ceb794]
/usr/bin/qemu-system-ppc64(+0x827f08)[0x116897f08]
/usr/bin/qemu-system-ppc64(+0x2c8edc)[0x116338edc]
/usr/bin/qemu-system-ppc64(+0x2d15f8)[0x1163415f8]
/usr/bin/qemu-system-ppc64(main+0x4b74)[0x1162e8214]
/lib64/libc.so.6(+0x24980)[0x7fffb87c4980]
/lib64/libc.so.6(__libc_start_main+0xc4)[0x7fffb87c4b74]
======= Memory map: ========
116070000-116ce0000 r-xp 00000000 fd:00 138849                           /usr/bin/qemu-system-ppc64
116cf0000-116ef0000 r--p 00c70000 fd:00 138849                           /usr/bin/qemu-system-ppc64
116ef0000-116f80000 rw-p 00e70000 fd:00 138849                           /usr/bin/qemu-system-ppc64
116f80000-1173d0000 rw-p 00000000 00:00 0 
144970000-1452d0000 rw-p 00000000 00:00 0                                [heap]

Steps to reproduce:

Boot healthy guest from qemu command line,
# qemu-kvm --enable-kvm --nographic -vga none -machine pseries -m 4G,slots=32,maxmem=32G -smp 16,maxcpus=32 -device virtio-blk-pci,drive=rootdisk -drive file=/home/bala/images/hostos-ppc64le.qcow2,if=none,cache=none,format=qcow2,id=rootdisk -monitor telnet:127.0.0.1:1234,server,nowait
Have another instance of qemu command in listening for migration as destination,
# qemu-kvm --enable-kvm --nographic -vga none -machine pseries -m 4G,slots=32,maxmem=32G -smp 16,maxcpus=32 -device virtio-blk-pci,drive=rootdisk -drive file=/home/bala/images/hostos-ppc64le.qcow2,if=none,cache=none,format=qcow2,id=rootdisk -monitor telnet:127.0.0.1:1235,server,nowait -incoming tcp:0:4444
Enable postcopy_ram from qemu monitor in source vm,

# telnet 127.0.0.1 1234
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
QEMU 2.11.50 monitor - type 'help' for more information
(qemu) migrate_set_capability postcopy-ram on
(qemu) info migrate_capabilities 
xbzrle: off
rdma-pin-all: off
auto-converge: off
zero-blocks: off
compress: off
events: off
postcopy-ram: on
x-colo: off
release-ram: off
block: off
return-path: off
pause-before-switchover: off
x-multifd: off

Start migration in source qemu monitor,
(qemu) migrate -d tcp:127.0.0.1:4444

It is observed that guest in listening mode at destination immediately restarts and enters running state and guest in source also remains to be in running state.

Observation:
Just enabling postcopy_ram and shutdown the guest from inside (shutdown -h now) also triggers same Backtrace with memory map failure error after VM shutsdown

System configuration:
qemu: 2.11.50-1.dev.gita815ffa.el7.centos.ppc64le
slof: SLOF-20170724-2.dev.gitea31295.el7.centos.noarch
host kernel: 4.15.0-3.dev.gitd34a158.el7.centos.ppc64le
guest kernel: 4.15.0-3.dev.gitd34a158.el7.centos.ppc64le

Attachment:
Guest backtrace error in source and destination

The text was updated successfully, but these errors were encountered:

balamuruhans · 2018-02-01T15:44:59Z

migration_destination.log

balamuruhans · 2018-02-01T15:45:49Z

migration_source.log

cdeadmin · 2018-02-02T09:16:12Z

------- Comment From KURZGREG@fr.ibm.com 2018-02-02 04:09:38 EDT-------
(In reply to comment #1)
>
>
> migration started after postcopy_ram enabled immediately triggers guest
> listening in destination to reboot and enters running state, also guest in
> the source remains to be in running state.
>

I could reproduce ^^ with upstream QEMU on both a POWER host and on my laptop.
The following trace is also printed on the destination:

qemu-system-ppc64: Expected vmdescription section, but got 0

This behaviour isn't seen with QEMU 2.10. Bisect indicates the following commit
to be the culprit:

commit 58110f0
Author: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Date: Mon Jul 10 19:30:16 2017 +0300

migration: split common postcopy out of ram postcopy

Split common postcopy staff from ram postcopy staff.

Signed-off-by: Vladimir Sementsov-Ogievskiy &lt;vsementsov@virtuozzo.com&gt;
Reviewed-by: Dr. David Alan Gilbert &lt;dgilbert@redhat.com&gt;
Reviewed-by: Juan Quintela &lt;quintela@redhat.com&gt;
Signed-off-by: Juan Quintela &lt;quintela@redhat.com&gt;

This is hence a QEMU 2.11 regression.

> when tried to kill or shutdown any one, observed Backtrace with memory map
> failure, (attached full log)
> source guest when killed:
> ```
> # sos*** Error in `/usr/bin/qemu-system-ppc64': free(): invalid pointer:
> 0x000000014624b500 ***

I suppose you have used completion when typing the commands in the
monitor. If so, then this isn't related to the migration issue. It is a bug
recently introduced by commit:

commit e5dc1a6
Author: Marc-Andr? Lureau <marcandre.lureau@redhat.com>
Date: Thu Jan 4 17:05:15 2018 +0100

readline: add a free function

This commit is present in qemu-2.11.50-1.dev.gita815ffa.el7.centos.ppc64le.

I had already sent a fix for this. It was queued by Paolo Bonzini and should be
merged upstream next week.

http://patchwork.ozlabs.org/patch/862816/

cdeadmin · 2018-02-02T11:36:15Z

------- Comment From KURZGREG@fr.ibm.com 2018-02-02 06:27:22 EDT-------
The postcopy issue was actually discussed by the community. The root cause is that the
postcopy-ram capability should be set on the destination as well. In which case migration
occurs as expected.

This being said, I had a chat on irc with QEMU migration maintainer David Gilbert and he
could not point to any documentation stating postcopy-ram should be set on both source
and destination. Also, he agrees that the destination shouldn't start as it does now in case
of discrepancy.

FYI, this shouldn't happen when using libvirt because it always sets postcopy-ram on both
ends.

So we have three items to address here:

fix the bogus behaviour of the destination
improve the postcopy-ram documentation
ensure the fix for the double-free crash gets merged

cdeadmin · 2018-02-07T16:07:36Z

------- Comment From KURZGREG@fr.ibm.com 2018-02-07 10:57:07 EDT-------
(In reply to comment #5)
> So we have three items to address here:
> - fix the bogus behaviour of the destination

Fix is upstream:

https://git.qemu.org/?p=qemu.git;a=commit;h=875fcd013ab68c64802998b22f54f0184479d21b

> - improve the postcopy-ram documentation

Patch sent:

http://patchwork.ozlabs.org/patch/870439/

> - ensure the fix for the double-free crash gets merged

Fix in maintainer's tree, will be merged shortly:

bonzini/qemu@4183e2e

cdeadmin · 2018-02-16T14:36:06Z

------- Comment From KURZGREG@fr.ibm.com 2018-02-16 09:32:13 EDT-------
Remaining patches are now upstream.

Documentation of postcopy_ram:

https://git.qemu.org/?p=qemu.git;a=commit;h=c2eb7f213a15b870f7a35ec961e4f1e0f7e2df91

double-free crash:

https://git.qemu.org/?p=qemu.git;a=commit;h=4183e2ea6d092ea9d7f18af085cb1076fae08512

cdeadmin · 2018-02-21T05:15:53Z

------- Comment From bssrikanth@in.ibm.com 2018-02-21 00:12:39 EDT-------
Will we get this one for test in sprint 8? which level of qemu would have these fixes?

cdeadmin · 2018-02-21T09:37:10Z

------- Comment From KURZGREG@fr.ibm.com 2018-02-21 04:31:01 EDT-------
(In reply to comment #8)
> Will we get this one for test in sprint 8? which level of qemu would have
> these fixes?

QEMU 2.12 (expected release date: 2018-04-24)

balamuruhans · 2018-06-08T10:21:03Z

Tested it with latest HostOS build and issue is not observed. Thanks Greg.

Qemu

# rpm -qa | grep qemu
qemu-img-2.12.0-2.dev.gitd36f3ee.el7.ppc64le
qemu-2.12.0-2.dev.gitd36f3ee.el7.ppc64le
qemu-system-ppc-2.12.0-2.dev.gitd36f3ee.el7.ppc64le
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
qemu-common-2.12.0-2.dev.gitd36f3ee.el7.ppc64le
qemu-system-x86-2.12.0-2.dev.gitd36f3ee.el7.ppc64le

balamuruhans closed this as completed Jun 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Power9] [Qemu] Migration started after postcopy_ram enabled causes guest reboot in destination and guest remains running state in source as well #34

[Power9] [Qemu] Migration started after postcopy_ram enabled causes guest reboot in destination and guest remains running state in source as well #34

balamuruhans commented Feb 1, 2018 •

edited by cdeadmin

Loading

balamuruhans commented Feb 1, 2018

balamuruhans commented Feb 1, 2018

cdeadmin commented Feb 2, 2018

cdeadmin commented Feb 2, 2018

cdeadmin commented Feb 7, 2018

cdeadmin commented Feb 16, 2018

cdeadmin commented Feb 21, 2018

cdeadmin commented Feb 21, 2018

balamuruhans commented Jun 8, 2018

[Power9] [Qemu] Migration started after postcopy_ram enabled causes guest reboot in destination and guest remains running state in source as well #34

[Power9] [Qemu] Migration started after postcopy_ram enabled causes guest reboot in destination and guest remains running state in source as well #34

Comments

balamuruhans commented Feb 1, 2018 • edited by cdeadmin Loading

balamuruhans commented Feb 1, 2018

balamuruhans commented Feb 1, 2018

cdeadmin commented Feb 2, 2018

cdeadmin commented Feb 2, 2018

cdeadmin commented Feb 7, 2018

cdeadmin commented Feb 16, 2018

cdeadmin commented Feb 21, 2018

cdeadmin commented Feb 21, 2018

balamuruhans commented Jun 8, 2018

balamuruhans commented Feb 1, 2018 •

edited by cdeadmin

Loading