Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Power9] [Qemu] Migration started after postcopy_ram enabled causes guest reboot in destination and guest remains running state in source as well #34

Closed
balamuruhans opened this issue Feb 1, 2018 · 9 comments

Comments

@balamuruhans
Copy link

balamuruhans commented Feb 1, 2018

cde:info Mirrored with LTC bug https://bugzilla.linux.ibm.com/show_bug.cgi?id=164182 </cde:info>

migration started after postcopy_ram enabled immediately triggers guest listening in destination to reboot and enters running state, also guest in the source remains to be in running state.

when tried to kill or shutdown any one, observed Backtrace with memory map failure, (attached full log)
source guest when killed:

# sos*** Error in `/usr/bin/qemu-system-ppc64': free(): invalid pointer: 0x000000014624b500 ***
======= Backtrace: =========
/lib64/libc.so.6(cfree+0x4ac)[0x7fff959669fc]
/lib64/libglib-2.0.so.0(g_free+0x24)[0x7fff95e1b794]
/usr/bin/qemu-system-ppc64(+0x827f08)[0x13b107f08]
/usr/bin/qemu-system-ppc64(+0x2c8edc)[0x13aba8edc]
/usr/bin/qemu-system-ppc64(+0x2d15f8)[0x13abb15f8]
/usr/bin/qemu-system-ppc64(main+0x4b74)[0x13ab58214]
/lib64/libc.so.6(+0x24980)[0x7fff958f4980]
/lib64/libc.so.6(__libc_start_main+0xc4)[0x7fff958f4b74]
======= Memory map: ========
13a8e0000-13b550000 r-xp 00000000 fd:00 138849                           /usr/bin/qemu-system-ppc64
13b560000-13b760000 r--p 00c70000 fd:00 138849                           /usr/bin/qemu-system-ppc64
13b760000-13b7f0000 rw-p 00e70000 fd:00 138849                           /usr/bin/qemu-system-ppc64
13b7f0000-13bc40000 rw-p 00000000 00:00 0 
145b90000-1464f0000 rw-p 00000000 00:00 0  

destination guest when killed:

qemu-system-ppc64: terminating on signal 2
*** Error in `/usr/bin/qemu-system-ppc64': double free or corruption (fasttop): 0x0000000144fb0660 ***
======= Backtrace: =========
/lib64/libc.so.6(cfree+0x4ac)[0x7fffb88369fc]
/lib64/libglib-2.0.so.0(g_free+0x24)[0x7fffb8ceb794]
/usr/bin/qemu-system-ppc64(+0x827f08)[0x116897f08]
/usr/bin/qemu-system-ppc64(+0x2c8edc)[0x116338edc]
/usr/bin/qemu-system-ppc64(+0x2d15f8)[0x1163415f8]
/usr/bin/qemu-system-ppc64(main+0x4b74)[0x1162e8214]
/lib64/libc.so.6(+0x24980)[0x7fffb87c4980]
/lib64/libc.so.6(__libc_start_main+0xc4)[0x7fffb87c4b74]
======= Memory map: ========
116070000-116ce0000 r-xp 00000000 fd:00 138849                           /usr/bin/qemu-system-ppc64
116cf0000-116ef0000 r--p 00c70000 fd:00 138849                           /usr/bin/qemu-system-ppc64
116ef0000-116f80000 rw-p 00e70000 fd:00 138849                           /usr/bin/qemu-system-ppc64
116f80000-1173d0000 rw-p 00000000 00:00 0 
144970000-1452d0000 rw-p 00000000 00:00 0                                [heap]

Steps to reproduce:

  1. Boot healthy guest from qemu command line,
    # qemu-kvm --enable-kvm --nographic -vga none -machine pseries -m 4G,slots=32,maxmem=32G -smp 16,maxcpus=32 -device virtio-blk-pci,drive=rootdisk -drive file=/home/bala/images/hostos-ppc64le.qcow2,if=none,cache=none,format=qcow2,id=rootdisk -monitor telnet:127.0.0.1:1234,server,nowait

  2. Have another instance of qemu command in listening for migration as destination,
    # qemu-kvm --enable-kvm --nographic -vga none -machine pseries -m 4G,slots=32,maxmem=32G -smp 16,maxcpus=32 -device virtio-blk-pci,drive=rootdisk -drive file=/home/bala/images/hostos-ppc64le.qcow2,if=none,cache=none,format=qcow2,id=rootdisk -monitor telnet:127.0.0.1:1235,server,nowait -incoming tcp:0:4444

  3. Enable postcopy_ram from qemu monitor in source vm,

# telnet 127.0.0.1 1234
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
QEMU 2.11.50 monitor - type 'help' for more information
(qemu) migrate_set_capability postcopy-ram on
(qemu) info migrate_capabilities 
xbzrle: off
rdma-pin-all: off
auto-converge: off
zero-blocks: off
compress: off
events: off
postcopy-ram: on
x-colo: off
release-ram: off
block: off
return-path: off
pause-before-switchover: off
x-multifd: off
  1. Start migration in source qemu monitor,
    (qemu) migrate -d tcp:127.0.0.1:4444

It is observed that guest in listening mode at destination immediately restarts and enters running state and guest in source also remains to be in running state.

Observation:
Just enabling postcopy_ram and shutdown the guest from inside (shutdown -h now) also triggers same Backtrace with memory map failure error after VM shutsdown

System configuration:
qemu: 2.11.50-1.dev.gita815ffa.el7.centos.ppc64le
slof: SLOF-20170724-2.dev.gitea31295.el7.centos.noarch
host kernel: 4.15.0-3.dev.gitd34a158.el7.centos.ppc64le
guest kernel: 4.15.0-3.dev.gitd34a158.el7.centos.ppc64le

Attachment:
Guest backtrace error in source and destination

@balamuruhans
Copy link
Author

@balamuruhans
Copy link
Author

migration_source.log

@cdeadmin
Copy link

cdeadmin commented Feb 2, 2018

------- Comment From KURZGREG@fr.ibm.com 2018-02-02 04:09:38 EDT-------
(In reply to comment #1)
>
>
> migration started after postcopy_ram enabled immediately triggers guest
> listening in destination to reboot and enters running state, also guest in
> the source remains to be in running state.
>

I could reproduce ^^ with upstream QEMU on both a POWER host and on my laptop.
The following trace is also printed on the destination:

qemu-system-ppc64: Expected vmdescription section, but got 0

This behaviour isn't seen with QEMU 2.10. Bisect indicates the following commit
to be the culprit:

commit 58110f0
Author: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Date: Mon Jul 10 19:30:16 2017 +0300

migration: split common postcopy out of ram postcopy

Split common postcopy staff from ram postcopy staff.

Signed-off-by: Vladimir Sementsov-Ogievskiy &lt;vsementsov@virtuozzo.com&gt;
Reviewed-by: Dr. David Alan Gilbert &lt;dgilbert@redhat.com&gt;
Reviewed-by: Juan Quintela &lt;quintela@redhat.com&gt;
Signed-off-by: Juan Quintela &lt;quintela@redhat.com&gt;

This is hence a QEMU 2.11 regression.

> when tried to kill or shutdown any one, observed Backtrace with memory map
> failure, (attached full log)
> source guest when killed:
> ```
> # sos*** Error in `/usr/bin/qemu-system-ppc64': free(): invalid pointer:
> 0x000000014624b500 ***

I suppose you have used completion when typing the commands in the
monitor. If so, then this isn't related to the migration issue. It is a bug
recently introduced by commit:

commit e5dc1a6
Author: Marc-Andr? Lureau <marcandre.lureau@redhat.com>
Date: Thu Jan 4 17:05:15 2018 +0100

readline: add a free function

This commit is present in qemu-2.11.50-1.dev.gita815ffa.el7.centos.ppc64le.

I had already sent a fix for this. It was queued by Paolo Bonzini and should be
merged upstream next week.

http://patchwork.ozlabs.org/patch/862816/

@cdeadmin
Copy link

cdeadmin commented Feb 2, 2018

------- Comment From KURZGREG@fr.ibm.com 2018-02-02 06:27:22 EDT-------
The postcopy issue was actually discussed by the community. The root cause is that the
postcopy-ram capability should be set on the destination as well. In which case migration
occurs as expected.

This being said, I had a chat on irc with QEMU migration maintainer David Gilbert and he
could not point to any documentation stating postcopy-ram should be set on both source
and destination. Also, he agrees that the destination shouldn't start as it does now in case
of discrepancy.

FYI, this shouldn't happen when using libvirt because it always sets postcopy-ram on both
ends.

So we have three items to address here:

  • fix the bogus behaviour of the destination
  • improve the postcopy-ram documentation
  • ensure the fix for the double-free crash gets merged

@cdeadmin
Copy link

cdeadmin commented Feb 7, 2018

------- Comment From KURZGREG@fr.ibm.com 2018-02-07 10:57:07 EDT-------
(In reply to comment #5)
> So we have three items to address here:
> - fix the bogus behaviour of the destination

Fix is upstream:

https://git.qemu.org/?p=qemu.git;a=commit;h=875fcd013ab68c64802998b22f54f0184479d21b

> - improve the postcopy-ram documentation

Patch sent:

http://patchwork.ozlabs.org/patch/870439/

> - ensure the fix for the double-free crash gets merged

Fix in maintainer's tree, will be merged shortly:

bonzini/qemu@4183e2e

@cdeadmin
Copy link

------- Comment From KURZGREG@fr.ibm.com 2018-02-16 09:32:13 EDT-------
Remaining patches are now upstream.

Documentation of postcopy_ram:

https://git.qemu.org/?p=qemu.git;a=commit;h=c2eb7f213a15b870f7a35ec961e4f1e0f7e2df91

double-free crash:

https://git.qemu.org/?p=qemu.git;a=commit;h=4183e2ea6d092ea9d7f18af085cb1076fae08512

@cdeadmin
Copy link

------- Comment From bssrikanth@in.ibm.com 2018-02-21 00:12:39 EDT-------
Will we get this one for test in sprint 8? which level of qemu would have these fixes?

@cdeadmin
Copy link

------- Comment From KURZGREG@fr.ibm.com 2018-02-21 04:31:01 EDT-------
(In reply to comment #8)
> Will we get this one for test in sprint 8? which level of qemu would have
> these fixes?

QEMU 2.12 (expected release date: 2018-04-24)

@balamuruhans
Copy link
Author

Tested it with latest HostOS build and issue is not observed. Thanks Greg.

Qemu

# rpm -qa | grep qemu
qemu-img-2.12.0-2.dev.gitd36f3ee.el7.ppc64le
qemu-2.12.0-2.dev.gitd36f3ee.el7.ppc64le
qemu-system-ppc-2.12.0-2.dev.gitd36f3ee.el7.ppc64le
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
qemu-common-2.12.0-2.dev.gitd36f3ee.el7.ppc64le
qemu-system-x86-2.12.0-2.dev.gitd36f3ee.el7.ppc64le

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants