Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kpanic in scst_cm_dev_unregister #99

Closed
ishioni opened this issue Nov 12, 2022 · 3 comments · Fixed by #100
Closed

kpanic in scst_cm_dev_unregister #99

ishioni opened this issue Nov 12, 2022 · 3 comments · Fixed by #100

Comments

@ishioni
Copy link

ishioni commented Nov 12, 2022

I use Truenas Scale which ships scst 3.6.0.8557-~truenas+2. When my kubernetes cluster connects iscsi targets to pods, from time to time i get the following kernel panic. I can trigger it semi-reliably by randomly killing and spawning pods which require the iscsi targets. This happens both on kernels 5.10 and 5.15

[20421.307062] [20891]: scst: Attached to virtual device pvc-227dff34-8712-4b08-a0cd-2e9ee26fe99b (id 19)
[20421.308079] [1120566]: dev_vdisk: T10 device id for device pvc-227dff34-8712-4b08-a0cd-2e9ee26fe99b changed to 974a63b6449af8f
[20421.308116] [1120566]: dev_vdisk: USN for device pvc-227dff34-8712-4b08-a0cd-2e9ee26fe99b changed to 974a63b6449af8f
[20421.308121] list_del corruption. next->prev should be ffff955cb1ea2540, but was ffff955c54a32440
[20421.308128] ------------[ cut here ]------------
[20421.308130] kernel BUG at lib/list_debug.c:54!
[20421.308134] invalid opcode: 0000 [#1] SMP PTI
[20421.308137] CPU: 4 PID: 93251 Comm: kworker/4:1 Tainted: P           OE     5.15.62+truenas #1
[20421.308140] Hardware name: Default string Default string/SKYBAY, BIOS QZ01AR12 09/17/2017
[20421.308143] Workqueue: events vdev_inq_changed_fn [scst_vdisk]
[20421.308151] RIP: 0010:__list_del_entry_valid.cold+0x1d/0x47
[20421.308168] Code: c7 c7 e0 fe 15 bb e8 b4 f3 fe ff 0f 0b 48 89 fe 48 c7 c7 70 ff 15 bb e8 a3 f3 fe ff 0f 0b 48 c7 c7 20 00 16 bb e8 95 f3 fe ff <0f> 0b 48 89 f2 48 89 fe 48 c7 c7 e0 ff 15 bb e8 81 f3 fe ff 0f 0b
[20421.308172] RSP: 0018:ffffb889443afe38 EFLAGS: 00010246
[20421.308175] RAX: 0000000000000054 RBX: ffffffffc1d15220 RCX: ffff9563ddd20448
[20421.308177] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff9563ddd20440
[20421.308179] RBP: ffff955cb1ea2540 R08: 0000000000000000 R09: ffffb889443afc68
[20421.308181] R10: ffffb889443afc60 R11: ffffffffbb6d3268 R12: 0000000000000000
[20421.308183] R13: dead000000000122 R14: dead000000000100 R15: ffff955e1c6ab100
[20421.308186] FS:  0000000000000000(0000) GS:ffff9563ddd00000(0000) knlGS:0000000000000000
[20421.308188] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[20421.308190] CR2: 000055e2327427f8 CR3: 000000037fcc4005 CR4: 00000000003706e0
[20421.308193] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[20421.308195] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[20421.308197] Call Trace:
[20421.308199]  <TASK>
[20421.308201]  scst_cm_dev_unregister+0x66/0xd0 [scst]
[20421.308217]  scst_cm_update_dev+0x41/0xc0 [scst]
[20421.308231]  process_one_work+0x1ee/0x390
[20421.308234]  worker_thread+0x53/0x3e0
[20421.308237]  ? process_one_work+0x390/0x390
[20421.308239]  kthread+0x124/0x150
[20421.308241]  ? set_kthread_struct+0x50/0x50
[20421.308244]  ret_from_fork+0x1f/0x30
[20421.308248]  </TASK>
[20421.308249] Modules linked in: scst_vdisk(OE) isert_scst(OE) iscsi_scst(OE) scst(OE) rdma_cm(E) iw_cm(E) ib_cm(E) ib_core(E) dlm(E) rpcsec_gss_krb5(E) wireguard(E) libchacha20poly1305(E) chacha_x86_64(E) poly1305_x86_64(E) curve25519_x86_64(E) libcurve25519_generic(E) libchacha(E) ip6_udp_tunnel(E) udp_tunnel(E) xt_nat(E) xt_tcpudp(E) veth(E) xt_conntrack(E) nft_chain_nat(E) xt_MASQUERADE(E) nf_nat(E) nf_conntrack_netlink(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) xfrm_user(E) xfrm_algo(E) nft_counter(E) xt_addrtype(E) nft_compat(E) nf_tables(E) nfnetlink(E) br_netfilter(E) bridge(E) msr(E) binfmt_misc(E) essiv(E) authenc(E) dm_crypt(E) dm_mod(E) 8021q(E) garp(E) stp(E) mrp(E) llc(E) bonding(E) ntb_netdev(E) ntb_transport(E) ntb_split(E) ntb(E) ioatdma(E) intel_rapl_msr(E) intel_rapl_common(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) snd_hda_codec_hdmi(E) kvm_intel(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) mei_wdt(E) mei_hdcp(E) ledtrig_audio(E)
[20421.308285]  kvm(E) irqbypass(E) rapl(E) intel_cstate(E) evdev(E) intel_uncore(E) snd_hda_intel(E) snd_intel_dspcfg(E) snd_intel_sdw_acpi(E) mei_me(E) snd_hda_codec(E) wdat_wdt(E) i915(E) ir_rc6_decoder(E) pcspkr(E) watchdog(E) snd_hda_core(E) intel_wmi_thunderbolt(E) snd_hwdep(E) ttm(E) snd_pcm(E) snd_timer(E) drm_kms_helper(E) snd(E) soundcore(E) ee1004(E) rc_rc6_mce(E) mei(E) cec(E) sg(E) intel_pch_thermal(E) ite_cir(E) rc_core(E) intel_pmc_core(E) button(E) acpi_pad(E) nfsd(E) auth_rpcgss(E) fuse(E) nfs_acl(E) configfs(E) lockd(E) drm(E) grace(E) sunrpc(E) efivarfs(E) ip_tables(E) x_tables(E) autofs4(E) zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) raid1(E) raid0(E) multipath(E) linear(E) md_mod(E) ses(E) enclosure(E) scsi_transport_sas(E) sd_mod(E) crc32_pclmul(E) crc32c_intel(E)
[20421.308337]  ghash_clmulni_intel(E) nvme(E) igb(E) ahci(E) ahciem(E) i2c_algo_bit(E) xhci_pci(E) dca(E) nvme_core(E) t10_pi(E) crc_t10dif(E) e1000e(E) libahci(E) aesni_intel(E) crypto_simd(E) xhci_hcd(E) ptp(E) i2c_i801(E) intel_lpss_pci(E) cryptd(E) crct10dif_generic(E) libata(E) i2c_smbus(E) crct10dif_pclmul(E) crct10dif_common(E) pps_core(E) scsi_mod(E) intel_lpss(E) scsi_common(E) idma64(E) usbcore(E) usb_common(E) fan(E) wmi(E) video(E)
[20421.308373] ---[ end trace e89eb550d12b0ed7 ]---
lnocturno added a commit that referenced this issue Nov 15, 2022
This patch should fix the following bug:

list_del corruption. next->prev should be ffff955cb1ea2540, but was ffff955c54a32440
 ------------[ cut here ]------------
kernel BUG at lib/list_debug.c:54!
invalid opcode: 0000 [#1] SMP PTI
Workqueue: events vdev_inq_changed_fn [scst_vdisk]
RIP: 0010:__list_del_entry_valid.cold+0x1d/0x47
Call Trace:
 scst_cm_dev_unregister+0x66/0xd0 [scst]
 scst_cm_update_dev+0x41/0xc0 [scst]
 process_one_work+0x1ee/0x390
 worker_thread+0x53/0x3e0
 kthread+0x124/0x150
 ret_from_fork+0x1f/0x30

Fixes: #99
lnocturno added a commit that referenced this issue Nov 15, 2022
This patch should fix the following bug:

list_del corruption. next->prev should be ffff955cb1ea2540, but was ffff955c54a32440
 ------------[ cut here ]------------
kernel BUG at lib/list_debug.c:54!
invalid opcode: 0000 [#1] SMP PTI
Workqueue: events vdev_inq_changed_fn [scst_vdisk]
RIP: 0010:__list_del_entry_valid.cold+0x1d/0x47
Call Trace:
 scst_cm_dev_unregister+0x66/0xd0 [scst]
 scst_cm_update_dev+0x41/0xc0 [scst]
 process_one_work+0x1ee/0x390
 worker_thread+0x53/0x3e0
 kthread+0x124/0x150
 ret_from_fork+0x1f/0x30

scst_cm_desig_list is a global list for all SCST devices. It must be
protected with scst_cm_mutex because it can be modified when
scst_cm_update_dev() is called by scst_cm_init_inq_finish() from
another thread.

Fixes: #99
lnocturno added a commit that referenced this issue Nov 15, 2022
This patch should fix the following bug:

list_del corruption. next->prev should be ffff955cb1ea2540, but was ffff955c54a32440
 ------------[ cut here ]------------
kernel BUG at lib/list_debug.c:54!
invalid opcode: 0000 [#1] SMP PTI
Workqueue: events vdev_inq_changed_fn [scst_vdisk]
RIP: 0010:__list_del_entry_valid.cold+0x1d/0x47
Call Trace:
 scst_cm_dev_unregister+0x66/0xd0 [scst]
 scst_cm_update_dev+0x41/0xc0 [scst]
 process_one_work+0x1ee/0x390
 worker_thread+0x53/0x3e0
 kthread+0x124/0x150
 ret_from_fork+0x1f/0x30

scst_cm_desig_list is a global list for all SCST devices. It must be
protected with scst_cm_mutex because it can be modified by
scst_cm_init_inq_finish() from another thread when scst_cm_update_dev()
is called.

Fixes: #99
@lnocturno
Copy link
Contributor

Hi,

Thank you for the report!

I have created PR #100
Could you retest the bug with these four patches?

Thanks,
Gleb

lnocturno added a commit that referenced this issue Nov 18, 2022
This patch should fix the following bug:

list_del corruption. next->prev should be ffff955cb1ea2540, but was ffff955c54a32440
 ------------[ cut here ]------------
kernel BUG at lib/list_debug.c:54!
invalid opcode: 0000 [#1] SMP PTI
Workqueue: events vdev_inq_changed_fn [scst_vdisk]
RIP: 0010:__list_del_entry_valid.cold+0x1d/0x47
Call Trace:
 scst_cm_dev_unregister+0x66/0xd0 [scst]
 scst_cm_update_dev+0x41/0xc0 [scst]
 process_one_work+0x1ee/0x390
 worker_thread+0x53/0x3e0
 kthread+0x124/0x150
 ret_from_fork+0x1f/0x30

scst_cm_desig_list is a global list for all SCST devices. It must be
protected with scst_cm_mutex because it can be modified by
scst_cm_init_inq_finish() from another thread when scst_cm_update_dev()
is called.

Fixes: #99
@lnocturno
Copy link
Contributor

Hi,

Fix candidate was merged to the master branch. Tell me if you need these patches to be ported to the SCST 3.6 stable branch.
If you reproduce this problem again, feel free to reopen the issue.

Gleb.

@ishioni
Copy link
Author

ishioni commented Nov 18, 2022

Sorry for the lack of reply. Unfortunately I won't be able to test this patch per-se, as for me it's part of a specialized distribution that makes it hard to replace pieces of it. I've alerted the devs if this patch, and hopefully they'll pull it in. Looking at their sources they seem to be pulling in 3.7 for the newest version so a fix for 3.7 should suffice

Thank you for a very speedy fix :)

yocalebo pushed a commit to truenas/scst that referenced this issue Nov 23, 2022
This patch should fix the following bug:

list_del corruption. next->prev should be ffff955cb1ea2540, but was ffff955c54a32440
 ------------[ cut here ]------------
kernel BUG at lib/list_debug.c:54!
invalid opcode: 0000 [#1] SMP PTI
Workqueue: events vdev_inq_changed_fn [scst_vdisk]
RIP: 0010:__list_del_entry_valid.cold+0x1d/0x47
Call Trace:
 scst_cm_dev_unregister+0x66/0xd0 [scst]
 scst_cm_update_dev+0x41/0xc0 [scst]
 process_one_work+0x1ee/0x390
 worker_thread+0x53/0x3e0
 kthread+0x124/0x150
 ret_from_fork+0x1f/0x30

scst_cm_desig_list is a global list for all SCST devices. It must be
protected with scst_cm_mutex because it can be modified by
scst_cm_init_inq_finish() from another thread when scst_cm_update_dev()
is called.

Fixes: SCST-project#99
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants