Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BSOD when installing software #364

Open
JayanWarden opened this issue Mar 18, 2024 · 31 comments
Open

BSOD when installing software #364

JayanWarden opened this issue Mar 18, 2024 · 31 comments

Comments

@JayanWarden
Copy link

JayanWarden commented Mar 18, 2024

System information

Type Version/Name
Distribution Name Windows 11 Pro
Distribution Version 23H2
Kernel Version 22631.3296
Architecture x64
OpenZFS Version zfswin-2.2.3rc2-dirty + zfs-kmod-zfswin-2.2.3rc2-dirty

Describe the problem you're observing

BSOD while installing software on the ZFS volume.
Had this problem with RC1, continues to persist in RC2.
System was not stalled before the BSOD, I could observe kernel CPU util in taskmanager at around 30-40% (~10 threads active) so work was being done, in line with highly compressible data being written to disk.
BSOD is sporadic, cannot reproduce on demand. 3rd BSOD thus far, sometimes I can install data for hours on end.
I appended the minidump.
031824-16593-01.dmp
SYSTEM_THREAD_EXCEPTION_NOT_HANDLED
Caused by OpenZFS.sys+54de92
Crash Address OpenZFS.sys+1807e0

Describe how to reproduce the problem

  • Create a dataset with high compression like zstd-19 and blocksize=1M
  • Install a large, compressible application (for example through Steam)
  • Sadly very sporadic. Maybe race condition or some kind of buffer exhaustion?
@lundman
Copy link

lundman commented Mar 19, 2024

It's not clear from the minidump where in the stack we are, but it is in avx/ymm code so perhaps it is related. If you were to set the various _impl to generic, for example gcm_impl and so on (aes, raidz, sha256, sha512, blake3) does it still crash

@JayanWarden
Copy link
Author

JayanWarden commented Mar 19, 2024

You mean in registry? Sure, I can test that.
icp_aes_impl for example is empty. Does it need to be set to "generic" ?
Also for example zfs_blake3_impl is set to "cycle [fastest] generic sse2 sse41 avx2 " , I guess delete everything from there and set to "generic" as well?

EDIT:
Sorry for being clumsy with the "close issue" button.

@lundman
Copy link

lundman commented Mar 19, 2024

Yeah, just type in generic, and push return. You can hit F5 and it should say "cycle fastest [generic] sse2 sse41 avx2".

Although, the top of the stack claims to be in refcounter.. I need to see if I can load the symbols from the release

@lundman
Copy link

lundman commented Mar 19, 2024

Ah ok

OpenZFS!zfs_refcount_destroy_many+0x90
OpenZFS!zfs_refcount_destroy+0x17
OpenZFS!abd_fini_struct+0x68

ok yeah, nothing to do with _impl sorry. I'll need to think on this.

@lundman
Copy link

lundman commented Mar 19, 2024

Could you increase the size of the memory.dmp? It would be useful to see which abd, and the refcount.

@JayanWarden
Copy link
Author

Of course.
I switched the BSOD type to kernel memory dump, now I will just have to provoke the BSOD again.
I'll see if I can do that, and update this thread once I have the memory.dmp

@JayanWarden
Copy link
Author

Small update.
Could not provoke a BSOD, but a full filesystem hang.
ZFS Volume was unresponsive again, 0% CPU load. Unable to write to volume, but reading was possible.

I crashed the system via powershell to generate a memory dump, maybe this is related.
I made it available over as a download, it's nearly 13GB compressed.
https://teamnoke.live/nextcloud/index.php/s/AM5x2N99oF9SyPi

I will still try to provoke a BSOD further, but maybe this memory dump can give pointers already?

@inoperable
Copy link

inoperable commented Mar 28, 2024

Few cents from me: (this can be my Windows configuration specific) I "removed" 75% of system services without a lot of thought for any dependencies of those - so my BSODs might be caused indirectly by the config ive got - that being said, I can BSOD pretty fast with multithread / parallel heavy I/O software. Those seems to be a monkey wrench for our Windows driver gearbox

How to replicate:

  • git clone some fat repo with a lot of files (lot, like ~ 1000+ per dir)
  • cd into the parent directory
  • launch fzf.exe or fd.exe (any packager will do, scoop or winget) or anything else that globs
    globs/walks utilizing your cpu cores
  • 1-2 seconds

Booom, BSOD

TO CLARIFY it might be my ignorance for proper (win) system configuration and NOT the driver. I personally use @lundman awesome driver since 4+ years without any data loss, swapping daily between 3 os's (also at work) - but I also needed to learn zfs the hard way (which included data loss, but this was a configuration result and the result of not properly rtfm)

@lundman: Let me know if you need dumps from W10/W11 - can package some for you. Thanks for the great work!

@lundman
Copy link

lundman commented Mar 28, 2024

Yeah, I would love stacks from any BSOD, so i can fix them

@inoperable
Copy link

Win OpenZFS driver and WinBtrfs driver dont play well with each other (I uninstalled Winbtrfs after it caused some wierd errors on subvols)
SystemInformer Kernel Driver is causing BSODS

if you got those 2 mixed with OpenZFS driver, you can check how it behaves if those are not running

@lundman
Copy link

lundman commented Mar 28, 2024

really? huh - I'll try installing both

@JayanWarden
Copy link
Author

JayanWarden commented Apr 10, 2024

Hey there!

Sorry for the long silence, I have been testing on and off but now I finally have a reproducible BSOD test case.
I am writing a 32GB text file to a highly compressed and deduped dataset and it BSODs every time (tested 3 times) after 5-10 minutes in the middle of the file transfer.

Steps to reproduce on my system:
zfswin-2.2.3rc2-dirty
zfs-kmod-zfswin-2.2.3rc2-dirty

Create a dataset with Dedup and compression=zstd-19 with blocksize 1M as well as xattr=sa
Registry zfs_arc_max is 8GB , everything else is standard.
Now take a single file of compressible data and copy it to the ZFS volume with windows explorer. Will BSOD after a handfull of minutes with
SYSTEM_THREAD_EXCEPTION_NOT_HANDLED caused by OpenZFS.sys+54de92
Sadly my System now refuses to do a memory dump, it doesn't create one anymore. I will try to provide one after I debug this issue.

So I will provide you with the python script I used to create the dummy 32GB text file as well as the file I used, compressed.

Python script to generate a 32GB text file (took me a good 5h on my system, needs the pip module "faker"):
gentext.zip

Here's a link to the 32GB text file crunched down to 10.9GB:
https://teamnoke.live/nextcloud/index.php/s/4GNHY2A2mzkGXEn

@ConnorS-P
Copy link

ConnorS-P commented Apr 15, 2024

I also got the same error code (SYSTEM_THREAD_EXCEPTION_NOT_HANDLED BSOD) on 2.2.3rc3. May or may not be related. I run FreeFileSync in the background and use a zfs pool to keep a few folders in sync between my linux and windows dual boot, so I imagine it was doing something at the time. I hope this helps:

*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (7e)
This is a very common BugCheck.  Usually the exception address pinpoints
the driver/function that caused the problem.  Always note this address
as well as the link date of the driver/image that contains this address.
Arguments:
Arg1: ffffffff80000003, The exception code that was not handled
Arg2: fffff80149bd6ec9, The address that the exception occurred at
Arg3: ffff958b842ad548, Exception Record Address
Arg4: ffff958b842acd60, Context Record Address

Debugging Details:
------------------


KEY_VALUES_STRING: 1

    Key  : Analysis.CPU.mSec
    Value: 328

    Key  : Analysis.Elapsed.mSec
    Value: 2941

    Key  : Analysis.IO.Other.Mb
    Value: 3

    Key  : Analysis.IO.Read.Mb
    Value: 0

    Key  : Analysis.IO.Write.Mb
    Value: 40

    Key  : Analysis.Init.CPU.mSec
    Value: 93

    Key  : Analysis.Init.Elapsed.mSec
    Value: 40960

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 118

    Key  : Bugcheck.Code.KiBugCheckData
    Value: 0x7e

    Key  : Bugcheck.Code.LegacyAPI
    Value: 0x7e

    Key  : Bugcheck.Code.TargetModel
    Value: 0x7e

    Key  : Dump.Attributes.AsUlong
    Value: 1800

    Key  : Dump.Attributes.DiagDataWrittenToHeader
    Value: 1

    Key  : Dump.Attributes.ErrorCode
    Value: 0

    Key  : Dump.Attributes.LastLine
    Value: Dump completed successfully.

    Key  : Dump.Attributes.ProgressPercentage
    Value: 100

    Key  : Failure.Bucket
    Value: 0x7E_80000003_OpenZFS!unknown_function

    Key  : Failure.Hash
    Value: {0b5491d9-17d0-a673-da75-5b5b3f21b6a6}

    Key  : Hypervisor.Enlightenments.ValueHex
    Value: 1417df84

    Key  : Hypervisor.Flags.AnyHypervisorPresent
    Value: 1

    Key  : Hypervisor.Flags.ApicEnlightened
    Value: 0

    Key  : Hypervisor.Flags.ApicVirtualizationAvailable
    Value: 1

    Key  : Hypervisor.Flags.AsyncMemoryHint
    Value: 0

    Key  : Hypervisor.Flags.CoreSchedulerRequested
    Value: 0

    Key  : Hypervisor.Flags.CpuManager
    Value: 1

    Key  : Hypervisor.Flags.DeprecateAutoEoi
    Value: 1

    Key  : Hypervisor.Flags.DynamicCpuDisabled
    Value: 1

    Key  : Hypervisor.Flags.Epf
    Value: 0

    Key  : Hypervisor.Flags.ExtendedProcessorMasks
    Value: 1

    Key  : Hypervisor.Flags.HardwareMbecAvailable
    Value: 1

    Key  : Hypervisor.Flags.MaxBankNumber
    Value: 0

    Key  : Hypervisor.Flags.MemoryZeroingControl
    Value: 0

    Key  : Hypervisor.Flags.NoExtendedRangeFlush
    Value: 0

    Key  : Hypervisor.Flags.NoNonArchCoreSharing
    Value: 1

    Key  : Hypervisor.Flags.Phase0InitDone
    Value: 1

    Key  : Hypervisor.Flags.PowerSchedulerQos
    Value: 0

    Key  : Hypervisor.Flags.RootScheduler
    Value: 0

    Key  : Hypervisor.Flags.SynicAvailable
    Value: 1

    Key  : Hypervisor.Flags.UseQpcBias
    Value: 0

    Key  : Hypervisor.Flags.Value
    Value: 21631230

    Key  : Hypervisor.Flags.ValueHex
    Value: 14a10fe

    Key  : Hypervisor.Flags.VpAssistPage
    Value: 1

    Key  : Hypervisor.Flags.VsmAvailable
    Value: 1

    Key  : Hypervisor.RootFlags.AccessStats
    Value: 1

    Key  : Hypervisor.RootFlags.CrashdumpEnlightened
    Value: 1

    Key  : Hypervisor.RootFlags.CreateVirtualProcessor
    Value: 1

    Key  : Hypervisor.RootFlags.DisableHyperthreading
    Value: 0

    Key  : Hypervisor.RootFlags.HostTimelineSync
    Value: 1

    Key  : Hypervisor.RootFlags.HypervisorDebuggingEnabled
    Value: 0

    Key  : Hypervisor.RootFlags.IsHyperV
    Value: 1

    Key  : Hypervisor.RootFlags.LivedumpEnlightened
    Value: 1

    Key  : Hypervisor.RootFlags.MapDeviceInterrupt
    Value: 1

    Key  : Hypervisor.RootFlags.MceEnlightened
    Value: 1

    Key  : Hypervisor.RootFlags.Nested
    Value: 0

    Key  : Hypervisor.RootFlags.StartLogicalProcessor
    Value: 1

    Key  : Hypervisor.RootFlags.Value
    Value: 1015

    Key  : Hypervisor.RootFlags.ValueHex
    Value: 3f7

    Key  : SecureKernel.HalpHvciEnabled
    Value: 1

    Key  : WER.OS.Branch
    Value: ni_release_svc_prod3

    Key  : WER.OS.Version
    Value: 10.0.22621.2506


BUGCHECK_CODE:  7e

BUGCHECK_P1: ffffffff80000003

BUGCHECK_P2: fffff80149bd6ec9

BUGCHECK_P3: ffff958b842ad548

BUGCHECK_P4: ffff958b842acd60

FILE_IN_CAB:  MEMORY.DMP

TAG_NOT_DEFINED_202b:  *** Unknown TAG in analysis list 202b


DUMP_FILE_ATTRIBUTES: 0x1800

EXCEPTION_RECORD:  ffff958b842ad548 -- (.exr 0xffff958b842ad548)
ExceptionAddress: fffff80149bd6ec9 (OpenZFS+0x0000000000276ec9)
   ExceptionCode: 80000003 (Break instruction exception)
  ExceptionFlags: 00000000
NumberParameters: 1
   Parameter[0]: 0000000000000000

CONTEXT:  ffff958b842acd60 -- (.cxr 0xffff958b842acd60)
rax=0000000000000001 rbx=ffffd68a82c72190 rcx=0f5a43d630de0000
rdx=0000000000000039 rsi=ffffd689303e0050 rdi=0000000000000000
rip=fffff80149bd6ec9 rsp=ffff958b842ad780 rbp=0000000000000000
 r8=000000000000004d  r9=0000000000000000 r10=0000000000000000
r11=0000000000000000 r12=ffffd68a82c72190 r13=0000000000000000
r14=fffff80149c88850 r15=fffff80129a5bae0
iopl=0         nv up ei ng nz na po nc
cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00040286
OpenZFS+0x276ec9:
fffff801`49bd6ec9 cc              int     3
Resetting default scope

BLACKBOXBSD: 1 (!blackboxbsd)


BLACKBOXNTFS: 1 (!blackboxntfs)


BLACKBOXPNP: 1 (!blackboxpnp)


BLACKBOXWINLOGON: 1

PROCESS_NAME:  System

ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION}  Breakpoint  A breakpoint has been reached.

EXCEPTION_CODE_STR:  80000003

EXCEPTION_PARAMETER1:  0000000000000000

EXCEPTION_STR:  0x80000003

STACK_TEXT:  
ffff958b`842ad780 fffff801`49bd6f7e     : ffffd68a`82c72190 fffff801`29b31493 00000000`00000000 ffffd68a`06752840 : OpenZFS+0x276ec9
ffff958b`842ad810 fffff801`49c889dc     : ffffd689`0cd5e4c0 00000000`00000000 ffffd689`0cd5e600 ffffd689`0cd5e646 : OpenZFS+0x276f7e
ffff958b`842ad850 fffff801`29a5bbe0     : 00000000`82c73150 ffffd689`303b8050 00000000`00000000 00000000`00000000 : OpenZFS+0x3289dc
ffff958b`842ad8d0 fffff801`29ad6fd5     : ffffd689`0839bbb0 ffffd689`0cd5e4c0 ffff958b`842ada40 ffffd689`00000000 : nt!IopProcessWorkItem+0x100
ffff958b`842ad940 fffff801`29b6db37     : ffffd689`0cd5e4c0 00000000`000001b7 ffffd689`0cd5e4c0 fffff801`29ad6e80 : nt!ExpWorkerThread+0x155
ffff958b`842adb30 fffff801`29c1d554     : ffffaf00`66511180 ffffd689`0cd5e4c0 fffff801`29b6dae0 00000000`00000000 : nt!PspSystemThreadStartup+0x57
ffff958b`842adb80 00000000`00000000     : ffff958b`842ae000 ffff958b`842a7000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x34


SYMBOL_NAME:  OpenZFS+276ec9

MODULE_NAME: OpenZFS

IMAGE_NAME:  OpenZFS.sys

STACK_COMMAND:  .cxr 0xffff958b842acd60 ; kb

BUCKET_ID_FUNC_OFFSET:  276ec9

FAILURE_BUCKET_ID:  0x7E_80000003_OpenZFS!unknown_function

OS_VERSION:  10.0.22621.2506

BUILDLAB_STR:  ni_release_svc_prod3

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {0b5491d9-17d0-a673-da75-5b5b3f21b6a6}

Followup:     MachineOwner
---------```

@lundman
Copy link

lundman commented Apr 15, 2024

Thanks, I can probably lookup that symbol. But since it's coming off IopProcessWorkItem - which we only use in one place, I think I know where to start looking

@lundman
Copy link

lundman commented Apr 18, 2024

Oh, I see (Break instruction exception) - we are in fact tripping over an ASSERT:

OpenZFS+0x276ec9
OpenZFS+0x276f7e
OpenZFS+0x3289dc


 [C:\src\openzfs\module\zfs\abd.c @ 689] (fffff804`2f356d60)   OpenZFS!abd_return_buf+0x169   |  (fffff804`2f356f50)   OpenZFS!abd_cmp_buf
 [C:\src\openzfs\include\sys\abd.h @ 172] (fffff804`2f356f50)   OpenZFS!abd_cmp_buf+0x2e   |  (fffff804`2f356f90)   OpenZFS!abd_return_buf_copy
 [C:\src\openzfs\module\os\windows\zfs\vdev_disk.c @ 589] (fffff804`2f408850)   OpenZFS!vdev_disk_io_start_done+0x18c   |  (fffff804`2f408ab0)   OpenZFS!vdev_disk_open

Specifically

		ASSERT0(abd_cmp_buf(abd, buf, n));

@lundman
Copy link

lundman commented Apr 18, 2024

Taking out the assert (but I don't think that is the right answer) if you want to try
OpenZFSOnWindows-debug-.99-137-g4b326b9b92-dirty.exe
it might give more clues.

@lundman
Copy link

lundman commented Apr 25, 2024

OK that was tricky. There is indeed an unload BSOD problem, fixed in 01aa832
and I have rolled out rc4 to address it. If you are running older rcX, you can avoid the
BSOD by renaming the openzfs.sys to something else (openzfs.tmp maybe) and rebooting.

@JayanWarden
Copy link
Author

Hello everyone, hello @lundman ,

Again sorry for my long absence, but I have now been able to reproduce my original BSOD with a full memory dump.
I am still on
zfswin-2.2.3rc2-dirty
zfs-kmod-zfswin-2.2.3rc2-dirty

SYSTEM_THREAD_EXCEPTION_NOT_HANDLED
Parameter 1: ffffffff80000003 Parameter 2: fffff807b30107e0
Parameter 3: ffff850e7bf5ee38 Parameter 4: ffff850e7bf5e650
Caused by OpenZFS.sys+54de92
Crash Address: OpenZFS.sys+1807e0

Full Memory Dump (9.2GB) :
https://drive.google.com/file/d/1eri41N6m-mo0o0CPBB3fEvGTQbbSdz3o/view?usp=sharing

@lundman
Copy link

lundman commented May 9, 2024

OK sorry for the delay, had to find a way to fit the memory.dmp on my VM :)

This is the cause:

     : 00000000`00008000 00000000`000007ff 00000000`000007ff 00ffbc0c`24a6cb80 : OpenZFS!zfs_refcount_destroy_many+0x90 [C:\src\openzfs\module\zfs\refcount.c @ 98] 
     : 00000000`00000207 00000000`00000207 00000000`01000000 00000000`00100000 : OpenZFS!zfs_refcount_destroy+0x17 [C:\src\openzfs\module\zfs\refcount.c @ 113] 
     : ffffbc0b`00000000 00000000`00000080 ffffbc0b`f0208300 ffffbc0b`f020a068 : OpenZFS!abd_fini_struct+0x68 [C:\src\openzfs\module\zfs\abd.c @ 160] 
     : ffffbc00`af4ece80 fffff807`b2fc724f 00000000`00000000 ffffbc0c`7082c3d8 : OpenZFS!abd_free+0xdd [C:\src\openzfs\module\zfs\abd.c @ 320] 
     : ffffbc0b`f3a317d0 00000000`00000000 00000000`00000001 ffffbc0b`f002eb80 : OpenZFS!zio_pop_transforms+0x76 [C:\src\openzfs\module\zfs\zio.c @ 452] 
     : 00000000`00000001 fffff807`b2e95c30 ffffbc0b`dafba040 00000000`00000080 : OpenZFS!zio_done+0x1392 [C:\src\openzfs\module\zfs\zio.c @ 4795] 
     : 00000000`00000246 00000017`fef81000 00000000`00000000 00000000`00000000 : OpenZFS!zio_execute+0x30f [C:\src\openzfs\module\zfs\zio.c @ 2274] 
     : ffffbc0c`153ba080 fffff807`b2e95c30 ffffbc0b`f3a317d0 00000000`00000000 : OpenZFS!taskq_thread+0x51a [C:\src\openzfs\module\os\windows\spl\spl-taskq.c @ 2083] 
     : ffffd180`72451180 ffffbc0c`153ba080 fffff807`6aaed700 00000000`00000246 : nt!PspSystemThreadStartup+0x57
     : ffff850e`7bf60000 ffff850e`7bf59000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x34


FAULTING_SOURCE_LINE:  C:\src\openzfs\module\zfs\refcount.c

FAULTING_SOURCE_FILE:  C:\src\openzfs\module\zfs\refcount.c

FAULTING_SOURCE_LINE_NUMBER:  98

FAULTING_SOURCE_CODE:  
    94: {
    95: 	reference_t *ref;
    96: 	void *cookie = NULL;
    97: 
>   98: 	ASSERT3U(rc->rc_count, ==, number);
    99: 	while ((ref = avl_destroy_nodes(&rc->rc_tree, &cookie)) != NULL)
   100: 		kmem_cache_free(reference_cache, ref);
   101: 	avl_destroy(&rc->rc_tree);
   102: 
   103: 	while ((ref = list_remove_head(&rc->rc_removed)))

number here is 0, but rc->rc_count has the value 0x41200 (!)

The cbuf log

FFFFBC0C0DF74540: dprintf: dnode.c:2470:dnode_diduse_space(): ds=tank obj=338849
3 dn=FFFFBC0DA7F40590 dnp=FFFFBC00A730BA00 used=492636160 delta=266752
FFFFBC0C187F1080: zio.c:2197:zio_deadman_impl(): slow zio[12]: zio=FFFFBC0E30748
1F0x timestamp=1228119375015300 delta=3145200 queued=0 io=1228119375017600 path=
/dev/physicaldrive0 last=1228119019293000 type=2 priority=3 flags=0x184080 stage
=0x200000 pipeline=0x2e00000 pipeline-trace=0x200001 objset=54 object=3388493 le
vel=0 blkid=2180 offset=2566463959040 size=266752 error=0
FFFFBC0C187F1080: zio.c:2236:zio_deadman(): zio_wait restarting hung I/O for pool 'tank'

Which is interesting, I don't think I have ever triggered the slow-IO path, so that is the best bet to look at what goes wrong.

2.2.3rc2-dirty is a bit old, can you go to the latest when convenient.

@JayanWarden
Copy link
Author

Sure!
I'll update my ZFS version and I'll look if I can reproduce the BSOD again

@lundman
Copy link

lundman commented May 10, 2024

Don't have to chase this particular bug, the memory.dmp provided showed it isn't something we have fixed.

One is curious if failmode=wait is set instead, perhaps it wouldn't crash, but pool gets suspended. Not saying that is better, just curious if the crash is from the restarting of IO, or, has things already gone wrong, and the "slow IO - restarting" is not relevant.

@JayanWarden
Copy link
Author

Indeed, I have set zfs_deadman_failmode=continue.
I encountered pool hangs with normal settings. Pool got unresponsive when trying to write compressible data to it.
I could read files, but writing files caused the whole pool to stall and not respond anymore, crashing Windows Explorer.

Now it seems with continue it simply BSODs the system instead of making the whole filesystem unresponsive.
Do you think adding a ZIL device could help? Offloading writes to the ZIL first and freeing up the rest of the FS to deal with synchronous writes later? I have an optane drive available, if ZFSonWin can use a ZIL as a file-based virtual drive with fsutil file createnew

@lundman
Copy link

lundman commented May 10, 2024

Yeah that is interesting, so I think we are looking at

  1. we lose/miss IO sometimes, for some reason. And this would suspend the pool with default settings.
    Probably a Windows issue.

  2. setting failmode=continue to try to work past it, it will BSOD when it tries to restart the missed IO.
    Probably an issue on all platforms.

You are using just datasets or zvols?

@JayanWarden
Copy link
Author

I am working directly on a zvol, no datasets created

@JayanWarden
Copy link
Author

Some more observations:
taskq_batch_pct=50 , before a BSOD the FS is still busy, load average of about 30-40%.
My best guess would be that ZFS is busy compressing the data because of ZSTD-19 (silly, I know, but it's available so why not use it) and maybe we run into deadman timeouts and it tries to restart the IO but in reality it is still live and well, just compressing

@lundman
Copy link

lundman commented May 10, 2024

yeah so with higher timeout, or disabled deadman, it might not die at all? It's in the Registry

@JayanWarden
Copy link
Author

I have done some preliminary testing with deadman timeout set to 5 minutes instead of 1 minute and wait instead of continue.
So far I cannot crash my system with the 32GB text file anymore and compressible data seems to flush to disk just fine.
Will continue observing

@lundman
Copy link

lundman commented May 13, 2024

OK that is interesting - I do think we should fix the BSOD when it tries to restart IO - or lob it over to upstream and run away.

@inoperable
Copy link

@lundman do you have a very recent dirty build somehwere around? the devvm from ms i used evaluated itself into oblivion... at so need to reconfigure. thanks to ms ^.^ its faster to download https://developer.microsoft.com/en-us/windows/downloads/virtual-machines and uhmm "refubrish" then install vs manually ;>

@lundman
Copy link

lundman commented May 14, 2024

If it's the Win downloaded VM images, you can run slmgr -rearm (up to 5 times), for a further few months. Latest is on GH, I will probably do a dirty today to have the latest out.

@JayanWarden
Copy link
Author

I am happy to announce that with deadman timeouts set to an unreasonably high value, my ZFS works like a charm!
No more filesystem hangs or BSODs during the last 3 weeks.

If you think it's applicable, you can close this issue if you'd consider this as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants