Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute_halo_properties.py fails on snap8 due to "No space left on device" #30

Open
jemme07 opened this issue May 24, 2023 · 4 comments

Comments

@jemme07
Copy link

jemme07 commented May 24, 2023

Lots of runs failed before compute_halo_properties.py finished with the following message:

mca_fbtl_posix_pwritev: error in (p)write(v):No space left on device 
dynamic_gen2_write_all: fbtl_pwritev failed 

Depending on the point at which the run fails, SOAP produces halo_properties_XXXX.hdf5 with different file sizes. It is unclear which properties/datasets are incomplete in the final file. The compression does not fail on the next step.

@jemme07 jemme07 changed the title s8 compute_halo_properties.py fails on snap8 due to "No space left on device" May 24, 2023
@jchelly
Copy link
Collaborator

jchelly commented May 25, 2023

Do you have the stdout from one of these runs? I'd like to know if SOAP crashed. The worst case scenario would be if MPI-IO failed to write data without raising an exception.

Assuming that's not the case, we could write the output to a temporary filename and then rename it after everything is written successfully so that the output file is only present if it was fully written.

@jemme07
Copy link
Author

jemme07 commented May 25, 2023

Here is an example stdout:

/snap8/scratch/dp004/fkgm22/SOAP/scripts/FLAMINGO/L1000N1800/logs/wrong/halo_properties_L1000N1800_HYDRO_STRONG_AGN.74.out

@MatthieuSchaller
Copy link
Member

Any update on this?

@jchelly
Copy link
Collaborator

jchelly commented Aug 7, 2023

I don't think we have a good fix for this. It looks like a collective write fails to write data because the disk is full but the error code gets lost somewhere before it gets back to python so h5py doesn't raise an exception and we can't detect the failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants