Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted files on Betzy #253

Open
adagj opened this issue Apr 26, 2021 · 24 comments
Open

Corrupted files on Betzy #253

adagj opened this issue Apr 26, 2021 · 24 comments
Assignees
Labels
bug Something isn't working question & help wanted Extra attention is needed

Comments

@adagj
Copy link
Contributor

adagj commented Apr 26, 2021

Hi all,
@tto061 @oyvindseland @DirkOlivie @AleksiNummelin @monsieuralok @jgriesfeller @j34ni +++

we are having a discussion with sigma2 support about corrupted files on betzy. I recall that some of you also encountered problems with corrupted files copying from betzy to nird. Is that correct? If so, can you please try to comment on the questions from sigma2 so we can gather all the information and .
Thanks!

  • When copying from betzy to nird (or vica versa): For these cases it would be good to know which login node on Betzy was involved.
  • When copying from one location on betzy to another location on betzy: Is this happening while your working on a login node or also within jobs (batch or interactive)? Which nodes were involved?
  • Problems when resubmitting jobs: This is all on the same file system /cluster/... ? Where/when do the different file operations happen (output files are archived by a job or outside a job? for the restart the copying happens before a job is submitted or inside a job script? the occasional corruption happens by the archival or by the copying before the restart?)

Best regards,
Ada

@DirkOlivie
Copy link
Contributor

I have had corrupted files for two experiments on Betzy. However, for both cases it might have been also related to the compression/archiving script. I noticed that the files were corrupt once they were in the archive folder on Betzy.

  • Around 18-21 March, N1850 experiment, 2x2 degree : one cam.h0 (standard size around 300Mb) and one cam.h1 file (standard size around 60Mb) were corrupt. These were the only corrupt files over a 200-year long simulation. I used the blom.hm files, and they seemed all ok.
  • Around 21-23 April, NFHISTnorpddmsbcsdyn (fSST simulation), 1x1degree : many cam.h0 fields (standard size around 1-2 Gb) were corrupt in this only two-year long simulation.

@adagj
Copy link
Contributor Author

adagj commented Apr 28, 2021

From support @Sigma2 :

Hello all,
many thanks for all the information. After extensive testing on login-2 and a compute node, we have adjusted the settings on the login nodes. We reopened login-2 for use. We hope the new settings prevent data corruption ... future will tell. Please also see the updated ops log entry https://opslog.sigma2.no/2021/04/27/betzy-data-corruption-on-login-2/

Again, many thanks for all the information on short notice, that was very helpful.

Also very sorry for the inconvenience this has caused in the past ... it just seemed not so easy to reproduce. As the ops log entry states if you experience data corruption don't hesitate to contact support. Might be good to reference this case.

@adagj adagj added bug Something isn't working question & help wanted Extra attention is needed labels Apr 28, 2021
@mp586
Copy link

mp586 commented Jan 19, 2022

Hi,

I am having a related issue with four simulations on Betzy where several of the cam.h0. files are corrupted in the archive folder. They have file sizes of somewhere between 2MB and 290 MB, where the normal file size is roughly 350MB. When I try to open these files with ncdump I just get "ncdump: NHIST_PeffASIA_x2_f19_20211221.cam.h0.1960-11.nc: NetCDF: HDF" error

Some of the corrupted files are the _tmp files, but also some are the already-compressed files (and the _tmp file doesn't exist). In one case I have about 10 corrupt .h0. files (excluding the corrupt _tmp files) out of a 30 year simulation for which I think I need to rerun the respective years to generate the output again.

I'm running NorESM2-LM with the most recent model tag (2.0.5) and those are my first simulations on betzy, the ones I did on fram were completely fine.

Is it best to email sigma2 about this or is this an issue related to the archiving script? Is there a way to get the output from the corrupted files back somehow or do I need to rerun those years?

Any help would be greatly appreciated!
Thank you,
Marianne

@DirkOlivie
Copy link
Contributor

Hi Marianne, @mp586 @monsieuralok

I have also experienced it the last 2 weeks. I think you cannot retrieve the data from the reduced files - the only option is to rerun parts of the experiments again. It is the compression in the archiving script which causes the problem.

The problem can be (often) avoided by putting nthreads=2 (instead of 4) in cime/scripts/Tools/noresm2netcdf4.sh. However, even with nthreads=2, I still had recently corrupted files.

I hope this helps a bit.

Best regards,
Dirk

@AleksiNummelin
Copy link
Collaborator

Hi all, since late last year, I've turned off the archive compression flag in env_run for all my experiments due to unpredictable data corruption. Sorry for not bringing this up amongst everyone mostly due to the holiday season, and the fact that it's a bit difficult to pinpoint exact cause here (other than somehow memory is not available). This is related to the specific compression script and the compression commands used there, because I have my own python based conversion script which still works.

@adagj
Copy link
Contributor Author

adagj commented Jan 19, 2022

Hi all and thanks for reporting!
I have also had corrupted blom files the last 2 weeks. I didn't test changing nthreads, but reran the model experiment for the affected years. Maybe you can share your compression script @AleksiNummelin if that provides a better method? Kind of tedious to rerun experiments... and I'll of course report back to sigma.

@AleksiNummelin
Copy link
Collaborator

This is one of those long overdue things, but I created now a repo https://github.com/AleksiNummelin/BLOM_utils that for now just includes a Python file and a shell script that can be used for netcdf4 conversion on Betzy (or one can just run the python script on NIRD). I've mainly used the compression on atmosphere, ocean, and ice output, but I think it should work on land and runoff output files as well (and it's not a big loss if it doesn't). I will guarantee that this method will never corrupt files, but it might crash due to memory issues, so depending a bit on the setup and load on Betzy, one might need to adjust the cpu vs memory settings. Usually the most efficient usage is to submit a conversion job per component (atm, ice, ocean). There is also a slight chance that the CMORization will not work, we had some issues with @YanchunHe last year regarding the 0.25 deg ocean output, but it was never clear what the problem really was. I think it might be related to the fact that after the conversion the time variable is not 'unlimited'. @mp586 my suggestion would be to set COMPRESS_ARCHIVE_FILES flag to false in env_run.xml, and then use this script for netcdf4 compression (if you are just testing stuff on Betzy and not moving data over to NIRD, you don't need to do the compression).

@YanchunHe
Copy link
Contributor

YanchunHe commented Jan 20, 2022

here is also a slight chance that the CMORization will not work, we had some issues with @YanchunHe last year regarding the 0.25 deg ocean output, but it was never clear what the problem really was.

Yes, I can confirm that the missing time dimension, which should be defined as unlimited is exactly the reason that the cmorisation fail to these 1/4 degree ocean output. As the cmor tool relies heavily on the time information to decide how to concatenate the variables from multiple files to a single file.

A quick can be converting the time dimension from fixed to unlimited with NCO:

ncks --mk_rec_dmn time input.nc output.nc

@mp586
Copy link

mp586 commented Jan 20, 2022

Hi all, thank you so much for sharing those helpful tips on how to avoid the file corruption! I will give it a go!
@YanchunHe with converting the time dimension, do you mean the non-compressed files before running @AleksiNummelin 's scripts? Unfortunately I do need to move the data over to NIRD, so I need to get the compression to work.
Thank you so much again for all the responses!

@YanchunHe
Copy link
Contributor

@YanchunHe with converting the time dimension, do you mean the non-compressed files before running

I mean after you compressed the file, e.g., with Aleksi's script, and find the 'time' dimension is now a normal fixed dimension, you can convert it to real unlimited time dimension with the above 'ncks' command. And then transfer to NIRD. But you don't need a 'unlimited' time dimension, you don't need to convert.

@stefan-hofer
Copy link

Just wanted to report here that I had the same problem on Betzy with my runs, really quite annoying. The same way of compression worked a few months ago and now completely unchanged corrupts some random files. I will also try @AleksiNummelin 's approach.

@mp586
Copy link

mp586 commented Feb 2, 2022

Hi,

Thank you very much for all the help so far! I ran @AleksiNummelin script and it compressed the data, and then stumbled into a similar issue as @YanchunHe described with the concatenation but for the cam data, so I ran the ncks command, but I still get this error:

ValueError: Every dimension needs a coordinate for inferring concatenation order

Does anyone happen to be familiar with this issue and know how to fix it?

Thank you once more for your help!
Best
Marianne

@adagj
Copy link
Contributor Author

adagj commented Feb 2, 2022

Hi, there are a lot of issues with corrupted files on Betzy at the moment. We are working on it, but right now it is probably best to set COMPRESS_ARCHIVE_FILES = FALSE in env_run.xml and use Aleksi's script.

@mp586 did you try to run the command on each single file? Or all at once? I think you need to loop through ond set time to UNLIMITED on each file separately...

@AleksiNummelin
Copy link
Collaborator

AleksiNummelin commented Feb 3, 2022

Hi, @mp586 can you confirm (1) do all your files have 'unlimited' timeaxis? (2) can you check if your extraction works with existing data that we know is fine (e.g. see paths to piControl experiments at https://noresmhub.github.io/noresm-exp/noresm2_deck/noresm2_mm_piC.html) and not with your data even if timeaxis is 'unlimited'?

@AleksiNummelin
Copy link
Collaborator

Also @mp586, as a quick fix, I can provide some python code to do concatenation/extraction that will work even if NCL fails...

@AleksiNummelin
Copy link
Collaborator

AleksiNummelin commented Feb 3, 2022

Hi again, I've now pushed a modification to the compression script https://github.com/AleksiNummelin/BLOM_utils that sets time to be unlimited (not sure why I didn't do this before, my guess is that it was not an option in xarray at the time). I also added a python script (and a shell script to submit that on Betzy) to fix the files converted with the old version - basically just reading in the file and setting time dimension to be unlimited.

@adagj
Copy link
Contributor Author

adagj commented Feb 3, 2022

Great! Thanks @AleksiNummelin

@mp586
Copy link

mp586 commented Feb 3, 2022

Hi all,

Thanks again for the quick responses and the new script! I will try that now.
@AleksiNummelin : I just confirmed that (1) the time dimension has been set to unlimited, and (2) the operations work on previous runs which I ran on fram.
@adagj : I looped through individual files

Thank you again for your help :)

@AleksiNummelin
Copy link
Collaborator

I see, @mp586 then it sounds like something else is going on (and the new script fixing the time dimension probably doesn't help). Can you post here again the command you are using (or send email)?

@mp586
Copy link

mp586 commented Feb 3, 2022

@AleksiNummelin the problem seems to be with using xr.open_mfdataset('*.nc'), weirdly it is not an issue with all files, but only some! I am just wondering right now whether it's not an issue related to the compression script that you sent, but whether those files might have been corrupted beforehand. I only reran single years of my simulations where I could not open the files, but then I ran the compression script over all years. The ones that I reran seem to be fine as far as I can tell, so maybe the best thing to do is just to rerun the entire simulation and then run your compression script. Sorry this is a bit confusing.

@AleksiNummelin
Copy link
Collaborator

Hmm, okay, that sounds a bit peculiar. It is possible that although the files look fine they still miss something. Could you share the data location here, and point to one file that you think works and one that doesn't, I can try to have a quick look to see if I can see something obvious.

@AleksiNummelin
Copy link
Collaborator

Hi all, there is one more issue which is that currently the diagnostics fail with the converted files, and @YanchunHe pointed out that it is because of the use of nan's as a fill value (NCO can't handle nan's). I will try to have a look at this soon and commit updated conversion script.

@AleksiNummelin
Copy link
Collaborator

I have updated the netcdf3to4.py to take care of the FillValues as well now. Essentially, if the original data has a FillValue, it will inherit that, otherwise it will keep it without a FillValue (the previous behavior was to always add nan as FillValue). I also added a fix_FillValue.py script to fix both the FillValue and the unlimited time dimension if one ran the previous version of the netcdf3to4.py. This works, but it doesn't do exactly the same as the the netcdf3to4.py, but rather adds one FillValue depending on the component (sometimes it depends a bit on the variable in the original data, but that doesn't really matter).

@monsieuralok
Copy link

Hi,
Regarding corruption of file on Betzy first change #SBATCH --mem-per-cpu=16GB to #SBATCH --mem-per-cpu=30GB in case.st_archive. Also, change nthreads to 4 in noresm2netcdf4.sh. It should work for 1 degree ocn and atm. But, if you are using quarter degree ocean or atmosphere I would say presently turn off compression in env_run.xml setting COMPRESS_ARCHIVE_FILES to FALSE. Mostly, it should solve few problems. We are testing few other options and we will update you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question & help wanted Extra attention is needed
Projects
Status: Blocked
Development

No branches or pull requests

7 participants