Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to synchronously open object (object 'nu' doesn't exist) #56

Open
qwerfdsadad opened this issue Mar 6, 2024 · 2 comments
Open

Comments

@qwerfdsadad
Copy link

qwerfdsadad commented Mar 6, 2024

Hello!
This work of yours has been a strong support to drive innovation in machine learning simulation and I thank you for your contribution. I was recently studying your project code.

I went to the /data_gen_NLE/ReactionDiffusionEq/ folder to generate the Reaction Diffusion dataset and went to the /pdebench/models/ folder to run run_forward_1D.sh to train the network. The command to run is:

CUDA_VISIBLE_DEVICES='0' python3 train_models_forward.py +args=config_ReacDiff.yaml ++args.filename='ReacDiff_Nu1.0_Rho1.0.hdf5' ++args.model_name='FNO'

Then, I encountered this bug.
image

Similarly, I went to the /data_gen_NLE/BurgersEq/ folder to generate the burgers dataset and then trained the network with the command,

 CUDA_VISIBLE_DEVICES='2,3' python3 train_models_forward.py +args=config_Bgs.yaml ++args.filename='1D_Burgers_Sols_Nu1.0.hdf5' ++args.model_name='FNO'

and encountered a similar bug.
image

But, I used the dataset downloaded from the /pdebench/data_download/ directory for testing and the program was able to run successfully.

I wonder if it is a problem with the HDF5 file. I use the HDFView to check the Data format.

image
I found that the t-axis coordinate has 202 points(form 0 to 2.01) and the x-axis has 1024 points(form 0 to 1), but the tensor is a 2*5000 data format.

The config file to generate 1D_Burgers_Sols_Nu1.0.hdf5 files is
image

@qwerfdsadad qwerfdsadad changed the title the meaning of 'args' and 'multi' directory directory Mar 7, 2024
@qwerfdsadad qwerfdsadad changed the title directory run_forward_1D.sh with Burgers Mar 7, 2024
@qwerfdsadad qwerfdsadad changed the title run_forward_1D.sh with Burgers run_forward_1D.sh with different PDE datasets Mar 7, 2024
@qwerfdsadad qwerfdsadad changed the title run_forward_1D.sh with different PDE datasets Unable to synchronously open object (object 'nu' doesn't exist) Mar 8, 2024
@mtakamoto-D
Copy link
Collaborator

Hi. Thank you for your kind report.
This could be originated from pmap which split batch dimension from (N_b, ...) into (N_GPU, N_b/N_GPU, ... ).
Please try to reshape the resulting file batch dimension from the latter one to the original batch number.
In addition, our forward script does not allow us to use multi-GPUs, so please only use 1-GPU for training.

@qwerfdsadad
Copy link
Author

Reference

Thanks for your reply.

    vm_evolve = jax.pmap(jax.vmap(evolve, axis_name='j'), axis_name='i')
    local_devices = jax.local_device_count()
    uu = vm_evolve(u.reshape([local_devices, cfg.multi.numbers//local_devices, -1]))
    save_dim=[cfg.multi.numbers]+list(uu.shape[-2:])
    uu_reshape=uu.reshape(save_dim)
    jnp.save(cwd+cfg.multi.save+'1D_Advection_Sols_beta'+str(beta)[:5], uu_reshape)

This is mt solution. For the Advection-1D data set, I created a new variable uu_reshape to change the original uu shape. However, this method is not applicable to different dimensions and different data sets. The variable save_dim needs to be assigned a value for different data sets.
Is there a unified approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants