Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError when training streetsurf on seg100613 #46

Open
amoghskanda opened this issue Mar 13, 2024 · 22 comments
Open

KeyError when training streetsurf on seg100613 #46

amoghskanda opened this issue Mar 13, 2024 · 22 comments

Comments

@amoghskanda
Copy link

Firstly, great work and thanks for making it open-source. I setup everything following the readme for both streetsurf and nr3d. I wanted to use the withmask_nolidar.240219.yaml config file, made the path and sequence change to use seg100613(quick downloaded from streetsurf repo). The scenario.pt file is incomplete as waymo_dataset.py is accessing frame_timestamps(line 406) which is not a valid key in the scenario dictionary. There's another key error - line506 waymo_dataset.py, no global_timestamps key in the scenario['observers']['ego_car']['data'] dictionary. Can you share the complete scenario.pt file? or the zip file to segment-13476374534576730229_240_000_260_000_with_camera_labels sequence?

@zzzxxxttt
Copy link

zzzxxxttt commented Mar 16, 2024

I encountered the same issue, the problem was solved after checking out the latest commit (faba099) and re-generate data.

@zzzxxxttt
Copy link

zzzxxxttt commented Mar 16, 2024

By the way if anyone encountered TypeError: __init__() takes 1 positional argument but 2 were given, just replace @torch.no_grad with with torch.no_grad(): in nr3d_lib/models/fields/nerf/lotd_nerf.py:

    # @torch.no_grad
    def query_density(self, x: torch.Tensor):
        with torch.no_grad():
            # NOTE: x must be in range [-1,1]
            ...

@amoghskanda
Copy link
Author

@zzzxxxttt thank you for the reply. The key error persists. The problem is with the scenario.pt file as scenario['metas'] has no key under the name 'frame_timestamps'. Can you upload your scenario.pt file? This is for seg100613

@zzzxxxttt
Copy link

zzzxxxttt commented Mar 20, 2024

@amoghskanda sure, here it is
scenario.zip

image

@amoghskanda
Copy link
Author

Thank you for the scenario.pt file.
@zzzxxxttt did you face the below error?

init() got an unexpected keyword argument 'fn_type'
Line 183, train.py, MonoDepthLoss takes different parameters which are missing in the init of the class, defined in app/loss/mono.py class MonoDepthLoss

@amoghskanda
Copy link
Author

amoghskanda commented Mar 21, 2024

I made some changes to mono.py and used MonoSDFDepthLoss and somewhat fixed it. I'm getting a
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu). This is because the cache is loaded on the cpu and everything else on gpu(cuda:0). Is there a fix to this? I preloaded cache onto gpu(RTX3090) but then it runs out of memory. I reduced n_frames in withmask_nolidar.240219.yaml for segment-100613 from 163 to 30, able to load cache camera onto gpu, I run into RuntimeError: The size of tensor a (65536) must match the size of tensor b (256) at non-singleton dimension 1. What was the batchsize when you trained?
@ventusff @zzzxxxttt

@zzzxxxttt
Copy link

Thank you for the scenario.pt file. @zzzxxxttt did you face the below error?

init() got an unexpected keyword argument 'fn_type' Line 183, train.py, MonoDepthLoss takes different parameters which are missing in the init of the class, defined in app/loss/mono.py class MonoDepthLoss

No, I didn't met this error

@zzzxxxttt
Copy link

I made some changes to mono.py and used MonoSDFDepthLoss and somewhat fixed it. I'm getting a RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu). This is because the cache is loaded on the cpu and everything else on gpu(cuda:0). Is there a fix to this? I preloaded cache onto gpu(RTX3090) but then it runs out of memory. I reduced n_frames in withmask_nolidar.240219.yaml for segment-100613 from 163 to 30, able to load cache camera onto gpu, I run into RuntimeError: The size of tensor a (65536) must match the size of tensor b (256) at non-singleton dimension 1. What was the batchsize when you trained? @ventusff @zzzxxxttt

I also use withmask_nolidar.240219.yaml and only modified the data location, I can train it on my 12G memory RTX3060 without error.

@amoghskanda
Copy link
Author

amoghskanda commented Mar 21, 2024

so your data is loaded onto cache right? you did not make any changes when it comes to which device data and model are getting loaded onto? I have rtx3090 and data is loaded onto cpu, I run into
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
preload_on_gpu is false in withmask.yaml(by default)
I did not make any changes as to which device

@sonnefred
Copy link

I made some changes to mono.py and used MonoSDFDepthLoss and somewhat fixed it. I'm getting a RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu). This is because the cache is loaded on the cpu and everything else on gpu(cuda:0). Is there a fix to this? I preloaded cache onto gpu(RTX3090) but then it runs out of memory. I reduced n_frames in withmask_nolidar.240219.yaml for segment-100613 from 163 to 30, able to load cache camera onto gpu, I run into RuntimeError: The size of tensor a (65536) must match the size of tensor b (256) at non-singleton dimension 1. What was the batchsize when you trained? @ventusff @zzzxxxttt

I also use withmask_nolidar.240219.yaml and only modified the data location, I can train it on my 12G memory RTX3060 without error.

Hi, I also try to use withmask_nolidar.240219.yaml, but got an error when loading the images to make ImagePatchDataset. Have you met this error and how did you solve it? Thanks!
图片

@amoghskanda
Copy link
Author

yes, I removed **kwargs as an argument when calling get_frame_weights_uniform(), Line 66 dataloader/sampler.py because that function, defined later, takes only 2 arguments.

frame_weights = get_frame_weights_uniform(scene_loader, scene_weights)

@sonnefred
Copy link

yes, I removed **kwargs as an argument when calling get_frame_weights_uniform(), Line 66 dataloader/sampler.py because that function, defined later, takes only 2 arguments.

frame_weights = get_frame_weights_uniform(scene_loader, scene_weights)

Thank you for the reply, and I met a new error like this. Have you met this before?
图片

@amoghskanda
Copy link
Author

amoghskanda commented Mar 21, 2024

yes. I tried caching on gpu instead of cpu and changed the value of n_frames in the configs file from 163 to 30, for seg-10061, encountered the above error. When I reverted it to default settings(cache on cpu and 163), ran into #51

@amoghskanda
Copy link
Author

I made some changes to mono.py and used MonoSDFDepthLoss and somewhat fixed it. I'm getting a RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu). This is because the cache is loaded on the cpu and everything else on gpu(cuda:0). Is there a fix to this? I preloaded cache onto gpu(RTX3090) but then it runs out of memory. I reduced n_frames in withmask_nolidar.240219.yaml for segment-100613 from 163 to 30, able to load cache camera onto gpu, I run into RuntimeError: The size of tensor a (65536) must match the size of tensor b (256) at non-singleton dimension 1. What was the batchsize when you trained? @ventusff @zzzxxxttt

I also use withmask_nolidar.240219.yaml and only modified the data location, I can train it on my 12G memory RTX3060 without error.

cache is on the cpu right. The tensors frame_ind,h,w are on cpu as well. _ret_image_raw is on cpu as well. Not sure why I'm facing #51

@sonnefred
Copy link

yes. I tried caching on gpu instead of cpu and changed the value of n_frames in the configs file from 163 to 30, for seg-10061, encountered the above error. When I reverted it to default settings(cache on cpu and 163), ran into #51

Ok, have you solved the problem?

@amoghskanda
Copy link
Author

not yet, on it. Try training without changing the size of n_frames from the config file. Lmk if you run into the same issue as me

@sonnefred
Copy link

not yet, on it. Try training without changing the size of n_frames from the config file. Lmk if you run into the same issue as me

Sorry, I'm trying to run code_multi, but got the error like this, have you met this before?
图片

@amoghskanda
Copy link
Author

@sonnefred , I used another config(with mask with lidar) and was able to train and render as well

@amoghskanda
Copy link
Author

@zzzxxxttt did you try rendering nvs with different nvs paths like spherical_spiral or small_circle?

@sonnefred
Copy link

@sonnefred , I used another config(with mask with lidar) and was able to train and render as well

ok, thank you, but I'd like to use monodepth supervision, still working on it ...

@sonnefred
Copy link

I made some changes to mono.py and used MonoSDFDepthLoss and somewhat fixed it. I'm getting a RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu). This is because the cache is loaded on the cpu and everything else on gpu(cuda:0). Is there a fix to this? I preloaded cache onto gpu(RTX3090) but then it runs out of memory. I reduced n_frames in withmask_nolidar.240219.yaml for segment-100613 from 163 to 30, able to load cache camera onto gpu, I run into RuntimeError: The size of tensor a (65536) must match the size of tensor b (256) at non-singleton dimension 1. What was the batchsize when you trained? @ventusff @zzzxxxttt

I also use withmask_nolidar.240219.yaml and only modified the data location, I can train it on my 12G memory RTX3060 without error.

@zzzxxxttt Hi, how do you run this exp successfully? I still met a CUDA error when using this ymal ... Could you give any help? Thanks.

@lhp121
Copy link

lhp121 commented Jun 11, 2024

2024-06-11 19:16:01,146-rk0-train.py#959:=> Start loading data, for experiment: logs/streetsurf/seg100613.nomask_withlidar_exp1
2024-06-11 19:16:01,146-rk0-base.py#88:=> Caching data to device=cpu...
2024-06-11 19:16:01,146-rk0-base.py#95:=> Caching camera data...
Caching cameras...: 0%| | 0/3 [00:00<?, ?it/s]
Process finished with exit code 137 (interrupted by signal 9:SIGKILL)

Has anyone encountered this error before, and how can I adjust the parameters to make it run on my GTX 1660 Ti graphics card?
问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants