Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error during training on 3DMatch dataset #50

Open
benjaminkelenyi opened this issue Feb 7, 2022 · 0 comments
Open

Error during training on 3DMatch dataset #50

benjaminkelenyi opened this issue Feb 7, 2022 · 0 comments

Comments

@benjaminkelenyi
Copy link

benjaminkelenyi commented Feb 7, 2022

Hello, thank you very much for this nice work.
I'm trying to train a model using the 3DMatch dataset, but after a while, I'm getting the following error:

[1059  530   38 ...  631  144  924]
Validation : 0.0% (timings : 58.95 0.00)
2022-02-07 16:05:30.380600: E tensorflow/stream_executor/dnn.cc:613] CUDNN_STATUS_NOT_SUPPORTED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(3935): 'cudnnBatchNormalizationForwardInference( cudnn.handle(), mode, &one, &zero, x_descriptor.handle(), x.opaque(), x_descriptor.handle(), y->opaque(), scale_offset_descriptor.handle(), scale.opaque(), offset.opaque(), estimated_mean.opaque(), maybe_inv_var, epsilon)'
Traceback (most recent call last):
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.OutOfRangeError: 2 root error(s) found.
  (0) Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
  (1) Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[optimizer/gradients/KernelPointNetwork/Sum_1_grad/Fill/value/_571]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'IteratorGetNext':
  File "training_3DMatch.py", line 175, in <module>
    dataset.init_input_pipeline(config)
  File "/home/rambo/ws_benji/D3Feat/datasets/common.py", line 770, in init_input_pipeline
    self.flat_inputs = iter.get_next()
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 429, in get_next
    name=name)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_dataset_ops.py", line 2518, in iterator_get_next
    output_shapes=output_shapes, name=name)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: cuDNN launch failure : input shape ([68418,64,1,1])
         [[{{node KernelPointNetwork/layer_0/simple_0/batch_normalization/cond/FusedBatchNormV3_1}}]]
         [[loss/cdist/Sqrt/_1141]]
  (1) Internal: cuDNN launch failure : input shape ([68418,64,1,1])
         [[{{node KernelPointNetwork/layer_0/simple_0/batch_normalization/cond/FusedBatchNormV3_1}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "training_3DMatch.py", line 207, in <module>
    trainer.train(model, dataset)
  File "/home/rambo/ws_benji/D3Feat/utils/trainer.py", line 387, in train
    self.validation(model, dataset)
  File "/home/rambo/ws_benji/D3Feat/utils/trainer.py", line 441, in validation
    desc_loss, det_loss, accuracy, ave_d_pos, ave_d_neg, dists, scores, anc_key, pos_key = self.sess.run(ops, {model.dropout_prob: 1.0})
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: cuDNN launch failure : input shape ([68418,64,1,1])
         [[node KernelPointNetwork/layer_0/simple_0/batch_normalization/cond/FusedBatchNormV3_1 (defined at /home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
         [[loss/cdist/Sqrt/_1141]]
  (1) Internal: cuDNN launch failure : input shape ([68418,64,1,1])
         [[node KernelPointNetwork/layer_0/simple_0/batch_normalization/cond/FusedBatchNormV3_1 (defined at /home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'KernelPointNetwork/layer_0/simple_0/batch_normalization/cond/FusedBatchNormV3_1':
  File "training_3DMatch.py", line 189, in <module>
    model = KernelPointFCNN(dataset.flat_inputs, config)
  File "/home/rambo/ws_benji/D3Feat/models/KPFCNN_model.py", line 130, in __init__
    self.out_features, self.out_scores = assemble_FCNN_blocks(self.anchor_inputs, self.config, self.dropout_prob)
  File "/home/rambo/ws_benji/D3Feat/models/D3Feat.py", line 15, in assemble_FCNN_blocks
    F = assemble_CNN_blocks(inputs, config, dropout_prob)
  File "/home/rambo/ws_benji/D3Feat/models/network_blocks.py", line 1099, in assemble_CNN_blocks
    training)
  File "/home/rambo/ws_benji/D3Feat/models/network_blocks.py", line 242, in simple_block
    training))
  File "/home/rambo/ws_benji/D3Feat/models/network_blocks.py", line 160, in batch_norm
    training=training)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 330, in new_func
    return func(*args, **kwargs)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/layers/normalization.py", line 327, in batch_normalization
    return layer.apply(inputs, training=training)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 330, in new_func
    return func(*args, **kwargs)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 1700, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper
    return converted_call(f, options, args, kwargs)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call
    return _call_unconverted(f, args, kwargs, options)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted
    return f(*args, **kwargs)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/layers/normalization.py", line 167, in call
    return super(BatchNormalization, self).call(inputs, training=training)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/keras/layers/normalization.py", line 710, in call
    outputs = self._fused_batch_norm(inputs, training=training)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/keras/layers/normalization.py", line 565, in _fused_batch_norm
    training, _fused_batch_norm_training, _fused_batch_norm_inference)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/keras/utils/tf_utils.py", line 59, in smart_cond
    pred, true_fn=true_fn, false_fn=false_fn, name=name)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/smart_cond.py", line 59, in smart_cond
    name=name)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1235, in cond
    orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1061, in BuildCondBranch
    original_result = fn()
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/keras/layers/normalization.py", line 562, in _fused_batch_norm_inference
    data_format=data_format)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/nn_impl.py", line 1502, in fused_batch_norm
    name=name)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 4620, in fused_batch_norm_v3
    name=name)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

2022-02-07 16:05:32.189426: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.

The error comes from this section of code:
image

Do you have any idea why is this happening?

Thanks a lot,
Benjamin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant