Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault on multiple calls to conv2d. #28

Closed
arjunchitturi opened this issue Jun 26, 2020 · 15 comments
Closed

Segmentation fault on multiple calls to conv2d. #28

arjunchitturi opened this issue Jun 26, 2020 · 15 comments
Labels

Comments

@arjunchitturi
Copy link

arjunchitturi commented Jun 26, 2020

When trying to make a predict fn involving a conv2d (similar to cnn_mnist), i see Segmentation faults after around 21 calls.
Is this a bug or Is there a better way to make predictions?

Minimum code to reproduce:

use autograd as ag; 
use ag::ndarray_ext as array;
use ag::tensor::Variable;

fn main() {
    let rng = array::ArrayRng::<f32>::default();
    let w1_arr = array::into_shared(rng.random_normal(&[16, 1, 3, 3], 0., 0.5));
    let b1_arr = array::into_shared(array::zeros(&[1, 16, 8, 8]));

    ag::with(|g| {
        let rng1 = array::ArrayRng::<f32>::default();
        let w1 = g.variable(w1_arr.clone());

        let b1 = g.variable(b1_arr.clone());
        for i in 0..100 {
            println!("Calling pred: {}", i);
            let x = g.variable(rng1.glorot_uniform(&[8, 8]));
            println!("Input value: {:?}", x.eval(&[]));
            let _ = g.conv2d(x, w1, 1, 1) + b1;
        }
    })
}
@raskr
Copy link
Owner

raskr commented Jun 26, 2020

@arjunc77 I can't reproduce that... Could you paste the stack trace?

@arjunchitturi
Copy link
Author

RUST_BACKTRACE=full cargo run --example predict
Calling pred: 0
Calling pred: 1
Calling pred: 2
Calling pred: 3
Calling pred: 4
Calling pred: 5
Calling pred: 6
Calling pred: 7
Calling pred: 8
Calling pred: 9
Calling pred: 10
Calling pred: 11
Calling pred: 12
Calling pred: 13
Calling pred: 14
Calling pred: 15
Calling pred: 16
Calling pred: 17
Calling pred: 18
Calling pred: 19
Calling pred: 20
Calling pred: 21
Segmentation fault

@arjunchitturi
Copy link
Author

arjunchitturi commented Jun 26, 2020

NOTE: I get it working when i run the above code from inside rust-autograd examples path. But when i try to run the same code from another crate with autograd 1.0.0 installed i run into the above traceback.

@arjunchitturi
Copy link
Author

arjunchitturi commented Jun 26, 2020

@raskr , i have a better debug trace from gdb. Please find it below:

Calling pred: 0
Calling pred: 1
Calling pred: 2
Calling pred: 3
Calling pred: 4
Calling pred: 5
Calling pred: 6
Calling pred: 7
Calling pred: 8
Calling pred: 9
Calling pred: 10
Calling pred: 11
Calling pred: 12
Calling pred: 13
Calling pred: 14
Calling pred: 15
Calling pred: 16
Calling pred: 17
Calling pred: 18
Calling pred: 19
Calling pred: 20
Calling pred: 21

Thread 1 "predtest" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7fe6a00 (LWP 8769)]
0x00005555555e39db in autograd::tensor::TensorBuilder<T>::build::{{closure}} (a=0x7fffffffd888) at /home/myname/.cargo/registry/src/gitpro.ttaallkk.top-1ecc6299db9ec823/autograd-1.0.0/src/tensor.rs:768
768	                .map(|a| a.get(graph).inner().top_rank)

@quietlychris
Copy link

@raskr @arjunc77 I have also reproduced this issue, on Ubuntu 18.04 using rustc 1.45.0-nightly. The Cargo.toml file is simply using autograd = { version = "1.0.0"}, with and without the mkl flag, so there shouldn't be any versioning issues there. I'm happy to help test any fixes or versions moving forward as well.

My gdb backtrace is the following:

Core was generated by `target/debug/check_issue_28'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055e96339899b in autograd::tensor::TensorBuilder<T>::build::{{closure}} (a=0x7ffd9daa0598)
    at /home/chrism/.cargo/registry/src/gitpro.ttaallkk.top-1ecc6299db9ec823/autograd-1.0.0/src/tensor.rs:768
768	                .map(|a| a.get(graph).inner().top_rank)
warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts
of file /home/chrism/rust-projects/autograd-projects/check_issue_28/target/debug/check_issue_28.
Use `info auto-load python-scripts [REGEXP]' to list them.
(gdb) bt
#0  0x000055e96339899b in autograd::tensor::TensorBuilder<T>::build::{{closure}} (a=0x7ffd9daa0598)
    at /home/chrism/.cargo/registry/src/gitpro.ttaallkk.top-1ecc6299db9ec823/autograd-1.0.0/src/tensor.rs:768
#1  0x000055e9633b97bc in core::iter::adapters::map_fold::{{closure}} (acc=0, elt=0x7ffd9daa0598)
    at /home/chrism/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/iter/adapters/mod.rs:785
#2  0x000055e96335e78f in core::iter::traits::iterator::Iterator::fold::ok::{{closure}} (acc=0, x=0x7ffd9daa0598)
    at /home/chrism/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/iter/traits/iterator.rs:2002
#3  0x000055e96336fab9 in core::iter::traits::iterator::Iterator::try_fold (self=0x7ffd9da9fc70, init=0, f=...)
    at /home/chrism/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/iter/traits/iterator.rs:1878
#4  0x000055e96336bb8b in core::iter::traits::iterator::Iterator::fold (self=..., init=0, f=...)
    at /home/chrism/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/iter/traits/iterator.rs:2005
#5  0x000055e9633ba4d6 in <core::iter::adapters::Map<I,F> as core::iter::traits::iterator::Iterator>::fold (self=..., init=0, g=...)
    at /home/chrism/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/iter/adapters/mod.rs:825
#6  0x000055e9633b6672 in core::iter::traits::iterator::Iterator::fold_first (self=..., f=...)
    at /home/chrism/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/iter/traits/iterator.rs:2042
#7  0x000055e9633b7d85 in core::iter::traits::iterator::Iterator::max_by (self=..., compare=0x7ffd9daa0588)
    at /home/chrism/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/iter/traits/iterator.rs:2537
#8  0x000055e9633b7608 in core::iter::traits::iterator::Iterator::max (self=...)
    at /home/chrism/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/iter/traits/iterator.rs:2445
#9  0x000055e9633903d1 in autograd::tensor::TensorBuilder<T>::build (self=..., graph=0x7ffd9daa0b10, op=...)
    at /home/chrism/.cargo/registry/src/gitpro.ttaallkk.top-1ecc6299db9ec823/autograd-1.0.0/src/tensor.rs:766
#10 0x000055e9633e7ec5 in autograd::ops::<impl autograd::graph::Graph<F>>::conv2d (self=0x7ffd9daa0b10, x=..., w=..., pad=1, stride=1)
    at /home/chrism/.cargo/registry/src/gitpro.ttaallkk.top-1ecc6299db9ec823/autograd-1.0.0/src/ops/mod.rs:2378
#11 0x000055e9633a944a in check_issue_28::main::{{closure}} (g=0x7ffd9daa0b10) at src/main.rs:19
#12 0x000055e9633e820e in autograd::graph::with (f=...)
    at /home/chrism/.cargo/registry/src/gitpro.ttaallkk.top-1ecc6299db9ec823/autograd-1.0.0/src/graph.rs:98
#13 0x000055e963348d79 in check_issue_28::main () at src/main.rs:10

@arjunchitturi
Copy link
Author

arjunchitturi commented Jun 26, 2020

@quietlychris , thanks for sharing.

My OS is ubuntu 18.04 as well.
rustc 1.44.1 (c7087fe00 2020-06-17).

Running the example within rust-autograd repo works fine. I believe lib.rs loads the Tensor or graph in a way that prevents this issue.

@arjunchitturi
Copy link
Author

Not able to replicate the issue on Mac OS.
@quietlychris , can you verify whether you machine is bare metal or runs on a hypervisor?

@quietlychris
Copy link

I'm running Ubuntu 18.04 on bare-metal, on a Lenovo x250, which is an x86_64 platform running an Intel i5-5300U.

For what it's worth, I ran this program through valgrind to see if there was a memory leak associated with it ($ valgrind ./target/debug/check_issue_28 ), and it looks like it got through all 100 iterations without an issue, although just running the binary without valgrind still leads to a segfault. Since I believe that valgrind acts like at a high level like a virtual machine (I'm not super familiar with it's implementation details, so I could be wrong), this might be related to the behavior that you're seeing?

@arjunchitturi
Copy link
Author

In my case the OS (Ubuntu 18.04) runs on a hypervisor (vcpu). So, the common denominator is the OS whether or not it is baremetal.

@quietlychris
Copy link

quietlychris commented Jun 26, 2020

Hmm, looks like it's not just an Ubuntu 18.04 issue, though. I just tried it on Pop!_OS 20.04 (which is sort of an Ubuntu derivative) also running on x86_64 bare metal, and got the same behavior.

@raskr
Copy link
Owner

raskr commented Jun 27, 2020

@arjunc77 @quietlychris Thank you for the additional info!
Probably the segfault is caused by this line which is accessing node_set: UnsafeCell<...>.

Running the example within rust-autograd repo works fine. I believe lib.rs loads the Tensor or graph in a way that prevents this issue.

Ummm that's strange. *.rs files in this crate only define utility types and functions.

@raskr raskr added the bug label Jun 27, 2020
@acrrd
Copy link
Contributor

acrrd commented Aug 3, 2020

The problem is that the install function returns a reference to a TensorInternal that is allocated in the vector.
When the vector reallocate all the references become invalid. This is already fixed in eda1a80.
Could you make a new release with these fixes?

@raskr
Copy link
Owner

raskr commented Aug 3, 2020

@acrrd

The problem is that the install function returns a reference to a TensorInternal that is allocated in the vector.
When the vector reallocate all the references become invalid. This is already fixed in eda1a80.

That's it! I'll submit a new release tomorrow.

@raskr
Copy link
Owner

raskr commented Aug 4, 2020

@acrrd @arjunc77 @quietlychris Made a patch release v1.0.1. Sorry for the inconvenience...

@arjunchitturi
Copy link
Author

Thanks @raskr , issue fixed with the new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants