Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify the loss value of ResNet training between Go and Python version #241

Open
Yancey1989 opened this issue Aug 28, 2020 · 1 comment
Open
Assignees

Comments

@Yancey1989
Copy link
Collaborator

To get a ResNet50 training baseline of loss value, I running the resnet.py example and got the following logs:

batch: 10, loss: 10.711030, acc1: 0.000000, acc5: 0.000000
batch: 20, loss: 7.499339, acc1: 0.000000, acc5: 0.000000
batch: 30, loss: 7.281894, acc1: 0.000000, acc5: 0.000000
batch: 40, loss: 7.059255, acc1: 0.000000, acc5: 0.000000
batch: 50, loss: 7.000484, acc1: 0.000000, acc5: 0.000000
batch: 60, loss: 6.871602, acc1: 0.000000, acc5: 3.125000
batch: 70, loss: 6.962079, acc1: 0.000000, acc5: 0.000000
batch: 80, loss: 6.872428, acc1: 0.000000, acc5: 0.000000
batch: 90, loss: 6.922100, acc1: 0.000000, acc5: 0.000000
batch: 100, loss: 6.918412, acc1: 0.000000, acc5: 0.000000
batch: 110, loss: 6.880023, acc1: 0.000000, acc5: 0.000000
batch: 120, loss: 6.936709, acc1: 0.000000, acc5: 3.125000
batch: 130, loss: 6.936309, acc1: 0.000000, acc5: 0.000000
batch: 140, loss: 6.923660, acc1: 0.000000, acc5: 0.000000
batch: 150, loss: 6.924109, acc1: 0.000000, acc5: 0.000000
batch: 160, loss: 6.923644, acc1: 0.000000, acc5: 3.125000
...
@Yancey1989
Copy link
Collaborator Author

Go version experiment

Run the following command to shuffle and creating.tgz files from the ImageNet training data:

$ find ./train | grep "JPEG" |  sort -R >  shuffle_list.txt
$ tar czf train_shuffle.tgz -T shuffle_list.txt

#239 can randomly skip some samples at the begging of each epoch, we use this PR to run the experiment and get the following logs:

2020/08/28 11:33:39 No CUDA found; CPU only
2020/08/28 11:51:41 building label vocabulary done.
2020/08/28 11:52:15 Epoch: 0, Batch: 10, loss:25.210730, acc1: 0.000000, acc5:0.000000, throughput: 0.028925 samples/secs
2020/08/28 11:52:48 Epoch: 0, Batch: 20, loss:18.479069, acc1: 0.000000, acc5:0.000000, throughput: 0.060924 samples/secs
2020/08/28 11:53:21 Epoch: 0, Batch: 30, loss:23.318371, acc1: 0.000000, acc5:0.000000, throughput: 0.091268 samples/secs
2020/08/28 11:53:54 Epoch: 0, Batch: 40, loss:17.084089, acc1: 0.000000, acc5:0.000000, throughput: 0.119941 samples/secs
2020/08/28 11:54:28 Epoch: 0, Batch: 50, loss:27.470881, acc1: 0.000000, acc5:0.000000, throughput: 0.150361 samples/secs
2020/08/28 11:55:01 Epoch: 0, Batch: 60, loss:13.764173, acc1: 0.000000, acc5:0.000000, throughput: 0.179411 samples/secs
2020/08/28 11:55:35 Epoch: 0, Batch: 70, loss:19.928579, acc1: 0.000000, acc5:0.000000, throughput: 0.208202 samples/secs
2020/08/28 11:56:08 Epoch: 0, Batch: 80, loss:9.244127, acc1: 0.000000, acc5:0.000000, throughput: 0.240541 samples/secs
2020/08/28 11:56:42 Epoch: 0, Batch: 90, loss:15.051638, acc1: 0.000000, acc5:0.000000, throughput: 0.263969 samples/secs
2020/08/28 11:57:15 Epoch: 0, Batch: 100, loss:18.548998, acc1: 0.000000, acc5:0.000000, throughput: 0.304695 samples/secs
2020/08/28 11:57:47 Epoch: 0, Batch: 110, loss:12.095874, acc1: 0.000000, acc5:0.000000, throughput: 0.339955 samples/secs
2020/08/28 11:58:20 Epoch: 0, Batch: 120, loss:23.189226, acc1: 0.000000, acc5:0.000000, throughput: 0.362004 samples/secs
2020/08/28 11:58:54 Epoch: 0, Batch: 130, loss:12.052785, acc1: 0.000000, acc5:0.000000, throughput: 0.391076 samples/secs
2020/08/28 11:59:26 Epoch: 0, Batch: 140, loss:16.743521, acc1: 0.000000, acc5:3.125000, throughput: 0.431979 samples/secs
2020/08/28 11:59:59 Epoch: 0, Batch: 150, loss:14.558453, acc1: 0.000000, acc5:3.125000, throughput: 0.456955 samples/secs
2020/08/28 12:00:31 Epoch: 0, Batch: 160, loss:16.397123, acc1: 0.000000, acc5:0.000000, throughput: 0.493441 samples/secs
...

but the program panics after 850 minibatch

2020/08/28 12:37:23 Epoch: 0, Batch: 850, loss:76.659737, acc1: 0.000000, acc5:0.000000, throughput: 2.500808 samples/secs
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x7f515b598e39]

runtime stack:
runtime.throw(0x7bdeb4, 0x2a)
	/usr/local/go/src/runtime/panic.go:1116 +0x72
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:679 +0x46a

goroutine 1 [syscall]:
runtime.cgocall(0x706e60, 0xc000e57b80, 0xc000202790)
	/usr/local/go/src/runtime/cgocall.go:133 +0x5b fp=0xc000e57b50 sp=0xc000e57b18 pc=0x407e3b
github.com/wangkuiyi/gotorch/nn/functional._Cfunc_BatchNorm(0x7f4f0bc28f20, 0x36c0680, 0x36be550, 0x36c2240, 0x36c2260, 0xc000e57b01, 0x3fb999999999999a, 0x3ee4f8b588e368f1, 0xc000202790, 0x0)
	_cgo_gotypes.go:91 +0x4e fp=0xc000e57b80 sp=0xc000e57b50 pc=0x4f6a6e
github.com/wangkuiyi/gotorch/nn/functional.BatchNorm.func1(0xc000fac028, 0x36c0680, 0x36be550, 0x36c2240, 0x36c2260, 0x6fd701, 0x3fb999999999999a, 0x3ee4f8b588e368f1, 0xc000202790, 0xc000642060)
	/work/gotorch/nn/functional/functional.go:45 +0x18b fp=0xc000e57be8 sp=0xc000e57b80 pc=0x4f7b2b
github.com/wangkuiyi/gotorch/nn/functional.BatchNorm(0xc000fac028, 0xc0000100b0, 0xc0000100b8, 0xc0000100a0, 0xc0000100a8, 0x1, 0x3fb999999999999a, 0x3ee4f8b588e368f1, 0xc000642060)
	/work/gotorch/nn/functional/functional.go:45 +0xd6 fp=0xc000e57c50 sp=0xc000e57be8 pc=0x4f73c6
github.com/wangkuiyi/gotorch/nn.(*BatchNorm2dModule).Forward(0xc0001dc000, 0xc000fac028, 0xc000fac028)
	/work/gotorch/nn/batchnorm.go:73 +0x89 fp=0xc000e57ca8 sp=0xc000e57c50 pc=0x6fcba9
github.com/wangkuiyi/gotorch/vision/models.(*ResnetModule).Forward(0xc0001de000, 0xc000fac018, 0x735d60)
	/work/gotorch/vision/models/resnet.go:182 +0x70 fp=0xc000e57d68 sp=0xc000e57ca8 pc=0x703c60
main.trainOneBatch(0xc000fac018, 0xc000fac020, 0xc0001de000, 0xc000202f88, 0x426681f7, 0xc000000000)
	/work/gotorch/example/resnet/resnet.go:83 +0x3c fp=0xc000e57de0 sp=0xc000e57d68 pc=0x70538c
main.main()
	/work/gotorch/example/resnet/resnet.go:149 +0x51f fp=0xc000e57f88 sp=0xc000e57de0 pc=0x7059ff
runtime.main()
	/usr/local/go/src/runtime/proc.go:203 +0x1fa fp=0xc000e57fe0 sp=0xc000e57f88 pc=0x439f6a
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc000e57fe8 sp=0xc000e57fe0 pc=0x466771
exit status 2

TODOs

  1. The loss value does not decrease as we expected, we should find the reason.
  2. fix panics on batch_norm function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants