Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when trying to use CudaHalfTensor for training #332

Open
mbcel opened this issue Feb 17, 2017 · 4 comments
Open

Error when trying to use CudaHalfTensor for training #332

mbcel opened this issue Feb 17, 2017 · 4 comments

Comments

@mbcel
Copy link

mbcel commented Feb 17, 2017

I want to train my model using fp16 precision on my gpu. My gpu has the Pascal architecture and the cutorch.hasHalf flag indicates true. I am using cuDNN 5.1 and CUDA Toolkit 8.0.

As far as I understand it right I only have to change the Tensors that are allocated on my gpu from CudaTensor to CudaHalfTensor and the calculations should be in fp16 precision. However, when I do that I get an error on using the optim.sgd() function that says: "No algorithms found that would fit in free GPU memory".

Am I doing something wrong? Or is fp16 actually supported for a VGG16 model using sgd?

The detailed error message is:

In 1 module of nn.Sequential:
/home/.../torch/install/share/lua/5.1/cudnn/find.lua:469: No algorithms found that would fit in free GPU memory
stack traceback:
	[C]: in function 'error'
	/home/.../torch/install/share/lua/5.1/cudnn/find.lua:469: in function 'forwardAlgorithm'
	...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:189: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:185>
	[C]: in function 'xpcall'
	/home/.../torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
	/home/.../torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
	./trainManager.lua:103: in function 'opfunc'
	/home/.../torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'sgd'

@soumith
Copy link
Owner

soumith commented Feb 17, 2017

this is presumably a bug in cudnn 5.1 itself.

@borisfom
Copy link
Contributor

@Marcel1991 : Can you post a snippet of your code where this is happening ?
What is your card, exactly? What does cutorch.hasFastHalfInstructions() return ?

Also: try using CUDNN6. A number of FP16 issues fixed there. Check out R6 barnch of this repo and do 'luarocks make cudnn-scm-1.rockspec'.

@mbcel
Copy link
Author

mbcel commented Feb 20, 2017

@borisfom : So cutorch.hasFastHalfInstructions() returns false. My GPU is a Titan X Pascal.

I tried CUDNN6 with the R6 branch now. It's still not working but now I get a new error that seems to point more to the direction where something is going wrong:

/home/.../torch/install/share/lua/5.1/nn/Container.lua:67: 
In 1 module of nn.Sequential:
In 1 module of nn.ConcatTable:
/home/.../torch/install/share/lua/5.1/cudnn/find.lua:483: cudnnFindConvolutionForwardAlgorithm failed, sizes:  convDesc=[mode : CUDNN_CROSS_CORRELATION datatype : CUDNN_DATA_FLOAT] hash=-dimA8,3,368,1224 -filtA13,3,3,3 8,13,184,612 -padA1,1 -convStrideA2,2 CUDNN_DATA_FLOAT
stack traceback:
	[C]: in function 'error'
	/home/.../torch/install/share/lua/5.1/cudnn/find.lua:483: in function 'forwardAlgorithm'
	...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:190: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:186>
	[C]: in function 'xpcall'
	/home/.../torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
	/home/.../torch/install/share/lua/5.1/nn/ConcatTable.lua:11: in function </home/.../torch/install/share/lua/5.1/nn/ConcatTable.lua:9>
	[C]: in function 'xpcall'
	/home/.../torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
	/home/.../torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
	./trainManager.lua:104: in function 'opfunc'
	...

The relevant code is:

batchInputs = torch.CudaHalfTensor()
batchLabels = torch.CudaHalfTensor()

-- function trains one minibatch on module
function TrainManager.trainBatch(self, batchInputsCpu, batchLabelsCpu)
  local waitTime = waitTimer:time().real
  cutorch.synchronize()
  local batchTimer = torch.Timer()

  collectgarbage() -- free unused memory
  cutorch.synchronize()

  local options = self.options

  -- copy data into gpu tensors
  batchInputs:resize(batchInputsCpu:size()):copy(batchInputsCpu)
  batchLabels:resize(batchLabelsCpu:size()):copy(batchLabelsCpu)

  local batchLoss
  -- sgd expects function with input: moduleParameters; output: loss, gradParams
  local opFunction = function(modelParameters)
    model:zeroGradParameters()

    local outputs = model:forward(batchInputs)
    batchLoss = criterion:forward(outputs, batchLabels)
    local gradientOutputs = criterion:backward(outputs, batchLabels)
    model:backward(batchInputs, gradientOutputs)

    -- L2 regularization
    -- ignore to add l2 loss to error due to fair comparison of different l2 settings
    -- batchLoss = batchLoss + optimisationState.regL2 * torch.norm(modelParameters, 2)^2/2
    --gradientParameters:add( modelParameters:clone():mul(optimisationState.regL2) )

    return batchLoss, gradientParameters
  end

  optim.adam(opFunction, modelParameters, optimisationState)

...

The error occurs at the last line when the adam() function is called. The same happens with sgd() function

@mbcel
Copy link
Author

mbcel commented Apr 10, 2017

Does anyone use CudaHalfTensor successfully with the Titan X Pascal? And if yes, what nvidia driver do you use and which Ubuntu version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants