Error when trying to use CudaHalfTensor for training #332

mbcel · 2017-02-17T16:48:19Z

I want to train my model using fp16 precision on my gpu. My gpu has the Pascal architecture and the cutorch.hasHalf flag indicates true. I am using cuDNN 5.1 and CUDA Toolkit 8.0.

As far as I understand it right I only have to change the Tensors that are allocated on my gpu from CudaTensor to CudaHalfTensor and the calculations should be in fp16 precision. However, when I do that I get an error on using the optim.sgd() function that says: "No algorithms found that would fit in free GPU memory".

Am I doing something wrong? Or is fp16 actually supported for a VGG16 model using sgd?

The detailed error message is:

In 1 module of nn.Sequential:
/home/.../torch/install/share/lua/5.1/cudnn/find.lua:469: No algorithms found that would fit in free GPU memory
stack traceback:
	[C]: in function 'error'
	/home/.../torch/install/share/lua/5.1/cudnn/find.lua:469: in function 'forwardAlgorithm'
	...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:189: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:185>
	[C]: in function 'xpcall'
	/home/.../torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
	/home/.../torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
	./trainManager.lua:103: in function 'opfunc'
	/home/.../torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'sgd'

The text was updated successfully, but these errors were encountered:

soumith · 2017-02-17T17:00:32Z

this is presumably a bug in cudnn 5.1 itself.

borisfom · 2017-02-18T00:16:22Z

@Marcel1991 : Can you post a snippet of your code where this is happening ?
What is your card, exactly? What does cutorch.hasFastHalfInstructions() return ?

Also: try using CUDNN6. A number of FP16 issues fixed there. Check out R6 barnch of this repo and do 'luarocks make cudnn-scm-1.rockspec'.

mbcel · 2017-02-20T21:40:08Z

@borisfom : So cutorch.hasFastHalfInstructions() returns false. My GPU is a Titan X Pascal.

I tried CUDNN6 with the R6 branch now. It's still not working but now I get a new error that seems to point more to the direction where something is going wrong:

/home/.../torch/install/share/lua/5.1/nn/Container.lua:67: 
In 1 module of nn.Sequential:
In 1 module of nn.ConcatTable:
/home/.../torch/install/share/lua/5.1/cudnn/find.lua:483: cudnnFindConvolutionForwardAlgorithm failed, sizes:  convDesc=[mode : CUDNN_CROSS_CORRELATION datatype : CUDNN_DATA_FLOAT] hash=-dimA8,3,368,1224 -filtA13,3,3,3 8,13,184,612 -padA1,1 -convStrideA2,2 CUDNN_DATA_FLOAT
stack traceback:
	[C]: in function 'error'
	/home/.../torch/install/share/lua/5.1/cudnn/find.lua:483: in function 'forwardAlgorithm'
	...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:190: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:186>
	[C]: in function 'xpcall'
	/home/.../torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
	/home/.../torch/install/share/lua/5.1/nn/ConcatTable.lua:11: in function </home/.../torch/install/share/lua/5.1/nn/ConcatTable.lua:9>
	[C]: in function 'xpcall'
	/home/.../torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
	/home/.../torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
	./trainManager.lua:104: in function 'opfunc'
	...

The relevant code is:

batchInputs = torch.CudaHalfTensor()
batchLabels = torch.CudaHalfTensor()

-- function trains one minibatch on module
function TrainManager.trainBatch(self, batchInputsCpu, batchLabelsCpu)
  local waitTime = waitTimer:time().real
  cutorch.synchronize()
  local batchTimer = torch.Timer()

  collectgarbage() -- free unused memory
  cutorch.synchronize()

  local options = self.options

  -- copy data into gpu tensors
  batchInputs:resize(batchInputsCpu:size()):copy(batchInputsCpu)
  batchLabels:resize(batchLabelsCpu:size()):copy(batchLabelsCpu)

  local batchLoss
  -- sgd expects function with input: moduleParameters; output: loss, gradParams
  local opFunction = function(modelParameters)
    model:zeroGradParameters()

    local outputs = model:forward(batchInputs)
    batchLoss = criterion:forward(outputs, batchLabels)
    local gradientOutputs = criterion:backward(outputs, batchLabels)
    model:backward(batchInputs, gradientOutputs)

    -- L2 regularization
    -- ignore to add l2 loss to error due to fair comparison of different l2 settings
    -- batchLoss = batchLoss + optimisationState.regL2 * torch.norm(modelParameters, 2)^2/2
    --gradientParameters:add( modelParameters:clone():mul(optimisationState.regL2) )

    return batchLoss, gradientParameters
  end

  optim.adam(opFunction, modelParameters, optimisationState)

...

The error occurs at the last line when the adam() function is called. The same happens with sgd() function

mbcel · 2017-04-10T10:18:09Z

Does anyone use CudaHalfTensor successfully with the Titan X Pascal? And if yes, what nvidia driver do you use and which Ubuntu version?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when trying to use CudaHalfTensor for training #332

Error when trying to use CudaHalfTensor for training #332

mbcel commented Feb 17, 2017

soumith commented Feb 17, 2017

borisfom commented Feb 18, 2017

mbcel commented Feb 20, 2017 •

edited

Loading

mbcel commented Apr 10, 2017

Error when trying to use CudaHalfTensor for training #332

Error when trying to use CudaHalfTensor for training #332

Comments

mbcel commented Feb 17, 2017

soumith commented Feb 17, 2017

borisfom commented Feb 18, 2017

mbcel commented Feb 20, 2017 • edited Loading

mbcel commented Apr 10, 2017

mbcel commented Feb 20, 2017 •

edited

Loading