-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental Training for imagenet #1391
Conversation
@@ -32,6 +32,8 @@ case object PARTITIONED extends DataStrategy | |||
|
|||
case object REPLICATED extends DataStrategy | |||
|
|||
case class INCREMENTAL(cachePercentage: Double) extends DataStrategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes more sense to make it a new MemoryType
(e.g., DISK) instead of DataStrategy
; e.g., we may later support DISK + REPLICATED
-> BGRImgNormalizer(0.485, 0.456, 0.406, 0.229, 0.224, 0.225)) | ||
)) | ||
-> BGRImgNormalizer(0.485, 0.456, 0.406, 0.229, 0.224, 0.225) | ||
-> BGRImgToSample() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add toSample and toBatch here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the origin rdd is persisted on disk, when count the size of the dataset, MTLabeledBGRImgToBatch will throw an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted, As I won't count the hold rdd on disk.
* @param buffer | ||
*/ | ||
// T is the returning value type. like ByteRecord | ||
class IncrementalFeatureSet[T: ClassTag] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DiskFeatureSet?
currentFeatureSet = DRAMFeatureSet.rdd(currentSlice) | ||
currentFeatureSet.cache() | ||
currentFeatureSet.data(train) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean? In prediction/evaluation, only data in currentFeatureSet
are used?
override def data(train: Boolean): RDD[T] = { | ||
if (train) { | ||
if (currentFeatureSet != null) { | ||
currentFeatureSet.unpersist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you init currentFeatureSet
above and then discard it immediately here? maybe just init currentFeatureSet
to null here?
// as BigDL will overwrite checkpointPath to its subfolder. | ||
this.setCheckpoint(checkpointDir.get, checkPointTrigger.get) | ||
} | ||
this.train() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should set state("epoch")
to trueEpoch
immediately after train, otherwise endWhen(state)
can exit the loop too early?
protected var checkpointDir: Option[String] = None | ||
protected var cachePercentage: Double = 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you should make cachePercentage
a variable in Estimator
; instead, make it part of FeatureSet
, and Estimator
can then check its value in train.
cachedModels, parameters, trainingModel) | ||
|
||
trainingModel | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to call optimizer.shutdown()
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optimizer.shutdown integrate to close()
trainingModel | ||
} | ||
|
||
override def close(): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we call close
automatically when train
is done?
.setCheckpointDir(modelDir) | ||
.setOptimMethods(optimMethods) | ||
} | ||
if (d.originSet().isInstanceOf[DiskFeatureSet[Any]]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can define a variable memoryPercentage
in FeatureSet
which is set to 1 by default. Then you can just check this variable inside internalEstimator.train
8391903
to
4f9a1e2
Compare
Add ZooTrigger to fix every epoch won't stop at correct iteration. |
if (train) { | ||
if (currentFeatureSet != null && trained) { | ||
currentFeatureSet.unpersist() | ||
newSample() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering does the user need to enlarge the training epoch accordingly as it seems like the epoch is calculated as per num slice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will reset the epoch number if the epoch is not ended actually.
newSample() | ||
} | ||
currentFeatureSet.cache() | ||
trained = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CurrentFeatureSet should be shuffled here.
* Could be used as trigger in setValidation and setCheckpoint | ||
* in Optimizer, and also in TrainSummary.setSummaryTrigger. | ||
*/ | ||
case class EveryEpoch() extends ZooTrigger{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If memoryType DRAM is used, the validation phase is ignored at the end of each epoch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
With incremental training, we can use less DRAM during the training.
The training workload is like this: