Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental Training for imagenet #1391

Merged
merged 26 commits into from
Jul 31, 2019

Conversation

qiuxin2012
Copy link
Contributor

With incremental training, we can use less DRAM during the training.
The training workload is like this:

  1. Persist origin data on disk
  2. Sample a piece of data to memory
  3. train these cached samples
  4. unpersist these cached samples
  5. Goto 2.

@@ -32,6 +32,8 @@ case object PARTITIONED extends DataStrategy

case object REPLICATED extends DataStrategy

case class INCREMENTAL(cachePercentage: Double) extends DataStrategy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes more sense to make it a new MemoryType (e.g., DISK) instead of DataStrategy; e.g., we may later support DISK + REPLICATED

-> BGRImgNormalizer(0.485, 0.456, 0.406, 0.229, 0.224, 0.225))
))
-> BGRImgNormalizer(0.485, 0.456, 0.406, 0.229, 0.224, 0.225)
-> BGRImgToSample()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add toSample and toBatch here?

Copy link
Contributor Author

@qiuxin2012 qiuxin2012 Jul 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the origin rdd is persisted on disk, when count the size of the dataset, MTLabeledBGRImgToBatch will throw an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted, As I won't count the hold rdd on disk.

* @param buffer
*/
// T is the returning value type. like ByteRecord
class IncrementalFeatureSet[T: ClassTag]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DiskFeatureSet?

currentFeatureSet = DRAMFeatureSet.rdd(currentSlice)
currentFeatureSet.cache()
currentFeatureSet.data(train)
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean? In prediction/evaluation, only data in currentFeatureSet are used?

override def data(train: Boolean): RDD[T] = {
if (train) {
if (currentFeatureSet != null) {
currentFeatureSet.unpersist()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you init currentFeatureSet above and then discard it immediately here? maybe just init currentFeatureSet to null here?

// as BigDL will overwrite checkpointPath to its subfolder.
this.setCheckpoint(checkpointDir.get, checkPointTrigger.get)
}
this.train()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should set state("epoch") to trueEpoch immediately after train, otherwise endWhen(state) can exit the loop too early?

protected var checkpointDir: Option[String] = None
protected var cachePercentage: Double = 1.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you should make cachePercentage a variable in Estimator; instead, make it part of FeatureSet, and Estimator can then check its value in train.

cachedModels, parameters, trainingModel)

trainingModel
}
Copy link
Collaborator

@jason-dai jason-dai Jul 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to call optimizer.shutdown() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimizer.shutdown integrate to close()

trainingModel
}

override def close(): Unit = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we call close automatically when train is done?

.setCheckpointDir(modelDir)
.setOptimMethods(optimMethods)
}
if (d.originSet().isInstanceOf[DiskFeatureSet[Any]]) {
Copy link
Collaborator

@jason-dai jason-dai Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can define a variable memoryPercentage in FeatureSet which is set to 1 by default. Then you can just check this variable inside internalEstimator.train

@qiuxin2012
Copy link
Contributor Author

Add ZooTrigger to fix every epoch won't stop at correct iteration.

if (train) {
if (currentFeatureSet != null && trained) {
currentFeatureSet.unpersist()
newSample()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering does the user need to enlarge the training epoch accordingly as it seems like the epoch is calculated as per num slice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will reset the epoch number if the epoch is not ended actually.

newSample()
}
currentFeatureSet.cache()
trained = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CurrentFeatureSet should be shuffled here.

* Could be used as trigger in setValidation and setCheckpoint
* in Optimizer, and also in TrainSummary.setSummaryTrigger.
*/
case class EveryEpoch() extends ZooTrigger{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If memoryType DRAM is used, the validation phase is ignored at the end of each epoch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Jul 26, 2021
dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Jul 27, 2021
dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Jul 30, 2021
dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Jul 30, 2021
dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Aug 7, 2021
dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Aug 7, 2021
dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Aug 7, 2021
dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Aug 9, 2021
dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Aug 9, 2021
dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Aug 9, 2021
dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Aug 9, 2021
dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Aug 10, 2021
dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Aug 30, 2021
dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Sep 1, 2021
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 2, 2021
dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Sep 2, 2021
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 2, 2021
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 2, 2021
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 2, 2021
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 3, 2021
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 3, 2021
dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Sep 4, 2021
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 6, 2021
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 6, 2021
dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Sep 7, 2021
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 7, 2021
dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Sep 8, 2021
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 8, 2021
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 13, 2021
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants