Incremental Training for imagenet #1391

qiuxin2012 · 2019-05-27T02:28:27Z

With incremental training, we can use less DRAM during the training.
The training workload is like this:

Persist origin data on disk
Sample a piece of data to memory
train these cached samples
unpersist these cached samples
Goto 2.

jason-dai · 2019-07-08T03:49:13Z

zoo/src/main/scala/com/intel/analytics/zoo/feature/pmem/NativeArray.scala

@@ -32,6 +32,8 @@ case object PARTITIONED extends DataStrategy

 case object REPLICATED extends DataStrategy

+case class INCREMENTAL(cachePercentage: Double) extends DataStrategy


It makes more sense to make it a new MemoryType (e.g., DISK) instead of DataStrategy; e.g., we may later support DISK + REPLICATED

jason-dai · 2019-07-08T03:50:05Z

zoo/src/main/scala/com/intel/analytics/zoo/examples/inception/ImageNet2012.scala

-            -> BGRImgNormalizer(0.485, 0.456, 0.406, 0.229, 0.224, 0.225))
-        ))
+            -> BGRImgNormalizer(0.485, 0.456, 0.406, 0.229, 0.224, 0.225)
+            -> BGRImgToSample()


Why add toSample and toBatch here?

As the origin rdd is persisted on disk, when count the size of the dataset, MTLabeledBGRImgToBatch will throw an exception.

reverted, As I won't count the hold rdd on disk.

jason-dai · 2019-07-08T03:51:11Z

zoo/src/main/scala/com/intel/analytics/zoo/feature/FeatureSet.scala

+ * @param buffer
+ */
+// T is the returning value type. like ByteRecord
+class IncrementalFeatureSet[T: ClassTag]


DiskFeatureSet?

jason-dai · 2019-07-08T03:52:04Z

zoo/src/main/scala/com/intel/analytics/zoo/feature/FeatureSet.scala

+      currentFeatureSet = DRAMFeatureSet.rdd(currentSlice)
+      currentFeatureSet.cache()
+      currentFeatureSet.data(train)
+    } else {


What does it mean? In prediction/evaluation, only data in currentFeatureSet are used?

jason-dai · 2019-07-08T03:53:43Z

zoo/src/main/scala/com/intel/analytics/zoo/feature/FeatureSet.scala

+  override def data(train: Boolean): RDD[T] = {
+    if (train) {
+      if (currentFeatureSet != null) {
+        currentFeatureSet.unpersist()


Why do you init currentFeatureSet above and then discard it immediately here? maybe just init currentFeatureSet to null here?

jason-dai · 2019-07-08T03:57:35Z

zoo/src/main/scala/com/intel/analytics/zoo/pipeline/api/keras/models/Topology.scala

+          // as BigDL will overwrite checkpointPath to its subfolder.
+          this.setCheckpoint(checkpointDir.get, checkPointTrigger.get)
+        }
+        this.train()


We should set state("epoch") to trueEpoch immediately after train, otherwise endWhen(state) can exit the loop too early?

jason-dai · 2019-07-08T04:01:19Z

zoo/src/main/scala/com/intel/analytics/zoo/pipeline/api/keras/models/Topology.scala

  protected var checkpointDir: Option[String] = None
+  protected var cachePercentage: Double = 1.0


I don't think you should make cachePercentage a variable in Estimator; instead, make it part of FeatureSet, and Estimator can then check its value in train.

jason-dai · 2019-07-08T04:02:03Z

zoo/src/main/scala/com/intel/analytics/zoo/pipeline/api/keras/models/Topology.scala

+      cachedModels, parameters, trainingModel)
+
+    trainingModel
+  }


Do we need to call optimizer.shutdown() here?

optimizer.shutdown integrate to close()

jason-dai · 2019-07-08T04:42:13Z

zoo/src/main/scala/com/intel/analytics/zoo/pipeline/api/keras/models/Topology.scala

+    trainingModel
+  }
+
+  override def close(): Unit = {


Shall we call close automatically when train is done?

jason-dai · 2019-07-09T14:56:33Z

zoo/src/main/scala/com/intel/analytics/zoo/pipeline/estimator/Estimator.scala

            .setCheckpointDir(modelDir)
            .setOptimMethods(optimMethods)
+        }
+        if (d.originSet().isInstanceOf[DiskFeatureSet[Any]]) {


maybe we can define a variable memoryPercentage in FeatureSet which is set to 1 by default. Then you can just check this variable inside internalEstimator.train

qiuxin2012 · 2019-07-15T08:04:10Z

Add ZooTrigger to fix every epoch won't stop at correct iteration.

zhichao-li · 2019-07-18T01:18:27Z

zoo/src/main/scala/com/intel/analytics/zoo/feature/FeatureSet.scala

+      if (train) {
+        if (currentFeatureSet != null && trained) {
+          currentFeatureSet.unpersist()
+          newSample()


Wondering does the user need to enlarge the training epoch accordingly as it seems like the epoch is calculated as per num slice?

We will reset the epoch number if the epoch is not ended actually.

cyita · 2019-07-18T07:49:53Z

zoo/src/main/scala/com/intel/analytics/zoo/feature/FeatureSet.scala

+          newSample()
+        }
+        currentFeatureSet.cache()
+        trained = true


CurrentFeatureSet should be shuffled here.

cyita · 2019-07-18T07:58:25Z

zoo/src/main/scala/com/intel/analytics/zoo/common/ZooTrigger.scala

+ * Could be used as trigger in setValidation and setCheckpoint
+ * in Optimizer, and also in TrainSummary.setSummaryTrigger.
+ */
+case class EveryEpoch() extends ZooTrigger{


If memoryType DRAM is used, the validation phase is ignored at the end of each epoch.

jason-dai reviewed Jul 8, 2019

View reviewed changes

jason-dai reviewed Jul 9, 2019

View reviewed changes

qiuxin2012 added 10 commits July 11, 2019 09:51

check point

0bf1ebe

mile stone

39aaea7

checkpoint

a3e1ca7

fix resume training

f01d657

optimize2

a020f63

correct epoch number

ba531d5

some changes

65135bd

meet code review

401819f

meet code review

918ba2f

checkpoint

4f9a1e2

qiuxin2012 force-pushed the incrementally branch from 8391903 to 4f9a1e2 Compare July 11, 2019 08:56

qiuxin2012 added 4 commits July 15, 2019 13:58

Add ZooTrigger

4825759

and or

41fe249

code cleanup

537a463

add more unit test

c1adca0

fix typo

a316ae3

zhichao-li reviewed Jul 18, 2019

View reviewed changes

cyita reviewed Jul 18, 2019

View reviewed changes

dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Jul 26, 2021

Incremental Training for imagenet (intel-analytics#1391)

0760802

dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Jul 27, 2021

Incremental Training for imagenet (intel-analytics#1391)

009f854

dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Jul 30, 2021

Incremental Training for imagenet (intel-analytics#1391)

6b7eea4

dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Jul 30, 2021

Incremental Training for imagenet (intel-analytics#1391)

7678c3d

dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Aug 7, 2021

Incremental Training for imagenet (intel-analytics#1391)

5cdb717

dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Aug 7, 2021

Incremental Training for imagenet (intel-analytics#1391)

41b096a

dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Aug 7, 2021

Incremental Training for imagenet (intel-analytics#1391)

055d86c

dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Aug 9, 2021

Incremental Training for imagenet (intel-analytics#1391)

bd65b49

dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Aug 9, 2021

Incremental Training for imagenet (intel-analytics#1391)

bf7390f

dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Aug 9, 2021

Incremental Training for imagenet (intel-analytics#1391)

347fa80

dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Aug 9, 2021

Incremental Training for imagenet (intel-analytics#1391)

97af986

dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Aug 10, 2021

Incremental Training for imagenet (intel-analytics#1391)

0e85114

dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Aug 30, 2021

Incremental Training for imagenet (intel-analytics#1391)

34db5ef

dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Sep 1, 2021

Incremental Training for imagenet (intel-analytics#1391)

7181598

Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 2, 2021

Incremental Training for imagenet (intel-analytics#1391)

025997b

dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Sep 2, 2021

Incremental Training for imagenet (intel-analytics#1391)

9b9d025

Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 2, 2021

Incremental Training for imagenet (intel-analytics#1391)

0deeb40

Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 2, 2021

Incremental Training for imagenet (intel-analytics#1391)

6184b43

Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 2, 2021

Incremental Training for imagenet (intel-analytics#1391)

b91b9c8

Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 3, 2021

Incremental Training for imagenet (intel-analytics#1391)

16d1a82

Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 3, 2021

Incremental Training for imagenet (intel-analytics#1391)

2d0a72d

dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Sep 4, 2021

Incremental Training for imagenet (intel-analytics#1391)

3f321c5

Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 6, 2021

Incremental Training for imagenet (intel-analytics#1391)

87f4b01

Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 6, 2021

Incremental Training for imagenet (intel-analytics#1391)

55f7fb0

dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Sep 7, 2021

Incremental Training for imagenet (intel-analytics#1391)

786fea4

Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 7, 2021

Incremental Training for imagenet (intel-analytics#1391)

b4d0f86

dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Sep 8, 2021

Incremental Training for imagenet (intel-analytics#1391)

095e11d

Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 8, 2021

Incremental Training for imagenet (intel-analytics#1391)

2f30205

Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 13, 2021

Incremental Training for imagenet (intel-analytics#1391)

fd4ee6c

Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 22, 2021

Incremental Training for imagenet (intel-analytics#1391)

bc29b1b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental Training for imagenet #1391

Incremental Training for imagenet #1391

qiuxin2012 commented May 27, 2019

jason-dai Jul 8, 2019

jason-dai Jul 8, 2019

qiuxin2012 Jul 8, 2019 •

edited

Loading

qiuxin2012 Jul 31, 2019

jason-dai Jul 8, 2019

jason-dai Jul 8, 2019

jason-dai Jul 8, 2019

jason-dai Jul 8, 2019

jason-dai Jul 8, 2019

jason-dai Jul 8, 2019 •

edited

Loading

qiuxin2012 Jul 9, 2019

jason-dai Jul 8, 2019

jason-dai Jul 9, 2019 •

edited

Loading

qiuxin2012 commented Jul 15, 2019

zhichao-li Jul 18, 2019

qiuxin2012 Jul 18, 2019

cyita Jul 18, 2019

cyita Jul 18, 2019

qiuxin2012 Jul 29, 2019

		@@ -32,6 +32,8 @@ case object PARTITIONED extends DataStrategy

		case object REPLICATED extends DataStrategy

		case class INCREMENTAL(cachePercentage: Double) extends DataStrategy

		protected var checkpointDir: Option[String] = None
		protected var cachePercentage: Double = 1.0

Incremental Training for imagenet #1391

Incremental Training for imagenet #1391

Conversation

qiuxin2012 commented May 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qiuxin2012 Jul 8, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jason-dai Jul 8, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jason-dai Jul 9, 2019 • edited Loading

Choose a reason for hiding this comment

qiuxin2012 commented Jul 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qiuxin2012 Jul 8, 2019 •

edited

Loading

jason-dai Jul 8, 2019 •

edited

Loading

jason-dai Jul 9, 2019 •

edited

Loading