Mobilenet model trained on casia dataset #116

AnujPanthri · 2023-07-04T10:48:11Z

I was trying to train a mobilenet model using arcloss on casia dataset and I am unable to exceed lfw acc more than ~0.9763.

I wanna know is this the best accuracy I can get using this dataset(casia) ? as training on ms1m is not possible for me as it is really large.

Also as I was checking your training scripts I saw that you are using large batch size and you have done a lot of experiments, didn't those took a lot of this ? what hardware did you used to train them?

I am mainly using google colab and kaggle for training .

training code :

data_path = "faces_webface_112x112_112x112_folders"
eval_paths = ["faces_webface_112x112/lfw.bin"]

basic_model = models.buildin_models("MobileNet", dropout=0, emb_shape=256, output_layer="E") 

tt = train.Train(data_path, save_path='mobilenet_256_adam_E.h5',
    eval_paths=eval_paths,
    basic_model=basic_model,
    batch_size=512, random_status=0,
    lr_base=0.001, lr_decay=0.5, lr_decay_steps=16, lr_min=1e-5)

optimizer = keras.optimizers.Adam(learning_rate=0.001)
sch = [
  {"loss": losses.ArcfaceLoss(scale=16), "epoch": 20, "optimizer": optimizer},
]
tt.train(sch, 0)

training logs:

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet/mobilenet_1_0_224_tf_no_top.h5
17225924/17225924 [==============================] - 0s 0us/step
>>>> L2 regularizer value from basic_model: 0
>>>> Init type by loss function name...
>>>> Train arcface...
>>>> Init softmax dataset...
>>>> Image length: 490623, Image class length: 490623, classes: 10572
>>>> Use specified optimizer: <keras.optimizers.adam.Adam object at 0x78131c6ffb20>
>>>> Add arcface layer, arc_kwargs={'loss_top_k': 1, 'append_norm': False, 'partial_fc_split': 0, 'name': 'arcface'}, vpl_kwargs={'vpl_lambda': 0.15, 'start_iters': -958, 'allowed_delta': 200}...
>>>> loss_weights: {'arcface': 1}
Epoch 1/20

Learning rate for iter 1 is 0.0010000000474974513, global_iterNum is 0
958/958 [==============================] - ETA: 0s - loss: 11.6348 - accuracy: 0.3298
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.43it/s]
>>>> lfw evaluation max accuracy: 0.951500, thresh: 0.476799, previous max accuracy: 0.000000
>>>> Improved = 0.951500
Saving model to: checkpoints/mobilenet_256_adam_E_basic_lfw_epoch_1_0.951500.h5
Epoch 1: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 413s 401ms/step - loss: 11.6348 - accuracy: 0.3298
Epoch 2/20

Learning rate for iter 2 is 0.000990488799288869, global_iterNum is 958
958/958 [==============================] - ETA: 0s - loss: 8.2079 - accuracy: 0.6521
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.69it/s]
>>>> lfw evaluation max accuracy: 0.964000, thresh: 0.400249, previous max accuracy: 0.951500
>>>> Improved = 0.012500
Saving model to: checkpoints/mobilenet_256_adam_E_basic_lfw_epoch_2_0.964000.h5
Epoch 2: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 284s 295ms/step - loss: 8.2079 - accuracy: 0.6521
Epoch 3/20

Learning rate for iter 3 is 0.0009623204241506755, global_iterNum is 1916
958/958 [==============================] - ETA: 0s - loss: 7.1694 - accuracy: 0.7361
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.69it/s]
>>>> lfw evaluation max accuracy: 0.972000, thresh: 0.351204, previous max accuracy: 0.964000
>>>> Improved = 0.008000
Saving model to: checkpoints/mobilenet_256_adam_E_basic_lfw_epoch_3_0.972000.h5
Epoch 3: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 294ms/step - loss: 7.1694 - accuracy: 0.7361
Epoch 4/20

Learning rate for iter 4 is 0.0009165775263682008, global_iterNum is 2874
958/958 [==============================] - ETA: 0s - loss: 6.5255 - accuracy: 0.7772
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.70it/s]
>>>> lfw evaluation max accuracy: 0.972000, thresh: 0.326500, previous max accuracy: 0.972000
>>>> Improved = 0.000000
Saving model to: checkpoints/mobilenet_256_adam_E_basic_lfw_epoch_4_0.972000.h5
Epoch 4: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 294ms/step - loss: 6.5255 - accuracy: 0.7772
Epoch 5/20

Learning rate for iter 5 is 0.0008550179190933704, global_iterNum is 3832
958/958 [==============================] - ETA: 0s - loss: 6.0505 - accuracy: 0.8044
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.70it/s]
>>>> lfw evaluation max accuracy: 0.974167, thresh: 0.336947, previous max accuracy: 0.972000
>>>> Improved = 0.002167
Saving model to: checkpoints/mobilenet_256_adam_E_basic_lfw_epoch_5_0.974167.h5
Epoch 5: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 294ms/step - loss: 6.0505 - accuracy: 0.8044
Epoch 6/20

Learning rate for iter 6 is 0.0007800072198733687, global_iterNum is 4790
958/958 [==============================] - ETA: 0s - loss: 5.6642 - accuracy: 0.8246
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.38it/s]
>>>> lfw evaluation max accuracy: 0.974000, thresh: 0.302503, previous max accuracy: 0.974167

Epoch 6: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 294ms/step - loss: 5.6642 - accuracy: 0.8246
Epoch 7/20

Learning rate for iter 7 is 0.0006944283377379179, global_iterNum is 5748
958/958 [==============================] - ETA: 0s - loss: 5.3341 - accuracy: 0.8415
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.70it/s]
>>>> lfw evaluation max accuracy: 0.975500, thresh: 0.289799, previous max accuracy: 0.974167
>>>> Improved = 0.001333
Saving model to: checkpoints/mobilenet_256_adam_E_basic_lfw_epoch_7_0.975500.h5
Epoch 7: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 282s 294ms/step - loss: 5.3341 - accuracy: 0.8415
Epoch 8/20

Learning rate for iter 8 is 0.0006015697144903243, global_iterNum is 6706
958/958 [==============================] - ETA: 0s - loss: 5.0437 - accuracy: 0.8557
Evaluating lfw: 100%|██████████| 24/24 [00:04<00:00,  5.36it/s]
>>>> lfw evaluation max accuracy: 0.976333, thresh: 0.280856, previous max accuracy: 0.975500
>>>> Improved = 0.000833
Saving model to: checkpoints/mobilenet_256_adam_E_basic_lfw_epoch_8_0.976333.h5

Epoch 8: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 294ms/step - loss: 5.0437 - accuracy: 0.8557
Epoch 9/20

Learning rate for iter 9 is 0.0005050000036135316, global_iterNum is 7664
958/958 [==============================] - ETA: 0s - loss: 4.7825 - accuracy: 0.8688
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.70it/s]
>>>> lfw evaluation max accuracy: 0.976000, thresh: 0.273528, previous max accuracy: 0.976333

Epoch 9: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 294ms/step - loss: 4.7825 - accuracy: 0.8688
Epoch 10/20

Learning rate for iter 10 is 0.0004084303218405694, global_iterNum is 8622
958/958 [==============================] - ETA: 0s - loss: 4.5492 - accuracy: 0.8802
Evaluating lfw: 100%|██████████| 24/24 [00:04<00:00,  4.93it/s]
>>>> lfw evaluation max accuracy: 0.974500, thresh: 0.257037, previous max accuracy: 0.976333

Epoch 10: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 282s 294ms/step - loss: 4.5492 - accuracy: 0.8802
Epoch 11/20

Learning rate for iter 11 is 0.0003155716694891453, global_iterNum is 9580
958/958 [==============================] - ETA: 0s - loss: 4.3433 - accuracy: 0.8900
Evaluating lfw: 100%|██████████| 24/24 [00:04<00:00,  5.55it/s]
>>>> lfw evaluation max accuracy: 0.974833, thresh: 0.279876, previous max accuracy: 0.976333

Epoch 11: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 293ms/step - loss: 4.3433 - accuracy: 0.8900
Epoch 12/20

Learning rate for iter 12 is 0.00022999268549028784, global_iterNum is 10538
958/958 [==============================] - ETA: 0s - loss: 4.1669 - accuracy: 0.8983
Evaluating lfw: 100%|██████████| 24/24 [00:04<00:00,  5.51it/s]
>>>> lfw evaluation max accuracy: 0.975833, thresh: 0.270799, previous max accuracy: 0.976333

Epoch 12: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 294ms/step - loss: 4.1669 - accuracy: 0.8983
Epoch 13/20

Learning rate for iter 13 is 0.00015498216089326888, global_iterNum is 11496
958/958 [==============================] - ETA: 0s - loss: 4.0252 - accuracy: 0.9050
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.70it/s]
>>>> lfw evaluation max accuracy: 0.974667, thresh: 0.240988, previous max accuracy: 0.976333

Epoch 13: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 284s 294ms/step - loss: 4.0252 - accuracy: 0.9050
Epoch 14/20

Learning rate for iter 14 is 9.342252451460809e-05, global_iterNum is 12454
958/958 [==============================] - ETA: 0s - loss: 3.9148 - accuracy: 0.9100
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.70it/s]
>>>> lfw evaluation max accuracy: 0.974333, thresh: 0.240347, previous max accuracy: 0.976333

Epoch 14: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 284s 295ms/step - loss: 3.9148 - accuracy: 0.9100
Epoch 15/20

Learning rate for iter 15 is 4.767959035234526e-05, global_iterNum is 13412
958/958 [==============================] - ETA: 0s - loss: 3.8422 - accuracy: 0.9131
Evaluating lfw: 100%|██████████| 24/24 [00:04<00:00,  5.63it/s]
>>>> lfw evaluation max accuracy: 0.974500, thresh: 0.249563, previous max accuracy: 0.976333

Epoch 15: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 282s 293ms/step - loss: 3.8422 - accuracy: 0.9131
Epoch 16/20

Learning rate for iter 16 is 1.95112716028234e-05, global_iterNum is 14370
958/958 [==============================] - ETA: 0s - loss: 3.8022 - accuracy: 0.9150
Evaluating lfw: 100%|██████████| 24/24 [00:04<00:00,  5.61it/s]
>>>> lfw evaluation max accuracy: 0.974500, thresh: 0.247305, previous max accuracy: 0.976333

Epoch 16: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 294ms/step - loss: 3.8022 - accuracy: 0.9150
Epoch 17/20

Learning rate for iter 17 is 1e-05, global_iterNum is 15328
958/958 [==============================] - ETA: 0s - loss: 3.7912 - accuracy: 0.9158
Evaluating lfw: 100%|██████████| 24/24 [00:04<00:00,  5.71it/s]
>>>> lfw evaluation max accuracy: 0.975000, thresh: 0.243934, previous max accuracy: 0.976333

Epoch 17: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 278s 289ms/step - loss: 3.7912 - accuracy: 0.9158
Epoch 18/20

Learning rate for iter 18 is 0.0005050000036135316, global_iterNum is 16286
958/958 [==============================] - ETA: 0s - loss: 4.3969 - accuracy: 0.8887
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.70it/s]
>>>> lfw evaluation max accuracy: 0.973667, thresh: 0.249403, previous max accuracy: 0.976333

Epoch 18: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 284s 295ms/step - loss: 4.3969 - accuracy: 0.8887
Epoch 19/20

Learning rate for iter 19 is 0.0005038082017563283, global_iterNum is 17244
958/958 [==============================] - ETA: 0s - loss: 4.3282 - accuracy: 0.8907
Evaluating lfw: 100%|██████████| 24/24 [00:04<00:00,  5.57it/s]
>>>> lfw evaluation max accuracy: 0.974333, thresh: 0.246381, previous max accuracy: 0.976333

Epoch 19: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 294ms/step - loss: 4.3282 - accuracy: 0.8907
Epoch 20/20

Learning rate for iter 20 is 0.0005002443795092404, global_iterNum is 18202
958/958 [==============================] - ETA: 0s - loss: 4.2424 - accuracy: 0.8941
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.70it/s]
>>>> lfw evaluation max accuracy: 0.973167, thresh: 0.235816, previous max accuracy: 0.976333

Epoch 20: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 295ms/step - loss: 4.2424 - accuracy: 0.8941
>>>> Train arcface DONE!!! epochs = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], model.stop_training = False
>>>> My history:
{
  'lr': [0.0009905084734782577, 0.0009623592486605048, 0.000916633871383965, 0.0008550895727239549, 0.0007800916209816933, 0.0006945220520719886, 0.0006016692495904863, 0.0005051014595665038, 0.0004085298569407314, 0.0003156654420308769, 0.00023007707204669714, 0.00015505391638725996, 9.347893501399085e-05, 4.7718443966005e-05, 1.953109858732205e-05, 1.0000000656873453e-05, 9.999999747378752e-06, 0.0005038107046857476, 0.000500249327160418, 0.0004943501553498209],
  'loss': [11.634810447692871, 8.207895278930664, 7.169422149658203, 6.525516510009766, 6.050527572631836, 5.664212226867676, 5.334094047546387, 5.043704509735107, 4.7824506759643555, 4.549181938171387, 4.343286037445068, 4.166921138763428, 4.0251665115356445, 3.914809465408325, 3.842155694961548, 3.8022029399871826, 3.7911524772644043, 4.396862506866455, 4.328181743621826, 4.242437362670898],
  'accuracy': [0.32976415753364563, 0.6521174311637878, 0.7361018061637878, 0.7771826982498169, 0.8043939471244812, 0.8246305584907532, 0.8415114283561707, 0.855713427066803, 0.8687940239906311, 0.8802070021629333, 0.8900052309036255, 0.8982886672019958, 0.9050226807594299, 0.9099584817886353, 0.9130941033363342, 0.9150410890579224, 0.9158015847206116, 0.8887085914611816, 0.8907004594802856, 0.8940786719322205],
  'lfw': [0.9515, 0.964, 0.972, 0.972, 0.9741666666666666, 0.974, 0.9755, 0.9763333333333334, 0.976, 0.9745, 0.9748333333333333, 0.9758333333333333, 0.9746666666666667, 0.9743333333333334, 0.9745, 0.9745, 0.975, 0.9736666666666667, 0.9743333333333334, 0.9731666666666666],
  'lfw_thresh': [0.4767988324165344, 0.40024882555007935, 0.35120391845703125, 0.3265003561973572, 0.3369472026824951, 0.30250275135040283, 0.2897985875606537, 0.28085601329803467, 0.273528128862381, 0.2570366859436035, 0.2798755168914795, 0.2707985043525696, 0.24098770320415497, 0.2403470277786255, 0.24956296384334564, 0.24730539321899414, 0.2439337521791458, 0.2494034618139267, 0.24638071656227112, 0.23581618070602417],
}
>>>> Saving latest basic model to: checkpoints/mobilenet_256_adam_E_basic_model_latest.h5

The text was updated successfully, but these errors were encountered:

leondgarse · 2023-07-05T11:08:42Z

It could be rather hard reaching any satisfactory result using mobilenet +casia only, even not possible... I'm previously using RTX8000 with 46GB GPU memory.
Regarding your script, may try:

For lr_decay_steps=16, will using a cosine learning rate with restart, and lr will will be restarted on epoch [16+1==17, 17 + 16 * 2 + 1 == 50] with value [lr_base / 2 == 5e-4, lr_base / 4 == 2.5e-4]. So it's better set total epochs 17 or 50.
As you are using colab, and it should be using TF>=2.12.0, may add some weight_decay to keras.optimizer.Adam.
May further increase ArcfaceLoss scale till 64.

data_path = "faces_webface_112x112_112x112_folders"
eval_paths = ["faces_webface_112x112/lfw.bin"]

basic_model = models.buildin_models("MobileNet", dropout=0, emb_shape=256, output_layer="E") 

tt = train.Train(data_path, save_path='mobilenet_256_adam_E.h5',
    eval_paths=eval_paths,
    basic_model=basic_model,
    batch_size=512, random_status=0,
    lr_base=0.001, lr_decay=0.5, lr_decay_steps=16, lr_min=1e-5)

optimizer = keras.optimizers.Adam(learning_rate=0.001, weight_decay=5e-4)
sch = [
  {"loss": losses.ArcfaceLoss(scale=16), "epoch": 10, "optimizer": optimizer},
  {"loss": losses.ArcfaceLoss(scale=32), "epoch": 10},
  {"loss": losses.ArcfaceLoss(scale=64), "epoch": 30},
]
tt.train(sch, 0)

AnujPanthri · 2023-07-05T17:55:38Z

First of all this repo is really helpful for me and it has helped me a lot , so thank you for this and more over thank you for replying to me .

Wow 46 gb vram is impressive probably that is why you were able to use large batch sizes.

and I feel the main bottleneck is using casia dataset in my case , as you have got better results with mobilenet when trained on MS1M dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mobilenet model trained on casia dataset #116

Mobilenet model trained on casia dataset #116

AnujPanthri commented Jul 4, 2023 •

edited

Loading

leondgarse commented Jul 5, 2023

AnujPanthri commented Jul 5, 2023

Mobilenet model trained on casia dataset #116

Mobilenet model trained on casia dataset #116

Comments

AnujPanthri commented Jul 4, 2023 • edited Loading

leondgarse commented Jul 5, 2023

AnujPanthri commented Jul 5, 2023

AnujPanthri commented Jul 4, 2023 •

edited

Loading