adjust to msrun command

mindspore-lab · Sep 13, 2024 · 07fe3fe · 07fe3fe
1 parent 935ccba
commit 07fe3fe
Show file tree

Hide file tree

Showing 29 changed files with 85 additions and 90 deletions.
diff --git a/GETTING_STARTED.md b/GETTING_STARTED.md
@@ -48,27 +48,32 @@ to understand their behavior. Some common arguments are:
   ```
   </details>
 
-* To train a model on 8 NPUs/GPUs:
-  ```
-  mpirun --allow-run-as-root -n 8 python train.py --config ./configs/yolov7/yolov7.yaml  --is_parallel True
-  ```
-
 * To train a model on 1 NPU/GPU/CPU:
   ```
   python train.py --config ./configs/yolov7/yolov7.yaml 
   ```
-
+* To train a model on 8 NPUs/GPUs:
+  ```
+  msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov7_log python train.py --config ./configs/yolov7/yolov7.yaml  --is_parallel True
+  ```
 * To evaluate a model's performance on 1 NPU/GPU/CPU:
   ```
   python test.py --config ./configs/yolov7/yolov7.yaml --weight /path_to_ckpt/WEIGHT.ckpt
   ```
 * To evaluate a model's performance 8 NPUs/GPUs:
   ```
-  mpirun --allow-run-as-root -n 8 python test.py --config ./configs/yolov7/yolov7.yaml --weight /path_to_ckpt/WEIGHT.ckpt --is_parallel True
+  msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov7_log python test.py --config ./configs/yolov7/yolov7.yaml --weight /path_to_ckpt/WEIGHT.ckpt --is_parallel True
   ```
 *Notes: (1) The default hyper-parameter is used for 8-card training, and some parameters need to be adjusted in the case of a single card. (2) The default device is Ascend, and you can modify it by specifying 'device_target' as Ascend/GPU/CPU, as these are currently supported.*
 * For more options, see `train/test.py -h`.
 
+* Notice that if you are using `msrun` startup with 2 devices, please add `--bind_core=True` to improve performance. For example:
+```
+  msrun --bind_core=True --worker_num=2--local_worker_num=2 --master_port=8118 \
+        --log_dir=msrun_log --join=True --cluster_time_out=300 \
+        python train.py --config ./configs/yolov7/yolov7.yaml  --is_parallel True
+```
+> For more information, please refer to [here](https://www.mindspore.cn/tutorials/experts/en/r2.3.1/parallel/startup_method.html).
 
 ### Deployment
 

diff --git a/GETTING_STARTED_CN.md b/GETTING_STARTED_CN.md
@@ -45,33 +45,35 @@ python demo/predict.py --config ./configs/yolov7/yolov7.yaml --weight=/path_to_c
   ```
   </details>
 
-* 在多卡NPU/GPU上进行分布式模型训练，以8卡为例:
-
-  ```shell
-  mpirun --allow-run-as-root -n 8 python train.py --config ./configs/yolov7/yolov7.yaml  --is_parallel True
-  ```
-
 * 在单卡NPU/GPU/CPU上训练模型：
-
   ```shell
   python train.py --config ./configs/yolov7/yolov7.yaml 
   ```
-
+* 在多卡NPU/GPU上进行分布式模型训练，以8卡为例:
+  ```shell
+  msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov7_log python train.py --config ./configs/yolov7/yolov7.yaml  --is_parallel True
+  ```
 * 在单卡NPU/GPU/CPU上评估模型的精度：
-
   ```shell
   python test.py --config ./configs/yolov7/yolov7.yaml --weight /path_to_ckpt/WEIGHT.ckpt
   ```
 * 在多卡NPU/GPU上进行分布式评估模型的精度：
-
   ```shell
-  mpirun --allow-run-as-root -n 8 python test.py --config ./configs/yolov7/yolov7.yaml --weight /path_to_ckpt/WEIGHT.ckpt --is_parallel True
+  msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov7_log python test.py --config ./configs/yolov7/yolov7.yaml --weight /path_to_ckpt/WEIGHT.ckpt --is_parallel True
   ```
 
 *注意：默认超参为8卡训练，单卡情况需调整部分参数。 默认设备为Ascend，您可以指定'device_target'的值为Ascend/GPU/CPU。*
 * 有关更多选项，请参阅 `train/test.py -h`.
 * 在云脑上进行训练，请在[这里](./tutorials/cloud/modelarts_CN.md)查看
 
+*注意：如果您在 2 个设备上使用`msrun`指令启动，请添加`--bind_core=True`以提高性能。例如：
+```
+  msrun --bind_core=True --worker_num=2--local_worker_num=2 --master_port=8118 \
+        --log_dir=msrun_log --join=True --cluster_time_out=300 \
+        python train.py --config ./configs/yolov7/yolov7.yaml  --is_parallel True
+```
+> 有关更多选项, 请参阅[这里](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/parallel/startup_method.html)。
+
 ### 部署
 
 请在[这里](./deploy/README.md)查看.

diff --git a/configs/yolov3/README.md b/configs/yolov3/README.md
@@ -56,11 +56,11 @@ python mindyolo/utils/convert_weight_darknet53.py
 It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config ./configs/yolov3/yolov3.yaml --device_target Ascend --is_parallel True
+msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov3_log python train.py --config ./configs/yolov3/yolov3.yaml --device_target Ascend --is_parallel True
 ```
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above mpirun command.
+Similarly, you can train the model on multiple GPU devices with the above msrun command.
+**Note:** For more information about msrun configuration, please refer to [here](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/parallel/msrun_launcher.html).
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/config.py).
 

diff --git a/configs/yolov4/README.md b/configs/yolov4/README.md
@@ -70,11 +70,11 @@ python mindyolo/utils/convert_weight_cspdarknet53.py
 It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config ./configs/yolov4/yolov4-silu.yaml --device_target Ascend --is_parallel True --epochs 320
+msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov4_log python train.py --config ./configs/yolov4/yolov4-silu.yaml --device_target Ascend --is_parallel True --epochs 320
 ```
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above mpirun command.
+Similarly, you can train the model on multiple GPU devices with the above msrun command.
+**Note:** For more information about msrun configuration, please refer to [here](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/parallel/msrun_launcher.html).
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/config.py).
 

diff --git a/configs/yolov5/README.md b/configs/yolov5/README.md
@@ -50,11 +50,11 @@ Please refer to the [GETTING_STARTED](https://github.com/mindspore-lab/mindyolo/
 It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config ./configs/yolov5/yolov5n.yaml --device_target Ascend --is_parallel True
+msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov5_log python train.py --config ./configs/yolov5/yolov5n.yaml --device_target Ascend --is_parallel True
 ```
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above mpirun command.
+Similarly, you can train the model on multiple GPU devices with the above msrun command.
+**Note:** For more information about msrun configuration, please refer to [here](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/parallel/msrun_launcher.html).
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/config.py).
 

diff --git a/configs/yolov7/README.md b/configs/yolov7/README.md
@@ -51,11 +51,11 @@ Please refer to the [GETTING_STARTED](https://github.com/mindspore-lab/mindyolo/
 It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config ./configs/yolov7/yolov7.yaml --device_target Ascend --is_parallel True
+msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov7_log python train.py --config ./configs/yolov7/yolov7.yaml --device_target Ascend --is_parallel True
 ```
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above mpirun command.
+Similarly, you can train the model on multiple GPU devices with the above msrun command.
+**Note:** For more information about msrun configuration, please refer to [here](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/parallel/msrun_launcher.html).
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/config.py).
 

diff --git a/configs/yolov8/README.md b/configs/yolov8/README.md
@@ -60,11 +60,11 @@ Please refer to the [GETTING_STARTED](https://github.com/mindspore-lab/mindyolo/
 It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config ./configs/yolov8/yolov8n.yaml --device_target Ascend --is_parallel True
+msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov8_log python train.py --config ./configs/yolov8/yolov8n.yaml --device_target Ascend --is_parallel True
 ```
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above mpirun command.
+Similarly, you can train the model on multiple GPU devices with the above msrun command.
+**Note:** For more information about msrun configuration, please refer to [here](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/parallel/msrun_launcher.html).
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/config.py).
 

diff --git a/configs/yolox/README.md b/configs/yolox/README.md
@@ -50,13 +50,11 @@ Please refer to the [GETTING_STARTED](https://github.com/mindspore-lab/mindyolo/
 It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config ./configs/yolox/yolox-s.yaml --device_target Ascend --is_parallel True
+msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolox_log python train.py --config ./configs/yolox/yolox-s.yaml --device_target Ascend --is_parallel True
 ```
 
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
-
-
-Similarly, you can train the model on multiple GPU devices with the above mpirun command.
+Similarly, you can train the model on multiple GPU devices with the above msrun command.
+**Note:** For more information about msrun configuration, please refer to [here](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/parallel/msrun_launcher.html).
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/config.py).
 

diff --git a/docs/en/modelzoo/yolov3.md b/docs/en/modelzoo/yolov3.md
@@ -60,11 +60,11 @@ python mindyolo/utils/convert_weight_darknet53.py
 It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config ./configs/yolov3/yolov3.yaml --device_target Ascend --is_parallel True
+msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov3_log python train.py --config ./configs/yolov3/yolov3.yaml --device_target Ascend --is_parallel True
 ```
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above mpirun command.
+Similarly, you can train the model on multiple GPU devices with the above msrun command.
+**Note:** For more information about msrun configuration, please refer to [here](https://www.mindspore.cn/tutorials/experts/en/r2.3.1/parallel/msrun_launcher.html).
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/config.py).
 

diff --git a/docs/en/modelzoo/yolov4.md b/docs/en/modelzoo/yolov4.md
@@ -74,11 +74,11 @@ python mindyolo/utils/convert_weight_cspdarknet53.py
 It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config ./configs/yolov4/yolov4-silu.yaml --device_target Ascend --is_parallel True --epochs 320
+msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov4_log python train.py --config ./configs/yolov4/yolov4-silu.yaml --device_target Ascend --is_parallel True --epochs 320
 ```
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above mpirun command.
+Similarly, you can train the model on multiple GPU devices with the above msrun command.
+**Note:** For more information about msrun configuration, please refer to [here](https://www.mindspore.cn/tutorials/experts/en/r2.3.1/parallel/msrun_launcher.html).
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/config.py).
 

diff --git a/docs/en/modelzoo/yolov5.md b/docs/en/modelzoo/yolov5.md
@@ -54,11 +54,11 @@ Please refer to the [QUICK START](../tutorials/quick_start.md) in MindYOLO for d
 It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config ./configs/yolov5/yolov5n.yaml --device_target Ascend --is_parallel True
+msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov5_log python train.py --config ./configs/yolov5/yolov5n.yaml --device_target Ascend --is_parallel True
 ```
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above mpirun command.
+Similarly, you can train the model on multiple GPU devices with the above msrun command.
+**Note:** For more information about msrun configuration, please refer to [here](https://www.mindspore.cn/tutorials/experts/en/r2.3.1/parallel/msrun_launcher.html).
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/config.py).
 

diff --git a/docs/en/modelzoo/yolov7.md b/docs/en/modelzoo/yolov7.md
@@ -54,11 +54,11 @@ Please refer to the [QUICK START](../tutorials/quick_start.md) in MindYOLO for d
 It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config ./configs/yolov7/yolov7.yaml --device_target Ascend --is_parallel True
+msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov7_log python train.py --config ./configs/yolov7/yolov7.yaml --device_target Ascend --is_parallel True
 ```
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above mpirun command.
+Similarly, you can train the model on multiple GPU devices with the above msrun command.
+**Note:** For more information about msrun configuration, please refer to [here](https://www.mindspore.cn/tutorials/experts/en/r2.3.1/parallel/msrun_launcher.html).
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/config.py).
 

diff --git a/docs/en/modelzoo/yolov8.md b/docs/en/modelzoo/yolov8.md
@@ -64,11 +64,11 @@ Please refer to the [QUICK START](../tutorials/quick_start.md) in MindYOLO for d
 It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config ./configs/yolov8/yolov8n.yaml --device_target Ascend --is_parallel True
+msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov8_log python train.py --config ./configs/yolov8/yolov8n.yaml --device_target Ascend --is_parallel True
 ```
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
 
-Similarly, you can train the model on multiple GPU devices with the above mpirun command.
+Similarly, you can train the model on multiple GPU devices with the above msrun command.
+**Note:** For more information about msrun configuration, please refer to [here](https://www.mindspore.cn/tutorials/experts/en/r2.3.1/parallel/msrun_launcher.html).
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/config.py).
 

diff --git a/docs/en/modelzoo/yolox.md b/docs/en/modelzoo/yolox.md
@@ -53,13 +53,11 @@ Please refer to the [QUICK START](../tutorials/quick_start.md) in MindYOLO for d
 It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run
 ```shell
 # distributed training on multiple GPU/Ascend devices
-mpirun -n 8 python train.py --config ./configs/yolox/yolox-s.yaml --device_target Ascend --is_parallel True
+msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolox_log python train.py --config ./configs/yolox/yolox-s.yaml --device_target Ascend --is_parallel True
 ```
 
-> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
-
-
-Similarly, you can train the model on multiple GPU devices with the above mpirun command.
+Similarly, you can train the model on multiple GPU devices with the above msrun command.
+**Note:** For more information about msrun configuration, please refer to [here](https://www.mindspore.cn/tutorials/experts/en/r2.3.1/parallel/msrun_launcher.html).
 
 For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/config.py).
 

diff --git a/docs/en/tutorials/configuration.md b/docs/en/tutorials/configuration.md
@@ -42,7 +42,7 @@ __BASE__: [
 This part of the parameters is usually passed in from the command line. Examples are as follows:
 
   ```shell
-  mpirun --allow-run-as-root -n 8 python train.py --config ./configs/yolov7/yolov7.yaml  --is_parallel True --log_interval 50
+  msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov7_log python train.py --config ./configs/yolov7/yolov7.yaml  --is_parallel True --log_interval 50
   ```
 
 ## Dataset

diff --git a/docs/en/tutorials/finetune.md b/docs/en/tutorials/finetune.md
@@ -117,7 +117,7 @@ Since the SHWD training set only has about 6,000 images, the yolov7-tiny model w
 * Distributed model training on multi-card NPU/GPU, taking 8 cards as an example:
 
   ```shell
-  mpirun --allow-run-as-root -n 8 python train.py --config ./examples/finetune_SHWD/yolov7-tiny_shwd.yaml --is_parallel True
+  msrun --worker_num=8 --local_worker_num=8 --bind_core=True --log_dir=./yolov7-tiny_log python train.py --config ./examples/finetune_SHWD/yolov7-tiny_shwd.yaml --is_parallel True
   ```
 
 * Train the model on a single card NPU/GPU/CPU: