-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aws cloud deployment improvements #2618
Conversation
with 6 configurations, including 3 with T4 GPU (g4dn.xlarge) t2.small: 20.04: ami-04bad3c587fe60d89 22.04: ami-03c983f9003cb9cd1 24.04: ami-0406d1fdd021121cd g4dn.xlarge: 20.04: ami-04bad3c587fe60d89 22.04: ami-03c983f9003cb9cd1 24.04: ami-0406d1fdd021121cd - changed default image to ami-03c983f9003cb9cd1 (22.04 / Python 3.10) - added g4dn.xlarge as an option to prompt EC2_TYPE - fixed typo in prompt REGION - get default blockdevice name and set size to 16GB instead of 8GB - install apt package nvidia-driver-535-server if GPU found - run modprobe nvidia to avoid reboot if GPU found - adding ~/.local/bin to PATH - add --break-system-packages to pip install (required by Python 3.12) - add --no-cache-dir to pip install to avoid disk space issues - add @reboot cronjob to ensure nvflare is restarted after a server (re)start
@dirkpetersen , thanks for the PR. It improves the template quite a lot. However, we may need to have our internal discussion on the deployment scenario. We know different jobs require different dependencies. Sometimes, pytorch is one of the dependencies and thus the deployment will need larger volumes. We also see numpy-only jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good PR. It covers some scenarios we did not consider previously.
Thanks @IsaacYangSLA, this helps us a lot as we no longer have run a patched version. Another thing to consider is that this will not run on systems with GPU that run RHEL based images such as amazon linux, only ubuntu. The AI/ML community, that does not use Ubuntu, seems to be small though. |
/build |
/build |
/build |
/build |
@IsaacYangSLA , as I was working with other AWS regions, I noticed that some of our users would be challenged to look up appropriate AMI id without help from a cloud engineer as these ids are different in each region, also extensive testing triggered a number of fine tuning improvements that I would like to add to a pull request. I assume it might be better to have this current pull request go through, and then create a new one after that, or is it better to add to this one? Looking for your guidance I have made the following changes:
the last one is of course optional based on your team's coding preferences I tested the following combinations over the last couple of days but only the 24.04 ARM option seems to fail with nvidia drivers 535 and 550 The new UI would look like this:
here is the current aws_start.sh script i am using: |
Thank you so much @dirkpetersen for you contribution, we will keep the RHEL in mind in future releases. |
@dirkpetersen really appreciate your contribution and detailed tests. |
@@ -32,7 +32,7 @@ aws_start_sh: | | |||
EC2_TYPE=t2.xlarge | |||
REGION=us-west-2 | |||
else | |||
AMI_IMAGE=ami-04bad3c587fe60d89 | |||
AMI_IMAGE=ami-03c983f9003cb9cd1 # 22.04 20.04:ami-04bad3c587fe60d89 24.04:ami-0406d1fdd021121cd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SYangster can you add documentations for these different ami-id and ubuntu versions
/build |
* the following changes for aws cloud deployment have been tested with 6 configurations, including 3 with T4 GPU (g4dn.xlarge) t2.small: 20.04: ami-04bad3c587fe60d89 22.04: ami-03c983f9003cb9cd1 24.04: ami-0406d1fdd021121cd g4dn.xlarge: 20.04: ami-04bad3c587fe60d89 22.04: ami-03c983f9003cb9cd1 24.04: ami-0406d1fdd021121cd - changed default image to ami-03c983f9003cb9cd1 (22.04 / Python 3.10) - added g4dn.xlarge as an option to prompt EC2_TYPE - fixed typo in prompt REGION - get default blockdevice name and set size to 16GB instead of 8GB - install apt package nvidia-driver-535-server if GPU found - run modprobe nvidia to avoid reboot if GPU found - adding ~/.local/bin to PATH - add --break-system-packages to pip install (required by Python 3.12) - add --no-cache-dir to pip install to avoid disk space issues - add @reboot cronjob to ensure nvflare is restarted after a server (re)start * instead of setting the disk to 16GB increase the existing disk size by 8GB --------- Co-authored-by: Isaac Yang <isaacy@nvidia.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
the following changes for aws cloud deployment have been tested with 6 configurations, including 3 with T4 GPU (g4dn.xlarge) and 3 versions of Ubuntu
t2.small:
20.04: ami-04bad3c587fe60d89
22.04: ami-03c983f9003cb9cd1
24.04: ami-0406d1fdd021121cd
g4dn.xlarge:
20.04: ami-04bad3c587fe60d89
22.04: ami-03c983f9003cb9cd1
24.04: ami-0406d1fdd021121cd
Note: nvflare installs on 24.04 but is currently not supported
Fixes # .
Description
The critical change is here the disk size increase by 8GB. This results in all tested configurations having between 30% and 60% free disk space. The default install can run out of disk space if nvidia drivers, pytorch and a few other packages need to be installed
Types of changes
./runtest.sh
.