Upgrade ai ml tests to successfully run for long duration on another VM #1147

sethiay · 2023-05-29T14:53:36Z

Description

Upgrade ai ml tests to successfully run for long duration on VM other than the VM on which the kokoro tests are triggered. This is required because of intermittent issues that comes while running long duration tests on Kokoro VM directly.

Notes for reviewer:

I have used sudo gcloud instead of gcloud because gcloud is giving errors while doing ssh from Kokoro VM and I couldn't find a solution.
I have changed the base image for Pytorch dino model because the nvidia one doesn't have gsutil preinstalled.
I will take care of to-dos before merging.
To-do: Take care of deleting logs for older build in GCS buckets, similar to periodic perf tests.
The existing build.sh files for both tf and pytorch are moved under scripts/ml_tests/tf/resnet/ and scripts/ml_tests/pytorch/dino by the name setup_host_and_run_model.sh

Link to the issue in case of a bug fix.

NA

Testing details

Manual - Tested the setup for tf model with less number of epochs. To-do/In progress: Test the setup for pytorch and with relatively higher number of epochs.
Unit tests - NA
Integration tests - NA

perfmetrics/scripts/continuous_test/ml_tests/run_and_manage_test.sh

sethiay requested a review from vadlakondaswetha May 29, 2023 14:54

vadlakondaswetha reviewed May 31, 2023

View reviewed changes

perfmetrics/scripts/continuous_test/ml_tests/run_and_manage_test.sh Outdated Show resolved Hide resolved

vadlakondaswetha approved these changes May 31, 2023

View reviewed changes

sethiay added 27 commits June 22, 2023 18:01

Trying gcloud compute ssh to trigger tests

e4944de

Minor fixes

f1ee5e4

increasing epochs

26340bf

Intentionally adding

c48d552

Experimenting try and catch

4f46a31

Handling states

b2d874b

Small fixes

05b9ba1

fixes

261fdbe

minor fixes

10137ac

minor change

7d83097

fixes

9ede6aa

minior fix

cbe0cdc

small change

22c9f13

small change

201f3ae

small change

78b886b

Small fix

297f5b2

Small fix

1614b71

Minor enhancement

87d561f

Organizing the code better

f55f27f

minor fix

aa24642

Small fix

3e46de4

Fixes

2b82535

Updating path in log rotation

fc899e0

Change image for pytorch model

8912949

Reducing epoch for tf

77c7ff7

Small fix for Kokoro to work

eafc023

Try commit

1da14bd

sethiay added 25 commits June 22, 2023 18:01

Small fix

97ee0da

Small fix in copying gcsfuse logs

44ebe3e

Try timeout error

6b7df2f

Try timeout error with 30 mins

2e8667c

changing timeout back to 7.5 days

9adb758

Reducing the timeout for Kokoro VM as test will not run on them

872aea5

Small fixes

9e7c8d2

Run 1 epoch for pytorch for testing

9275cbd

Run 2 epoch for pytorch for testing

8120802

Trying out ssh initialization

bf7081a

Changing python version

91c38b2

Fixing warmup

6c2f25c

Fixing warmup

c8c8df6

Fixing warmup

d9bf72f

Disabling cache in pytorch model

3a4eebd

Trying dropping cache for GKE

ee31cdd

Correction

78601d1

Changing zone

3519d2e

Changing zone

1385020

Trying out new zone

562ae04

handling some to-dos

d714d2e

cleanup

3474f7f

trying changing zone

08ff3ba

Changing GPU spec

ec87e9d

change to 1 gpu

c796d46

sethiay force-pushed the ai_ml_tests branch from 56f3c20 to c796d46 Compare June 22, 2023 18:01

sethiay added 2 commits June 22, 2023 18:04

Changing branch to master

3ae309d

Removing comment

f6739e6

sethiay merged commit ce0a93d into master Jun 22, 2023
3 checks passed

sethiay deleted the ai_ml_tests branch May 14, 2024 13:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade ai ml tests to successfully run for long duration on another VM #1147

Upgrade ai ml tests to successfully run for long duration on another VM #1147

sethiay commented May 29, 2023 •

edited

Loading

Upgrade ai ml tests to successfully run for long duration on another VM #1147

Upgrade ai ml tests to successfully run for long duration on another VM #1147

Conversation

sethiay commented May 29, 2023 • edited Loading

Description

Link to the issue in case of a bug fix.

Testing details

sethiay commented May 29, 2023 •

edited

Loading