[WIP] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

helenxie-bit · 2024-09-03T13:17:38Z

What this PR does / why we need it:
This PR adds an e2e test for the tune API, specifically for the scenario of importing external models and datasets for LLM hyperparameter optimization.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

Docs included if any changes are user facing

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

google-oss-prow · 2024-09-03T13:17:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

helenxie-bit · 2024-09-03T13:21:23Z

/area gsoc

helenxie-bit · 2024-09-03T13:21:49Z

Ref: #2339

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

…roller Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

helenxie-bit · 2024-09-24T21:12:47Z

The e2e test for the tune API has been consistently failing due to a "Timeout Error," and I have been investigating the root cause. I set the retain_trials parameter to True and retrieved the logs from the pod in the Experiment. The logs revealed that both the pytorch container and the metrics-logger-and-collector container exited with an Error 137.

When I ran kubectl describe pod $POD_NAME -n default, I noticed the following events. One specific event, "SandboxChanged," stood out as potentially problematic:

Events:
  Type    Reason          Age                    From               Message
  ----    ------          ----                   ----               -------
  ...
  Normal  SandboxChanged  3m (x2 over 3m43s)     kubelet            Pod sandbox changed, it will be killed and re-created.
  ...

However, when I checked the pod logs using kubectl logs $POD_NAME -n default --all-containers, everything appeared normal, and the logs confirmed that "Training is complete."

I also examined the kubelet and container runtime logs. While the kubelet logs provided no additional insights, the container runtime logs displayed the following error, which I believe may be related to the issue:

Sep 24 20:39:17 fv-az1984-731 dockerd[3344]: time="2024-09-24T20:39:17.335535766Z" level=error msg="Can not stat \"/mnt/docker/overlay2/8b5c0a1c561c3db93ea374a55b4004ebffd509c370cd11140b1275bf004fc8cd/merged/run/mysqld/mysqlx.sock\": lstat /mnt/docker/overlay2/8b5c0a1c561c3db93ea374a55b4004ebffd509c370cd11140b1275bf004fc8cd/merged/run/mysqld/mysqlx.sock: no such file or directory"

@andreyvelich @tenzen-y Do you have any thoughts on how to resolve this issue?

add e2e test for tune api

6be7f29

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

google-oss-prow bot requested review from andreyvelich, anencore94 and gaocegege September 3, 2024 13:17

google-oss-prow bot added the size/M label Sep 3, 2024

helenxie-bit mentioned this pull request Sep 3, 2024

[GSoC] Project 4: Hyperparameter Optimization API in Katib for LLMs #2339

Open

6 tasks

google-oss-prow bot added the area/gsoc label Sep 3, 2024

helenxie-bit added 2 commits September 3, 2024 21:38

upgrade training-operator sdk

1a1f119

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

specify the version of training operator sdk

8461a49

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

helenxie-bit changed the title ~~[GSoC] Add e2e test for tune api with LLM hyperparameter optimization~~ [WIP] Add e2e test for tune api with LLM hyperparameter optimization Sep 3, 2024

google-oss-prow bot added the do-not-merge/work-in-progress label Sep 3, 2024

fix num_labels error and update the version of training operator cont…

c860238

…roller Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

google-oss-prow bot added size/L and removed size/M labels Sep 3, 2024

helenxie-bit added 13 commits September 3, 2024 22:30

check the version of training operator

216ebd9

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

debug

f6b96f5

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

check import path of HuggingFaceModelParams

c636493

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

update the version of training operator sdk

8180422

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

update the name of experiment

6101489

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

add step of checking pod

d67a1b8

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

check the logs of pod

295abb6

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

add check

e0a1b6d

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

check reason for imagepullbackoff

1df7df9

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

revert timeout limit

d1e1311

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

fix format

0cc319f

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

extend timeout limit

0383932

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

update training operator sdk version

08c8634

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

helenxie-bit added 26 commits September 12, 2024 22:54

check the logs of pod

7a98a00

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

rerun tests

8862d79

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

update the function of getting logs

e4f614d

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

add the step of describing pod

0385eea

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

check disk space

e0c5170

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

change work directory

0286f70

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

change work directory

f6e5ed5

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

increase timeout limit

7ea7e43

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

check the logs of controller and events

25d99b1

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

change work directory

fcd64fa

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

change work directory

122c611

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

change work directory

c1fde09

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

check the logs of kubelet

8ff6864

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

check the logs of kubelet

da3c298

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

increase cpu

a1bff26

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

check the logs of training operator

bbae57b

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

check the use of resources

e45ceac

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

check the logs of container 'pytorch' and 'storage_initializer'

4ae11ed

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

fix error of checking use of resources

bedab36

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

add other checks to find the error reason

7bfb3cc

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

set 'storage_config'

efffdc2

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

reduce the number of tests

2a18b17

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

Check container runtime logs

c6c964b

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

set the driver of minikube as docker

28ffb96

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

set the driver of minikube to none

dc684e3

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

check logs of pod

a12034c

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

[WIP] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

helenxie-bit commented Sep 3, 2024

google-oss-prow bot commented Sep 3, 2024

helenxie-bit commented Sep 3, 2024

helenxie-bit commented Sep 3, 2024

helenxie-bit commented Sep 24, 2024

[WIP] Add e2e test for tune api with LLM hyperparameter optimization #2420

Are you sure you want to change the base?

[WIP] Add e2e test for tune api with LLM hyperparameter optimization #2420

Conversation

helenxie-bit commented Sep 3, 2024

google-oss-prow bot commented Sep 3, 2024

helenxie-bit commented Sep 3, 2024

helenxie-bit commented Sep 3, 2024

helenxie-bit commented Sep 24, 2024

[WIP] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

[WIP] Add e2e test for `tune` api with LLM hyperparameter optimization #2420