Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automating initialization of on-demand self-hosted CNCF CIL runner #39

Merged
merged 1 commit into from
Mar 15, 2022
Merged

Automating initialization of on-demand self-hosted CNCF CIL runner #39

merged 1 commit into from
Mar 15, 2022

Conversation

gyohuangxin
Copy link
Member

@gyohuangxin gyohuangxin commented Feb 25, 2022

The commit automates below steps:

  1. Create a CNCF CIL machine and register it as a self-hosted runner
  2. Run SMP benchmarks on the self-hosted runner
  3. Stop and remove the CNCF CIL machine and self-hosted runner

Description

This PR fixes #38

Notes for Reviewers

Signed commits

  • Yes, I signed my commits.

@hershd23
Copy link
Contributor

hershd23 commented Feb 25, 2022

Great work till now @gyohuangxin. You've made the workflow quite straightforward and easy to understand. If you need any help in resolving those OS permission issues you mentioned in #38, do let me know

Also mentioning @MarioArriaga92 who had a question in the recent build and release about provisioning and releasing instances during the workflow. This should help Mario!

@gyohuangxin
Copy link
Member Author

@hershd23 Thanks, what you did before helped me a lot in this implementation!
The issue blocks me now is how to run minikube start with the user I created. I tried to create a user and switch to the user: https://github.com/gyohuangxin/meshery-smp-action/blob/self-hosted/.github/workflows/configurable-benchmark-test-self-hosted.yaml#L83-#L86
However, the message of minikube told me I was still using root user which is not allowed to run minikube start: https://github.com/gyohuangxin/meshery-smp-action/runs/5357892309?check_suite_focus=true#step:3:76
It seems it's hard to switch another user across github action steps.

@hershd23
Copy link
Contributor

hershd23 commented Mar 1, 2022

I tried looking for the solution to that myself wasn't able to find anything. Is it possible to make a different user say smp on machine startup itself and our Github action sessions are made on that user instead of root?

@gyohuangxin
Copy link
Member Author

@hershd23 Yes, thank you for your help, it makes sense. I'm investigating to create the user via userdata script, so that we can use the smp user from startup.

@gyohuangxin
Copy link
Member Author

@hershd23 I used the cloud-config configuration to create user smp and it works, https://github.com/gyohuangxin/meshery-smp-action/blob/self-hosted/.github/workflows/scripts/start-cil-runner.sh#L15
The cloud-config (userdata) is very helpful for us to configure everything from machine stratup and here is the related docs:

You can find the latest result from here: https://github.com/gyohuangxin/meshery-smp-action/runs/5401509632?check_suite_focus=true
The machine creation, registration, deletion has been implemented, but still an issue with runing mesheryctl perf, it seems the pods can not be accessed when usingkubernetes platform. Do you have any comments on this? @hershd23 @navendu-pottekkat

Configuring Meshery to access Minikube...
Error getting context: Post "http://localhost:9081/api/system/kubernetes/contexts": dial tcp [::1]:9081: connect: connection refused
Configuration file: load-test.yaml
Endpoint URL: http://192.168.49.2:31126/productpage
Service Mesh: ISTIO
Test Name: istio-fortio-load-test.yamltest
Load Generator: fortio
Running test with test configuration file load-test.yaml
Error: failed to make a request.Get "http://localhost:9081/api/user/performance/profiles?page_size=25&page=0&search=test": dial tcp [::1]:9081: connect: connection refused.
See https://docs.meshery.io/reference/mesheryctl/perf/apply for usage details

@pottekkat
Copy link
Contributor

@gyohuangxin It seems like we are deploying Meshery inside the minikube cluster and there is some networking issue. I would suggest we deploy Meshery in Docker and connect it to the Kubernetes cluster. This is how we run tests on the GitHub runners now.

I'm not sure how it deployed Meshery on Kubernetes right now. I will go over the action code and get back.

We can set the platform on the workflow to Docker and it should work.

Does this help?

@pottekkat
Copy link
Contributor

Also, not deploying Meshery on the cluster have the added benefit of producing more accurate results as running Meshery is not interfering with the performance of the cluster to some extent.

@gyohuangxin
Copy link
Member Author

@navendu-pottekkat Thanks, it helps. I'll try to deploy Meshery in Docker.

@gyohuangxin
Copy link
Member Author

@navendu-pottekkat I tried to deploy Meshery in docker, but there was a nil pointer panic in meshery container. https://github.com/gyohuangxin/meshery-smp-action/runs/5462624684?check_suite_focus=true#step:6:1240
Do you have any comments on this?

@pottekkat
Copy link
Contributor

pottekkat commented Mar 8, 2022

@navendu-pottekkat I tried to deploy Meshery in docker, but there was a nil pointer panic in meshery container. gyohuangxin/meshery-smp-action/runs/5462624684?check_suite_focus=true#step:6:1240
Do you have any comments on this?

@piyushsingariya I think we did fix this bug. A new release of mesheryctl should fix this right?

@piyushsingariya
Copy link

@navendu-pottekkat I tried to deploy Meshery in docker, but there was a nil pointer panic in meshery container. gyohuangxin/meshery-smp-action/runs/5462624684?check_suite_focus=true#step:6:1240
Do you have any comments on this?

@piyushsingariya I think we did fix this bug. A new release of mesheryctl should fix this right?

@navendu-pottekkat This isn't an mesheryctl issue, it's from the server. @gyohuangxin can you try running same performance test with local build??

@gyohuangxin
Copy link
Member Author

@navendu-pottekkat @piyushsingariya When the platform is docker and using mesheryctl perf, it seems that the meshery container cannot access the endpoint of service mesh application.
https://github.com/gyohuangxin/meshery-smp-action/runs/5462624684?check_suite_focus=true#step:6:1174
Is there any method to make endpoint can be accessed to meshery container?

@pottekkat
Copy link
Contributor

There is no Meshery method. It could be some issue with the networking but we were able to use the same configuration on GitHub runners to successfully run benchmark tests and access the application endpoint. It could be environment specific.

@gyohuangxin
Copy link
Member Author

@navendu-pottekkat Yes, I looked at the code and found the panic caused by failing to get minikube context meshery-meshery-1 | time="2022-03-10T07:48:08Z" level=warning msg="failed to generate in cluster context: " meshery-meshery-1 | time="2022-03-10T07:48:08Z" level=warning msg="failed to find kubernetes context".
And I tried the local build manually and it works, so it could be the OS permission issue again.

And regarding the Github runner, I found there were another panic with GitHub runners, https://github.com/layer5io/meshery-smp-action/runs/5493977248?check_suite_focus=true#step:6:1573, it may be another issue we need to fix.

The commit automates below steps:
1. Create a CNCF CIL machine and register it as a self-hosted runner
2. Run SMP benchmarks on the self-hosted runner
3. Stop and remove the CNCF CIL machine and self-hosted runner

Signed-off-by: Huang Xin <xin1.huang@intel.com>
@gyohuangxin gyohuangxin marked this pull request as ready for review March 11, 2022 07:46
@gyohuangxin
Copy link
Member Author

Hi, there. I'm blocked by it too much time and the next important job (running scheduled benchmarking test on CNCF cluster) shouldn't be blocked. Can we start reviewing this PR and create another issue to track the meshery problem?
@hershd23 @navendu-pottekkat @leecalcote

@pottekkat
Copy link
Contributor

Yes @gyohuangxin We can review the workflow and merge it. Could you open a new issue to track the others?

@pottekkat
Copy link
Contributor

@hershd23 Could you also review this PR?

@hershd23
Copy link
Contributor

Yes will do

@gyohuangxin
Copy link
Member Author

@navendu-pottekkat @hershd23 Thanks, I opened the issue #40.

Copy link
Contributor

@hershd23 hershd23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyohuangxin I have reviewed the PR. Great work. I have not suggested any changes but have left a few comments wanting to know a couple of things.

@@ -35,7 +35,7 @@ main() {
kubectl config view --minify --flatten > ~/minified_config
mv ~/minified_config ~/.kube/config

curl -L https://git.io/meshery | PLATFORM=$PLATFORM bash -
curl -L https://git.io/meshery | sudo PLATFORM=$PLATFORM bash - &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for running this command in the background?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tested on the self-hosted runner, it will be pending always if not running it in the background. But I don't know why this doesn't happen on github runner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm okay, it's a minor thing let's just keep a note of this behaviour and come back to it later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a good reminder.

# Use user_data_scripts to register the CNCF CIL runner as a self-hosted runner
user_data_scripts="#cloud-config\nusers:\n - default\n - name: smp\n groups: sudo, docker\n sudo: ALL=(ALL) NOPASSWD:ALL\n lock_passwd: true\nruncmd:\n - [runuser, -l, smp, -c, \'mkdir actions-runner && cd actions-runner\']\n - [runuser, -l, smp, -c, \'curl -o actions-runner-linux-x64-2.287.1.tar.gz -L https://github.com/actions/runner/releases/download/v2.287.1/actions-runner-linux-x64-2.287.1.tar.gz\']\n - [runuser, -l, smp, -c, \'tar xzf ./actions-runner-linux-x64-2.287.1.tar.gz\']\n - [runuser, -l, smp, -c, \'export RUNNER_ALLOW_RUNASROOT=1\']\n - [runuser, -l, smp, -c, \'./config.sh --url https://github.com/$REPOSITORY --token $REG_TOKEN --labels $hostname >> github-action-registeration.log\']\n - [runuser, -l, smp, -c, \'./run.sh >> github-action-registeration.log\']"

# TODO: the options "operating_system", "facility", "plan" are hardcoded now, we should make them configurable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO still pending, would this also come in the scope of this PR or should we just make a new issue and document it there, till someone else picks it up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be in another PR instead of this one. This PR are the initial automation of self-hosted runner, so I won't include too many changes. And regarding this TODO, I think there is still something to discuss, it's a good idea to create an new issue and see if anyone can picks it up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Let's create another issue on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyohuangxin I'll let you take care of making an issue on this one as you would understand the details better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hershd23 I created the issue #41 to track it.

exit 1
fi

# Wait 10 minutes until the machine is running
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to wait 10 minutes? Is standard waiting time for an instance to come up mentioned somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no standard waiting time in my opinion. My thought about waiting 10 minutes is that if it takes too long to wait for machine to be running, there must be something wrong with it.
However, on my second thought, it would be better if we could take more advantage of "state" field instead of just waiting 10 minutes:

  1. If "state" == "provisioning", sleep 10s...
  2. If "state" == "active", echo "Machine successfully created!" and continue.
  3. If "state" == "failed", echo "Failed to create machine" and exit.
    How do you think about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyohuangxin this new flow makes a lot of sense. But let's capture this in another issue. There might be slight experimentation required here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this issue and will be creating an issue ticket for this one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, please go ahead.

Copy link
Contributor

@hershd23 hershd23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's just create tickets for the things we have discussed in the comments and we will be good to go

@hershd23
Copy link
Contributor

@navendu-pottekkat @leecalcote I don't seem to have the permissions to merge these changes. Do review them and merge them once you're done with your review

@hershd23
Copy link
Contributor

@leecalcote @navendu-pottekkat please do review and merge. Looks like I do not have merging permissions here

Copy link

@piyushsingariya piyushsingariya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@leecalcote
Copy link
Member

This is really neat.

@leecalcote leecalcote merged commit 20c0b6f into layer5io:self-hosted Mar 15, 2022
@leecalcote
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants