Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular & visible ansible refreshes of machines #695

Closed
smlambert opened this issue Feb 5, 2019 · 21 comments
Closed

Regular & visible ansible refreshes of machines #695

smlambert opened this issue Feb 5, 2019 · 21 comments
Assignees
Labels
bug secure-dev Issues specific to SSDF/SLSA compliance work

Comments

@smlambert
Copy link
Contributor

Now that the playbooks are more stable, it would be good to have regular and visible machine refreshes, to ensure that new updates to the playbooks will be picked up and deployed on a regular basis. (related: #624 submitted 2 weeks ago, and needs deployment to test machines). In addition to 'full set of machines' refreshes, for one-off updates to a particular machine, it should also be made known/visible, and part of some easy to find communication.

Benefits include faster test triage and easier on-boarding of new helpers to the infra team. Visibility to all interested parties.

At present, what are the tools used for deployment, Ansible Tower? (not visible to non-infra folks). I ask because this request for scheduled/visible machine refreshes could possibly be addressed using the ansible plugins for Jenkins and scheduling a set of infra jobs to run regularly. These jobs would then be visible to more than the infra team, and the infra tasks would be dealt with similarly to the build and test tasks. But maybe Ansible tower gives other benefits, which would be good to understand (as its at the cost of visibility/transparency).

I know its already been discussed by infra and was possibly already in plan, so if this is already being done, please point me to it, I will like to help.

@AdamBrousseau
Copy link
Contributor

This is something we are looking into at the OpenJ9 CI as well so there's possibility for collaboration or reusing solutions here. eclipse-openj9/openj9#4221

@smlambert
Copy link
Contributor Author

Thanks Adam, I was going to post that as a related effort. And was going to post to that issue the question about difference between using Jenkins ansible plugins & schedule versus Tower approach.

@karianna karianna added the bug label Feb 5, 2019
@karianna
Copy link
Contributor

karianna commented Feb 5, 2019

AWX should be rolling out updates regularly - I suspect there is a bug.

@sxa
Copy link
Member

sxa commented Feb 6, 2019

Now that the playbooks are more stable

While refreshing regularly is a goal, we're not at the stage where they're stable enough to do it on a regular basis, and that has to be a prereq. We're working on it, but bear in mind we're currently getting very regular requests for new types of systems therefore ensuring we have an infrastructure capable of testing changes before they're deployed in production is critical to ensuring that visible ansible refreshes of production machines doesn't break anything

We're a lot closer than we were a couple of months ago but I would not advocate putting this in place right now as I believe the risk would be too great.

Current issue list: https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues?q=is%3Aissue+is%3Aopen+label%3Aansible

@sxa
Copy link
Member

sxa commented Feb 6, 2019

For reference, my plans on this are to start running them manually on subsets of machines, ensure that they are running "green" (many haven't been and I've done a pile of work under the infra repo to get them closer - we're almost there on xlinux I think, but other things like builds keep getting in the way!) This is the best way to understand any stability problems. Then start running the schedules for them automatically in AWX. Bear in mind that at present it's still just me (Windows excepted) really working on playbook stabilisation.

@smlambert
Copy link
Contributor Author

@sxa555 - apologies, I understood you and Husain were getting very close to playbook goodness, so thought it was timely to propose. I know you are holding the fort on this work (which I greatly appreciate).

Please let me know if there are any small tasks we can help with... (tricky I know due to permissions, etc).

@sxa
Copy link
Member

sxa commented Feb 7, 2019

No need to apologise - I want to get there as much as you do :-)

@sxa
Copy link
Member

sxa commented Feb 7, 2019

Shelley has confirmed that my re-run on the first machine has gone cleanly and resolved the issue, so I will be continuing to redeploy on other systems - I'll update this comment as and when each one is done.

test-softlayer-rhel69-x64-1 not yet done as it's subject to #698

@smlambert
Copy link
Contributor Author

And I will mentioned I verified that I could still run openjdk regression tests on test-packet-ubuntu1604-x64-3, after its refresh

@karianna
Copy link
Contributor

Aha - this is the issue I was missing. Right - I'm happy to pair with @sxa555 and get through this as well. Got stuck on s390 and ppcle with docker.

@sxa
Copy link
Member

sxa commented Mar 13, 2019

s390x and ppc64le now resolved as per #714 ... Now running on all UNIX build-*1 machines to validate whether there are outstanding issues

@sxa sxa assigned sxa and karianna Mar 13, 2019
@sxa
Copy link
Member

sxa commented Mar 13, 2019

Several machines failed due to the issue addressed by #729

build-linaro-centos74-armv8-1
build-packet-centos74-armv8-1
build-packet-ubuntu1604-armv8-2

There were a few issues with machines being unreachable (armv7 offline, others likely temporary)

build-marist-rhel74-s390x-1
build-marist-rhel74-s390x-2
build-scaleway-ubuntu1604-armv7-2

And a few of special case failures:
build-joyent-centos69-x64-1 - out of space - removed /swapfile - wasn't in use
build-marist-sles12-s390x-1 - zypper upgrade failed - managed to hand hold manually
build-osuosl-ubuntu1604-ppc64le-1 - AO10 download failed with 404 from this link - Seems correct based on this link - Fix is in #730

Now running on a subset of the test machines (test-softlayer*) to validate those.

@karianna
Copy link
Contributor

Several machines failed due to the issue addressed by #729

build-linaro-centos74-armv8-1
build-packet-centos74-armv8-1
build-packet-ubuntu1604-armv8-2

There were a few issues with machines being unreachable (armv7 offline, others likely temporary)

build-marist-rhel74-s390x-1
build-marist-rhel74-s390x-2
build-scaleway-ubuntu1604-armv7-2

And a few of special case failures:
build-joyent-centos69-x64-1 - out of space - removed /swapfile - wasn't in use
build-marist-sles12-s390x-1 - zypper upgrade failed - managed to hand hold manually
build-osuosl-ubuntu1604-ppc64le-1 - AO10 download failed with 404 from this link - Seems correct based on this link - Fix is in #730

Now running on a subset of the test machines (test-softlayer*) to validate those.

Thanks for the efforts in getting us to green @sxa555 !

@sxa
Copy link
Member

sxa commented Mar 14, 2019

/var/run/systemd/sessions on build-linaro-centos74-armv8-2 is chewing a lot of disk space and is blocking yum commands - rebooting to clear.

@sxa
Copy link
Member

sxa commented Mar 22, 2019

numactl needs to be excluded on arm32:

failed: [build-scaleway-ubuntu1604-armv7-2] (item=numactl) => {"changed": false, "failed": true, "item": "numactl", "msg": "No package matching 'numactl' is available"}

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented May 17, 2023

A quick breakdown of issues by platform that need to be addressed before getting the playbooks to a state where they can run regularly without error

Mac
#3050 10.15 boxes need to be upgraded, brew upgrades are being skipped on these boxes
#3049 GPG signature verification of packages is failing on some mac machines
#3048 If our playbooks detect a curl version less than 7.58 it will upgrade it to the highest version it can, on mac it will upgrade curl to version 8 which looks like it causes problems for ansible's get_url module

x64
#3066

No remaining issues with ppc64le and s390x

Windows, AIX, Solaris and aarch64 still need addressing

@Haroon-Khel
Copy link
Contributor

Update:

Scheduled playbook deployment is up and running for AIX, Solaris and aarch64. Some outstanding issues remain with AIX #3086

Windows is nearly finished, just waiting on final actions regarding credentials for windows machines

@sxa
Copy link
Member

sxa commented Jun 30, 2023

Scheduled playbook deployment is up and running for AIX, Solaris and aarch64. Some outstanding issues remain with AIX #3086

aarch64 presumably means Linux/aarch64 in this case (We have three aarch64 platforms). And assuming so, are all the other linux architectures already running on a schedule?

@Haroon-Khel
Copy link
Contributor

Yes, all of the platforms are now running on a schedule

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug secure-dev Issues specific to SSDF/SLSA compliance work
Projects
No open projects
Development

No branches or pull requests

5 participants