Reduce Infrastructure Spending #2156

afrittoli · 2024-09-06T11:12:32Z

We need to monitor infrastructure spending and make sure we reduce costs where possible.
Currently, our spending breakdown looks like this:

afrittoli · 2024-09-06T11:14:53Z

The cloud storage spending is associated mostly with the container registry image download bandwidth. We will have to migrate to the artifact registry soon, but that will not reduce cost.

afrittoli · 2024-09-06T11:51:54Z

Areas where we could reduce spending:

Cloud Storage / Container Registry: move released container images to ghcr.io. This will shift over time the traffic from gcr.io to ghcr.io, as Tekton users move to the latest releases.
- Knowing the source of the egress traffic would help optimise this work, unfortunately, it is currently not possible to find that out from Google cloud monitoring tools.
- It is possible that some or most of the traffic may be generated by CI systems of Tekton or other projects that use Tekton. If this is true, even moving new releases might alleviate the cost.
Compute Engine: this is made of a combination of running services and jobs (CI/CD pipelines). The data needs further analysis; some initial considerations:
- Prow: this is used by most CI jobs today, and it cannot be removed easily. The CPU/memory consumption of the control plane seems contained. The test jobs name space is where most of the CPU requests and consumptions are spent. We can review existing CI jobs to see if there is any optimisation work that can be done. We have done work in the past to optimise these, using kind for e2e tests, with a nodepool that scales down to zero when not used.
- Tekton (dogfooding): this is used by all CD pipelines, so it cannot be removed easily. The CPU/memory consumption of the control plane seems contained.
- Messaging (knative + kafka): this is used to transport events from Tekton pipelines to Tekton Event Listeners. Since we only have one consumer (the tekton-events event listener) we could remove knative + kafka and rely on 1:1 delivery of events. Kafka is one of the top consumers of memory among the various services deployed, so it may be worth removing it even if it means losing some reliability in the delivery of events
- Nightly builds: these use decent amounts of CPU and memory, even if only for 30m each. We could reduce the frequency of nightly builds and reduce the amount of testing we do there since the code is already tested in each PR
Container Registry vulnerability scanning: we could disable this. We use minimal base images which are already scanned elsewhere, and we perform various types of security scanning on the code in each PR, which may be sufficient
- This is disabled now

Other services are basically negligible and not worth looking into for now.

afrittoli · 2024-09-09T11:44:56Z

This PR enables removing tests from nightly builds: tektoncd/pipeline#8252

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce Infrastructure Spending #2156

Reduce Infrastructure Spending #2156

afrittoli commented Sep 6, 2024

afrittoli commented Sep 6, 2024

afrittoli commented Sep 6, 2024 •

edited

Loading

afrittoli commented Sep 9, 2024

Reduce Infrastructure Spending #2156

Reduce Infrastructure Spending #2156

Comments

afrittoli commented Sep 6, 2024

afrittoli commented Sep 6, 2024

afrittoli commented Sep 6, 2024 • edited Loading

afrittoli commented Sep 9, 2024

afrittoli commented Sep 6, 2024 •

edited

Loading