Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto osd cache drop #570

Merged
merged 31 commits into from
Jun 8, 2021
Merged

Conversation

bengland2
Copy link
Contributor

@bengland2 bengland2 commented May 7, 2021

Description

Depends on benchmark-wrapper PR 269.

This adds support for ceph OSD cache dropping, without each workload having to specify a pod IP.
At present, it cannot automatically start the Ceph toolbox or the cache dropping pod because of a need for authorization
from openshift-storage/rook-ceph namespaces, but the user can start these pods and leave them running and the rest of it works.

for openshift, ceph toolbox can be started with


$ oc patch OCSInitialization ocsinit -n openshift-storage --type json --patch  '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'

for the cache drop pod, you can start it by filling in the vars in roles/ceph_osd_cache_drop/rook_ceph_drop_cache_pod.yaml
and running the pod with kubectl/oc.

Fixes

lack of ceph cache dropping makes benchmark-operator incomplete for OCS.

@bengland2
Copy link
Contributor Author

for the reviewers, @chaitanyaenr and @dry923 , easiest way to review this is by looking at the "files" tab instead of each commit.
I'll squash the commits at merge time. the problem I'm having with this right now is the startup of two pods:

  • Ceph toolbox pod, which is an OCS pod that runs the /usr/bin/ceph CLI - I attempt to start with ansible k8s equivalent of
oc patch OCSInitialization ocsinit -n openshift-storage --type json --patch  '[{ "op": "replace", "path": "/spec/enableCephTools", "
value": true }]'
  • Ceph cache dropper pod, which is a cherrypy pod that executes "ceph tell osd.* cache drop" whenever a client does a HTTP GET on /DropOSDCache URL.

The problem with both of these is that I get HTTP 403 error code from the K8S API when I attempt to start them in roles/ceph_osd_cache_drop/tasks/main.yml . I know it's just privs because the same pod startups work with $KUBECONFIG (admin) as my authorization. And when I start the pods by hand using "oc" then the code all works.

I've tried to start the cache dropper pod in my-ripsaw but then it can't get access to the secrets that are only available within the openshift-storage namespace. But I don't have privs to start a pod within openshift-storage namespace.

@jtaleric jtaleric added the ok to test Kick off our CI framework label May 11, 2021
@jtaleric
Copy link
Member

/rerun all

namespace: "{{ rook_ceph_namespace }}"
register: drop_pod_already_exists

#- debug:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We good to remove this?

@bengland2
Copy link
Contributor Author

@jtaleric I want to be able to remove the rook_ceph_namespace bit, but I don't know how to get authentication and authorization to get the secrets yet, still working that out. Same with the oc patch OCSInitialization task in the same file.

@comet-perf-ci
Copy link
Collaborator

Results for ec2_jjb

Test Result Retries Duration (HH:MM:SS)
test_backpack Pass 0 00:03:52
test_byowl Pass 0 00:00:12
test_fiod Pass 1 00:34:29
test_flent Pass 0 00:06:54
test_fs_drift Pass 0 00:04:33
test_hammerdb Pass 0 00:03:57
test_iperf3 Pass 0 00:05:46
test_kubeburner Pass 0 00:22:50
test_log_generator Pass 0 00:02:53
test_pgbench Pass 0 00:02:20
test_smallfile Pass 0 00:04:17
test_stressng Pass 0 00:01:31
test_sysbench Pass 0 00:01:04
test_uperf Pass 1 00:41:29
test_vegeta Pass 0 00:04:08
test_ycsb Pass 0 00:02:53
test_scale_openshift Pass 0 00:08:09

@bengland2
Copy link
Contributor Author

I'd merge it now but this depends on bohica quay.io/cloud-bulldozer/ceph-cache-dropper being updated and that hasn't rebuilt. How to trigger rebuild? Also I'd like to figure out how to be able to put the ceph_cache_dropper pod in my-ripsaw namespace instead of openshift-storage/rook-ceph namespace.

@stale
Copy link

stale bot commented Jun 2, 2021

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the not_ready label Jun 2, 2021
@bengland2
Copy link
Contributor Author

I think I'm going to have to change this so that the user is responsible for actually starting the cache dropper pod and the ceph toolbox pod, but the rest of it can stay the same. This is because the benchmark-operator doesn't have authorization to start pods in the openshift-storage namespace. Will try to get to this next week and finish up.

@stale stale bot removed the not_ready label Jun 3, 2021
@jtaleric
Copy link
Member

jtaleric commented Jun 7, 2021

@bengland2 ack - so are we good with your current implementation? LGTM..

@bengland2
Copy link
Contributor Author

@jtaleric current implementation will not work for reasons discussed above, but if I make the user create the ceph toolbox pod and the cache dropper pod using a provided YAML file, then the authorization problem goes away and the rest of it should work fine. I still think there is a way to get some sort of token and do it but for now that's beyond what I know how to do. At least the CR will not need to change to contain the IP addr of the cache dropper pod - benchmark-operator will discover the pod and make use of it automatically if the user specifies that Ceph OSD cache dropping is desired. This change is small and I should be able to test it this week and merge it. Sorry for the delays, the Multus Alias allocation required me to postpone work on this. Thanks for the other merge.

@comet-perf-ci
Copy link
Collaborator

Results for ec2_jjb

Test Result Retries Duration (HH:MM:SS)
test_backpack Pass 0 00:05:06
test_byowl Pass 0 00:00:11
test_fiod Pass 1 00:35:17
test_flent Pass 0 00:08:59
test_fs_drift Pass 0 00:04:41
test_hammerdb Pass 0 00:03:54
test_iperf3 Pass 0 00:06:08
test_kubeburner Pass 0 00:23:29
test_log_generator Pass 0 00:04:16
test_pgbench Pass 0 00:02:42
test_smallfile Pass 0 00:04:48
test_stressng Pass 0 00:01:48
test_sysbench Pass 0 00:01:06
test_uperf Pass 0 00:26:19
test_vegeta Pass 0 00:04:24
test_ycsb Pass 0 00:02:48
test_scale_openshift Pass 0 00:09:00

@comet-perf-ci
Copy link
Collaborator

Results for ec2_jjb

Test Result Retries Duration (HH:MM:SS)
test_backpack Pass 0 00:03:53
test_byowl Pass 0 00:00:16
test_fiod Pass 1 00:36:34
test_flent Pass 0 00:06:53
test_fs_drift Pass 0 00:04:13
test_hammerdb Pass 0 00:03:37
test_iperf3 Pass 0 00:06:02
test_kubeburner Pass 0 00:23:42
test_log_generator Pass 0 00:04:00
test_pgbench Pass 0 00:02:56
test_smallfile Pass 0 00:04:52
test_stressng Pass 0 00:02:05
test_sysbench Pass 0 00:00:59
test_uperf Pass 0 00:24:46
test_vegeta Pass 0 00:03:45
test_ycsb Pass 0 00:01:37
test_scale_openshift Pass 0 00:03:54

@bengland2
Copy link
Contributor Author

I'm merging because the test_fiod.sh script passed when run against an AWS cluster, and the test result in terms of passes and fails is identical to what it was before I included OSD cache dropping tests in the PR. Specifically there was an fio failure back 28 days ago when none of the test_crs used OSD cache dropping. So none of the OSD cache dropping code actually executed, and fio test passed 2nd time. This makes me think there was a timeout due to slow image load instead of actual problem with the code. I would like to see the CI save the logs for each test run so that it would be possible to further diagnose things instead of guessing.

@bengland2 bengland2 merged commit cfb2b0a into cloud-bulldozer:master Jun 8, 2021
ebattat pushed a commit to ebattat/benchmark-operator that referenced this pull request Jun 10, 2021
only do ceph osd cache dropping if user requests it
default to openshift for benchmark-operator
add option to drop Ceph OSD cache to CR
document Ceph OSD cache dropping
user must start cache dropper and ceph toolbox pod
test both OSD cache dropping and kernel cache dropping at same time
only if openshift-storage namespace is defined
amitsagtani97 pushed a commit that referenced this pull request Jun 11, 2021
* add hammerdb vm support CNV-6501 and pod support for mariadb and postgres

* add generic hammerdb cr

* add hammerdb vm example

* change hammerdb crds hirarchy according to database type

* fixes after review

* fix hammerdb mssql test

* revert sql server namespace

* revert transactions number

* update transactions number to 500k

* update transactions to 100000

* update transactions to 100000

* update transactions to 10000 for fast run

* fix hammer workload name

* add creator pod wait

* add debug true

* revert app label to hammerdb_workload

* fix type name

* temporary fix in common.sh

* revert my common.sh changes

* change db init to false

* change db init to true

* update changes to support operator-sdk version 1.5.0

* update changes to support operator-sdk version 1.5.0

* enlarge the timeout from 500 to 800

* increase timeout to 1000

* revert timout to 500s

* add pin and resources support

* add mssql 2019 image and creator pod

* revert it back to legacy mssql test

* add es custom fields support

* fix image example name

* add hammerdb vm support CNV-6501 and pod support for mariadb and postgres

* add generic hammerdb cr

* add hammerdb vm example

* change hammerdb crds hirarchy according to database type

* fixes after review

* fix hammerdb mssql test

* revert sql server namespace

* revert transactions number

* update transactions number to 500k

* update transactions to 100000

* update transactions to 100000

* update transactions to 10000 for fast run

* fix hammer workload name

* add creator pod wait

* add debug true

* revert app label to hammerdb_workload

* fix type name

* temporary fix in common.sh

* revert my common.sh changes

* change db init to false

* change db init to true

* update changes to support operator-sdk version 1.5.0

* enlarge the timeout from 500 to 800

* increase timeout to 1000

* revert timout to 500s

* add pin and resources support

* add mssql 2019 image and creator pod

* revert it back to legacy mssql test

* add es custom fields support

* fix image example name

* update changes to support operator-sdk version 1.5.0

* add latest changes

* update changes to support operator-sdk version 1.5.0

* fix operator-sdk version 1.5.0

* add os version

* fixes after changes

* add es_os_version

* update changes to support operator-sdk version 1.5.0

* fix hammer doc - database per the CR file

* fix hammedb doc

* fix hammedb doc

* add es_kind

* add es_kind to cr and fix merge conflict

* remove .idea

* update changes to support operator-sdk version 1.5.0

* adding cerberus validate certs parameter

Signed-off-by: Kedar Vijay Kulkarni <kkulkarni@redhat.com>

* Remove magzine section from CONTRIBUTING.md

Signed-off-by: Kedar Vijay Kulkarni <kkulkarni@redhat.com>

* Add support for kafka as log backend for verification

Signed-off-by: Sai Sindhur Malleni <smalleni@redhat.com>

* Expand README

Signed-off-by: Kedar Vijay Kulkarni <kkulkarni@redhat.com>

* Quiesce logging in pod for log generator workload

This along with cloud-bulldozer/benchmark-wrapper#273
helps suppress any unneeded logging in the pod, so that we can accurately and easily count
the number of log emssages recevied in a backend like kafka merely by using the offsets.
The plan is to deploy the log generator pods in a separate namesapce and forward those logs
to a topic in kafka. That way we would be able to reliably count the messages received just
by looking at kafka topic offset. Otherwise there would be other logs from the log generator pods
as well as benchmark-operator pod that would make it hard to reliably count logs received just by
kafka offset.

Signed-off-by: Sai Sindhur Malleni <smalleni@redhat.com>

* removed line breaks for trex tasks only

* mounting module path for mlnx

* updated doc for mlnx sriov policy

* Update installation.md

* Make sink verification optional for kafka

Signed-off-by: Sai Sindhur Malleni <smalleni@redhat.com>

* Auto osd cache drop (#570)

only do ceph osd cache dropping if user requests it
default to openshift for benchmark-operator
add option to drop Ceph OSD cache to CR
document Ceph OSD cache dropping
user must start cache dropper and ceph toolbox pod
test both OSD cache dropping and kernel cache dropping at same time
only if openshift-storage namespace is defined

* replace preview by working in hammerdb doc

* update changes to support operator-sdk version 1.5.0

* replace preview by working in hammerdb doc

* remove stressng fixes

Co-authored-by: Kedar Vijay Kulkarni <kkulkarni@redhat.com>
Co-authored-by: Sai Sindhur Malleni <smalleni@redhat.com>
Co-authored-by: Murali Krishnasamy <mukrishn@redhat.com>
Co-authored-by: Ayesha Vijay Kumar <84931574+Ayesha279@users.noreply.github.com>
Co-authored-by: Ben England <bengland@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok to test Kick off our CI framework
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants