Auto osd cache drop #570

bengland2 · 2021-05-07T17:04:16Z

Description

Depends on benchmark-wrapper PR 269.

This adds support for ceph OSD cache dropping, without each workload having to specify a pod IP.
At present, it cannot automatically start the Ceph toolbox or the cache dropping pod because of a need for authorization
from openshift-storage/rook-ceph namespaces, but the user can start these pods and leave them running and the rest of it works.

for openshift, ceph toolbox can be started with


$ oc patch OCSInitialization ocsinit -n openshift-storage --type json --patch  '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'

for the cache drop pod, you can start it by filling in the vars in roles/ceph_osd_cache_drop/rook_ceph_drop_cache_pod.yaml
and running the pod with kubectl/oc.

Fixes

lack of ceph cache dropping makes benchmark-operator incomplete for OCS.

use KUBECONFIG if defined else use cluster config

quiet down logging

use single-quote within double-quote

bengland2 · 2021-05-10T18:22:27Z

for the reviewers, @chaitanyaenr and @dry923 , easiest way to review this is by looking at the "files" tab instead of each commit.
I'll squash the commits at merge time. the problem I'm having with this right now is the startup of two pods:

Ceph toolbox pod, which is an OCS pod that runs the /usr/bin/ceph CLI - I attempt to start with ansible k8s equivalent of

oc patch OCSInitialization ocsinit -n openshift-storage --type json --patch  '[{ "op": "replace", "path": "/spec/enableCephTools", "
value": true }]'

Ceph cache dropper pod, which is a cherrypy pod that executes "ceph tell osd.* cache drop" whenever a client does a HTTP GET on /DropOSDCache URL.

The problem with both of these is that I get HTTP 403 error code from the K8S API when I attempt to start them in roles/ceph_osd_cache_drop/tasks/main.yml . I know it's just privs because the same pod startups work with $KUBECONFIG (admin) as my authorization. And when I start the pods by hand using "oc" then the code all works.

I've tried to start the cache dropper pod in my-ripsaw but then it can't get access to the secrets that are only available within the openshift-storage namespace. But I don't have privs to start a pod within openshift-storage namespace.

jtaleric · 2021-05-11T20:08:10Z

/rerun all

jtaleric · 2021-05-11T20:08:41Z

roles/ceph_osd_cache_drop/tasks/main.yml

+    namespace: "{{ rook_ceph_namespace }}"
+  register: drop_pod_already_exists
+
+#- debug:


We good to remove this?

bengland2 · 2021-05-11T20:40:12Z

@jtaleric I want to be able to remove the rook_ceph_namespace bit, but I don't know how to get authentication and authorization to get the secrets yet, still working that out. Same with the oc patch OCSInitialization task in the same file.

comet-perf-ci · 2021-05-11T21:19:52Z

Results for ec2_jjb

Test	Result	Retries	Duration (HH:MM:SS)
test_backpack	Pass	0	00:03:52
test_byowl	Pass	0	00:00:12
test_fiod	Pass	1	00:34:29
test_flent	Pass	0	00:06:54
test_fs_drift	Pass	0	00:04:33
test_hammerdb	Pass	0	00:03:57
test_iperf3	Pass	0	00:05:46
test_kubeburner	Pass	0	00:22:50
test_log_generator	Pass	0	00:02:53
test_pgbench	Pass	0	00:02:20
test_smallfile	Pass	0	00:04:17
test_stressng	Pass	0	00:01:31
test_sysbench	Pass	0	00:01:04
test_uperf	Pass	1	00:41:29
test_vegeta	Pass	0	00:04:08
test_ycsb	Pass	0	00:02:53
test_scale_openshift	Pass	0	00:08:09

bengland2 · 2021-05-11T21:48:36Z

I'd merge it now but this depends on bohica quay.io/cloud-bulldozer/ceph-cache-dropper being updated and that hasn't rebuilt. How to trigger rebuild? Also I'd like to figure out how to be able to put the ceph_cache_dropper pod in my-ripsaw namespace instead of openshift-storage/rook-ceph namespace.

stale · 2021-06-02T14:42:12Z

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

bengland2 · 2021-06-03T17:09:02Z

I think I'm going to have to change this so that the user is responsible for actually starting the cache dropper pod and the ceph toolbox pod, but the rest of it can stay the same. This is because the benchmark-operator doesn't have authorization to start pods in the openshift-storage namespace. Will try to get to this next week and finish up.

jtaleric · 2021-06-07T19:13:48Z

@bengland2 ack - so are we good with your current implementation? LGTM..

bengland2 · 2021-06-07T19:18:36Z

@jtaleric current implementation will not work for reasons discussed above, but if I make the user create the ceph toolbox pod and the cache dropper pod using a provided YAML file, then the authorization problem goes away and the rest of it should work fine. I still think there is a way to get some sort of token and do it but for now that's beyond what I know how to do. At least the CR will not need to change to contain the IP addr of the cache dropper pod - benchmark-operator will discover the pod and make use of it automatically if the user specifies that Ceph OSD cache dropping is desired. This change is small and I should be able to test it this week and merge it. Sorry for the delays, the Multus Alias allocation required me to postpone work on this. Thanks for the other merge.

comet-perf-ci · 2021-06-07T21:53:00Z

Results for ec2_jjb

Test	Result	Retries	Duration (HH:MM:SS)
test_backpack	Pass	0	00:05:06
test_byowl	Pass	0	00:00:11
test_fiod	Pass	1	00:35:17
test_flent	Pass	0	00:08:59
test_fs_drift	Pass	0	00:04:41
test_hammerdb	Pass	0	00:03:54
test_iperf3	Pass	0	00:06:08
test_kubeburner	Pass	0	00:23:29
test_log_generator	Pass	0	00:04:16
test_pgbench	Pass	0	00:02:42
test_smallfile	Pass	0	00:04:48
test_stressng	Pass	0	00:01:48
test_sysbench	Pass	0	00:01:06
test_uperf	Pass	0	00:26:19
test_vegeta	Pass	0	00:04:24
test_ycsb	Pass	0	00:02:48
test_scale_openshift	Pass	0	00:09:00

comet-perf-ci · 2021-06-08T20:06:10Z

Results for ec2_jjb

Test	Result	Retries	Duration (HH:MM:SS)
test_backpack	Pass	0	00:03:53
test_byowl	Pass	0	00:00:16
test_fiod	Pass	1	00:36:34
test_flent	Pass	0	00:06:53
test_fs_drift	Pass	0	00:04:13
test_hammerdb	Pass	0	00:03:37
test_iperf3	Pass	0	00:06:02
test_kubeburner	Pass	0	00:23:42
test_log_generator	Pass	0	00:04:00
test_pgbench	Pass	0	00:02:56
test_smallfile	Pass	0	00:04:52
test_stressng	Pass	0	00:02:05
test_sysbench	Pass	0	00:00:59
test_uperf	Pass	0	00:24:46
test_vegeta	Pass	0	00:03:45
test_ycsb	Pass	0	00:01:37
test_scale_openshift	Pass	0	00:03:54

bengland2 · 2021-06-08T20:34:22Z

I'm merging because the test_fiod.sh script passed when run against an AWS cluster, and the test result in terms of passes and fails is identical to what it was before I included OSD cache dropping tests in the PR. Specifically there was an fio failure back 28 days ago when none of the test_crs used OSD cache dropping. So none of the OSD cache dropping code actually executed, and fio test passed 2nd time. This makes me think there was a timeout due to slow image load instead of actual problem with the code. I would like to see the CI save the logs for each test run so that it would be possible to further diagnose things instead of guessing.

only do ceph osd cache dropping if user requests it default to openshift for benchmark-operator add option to drop Ceph OSD cache to CR document Ceph OSD cache dropping user must start cache dropper and ceph toolbox pod test both OSD cache dropping and kernel cache dropping at same time only if openshift-storage namespace is defined

* add hammerdb vm support CNV-6501 and pod support for mariadb and postgres * add generic hammerdb cr * add hammerdb vm example * change hammerdb crds hirarchy according to database type * fixes after review * fix hammerdb mssql test * revert sql server namespace * revert transactions number * update transactions number to 500k * update transactions to 100000 * update transactions to 100000 * update transactions to 10000 for fast run * fix hammer workload name * add creator pod wait * add debug true * revert app label to hammerdb_workload * fix type name * temporary fix in common.sh * revert my common.sh changes * change db init to false * change db init to true * update changes to support operator-sdk version 1.5.0 * update changes to support operator-sdk version 1.5.0 * enlarge the timeout from 500 to 800 * increase timeout to 1000 * revert timout to 500s * add pin and resources support * add mssql 2019 image and creator pod * revert it back to legacy mssql test * add es custom fields support * fix image example name * add hammerdb vm support CNV-6501 and pod support for mariadb and postgres * add generic hammerdb cr * add hammerdb vm example * change hammerdb crds hirarchy according to database type * fixes after review * fix hammerdb mssql test * revert sql server namespace * revert transactions number * update transactions number to 500k * update transactions to 100000 * update transactions to 100000 * update transactions to 10000 for fast run * fix hammer workload name * add creator pod wait * add debug true * revert app label to hammerdb_workload * fix type name * temporary fix in common.sh * revert my common.sh changes * change db init to false * change db init to true * update changes to support operator-sdk version 1.5.0 * enlarge the timeout from 500 to 800 * increase timeout to 1000 * revert timout to 500s * add pin and resources support * add mssql 2019 image and creator pod * revert it back to legacy mssql test * add es custom fields support * fix image example name * update changes to support operator-sdk version 1.5.0 * add latest changes * update changes to support operator-sdk version 1.5.0 * fix operator-sdk version 1.5.0 * add os version * fixes after changes * add es_os_version * update changes to support operator-sdk version 1.5.0 * fix hammer doc - database per the CR file * fix hammedb doc * fix hammedb doc * add es_kind * add es_kind to cr and fix merge conflict * remove .idea * update changes to support operator-sdk version 1.5.0 * adding cerberus validate certs parameter Signed-off-by: Kedar Vijay Kulkarni <kkulkarni@redhat.com> * Remove magzine section from CONTRIBUTING.md Signed-off-by: Kedar Vijay Kulkarni <kkulkarni@redhat.com> * Add support for kafka as log backend for verification Signed-off-by: Sai Sindhur Malleni <smalleni@redhat.com> * Expand README Signed-off-by: Kedar Vijay Kulkarni <kkulkarni@redhat.com> * Quiesce logging in pod for log generator workload This along with cloud-bulldozer/benchmark-wrapper#273 helps suppress any unneeded logging in the pod, so that we can accurately and easily count the number of log emssages recevied in a backend like kafka merely by using the offsets. The plan is to deploy the log generator pods in a separate namesapce and forward those logs to a topic in kafka. That way we would be able to reliably count the messages received just by looking at kafka topic offset. Otherwise there would be other logs from the log generator pods as well as benchmark-operator pod that would make it hard to reliably count logs received just by kafka offset. Signed-off-by: Sai Sindhur Malleni <smalleni@redhat.com> * removed line breaks for trex tasks only * mounting module path for mlnx * updated doc for mlnx sriov policy * Update installation.md * Make sink verification optional for kafka Signed-off-by: Sai Sindhur Malleni <smalleni@redhat.com> * Auto osd cache drop (#570) only do ceph osd cache dropping if user requests it default to openshift for benchmark-operator add option to drop Ceph OSD cache to CR document Ceph OSD cache dropping user must start cache dropper and ceph toolbox pod test both OSD cache dropping and kernel cache dropping at same time only if openshift-storage namespace is defined * replace preview by working in hammerdb doc * update changes to support operator-sdk version 1.5.0 * replace preview by working in hammerdb doc * remove stressng fixes Co-authored-by: Kedar Vijay Kulkarni <kkulkarni@redhat.com> Co-authored-by: Sai Sindhur Malleni <smalleni@redhat.com> Co-authored-by: Murali Krishnasamy <mukrishn@redhat.com> Co-authored-by: Ayesha Vijay Kumar <84931574+Ayesha279@users.noreply.github.com> Co-authored-by: Ben England <bengland@redhat.com>

bengland2 added 24 commits May 3, 2021 10:18

wait for ceph osd cache dropper pod to be in running state

615e901

only do ceph osd cache dropping if user requests it

bf0cf4e

default to openshift for benchmark-operator

fd4a4c0

what ceph_osd_cache_drop role does

f81e7b1

do not use manual pod startup anymore

057f4a4

starts up cache dropper web service pod

2b43e57

add option to drop Ceph OSD cache to CR

892543e

apply change to OCS state correctly

4c086d0

wait for pod in any namespace

d860e7f

use KUBECONFIG if defined else use cluster config

change name of cache drop pod port

9cc3537

works except for privs to start pods

6178aad

works, also acts as a template to allow port, namespace reconfiguration

9f8d398

quiet down logging for kernel cache drop

f38bcca

vars here can be shared by cache drop roles and workload roles

f2bead5

pass port number from ansible var

6293f65

explain meaning of rook_ceph_namespace

8c8fbae

easier to understand only 1st smf client pod drops cache

460aa31

pass ceph cache drop port num from ansible var

5ee5faf

how to debug cache dropper image

4603ff8

if ceph cache drop fails, less to cleanup this way

fbf1d8d

better to fail quick and clear if requested ceph cache drop not possible

7593e76

quiet down logging

add ceph cache drop support

b40619c

use single-quote within double-quote

can support upstream k8s

1424201

document Ceph OSD cache dropping

ad3f4ab

bengland2 requested review from dry923 and chaitanyaenr May 10, 2021 15:10

jtaleric added the ok to test Kick off our CI framework label May 11, 2021

jtaleric reviewed May 11, 2021

View reviewed changes

stale bot added the not_ready label Jun 2, 2021

stale bot removed the not_ready label Jun 3, 2021

bengland2 added 3 commits June 7, 2021 17:00

remove jinja2 vars so user can invoke directly

bd213d8

user must start cache dropper and ceph toolbox pod

f823930

instruct user on how to set up Ceph OSD cache dropping

3f68894

bengland2 added 4 commits June 8, 2021 15:02

ceph_cache_drop_port must be a string apparently

bd9a6fa

do both OSD cache dropping and kernel cache dropping

74785eb

only test OSD cache drop if openshift-storage namespace exists

85fee94

oops, revert to ES_SERVER for benefit of CI

336996d

bengland2 merged commit cfb2b0a into cloud-bulldozer:master Jun 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto osd cache drop #570

Auto osd cache drop #570

bengland2 commented May 7, 2021 •

edited

Loading

bengland2 commented May 10, 2021

jtaleric commented May 11, 2021

jtaleric May 11, 2021

bengland2 commented May 11, 2021

comet-perf-ci commented May 11, 2021

bengland2 commented May 11, 2021

stale bot commented Jun 2, 2021

bengland2 commented Jun 3, 2021

jtaleric commented Jun 7, 2021

bengland2 commented Jun 7, 2021

comet-perf-ci commented Jun 7, 2021

comet-perf-ci commented Jun 8, 2021

bengland2 commented Jun 8, 2021

Auto osd cache drop #570

Auto osd cache drop #570

Conversation

bengland2 commented May 7, 2021 • edited Loading

Description

Fixes

bengland2 commented May 10, 2021

jtaleric commented May 11, 2021

jtaleric May 11, 2021

Choose a reason for hiding this comment

bengland2 commented May 11, 2021

comet-perf-ci commented May 11, 2021

bengland2 commented May 11, 2021

stale bot commented Jun 2, 2021

bengland2 commented Jun 3, 2021

jtaleric commented Jun 7, 2021

bengland2 commented Jun 7, 2021

comet-perf-ci commented Jun 7, 2021

comet-perf-ci commented Jun 8, 2021

bengland2 commented Jun 8, 2021

bengland2 commented May 7, 2021 •

edited

Loading