More explicit failure indication in cbt run. #97

bdastur · 2016-03-31T21:49:50Z

When executing the cbt.py test suite, it is very hard to figure out which steps failed/passed.
My experience with this tool is very limited as I just started using it, but I see that the pdsh commands fail without any error, so it is hard to decipher why.

Also, the use_existing flag in cluster: configuration in the yaml file should be highlighted when using against an existing cluster. Once I go through a successful execution I will create a pull request for any doc changes if makes sense and other issues if I see.

Another issue I see is username and groupname are taken as the same which is not the case. Might be useful to add a groups filed as well.

Lastly -->
Now I think I have gotten past some of my inital hurdles and am able to execute an fio benchmark, but I am not sure what is next.

The last step I see is:

21:30:37 - DEBUG - cbt - pdsh -R ssh -w behzad_dastur@v-stagemon-002-prod.abc.xyz.net,behzad_dastur@b-stageosd001-r19f29-prod.abc.acme.net,behzad_dastur@v-stagemon-001-prod.abc.acme.net sudo chown -R behzad_dastur.behzad_dastur /tmp/cbt/00000000/LibrbdFio/osd_ra-00004096/op_size-01048576/concurrent_procs-001/iodepth-064/randwrite/* 21:30:37 - DEBUG - cbt - rpdcp -f 1 -R ssh -w behzad_dastur@v-stagemon-002-prod.abc.acme.net,behzad_dastur@b-stageosd001-r19f29-prod.abc.acme.net,behzad_dastur@v-stagemon-001-prod.abc.acme.net -r /tmp/cbt/00000000/LibrbdFio/osd_ra-00004096/op_size-01048576/concurrent_procs-001/iodepth-064/randwrite/* /tmp/00000000/LibrbdFio/osd_ra-00004096/op_size-01048576/concurrent_procs-001/iodepth-064/randwrite

I can see logs created at:

[root@cbtvm001-d658 cbt]# ls /tmp/00000000/LibrbdFio/osd_ra-00004096/op_size-01048576/concurrent_procs-001/iodepth-064/read/ collectl.b-stageosd001-r19f29-prod.acme.symcpe.net collectl.v-stagemon-002-prod.abc.acme.net output.0.v-stagemon-001-prod.abc.acme.net collectl.v-stagemon-001-prod.abc.acme.net historic_ops.out.b-stageosd001-r19f29-prod.abc.acme.net
Are there ways to now visualize this data.

The text was updated successfully, but these errors were encountered:

ommoreno · 2016-03-31T23:55:11Z

The last thing CBT does is copy over the logs and output files from the nodes/clients and brings them over to the head node. This is all raw data and FIO summary outputs so you need to create a parser if you want to visualize the data as cluster performance.

bdastur · 2016-04-01T14:41:03Z

Thanks for confirming/clarifying @ommoreno .

bengland2 · 2016-07-21T19:32:35Z

see fiologparser.py in axboe/fio tree under tools/ , this is in process of being improved by Mark and Karl Cronburg.
Error checking is being tightened up, see PRs #107 and #110

sand33p-23 · 2016-10-19T09:11:29Z

im running cbt on existing cluster, but im not getting any output in "output.0" file.. all im getting is some output in "historic_ops.out.. tried running both librbdfio and rados benchmark..

bengland2 · 2016-10-19T12:50:54Z

Try running the fio or rados bench command standalone and see if you get an error. Then walk backwards in the command list until you find the first command that failed.

I added code into CBT to check for failures while constructing the cluster, and throw an exception if one occurs, but did not enable failure checking everywhere - there are cases where some users may find it useful to ignore a single failure, such as a test that constructs a 1000-OSD cluster and encounters a single bad disk. You can turn it on anywhere you like by adding the parameter ", continue_if_error=False" as the last parameter in the common.pdsh calls in CBT code.

It sounds like your cluster built if you are seeing historic_ops.out results. What happens when you run rados bench command that CBT runs by itself? Also look in benchmark/radosbench.py and enable error checking there, so that CBT will tell you what's going wrong.

sand33p-23 · 2016-10-19T17:06:11Z

Thanks for the steps , got the cbt running after lots of troubleshooting. Really need to document the steps so wont get issues when run it on another cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More explicit failure indication in cbt run. #97

More explicit failure indication in cbt run. #97

bdastur commented Mar 31, 2016

ommoreno commented Mar 31, 2016

bdastur commented Apr 1, 2016

bengland2 commented Jul 21, 2016

sand33p-23 commented Oct 19, 2016

bengland2 commented Oct 19, 2016

sand33p-23 commented Oct 19, 2016

More explicit failure indication in cbt run. #97

More explicit failure indication in cbt run. #97

Comments

bdastur commented Mar 31, 2016

ommoreno commented Mar 31, 2016

bdastur commented Apr 1, 2016

bengland2 commented Jul 21, 2016

sand33p-23 commented Oct 19, 2016

bengland2 commented Oct 19, 2016

sand33p-23 commented Oct 19, 2016