Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More explicit failure indication in cbt run. #97

Open
bdastur opened this issue Mar 31, 2016 · 6 comments
Open

More explicit failure indication in cbt run. #97

bdastur opened this issue Mar 31, 2016 · 6 comments

Comments

@bdastur
Copy link

bdastur commented Mar 31, 2016

When executing the cbt.py test suite, it is very hard to figure out which steps failed/passed.
My experience with this tool is very limited as I just started using it, but I see that the pdsh commands fail without any error, so it is hard to decipher why.

Also, the use_existing flag in cluster: configuration in the yaml file should be highlighted when using against an existing cluster. Once I go through a successful execution I will create a pull request for any doc changes if makes sense and other issues if I see.

Another issue I see is username and groupname are taken as the same which is not the case. Might be useful to add a groups filed as well.

Lastly -->
Now I think I have gotten past some of my inital hurdles and am able to execute an fio benchmark, but I am not sure what is next.

The last step I see is:

21:30:37 - DEBUG - cbt - pdsh -R ssh -w behzad_dastur@v-stagemon-002-prod.abc.xyz.net,behzad_dastur@b-stageosd001-r19f29-prod.abc.acme.net,behzad_dastur@v-stagemon-001-prod.abc.acme.net sudo chown -R behzad_dastur.behzad_dastur /tmp/cbt/00000000/LibrbdFio/osd_ra-00004096/op_size-01048576/concurrent_procs-001/iodepth-064/randwrite/* 21:30:37 - DEBUG - cbt - rpdcp -f 1 -R ssh -w behzad_dastur@v-stagemon-002-prod.abc.acme.net,behzad_dastur@b-stageosd001-r19f29-prod.abc.acme.net,behzad_dastur@v-stagemon-001-prod.abc.acme.net -r /tmp/cbt/00000000/LibrbdFio/osd_ra-00004096/op_size-01048576/concurrent_procs-001/iodepth-064/randwrite/* /tmp/00000000/LibrbdFio/osd_ra-00004096/op_size-01048576/concurrent_procs-001/iodepth-064/randwrite

I can see logs created at:

[root@cbtvm001-d658 cbt]# ls /tmp/00000000/LibrbdFio/osd_ra-00004096/op_size-01048576/concurrent_procs-001/iodepth-064/read/ collectl.b-stageosd001-r19f29-prod.acme.symcpe.net collectl.v-stagemon-002-prod.abc.acme.net output.0.v-stagemon-001-prod.abc.acme.net collectl.v-stagemon-001-prod.abc.acme.net historic_ops.out.b-stageosd001-r19f29-prod.abc.acme.net
Are there ways to now visualize this data.

@ommoreno
Copy link

The last thing CBT does is copy over the logs and output files from the nodes/clients and brings them over to the head node. This is all raw data and FIO summary outputs so you need to create a parser if you want to visualize the data as cluster performance.

@bdastur
Copy link
Author

bdastur commented Apr 1, 2016

Thanks for confirming/clarifying @ommoreno .

@bengland2
Copy link
Contributor

see fiologparser.py in axboe/fio tree under tools/ , this is in process of being improved by Mark and Karl Cronburg.
Error checking is being tightened up, see PRs #107 and #110

@sand33p-23
Copy link

im running cbt on existing cluster, but im not getting any output in "output.0" file.. all im getting is some output in "historic_ops.out.. tried running both librbdfio and rados benchmark..

@bengland2
Copy link
Contributor

Try running the fio or rados bench command standalone and see if you get an error. Then walk backwards in the command list until you find the first command that failed.

I added code into CBT to check for failures while constructing the cluster, and throw an exception if one occurs, but did not enable failure checking everywhere - there are cases where some users may find it useful to ignore a single failure, such as a test that constructs a 1000-OSD cluster and encounters a single bad disk. You can turn it on anywhere you like by adding the parameter ", continue_if_error=False" as the last parameter in the common.pdsh calls in CBT code.

It sounds like your cluster built if you are seeing historic_ops.out results. What happens when you run rados bench command that CBT runs by itself? Also look in benchmark/radosbench.py and enable error checking there, so that CBT will tell you what's going wrong.

@sand33p-23
Copy link

Thanks for the steps , got the cbt running after lots of troubleshooting. Really need to document the steps so wont get issues when run it on another cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants