Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial draft for Stitcher workload (SQLServer only). #361

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

anjagruenheid
Copy link
Contributor

Implementation of Stitcher workload as described in "Stitcher: Learned Workload Synthesis from Historical Performance Footprints.": https://openproceedings.org/2023/conf/edbt/paper-19.pdf by Chengcheng Wan, Yiwen Zhu, Joyce Cahoon, Wenjing Wang, Katherine Lin, Sean Liu, Raymond Truong, Neetu Singh, Alexandra M. Ciortea, Konstantinos Karanasos, Subru Krishnan, published in EDBT 2023.

Copy link
Collaborator

@bpkroth bpkroth Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nicer I think to move these into a subdirectory.
Even better would be to provide a config mechanism that would allow something like the following:

<params>
   <!-- db specific connect info -->
   <include_workload_info>data/stitcher/workload_a.xml</include_workload_info>
</params>

Now have data/stitcher/workload_a.xml contain something like the following:

<workload_info>
  <transaction_types>
     ... <!-- as before -->
  </transaction_types>
  <works>
     <work>
        <comment>timestamp: 2023-02-24-15-11-28-957898, some other data</comment>
        <time>...</time>
        <rate>...</rate>
        <weights>...</weights>
        <terminals>...</terminals> <!-- new: moving this parameter in here might require more work -->
     </work>
     <!-- new: repeat as necessary all in a single file -->
      ...
  </works>
</workload_info>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I can look into that. It would definitely make it more generic for any other benchmarks that are generated in this manner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thinking about it some more, I think we could generalize this as a new type of benchmark (multi-benchmark?) where the user specifies as input a set of configurations and an execution order as part of its xml definition. The execution order can either contain a pointer to a configuration or a 'sleep' command. The tricky part is rewriting the DBWorkload.java class because you'd need a way to instantiate BenchBase in subthreads. I can give it a go if you're okay with that plan? The stitcher workload would then be one instantiation of such a multi-benchmark.

Prior to executing the shell scripts, the data needs to be preloaded with the following commands:

```sh
java -jar benchbase.jar -b tpcc -c config/$dbms/stitcher/2023-02-24-15-11-28-957898.xml --create=true --load=true --execute=false
Copy link
Collaborator

@bpkroth bpkroth Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little strange to need to preload a dated config.
Why not preload a particular DB with an explicit scale factor?

Basically, it seems to me that for stitcher, which at a high level tries to construct workload phases out of existing benchmarks, need to

  1. Preload a number of known benchmarks at a particular scalefactor for each.
  2. Run a sequence of workloads corresponding to each of those workloads.

If we ignore the case where some workloads may overlap in timeperiod for a moment (i.e., they aren't strictly sequential where one workload phase ends before another begins), then it'd be nice to be able to list the full sequence of 1 and 2 inside the same config, or else provide a standard reusable script that can take a directory that's laid out in the appropriate format and replay all of these steps directly.

What's here right now is kind of a one off and not super reusable other than as a template.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is it possible for us to have instructions for generating the config sequence given a resource utilization trace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how these were originally generated but I don't think the date matters that much? My guess is that these were the configurations of experiments that they collected telemetry for and that were then (partially) picked to mimic this one specific workload. If you look at the configurations, they have different SFs but also different query weight distributions. You would preload 4 datasets (TPC-H, TPC-C SF 16 and 160, and YCSB) and then execute these different configurations on top of those preloaded instances. This benchmark is a static snapshot of a real-world benchmark that we can publish (i.e., add to an open-source repo) because it was already used in published work. Afaik, there is some discussion as to whether they want to open-source the Stitcher code but no resolution yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants