Skip to content
SWIMProjectUCB edited this page Apr 20, 2012 · 37 revisions

SWIM -- Statistical Workload Injector for MapReduce

Yanpei Chen, Sara Alspaugh, Archana Ganapathi (1), Rean Griffith (2), Randy Katz

{ychen2, alspaugh, randy} [at] eecs [dot] berkeley [dot] edu, (1) aganapathi [at] splunk [dot] com, (2) rean [at] vmware [dot] com.

Additional contributions from Madalin Mihailescu (madalin@cs.toronto.edu).

Version 1.4. Released January 2012.

Overview

MapReduce systems face enormous challenges due to increasing growth, diversity, and consolidation of the data and computation involved. Provisioning, configuring, and managing large-scale MapReduce clusters require realistic, workload-specific performance insights that existing MapReduce benchmarks are ill-equipped to supply. SWIM includes

  1. Repository of real life MapReduce workloads from production systems.
  2. Workload synthesis tools to generate representative test workloads by sampling historical MapReduce cluster traces.
  3. Workload replay tools to execute the historical or test workloads with low performance overhead.

SWIM enables rigorous performance measurement of MapReduce systems. SWIM contains suites of workloads of thousands of jobs, with complex data, arrival, and computation patterns. This represents an advance over previous MapReduce pseudo-benchmarks of limited diversity and scope. SWIM informs both highly targeted, workload specific optimizations, as well as designs that intend to bring general benefit.

We believe MapReduce cluster operators can use SWIM to accomplish other previously challenging tasks, including but not limited to resource provisioning and planning in multiple dimensions, configurations tuning for diverse job types within a workload, anticipating workload consolidation behavior and quantify workload superposition in multiple dimensions.

SWIM is currently integrated with Hadoop. The performance and evaluation science behind it is extensible to MapReduce systems in general.

You can learn more about SWIM from our IEEE MASCOTS 2011 paper The Case for Evaluating MapReduce Performance Using Workload Suites.

This page contains an early release of SWIM, and we expect to populate this page with additional workloads and examples as they become available. We welcome and appreciate all comments, suggestions, bug fix requests, use cases, and success stories.

Download SWIM

SWIM is currently open-source under the New BSD License, except for files derived from Apache Hadoop, which are under the Apache License 2.0.

Please use either the git repository directly or download the repository as a compressed archive.

Next

Analyze historical cluster traces and synthesize representative workload

Workloads repository

Performance measurement by executing synthetic or historical workloads

Our IEEE MASCOTS 2011 paper discusses the scientific/engineering details of the workload synthesis and replay methods.

CHANGELOG and prior improvements to SWIM

Acknowledgements

We thank the various industry partners and government sponsors of UC Berkeley Reliable, Adaptive, and Distribtued systems Laboratory and its successor Algorithms, Machines, and People Laboratory for their support and feedback on the initial versions of SWIM.