Skip to content

Analyze historical cluster traces and synthesize representative workload

SWIMProjectUCB edited this page Apr 20, 2012 · 7 revisions

Standard first steps in the Hadoop workload analysis process involves extracting per job statistics from job history logs, and synthesizing representative test workloads from historical trace data. These shorter duration test workloads can then drive performance comparisons on real life, production systems.

Step 1. Parse historical Hadoop logs

This tool parses Hadoop job history logs created in the default Hadoop logging format.

parse-hadoop-jobhistory.pl

Usage:

perl parse-hadoop-jobhistory.pl [job history dir] > outputFile.tsv

This script prints the output to STDOUT, which can then be piped to file or consumed for further analysis. The output format is tab separated values (.tsv), one row per job. The output fields are:

   1.  unique_job_id
   2.  job_name
   3.  map_input_bytes
   4.  shuffle_bytes
   5.  reduce_output_bytes
   6.  submit_time_seconds (epoch format)
   7.  duration_seconds
   8.  map_time_task_seconds (2 tasks of 10 seconds = 20 task-seconds)
   9.  red_time_task_seconds
   10. total_time_task_seconds
   11. map_tasks_count     
   12. reduce_tasks_count
   13. hdfs_input_path (if available in history log directory)
   14. hdfs_output_path (if available in history log directory)

For example, one could parse the Facebook history logs by calling

perl parse-hadoop-jobhistory.pl FB-2009_LogRepository > FacebookTrace.tsv 

The subsequent output can be directly fed into Step 2. Alternately, human analysts can use the output to distill other insights.

Step 2. Synethsizes workloads by sampling historical Hadoop trace

This step synthesizes representative workloads of short duration using potentially months of trace data parsed by Step 1 previously. The key synthesis technique is continuous time window sampling in multiple dimensions. Our IEEE MASCOTS 2011 paper explains this method in more detail.

WorkloadSynthesis.pl

Usage:

perl WorkloadSynthesis.pl 
    --inPath=STRING 
    --outPrefix=STRING 
    --repeats=INT 
    --samples=INT 
    --length=INT 
    --traceStart=INT 
    --traceEnd=INT

The script samples the Hadoop trace in inPath, and generates several synthetic workloads stored with file names prefixed by outPrefix.

Arguments in the above:

  • inPath Path to file of the full trace . Expected format: Tab separated fields, one record per row per job, each row contains fields in the following order:

      unique_job_id
      job_name
      map_input_bytes
      shuffle_bytes
      reduce_output_bytes
      submit_time_seconds (epoch format)
      duration_seconds
      map_time_task_seconds (2 tasks of 10 seconds = 20 task-seconds)
      red_time_task_seconds
      total_time_task_seconds
      map_tasks_count (optional)    
      reduce_tasks_count (optional)
      hdfs_input_path (optional)
      hdfs_output_path (optional)
    

Additional fields would not trigger error but would be ignored.

  • outPrefix Prefix of synthetic workloads output files. Output format: Tab separated fields, one record per row per job, each row contains fields in the following order:

      new_unique_job_id
      submit_time_seconds (relative to the start of the workload)
      inter_job_submit_gap_seconds
      map_input_bytes
      shuffle_bytes
      reduce_output_bytes
    
  • repeats The number of synthetic workloads to create.

  • samples The number of samples in each synthetic workload.

  • length The time length of each sample window in seconds.

  • traceStart Start time of the trace (epoch format)

  • traceEnd End time of the trace (epoch format)

E.g., FB-2009_samples_24_times_1hr_0.tsv and FB-2009_samples_24_times_1hr_0.tsv were created by running

perl WorkloadSynthesis.pl 
    --inPath=FacebookTrace.tsv 
    --outPrefix=FB-2009_samples_24_times_1hr_ 
    --repeats=2 
    --samples=24 
    --length=3600 
    --traceStart=FacebookTraceStart 
    --traceEnd=FacebookTraceEnd