Skip to content

Latest commit

 

History

History
81 lines (61 loc) · 3.81 KB

synthetic-data-generation.md

File metadata and controls

81 lines (61 loc) · 3.81 KB

Synthetic data generation (labeling)

The instructions to synthetically generate new solutions are almost identical to the evaluation instructions, since both workflows use the same scripts, just with different parameters.

Make sure to complete prerequisites before proceeding.

Here are the basic commands to generate 128 solutions for each problem in GSM8K dataset using any "teacher" model, e.g. Mixtral-8x7B.

  1. Get the model and follow instructions to convert it to TensorRT-LLM format. While you can do inference with NeMo, we highly recommend using TensorRT-LLM for synthetic data generation as it can be up to 10x faster.

  2. Start data generation. Note that if you're running locally, all jobs will run sequentially.

    python pipeline/run_labeling.py \
      --model_path <path to trtllm model> \
      --server_type tensorrt_llm \
      --output_dir ./synthetic-solutions/ \
      --num_gpus <number of GPUs on your machine/cluster node> \
      --num_runs 128 \
      +prompt=openmathinstruct/base \
      ++prompt.few_shot_examples.examples_type=gsm8k_text_with_code \
      ++dataset=gsm8k \
      ++split_name=train_full
    

    This will run 128 slurm jobs each generating a solutions with unique random seed. You can customize solution format with ++prompt.few_shot_examples.examples_type (see nemo_skills/inference/prompt/few_shot_examples).

  3. You would typically follow by converting the data to SFT format and finetuning models.

For more details read evaluation docs.

Masked solutions

We provide masked datasets GSM8K-Masked and MATH-Masked that were generated using Mixtral-8x7b. Here are the steps to create masked solutions for the different dataset or using other model.

  1. Get the model and follow instructions to convert it to TensorRT-LLM format. While you can do inference with NeMo, we highly recommend using TensorRT-LLM for synthetic data generation as it can be up to 10x faster.

  2. For GSM8K and MATH you can use ++prompt.few_shot_examples.examples_type=gsm8k_generate_masked and ++prompt.few_shot_examples.examples_type=math_generate_masked respectively. If using other dataset, create few-shot examples that show how to "translate" original reference solution to a masked one.

  3. Start data generation. Note that if you're running locally, all jobs will run sequentially.

    python pipeline/run_labeling.py \
      --model_path <path to trtllm model> \
      --server_type tensorrt_llm \
      --output_dir ./masked-solutions/ \
      --num_gpus <number of GPUs on your machine/cluster node> \
      --num_runs 32 \
      +prompt=openmathinstruct/text_masked_base \
      ++prompt.few_shot_examples.examples_type=gsm8k_generate_masked \
      ++dataset=gsm8k \
      ++split_name=train_full
    

This will run 32 slurm jobs with unique random seed, each generating a masked solutions based on reference solutions and provided few-shot examples.

  1. Pick the best masked solutions and convert to the expected format.

    python nemo_skills/finetuning/prepare_masked_data.py \
      ++dataset=<dataset_name from datasets folder> \
      ++masked_soln_jsonl_files=./masked-solutions/output-rs*.jsonl \
      ++split_name=train_full
    

Prepared dataset will be saved in datasets/<dataset_name>-masked/<split_name>.jsonl.

Now you can go back to step 2 of the previous section and specify ++dataset=<dataset_name>-masked and +prompt=openmathinstruct/masked_solution.