[RLlib] Rename of `SingleAgentRLModuleSpec` to `RLModuleSpec` breaks restoring old checkpoints #47426

Kakadus · 2024-08-30T09:50:35Z

What happened + What you expected to happen

I wanted to restore checkpoints created with ray v2.34.0 with ray v2.35.0, which errors with

>>> from ray.rllib.algorithms import Algorithm
>>> Algorithm.from_checkpoint(path=".../ray_results/pbt_humanoid_test/PPO_Humanoid-v4_3338d_00003_3_2024-09-09_00-46-34/checkpoint_000014")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../.cache/pypoetry/virtualenvs/ray-f4XCQ9mO-py3.12/lib/python3.12/site-packages/ray/rllib/algorithms/algorithm.py", line 399, in from_checkpoint
    state = Algorithm._checkpoint_info_to_algorithm_state(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../.cache/pypoetry/virtualenvs/ray-f4XCQ9mO-py3.12/lib/python3.12/site-packages/ray/rllib/algorithms/algorithm.py", line 3442, in _checkpoint_info_to_algorithm_state
    state = pickle.load(f)
            ^^^^^^^^^^^^^^
AttributeError: Can't get attribute 'SingleAgentRLModuleSpec' on <module 'ray.rllib.core.rl_module.rl_module' from '.../.cache/pypoetry/virtualenvs/ray-f4XCQ9mO-py3.12/lib/python3.12/site-packages/ray/rllib/core/rl_module/rl_module.py'>

I expected to be able to restore checkpoints from older ray versions after the upgrade.

Adding a line like

SingleAgentRLModuleSpec = RLModuleSpec

to ray/rllib/core/rl_module.rl_module.py allows me to continue from the old checkpoint.

This seems to be caused by #46840

Versions / Dependencies

python 3.10.12 / 3.12.5
ray v2.35.0

Reproduction script

Create a checkpoint with ray 2.34.0, upgrade to ray 2.25.0 and try to restore the checkpoint

Issue Severity

Medium: It is a significant difficulty but I can work around it.

edit: replaced the traceback with a more minimal one.

The text was updated successfully, but these errors were encountered:

simonsays1980 · 2024-09-11T17:43:31Z

@Kakadus Thanks for raising this issue. With which ray version has the checkpoint been trained? And did you use api_stack(enable_env_runner_and_connector_v2=True, enable_rl_module_and_learner=True)?

Kakadus · 2024-09-13T09:44:49Z

@Kakadus Thanks for raising this issue. With which ray version has the checkpoint been trained?

The checkpoint is created with v2.34.0. The error happens when restoring with v2.35.0

And did you use api_stack(enable_env_runner_and_connector_v2=True, enable_rl_module_and_learner=True)?

no, not intentionally at least. I reproduced the error and tested with the first best example I found:

Run this with ray v2.34.0

#!/usr/bin/env python
"""Example of using PBT with RLlib.

Note that this requires a cluster with at least 8 GPUs in order for all trials
to run concurrently, otherwise PBT will round-robin train the trials which
is less efficient (or you can set {"gpu": 0} to use CPUs for SGD instead).

Note that Tune in general does not need 8 GPUs, and this is just a more
computationally demanding example.
"""

import random

from ray import train, tune
from ray.rllib.algorithms.ppo import PPO
from ray.tune import TuneConfig
from ray.tune.schedulers import PopulationBasedTraining

if __name__ == "__main__":
    # Postprocess the perturbed config to ensure it's still valid
    def explore(config):
        # ensure we collect enough timesteps to do sgd
        if config["train_batch_size"] < config["sgd_minibatch_size"] * 2:
            config["train_batch_size"] = config["sgd_minibatch_size"] * 2
        # ensure we run at least one sgd iter
        if config["num_sgd_iter"] < 1:
            config["num_sgd_iter"] = 1
        return config


    pbt = PopulationBasedTraining(
        time_attr="time_total_s",
        perturbation_interval=120,
        resample_probability=0.25,
        # Specifies the mutations of these hyperparams
        hyperparam_mutations={
            "lambda": lambda: random.uniform(0.9, 1.0),
            "clip_param": lambda: random.uniform(0.01, 0.5),
            "lr": [1e-3, 5e-4, 1e-4, 5e-5, 1e-5],
            "num_sgd_iter": lambda: random.randint(1, 30),
            "sgd_minibatch_size": lambda: random.randint(128, 16384),
            "train_batch_size": lambda: random.randint(2000, 160000),
        },
        custom_explore_fn=explore,
    )

    tuner = tune.Tuner(
        PPO,
        run_config=train.RunConfig(
            name="pbt_humanoid_test",
            checkpoint_config=train.CheckpointConfig(checkpoint_frequency=1),
        ),
        tune_config=TuneConfig(
            scheduler=pbt,
            num_samples=8,
            metric="env_runners/episode_reward_mean",
            mode="max",
            reuse_actors=True,
        ),
        param_space={
            "env": "Humanoid-v4",
            "kl_coeff": 1.0,
            "num_workers": 1,
            "num_gpus": 0,
            "model": {"free_log_std": True},
            # These params are tuned from a fixed starting value.
            "lambda": 0.95,
            "clip_param": 0.2,
            "lr": 1e-4,
            # These params start off randomly drawn from a set.
            "num_sgd_iter": 10,
            "sgd_minibatch_size": 128,
            "train_batch_size": 256,
        },
    )
    results = tuner.fit()

    print("best hyperparameters: ", results.get_best_result().config)

And restore a checkpoint with 3.35.0

#!/usr/bin/env python
from ray.rllib.algorithms import Algorithm


Algorithm.from_checkpoint(path=".../ray_results/pbt_humanoid_test/PPO_Humanoid-v4_3338d_00003_3_2024-09-09_00-46-34/checkpoint_000014")

simonsays1980 · 2024-09-16T16:16:06Z

@Kakadus thanks for raising this issue. We overhauled the checkpointing in newer versions to give it a higher flexibility.

You could use copyreg and a dynamic binding to influence how pickle loads the SingleAgentRLModuleSpec:

import copyreg
import pickle
import ray.rllib.core.rl_module.rl_module as rl_module
from rl_module import RLModuleSpec


def single_agent_rl_module_spec_constructor(*args, **kwargs):
    """Constructor to replace SingleAgentRLModuleSpec with RLModuleSpec."""
    return RLModuleSpec(*args, **kwargs)

# Dynamically alias the old class name to the new one
rl_module.SingleAgentRLModuleSpec = RLModuleSpec

# Tell `pickle` how to handle old `SingleAgentRLModuleSpec` instances.
copyreg.pickle(
    ("ray.rllib.core.rl_module", "SingleAgentRLModuleSpec"),
    single_agent_rl_module_spec_constructor,
)

# Try loading the checkpoint.
Algorithm.from_checkpoint(...)

Kakadus · 2024-09-17T23:25:06Z

Thanks @simonsays1980

If I understand correctly, #47708 will prevent this type of error from happening in the future; making the created checkpoints more backward compatible, while this error has to be worked around. Would it make sense to merge #47560 then to have at least one release which is able to restore older checkpoints?

Kakadus added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 30, 2024

anyscalesam added the rllib RLlib related issues label Sep 3, 2024

Kakadus linked a pull request Sep 8, 2024 that will close this issue

[RLlib] add SingleAgentRLModuleSpec alias to RLModuleSpec #47560

Open

8 tasks

simonsays1980 self-assigned this Sep 11, 2024

simonsays1980 added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 11, 2024

simonsays1980 linked a pull request Sep 17, 2024 that will close this issue

[RLlib] Increase backward compatibility of checkpoints. #47708

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Rename of `SingleAgentRLModuleSpec` to `RLModuleSpec` breaks restoring old checkpoints #47426

[RLlib] Rename of `SingleAgentRLModuleSpec` to `RLModuleSpec` breaks restoring old checkpoints #47426

Kakadus commented Aug 30, 2024 •

edited

Loading

simonsays1980 commented Sep 11, 2024 •

edited

Loading

Kakadus commented Sep 13, 2024 •

edited

Loading

simonsays1980 commented Sep 16, 2024 •

edited

Loading

Kakadus commented Sep 17, 2024

[RLlib] Rename of SingleAgentRLModuleSpec to RLModuleSpec breaks restoring old checkpoints #47426

[RLlib] Rename of SingleAgentRLModuleSpec to RLModuleSpec breaks restoring old checkpoints #47426

Comments

Kakadus commented Aug 30, 2024 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

simonsays1980 commented Sep 11, 2024 • edited Loading

Kakadus commented Sep 13, 2024 • edited Loading

simonsays1980 commented Sep 16, 2024 • edited Loading

Kakadus commented Sep 17, 2024

[RLlib] Rename of `SingleAgentRLModuleSpec` to `RLModuleSpec` breaks restoring old checkpoints #47426

[RLlib] Rename of `SingleAgentRLModuleSpec` to `RLModuleSpec` breaks restoring old checkpoints #47426

Kakadus commented Aug 30, 2024 •

edited

Loading

simonsays1980 commented Sep 11, 2024 •

edited

Loading

Kakadus commented Sep 13, 2024 •

edited

Loading

simonsays1980 commented Sep 16, 2024 •

edited

Loading