Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pytorch PPO Implementation, dimension difference #384

Open
kevin-mahon opened this issue Feb 22, 2023 · 2 comments
Open

Pytorch PPO Implementation, dimension difference #384

kevin-mahon opened this issue Feb 22, 2023 · 2 comments

Comments

@kevin-mahon
Copy link

kevin-mahon commented Feb 22, 2023

Hello, apologies if I do this wrong I don't contribute to open source often. I was attempting to run the Pytorch PPO implementation and kept getting several errors regarding the dimension of the observation 'o'. I believe this is because the return values from the environment are being mishandled. See my proposed changes below:
Original:

o, ep_ret, ep_len = env.reset(), 0, 0

# Main loop: collect experience in env and update/log each epoch
for epoch in range(epochs):
    for t in range(local_steps_per_epoch):
        a, v, logp = ac.step(torch.as_tensor(o, dtype=torch.float32))

        next_o, r, d, _ = env.step(a)
        ep_ret += r
        ep_len += 1

        # save and log
        buf.store(o, a, r, v, logp)
        logger.store(VVals=v)
        
        # Update obs (critical!)
        o = next_o

        timeout = ep_len == max_ep_len
        terminal = d or timeout
        epoch_ended = t==local_steps_per_epoch-1

        if terminal or epoch_ended:
            if epoch_ended and not(terminal):
                print('Warning: trajectory cut off by epoch at %d steps.'%ep_len, flush=True)
            # if trajectory didn't reach terminal state, bootstrap value target
            if timeout or epoch_ended:
                _, v, _ = ac.step(torch.as_tensor(o, dtype=torch.float32))
            else:
                v = 0
            buf.finish_path(v)
            if terminal:
                # only save EpRet / EpLen if trajectory finished
                logger.store(EpRet=ep_ret, EpLen=ep_len)
            o, ep_ret, ep_len = env.reset(), 0, 0

To fix this I just get rid of the 'info' return value from env.reset, but I imagine this might not work for all environments (if they have extra returns for the env.step() for example).

Adjusted:

(o,_), ep_ret, ep_len = env.reset(), 0, 0

# Main loop: collect experience in env and update/log each epoch
for epoch in range(epochs):
    for t in range(local_steps_per_epoch):
        a, v, logp = ac.step(torch.as_tensor(o, dtype=torch.float32))

        next_o, r, d, _ = env.step(a)
        ep_ret += r
        ep_len += 1

        # save and log
        buf.store(o, a, r, v, logp)
        logger.store(VVals=v)
        
        # Update obs (critical!)
        o = next_o

        timeout = ep_len == max_ep_len
        terminal = d or timeout
        epoch_ended = t==local_steps_per_epoch-1

        if terminal or epoch_ended:
            if epoch_ended and not(terminal):
                print('Warning: trajectory cut off by epoch at %d steps.'%ep_len, flush=True)
            # if trajectory didn't reach terminal state, bootstrap value target
            if timeout or epoch_ended:
                _, v, _ = ac.step(torch.as_tensor(o, dtype=torch.float32))
            else:
                v = 0
            buf.finish_path(v)
            if terminal:
                # only save EpRet / EpLen if trajectory finished
                logger.store(EpRet=ep_ret, EpLen=ep_len)
            (o,_), ep_ret, ep_len = env.reset(), 0, 0

Let me know if anyone else has run into this problem or if this fix works.

@Alberto-Hache
Copy link

Alberto-Hache commented Feb 22, 2023 via email

@kevin-mahon
Copy link
Author

Hello,
I'm using a custom environment following the OpenAI Gym api.
This would be the error (although this time for td3)

Traceback (most recent call last):
File "src/laser_gym.py", line 43, in
train_test()
File "src/laser_gym.py", line 25, in train_test
td3(env_fn=env_fn, ac_kwargs=ac_kwargs, steps_per_epoch=1000, epochs=260, logger_kwargs=logger_kwargs)
File "/home/kevin/work/spinningup/spinup/algos/pytorch/td3/td3.py", line 334, in td3
test_agent()
File "/home/kevin/work/spinningup/spinup/algos/pytorch/td3/td3.py", line 276, in test_agent
o, r, d, _ = test_env.step(get_action(o, 0))
File "/home/kevin/work/spinningup/spinup/algos/pytorch/td3/td3.py", line 267, in get_action
a = ac.act(torch.as_tensor(o, dtype=torch.float32))
ValueError: expected sequence of length 2 at dim 1 (got 1)

My reset and step functions are defined as follows:

def reset(self, options=None):
    self._agent_location = np.random.random_integers(int(self.LOW/2), int(self.HIGH/2), size=2).astype(self.DTYPE)
    obs = self._get_obs()
    info = self._get_info()
    return obs, info

def step(self, action):
    _new_agent_location = self._agent_location + np.matmul(self.response_matrix,action)
    rew = self.reward.get_reward(self._agent_location, action, _new_agent_location)
    self._agent_location = _new_agent_location

    if (np.abs(self._agent_location) > self.HIGH).any() or (np.abs(self._agent_location) < 0.05).all():
        done = True #hit the edge
    else:
        done = False 
    
    obs = self._get_obs()
    info = self._get_info()

    if self.render_mode == 'human':
        self.render()
    
    return obs, rew, done, info

I am a little confused how this could work for other people when it seems like all the environment's should return the same tuple (https://www.gymlibrary.dev/api/core/#gym.Env.reset)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants