You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been stuck on this issue for some time now and would greatly appreciate any help! I am trying to run the optimise_hyperparameter function over 2 A100GPU using PyTorch DDP strategy.
When I run this I get the following error: RuntimeError: DDP expects same model across all ranks, but Rank 0 has 160 params, while rank 1 has inconsistent 137 params.
I have tried setting the seed across ranks but no luck. Has anyone experiences this issue or have an example of using this function and training a TFT with DDP?
I am using the latest package versions and training on an Azure VM. The application is run once I trigger the train_model function.
def prepare_data(data_prep_folder):
# Load in training and validation dataset
training = torch.load(f"{data_prep_folder}/{constants.TRAIN_DATASET_FILE_NAME}")
validation = torch.load(f"{data_prep_folder}/{constants.VALIDATION_DATASET_FILE_NAME}")
logger.info(f"Training set loaded with {len(training)} length.")
logger.info(f"Validation set loaded with {len(validation)} length.")
# Create dataloaders
train_dataloader = training.to_dataloader(
train=True,
batch_size=128,
num_workers=47,
pin_memory=True
)
val_dataloader = validation.to_dataloader(
train=False,
batch_size=128,
num_workers=47,
pin_memory=True
)
logger.info(f"Dataloaders created with 128 batch size and 47 workers.")
return train_dataloader, val_dataloader
fkiraly
changed the title
Issue with using optimise_hyperparameter with PyTorch DDP -> please help! :)
[BUG] Issue with using optimise_hyperparameter with PyTorch DDP
Aug 30, 2024
Potentially related to the windows failures reported here: #1623
Can you kindly paste the full output of pip list, from your python environment, and also let us know what your operating system and python version are?
Hi community,
I have been stuck on this issue for some time now and would greatly appreciate any help! I am trying to run the optimise_hyperparameter function over 2 A100GPU using PyTorch DDP strategy.
When I run this I get the following error:
RuntimeError: DDP expects same model across all ranks, but Rank 0 has 160 params, while rank 1 has inconsistent 137 params.
I have tried setting the seed across ranks but no luck. Has anyone experiences this issue or have an example of using this function and training a TFT with DDP?
I am using the latest package versions and training on an Azure VM. The application is run once I trigger the train_model function.
def prepare_data(data_prep_folder):
def hyperparameter_tuner(train_dataloader, val_dataloader, model_train_folder):
# Start time
start_time = time.time()
logger.info("Starting hyperparameter tuning...")
The text was updated successfully, but these errors were encountered: