Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] detect_multi_node_defaults: fails w/ an UnboundLocalError on slurm single-node allocation #1130

Open
pmccormick opened this issue Mar 6, 2024 · 1 comment

Comments

@pmccormick
Copy link

Software versions

Unfortunately, legate-issue does not appear to be installed as part of the conda/miniconda enviornment our users have installed. I reproduced this issue myself with the latest conda environment dated as of this report and also don't see it installed. legate --version reports version 23.09.00.

Jupyter notebook / Jupyter Lab version

No response

Expected behavior

A legate-python script should execute on a single node allocation without error.

Observed behavior

The script launch fails with the following error:

$ legate simple.py
Traceback (most recent call last):
  File "/projects/legion/miniconda3/bin/legate", line 7, in <module>
    from legate.driver import main
  File "/projects/legion/miniconda3/lib/python3.11/site-packages/legate/driver/__init__.py", line 17, in <module>
    from .config import Config
  File "/projects/legion/miniconda3/lib/python3.11/site-packages/legate/driver/config.py", line 35, in <module>
    from .args import parser
  File "/projects/legion/miniconda3/lib/python3.11/site-packages/legate/driver/args.py", line 118, in <module>
    nodes_kw, ranks_per_node_kw = detect_multi_node_defaults()
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/projects/legion/miniconda3/lib/python3.11/site-packages/legate/driver/args.py", line 93, in detect_multi_node_defaults
    nodes_kw["default"] = nodes
                          ^^^^^
UnboundLocalError: cannot access local variable 'nodes' where it is not associated with a value

Example code or instructions

# insert any legate friendly code here... :-) 
# see additional info below (this appears to be environment-specific issue). 
print('hello')

Stack traceback or browser console output

This is likely a local use case nuance combined with slurm job submission details. It is often the case our users request a single node allocation without explicitly providing a node count. For example,

$ salloc -p redstone --qos=normal --time=10:00:00

Within such an allocation the failure occurs for all legate-launched scripts. The work-around is for users to explicitly provide a node count (-n 1) in their salloc request:

$ salloc -p redstone -n 1 --qos=normal --time=10:00:00

@manopapad
Copy link
Contributor

This might have been fixed since 23.09, if you were able to get a top-of-tree build working could you please test with that?

Not related, but note that if salloc doesn't automatically log you in to (one of) the compute nodes in your allocation (that's the behavior on our local SLURM cluster), then you'd want to pass a --launcher option to legate to send the processes to the compute nodes (or ssh into a compute node manually). Without a launcher legate will just run on the currently node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants