Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on querying NVIDIA devices | OverflowError: Python int too large to convert to C long #160

Closed
JensWendt opened this issue Aug 8, 2023 · 9 comments
Milestone

Comments

@JensWendt
Copy link

Describe the bug

Freshly installed gpustat. Upon running gpustat I get:

Error on querying NVIDIA devices. Use --debug flag to see more details.
Python int too large to convert to C long

gpustat --debug:

Error on querying NVIDIA devices. Use --debug flag to see more details.
Python int too large to convert to C long

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\gpustat\cli.py", line 58, in print_gpustat
    gpu_stats = GPUStatCollection.new_query(debug=debug, id=id)
  File "C:\ProgramData\Anaconda3\lib\site-packages\gpustat\core.py", line 604, in new_query
    gpu_info = get_gpu_info(handle)
  File "C:\ProgramData\Anaconda3\lib\site-packages\gpustat\core.py", line 561, in get_gpu_info
    process = get_process_info(nv_process)
  File "C:\ProgramData\Anaconda3\lib\site-packages\gpustat\core.py", line 469, in get_process_info
    psutil.Process(pid=nv_process.pid)
  File "C:\ProgramData\Anaconda3\lib\site-packages\psutil\__init__.py", line 332, in __init__
    self._init(pid)
  File "C:\ProgramData\Anaconda3\lib\site-packages\psutil\__init__.py", line 361, in _init
    self.create_time()
  File "C:\ProgramData\Anaconda3\lib\site-packages\psutil\__init__.py", line 717, in create_time
    self._create_time = self._proc.create_time()
  File "C:\ProgramData\Anaconda3\lib\site-packages\psutil\_pswindows.py", line 688, in wrapper
    return fun(self, *args, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\psutil\_pswindows.py", line 942, in create_time
    user, system, created = cext.proc_times(self.pid)
OverflowError: Python int too large to convert to C long

nvidia-smi:

C:\Users\MiN_Acc2>nvidia-smi
Tue Aug  8 15:25:02 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 527.27       Driver Version: 527.27       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000    WDDM* | 00000000:21:00.0 Off |                    0 |
| 33%   35C    P3    53W / 260W |  15655MiB / 46080MiB |     20%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     11700    C+G   C:\Windows\explorer.exe         N/A      |
|    0   N/A  N/A     64008    C+G   ...y\ShellExperienceHost.exe    N/A      |
|    0   N/A  N/A     90396    C+G   ...w5n1h2txyewy\SearchUI.exe    N/A      |
+-----------------------------------------------------------------------------+

Environment information:

  • OS: Windows Server 2019
  • NVIDIA Driver version: 527.27
  • gpustat version: 1.2.dev7+g7c09a0f
  • pynvml version: 12.535.77

It seems this bug has already been seen and solved over at nvitop XuehaiPan/nvitop#76

@JensWendt JensWendt added the bug label Aug 8, 2023
@Lunar13737
Copy link

I encountered the same problem, and it's been solved by downgrading the nvidia-ml-py to a former version 11.525.112 using pip install nvidia-ml-py==11.525.112. I hope it's helpful.

@PyroGenesis
Copy link

+1 same error

  • OS: Windows 10 Enterprise (Version: 2004, OS build: 19041.264)
  • NVIDIA Driver version: 536.99
  • The name(s) of GPU card: NVIDIA GeForce RTX 4090 x 2
  • gpustat version: gpustat 1.1.1

Thanks for the workaround @Lunar13737 , it worked for me.

@mjmikulski
Copy link

+1 and the workaround with downgrading nvidia-ml-py did not work for me :(

  • OS: Windows 11 Pro N
  • NVIDIA Driver Version: 535.98, CUDA Version: 12.2
  • GPU: NVIDIA gpuGeForce RTX 4070
  • gpustat version: gpustat 1.1.1

Any hints?

@wookayin wookayin added this to the 1.2 milestone Oct 16, 2023
@wookayin
Copy link
Owner

wookayin commented Oct 30, 2023

I'd like to reproduce this issue to have a correct fix. But I've never seen the issue.

What we know from #161 (comment):

  • nvidia-ml-py=11.535.77 is buggy, only works for 535.43 and 535.86 (the OP's case):
    • Does the problem go away if you install nvidia-ml-py==12.535.108? @JensWendt
  • It looks like that nvidia-ml-py 12.535.108 should correct all process-information related bugs, reverting the breaking changes in the previous versions. But this is just my guess, I'm not sure. I would need the nvidia-ml-py version installed on the system.

@Lunar13737, @PyroGenesis, @mjmikulski thanks for the datapoints. Could you please try upgrading nvidia-ml-py==12.535.108 and see if the OverflowError is gone?

@PyroGenesis
Copy link

Could you please try upgrading nvidia-ml-py==12.535.108 and see if the OverflowError is gone?

@wookayin I can confirm, overflow error does not occur in nvidia-ml-py 12.535.108

@wookayin
Copy link
Owner

@PyroGenesis Thanks. What was the previous version of nvidia-ml-py that resulted in this bug?

@PyroGenesis
Copy link

@wookayin I think it was most likely 12.535.77 that caused the error, though I'm not 100% sure because I didn't keep a record of it. I downgraded to 11.525.112 which worked, and now 12.535.108 works too.

@Lunar13737
Copy link

@wookayin nvidia-ml-py 12.535.108 works for me, no overflow error

@wookayin
Copy link
Owner

wookayin commented Nov 1, 2023

Thanks. I can conclude that the root cause of this bug is essentially same as #161: one should use neither nvidia-ml-py=11.535.77 nor broken NVIDIA drivers >= 535.43, < 535.98.

gpustat will print warnings when any of these versions of nvml library or driver is detected, so we can close this issue without adding an unnecessary compatibility layer.

@wookayin wookayin closed this as completed Nov 1, 2023
wookayin added a commit that referenced this issue Nov 1, 2023
nvidia-ml-py==12.535.77 is a buggy version that breaks the struct for
process information, and should not be used (unless NVIDIA driver is
*also* buggy, 535.43, 535.54, and 535.86). The latest version
nvidia-ml-py==12.535.108 fixes the problem and is still compatible with
our supported drivers (R450+).

To ensure users who will install gpustat 1.2.0 have a correct version
of nvidia-ml-py version installed, we bump up the requirement.

See #160 and #161 for more details.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants