Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UHD Graphics 600 | Calling kernel + enqueue_copy more than once, results in OUT_OF_RESOURCES error or freeze #711

Closed
GiorgosXou opened this issue Jan 15, 2024 · 4 comments
Labels

Comments

@GiorgosXou
Copy link

GiorgosXou commented Jan 15, 2024

🦠 Describe the bug

Every time I call a kernel and enqueue_copy in a loop more than once, it either results in OUT_OF_RESOURCES-error or freeze^1. This only happens with the GPU-option^2 and not when using the CPU-pocl-option^3 or when running it on another computer.

💥 To Reproduce

Steps to reproduce the behavior:

  1. Have a UHD Graphics 600
  2. Have intel-compute-runtime
  3. Run a basic example like this one:
Expand to see the basic example
import numpy as np
import pyopencl as cl

# I can have 50000 instead of 5 and it will work just fine, but the momment I loop more than once ... error
a_np = np.random.rand(5).astype(np.float32)
b_np = np.random.rand(5).astype(np.float32)

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

mf = cl.mem_flags
a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a_np)
b_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b_np)

prg = cl.Program(ctx, """
__kernel void sum(
    __global const float *a_g, __global const float *b_g, __global float *res_g)
{
  int gid = get_global_id(0);
  res_g[gid] = a_g[gid] + b_g[gid];
}
""").build()

res_g = cl.Buffer(ctx, mf.WRITE_ONLY, a_np.nbytes)
knl = prg.sum  # Use this Kernel object for repeated calls

for i in range(0,1): # <--------- if range more than 2... issues
    res_np = np.empty_like(a_np)
    prg.sum(queue, a_np.shape, None, a_g, b_g, res_g) #.wait()
    cl.enqueue_copy(queue, res_np, res_g) #is_blocking=False) #.wait()
    # res_g.release()

print(res_np - (a_np + b_np))
print(np.linalg.norm(res_np - (a_np + b_np)))
assert np.allclose(res_np, a_np + b_np)

☝️ Expected behavior

Based on another, more advanced test, in another computer with intel graphics, it worked and didn't result into this error or freeze.So I guess it should work on this GPU too?

💻 Environment

  • OS: Arch Linux
  • ICD Loader and version: ocl-icd-2.3.2-1
  • ICD and version: pocl-4.0-2
  • CPU/GPU: Intel GeminiLake [UHD Graphics 600]
  • Python version: Python 3.11.6
  • PyOpenCL version: python-pyopencl 1:2023.1-2

➕ Additional context

I've also think I found a related issue here JPaulMora/Pyrit#641 .

[1] (in some rare cases it might run more than once but max 2)
[2] [0] <pyopencl.Platform 'Intel(R) OpenCL Graphics' at 0x5597a2db72a0>
[3] [1] <pyopencl.Platform 'Portable Computing Language' at 0x7f6d966595d8>

Any Idea?

@GiorgosXou GiorgosXou added the bug label Jan 15, 2024
@GiorgosXou
Copy link
Author

GiorgosXou commented Jan 15, 2024

My guesses are: Either I understood something completly wrong about how something works, pyopencl issue or an issue with the drivers themeselves. I lean more to 3 or 1 but decided to post the issue here

@inducer
Copy link
Owner

inducer commented Jan 15, 2024

Could you check if things work OK with OpenCL called from C, e.g. vec-demo from this repository? If so, then I'd be inclined to think about a Pyopencl bug, otherwise I would guess driver issue as well. Upgrading the kernel may help too.

@GiorgosXou
Copy link
Author

GiorgosXou commented Jan 15, 2024

Runs just fine with pocl-CPU but again fails GPU with the same type of error, here's the output:

➜  tools git:(master) ./cl-demo
Choose platform:
[0] Intel(R) Corporation
[1] The pocl project
Enter choice: 0
Choose device:
[0] Intel(R) UHD Graphics 600
Enter choice: 0
---------------------------------------------------------------------
NAME: Intel(R) UHD Graphics 600
VENDOR: Intel(R) Corporation
PROFILE: FULL_PROFILE
VERSION: OpenCL 3.0 NEO
EXTENSIONS: cl_khr_byte_addressable_store cl_khr_device_uuid cl_khr_fp16 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_intel_command_queue_families cl_intel_subgroups cl_intel_required_subgroup_size cl_intel_subgroups_short cl_khr_spir cl_intel_accelerator cl_intel_driver_diagnostics cl_khr_priority_hints cl_khr_throttle_hints cl_khr_create_command_queue cl_intel_subgroups_char cl_intel_subgroups_long cl_khr_il_program cl_intel_mem_force_host_memory cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_ballot cl_khr_subgroup_non_uniform_arithmetic cl_khr_subgroup_shuffle cl_khr_subgroup_shuffle_relative cl_khr_subgroup_clustered_reduce cl_intel_device_attribute_query cl_khr_suggested_local_work_size cl_intel_split_work_group_barrier cl_khr_fp64 cl_ext_float_atomics cl_khr_external_memory cl_intel_planar_yuv cl_intel_packed_yuv cl_intel_motion_estimation cl_intel_device_side_avc_motion_estimation cl_intel_advanced_motion_estimation cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_3d_image_writes cl_intel_media_block_io cl_khr_gl_sharing cl_khr_gl_depth_images cl_khr_gl_event cl_khr_gl_msaa_sharing cl_intel_va_api_media_sharing cl_intel_sharing_format_query cl_khr_pci_bus_info
DRIVER_VERSION: 23.35.27191

Type: GPU
EXECUTION_CAPABILITIES: Kernel
GLOBAL_MEM_CACHE_TYPE: Read-Write (2)
CL_DEVICE_LOCAL_MEM_TYPE: Local (1)
SINGLE_FP_CONFIG: 0xbf
QUEUE_PROPERTIES: 0x3

VENDOR_ID: 32902
MAX_COMPUTE_UNITS: 12
MAX_WORK_ITEM_DIMENSIONS: 3
MAX_WORK_GROUP_SIZE: 256
PREFERRED_VECTOR_WIDTH_CHAR: 16
PREFERRED_VECTOR_WIDTH_SHORT: 8
PREFERRED_VECTOR_WIDTH_INT: 4
PREFERRED_VECTOR_WIDTH_LONG: 1
PREFERRED_VECTOR_WIDTH_FLOAT: 1
PREFERRED_VECTOR_WIDTH_DOUBLE: 1
MAX_CLOCK_FREQUENCY: 700
ADDRESS_BITS: 32
MAX_MEM_ALLOC_SIZE: 1717985280
IMAGE_SUPPORT: 1
MAX_READ_IMAGE_ARGS: 128
MAX_WRITE_IMAGE_ARGS: 128
IMAGE2D_MAX_WIDTH: 16384
IMAGE2D_MAX_HEIGHT: 16384
IMAGE3D_MAX_WIDTH: 16384
IMAGE3D_MAX_HEIGHT: 16384
IMAGE3D_MAX_DEPTH: 2048
MAX_SAMPLERS: 16
MAX_PARAMETER_SIZE: 2048
MEM_BASE_ADDR_ALIGN: 1024
MIN_DATA_TYPE_ALIGN_SIZE: 128
GLOBAL_MEM_CACHELINE_SIZE: 64
GLOBAL_MEM_CACHE_SIZE: 393216
GLOBAL_MEM_SIZE: 3435970560
MAX_CONSTANT_BUFFER_SIZE: 1717985280
MAX_CONSTANT_ARGS: 8
LOCAL_MEM_SIZE: 65536
ERROR_CORRECTION_SUPPORT: 0
PROFILING_TIMER_RESOLUTION: 52
ENDIAN_LITTLE: 1
AVAILABLE: 1
COMPILER_AVAILABLE: 1
MAX_WORK_GROUP_SIZES: 256 256 256
---------------------------------------------------------------------
*** 'clFinish' in 'cl-demo.c' on line 101 failed with error 'out of resources'.
[1]    50685 IOT instruction (core dumped)  ./cl-demo
➜  tools git:(master)

Now, where do I have to report this issue :P ... at least i'm feeling slightly well that my original code was fine (not the basic example) lol

@inducer
Copy link
Owner

inducer commented Jan 15, 2024

https://github.com/intel/compute-runtime looks like a good starting point to report.

At any rate, not looking like a pyopencl issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants