Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLaVA does not offload layers to GPU #3616

Closed
ruslanmustafin opened this issue Oct 13, 2023 · 0 comments · Fixed by #3621
Closed

LLaVA does not offload layers to GPU #3616

ruslanmustafin opened this issue Oct 13, 2023 · 0 comments · Fixed by #3621

Comments

@ruslanmustafin
Copy link

ruslanmustafin commented Oct 13, 2023

The issue was already mentioned in #3436. Creating a separate issue so that it does not get lost.

I run LLaVA with (commit id: 1e0e873)

./llava -m ggml-model-q5_k.gguf \
        --mmproj mmproj-model-f16.gguf \
        --temp 0.1 -ngl 64 -mg 0 \
        --image n008-2018-09-18-14-54-39-0400__CAM_FRONT__1537297366762404.jpg

This the relevant parts from the output:

ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6

...

llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required  = 4560.96 MB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/35 layers to GPU
llm_load_tensors: VRAM used: 0.00 MB
..................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 1024.00 MB
llama_new_context_with_model: compute buffer total size = 162.13 MB
llama_new_context_with_model: VRAM scratch buffer: 156.00 MB
llama_new_context_with_model: total VRAM used: 156.00 MB (model: 0.00 MB, context: 156.00 MB)

...

main: image encoded in  1561.49 ms by CLIP (    2.71 ms per image patch)

llama_print_timings:        load time =    3042.21 ms
llama_print_timings:      sample time =      11.65 ms /   136 runs   (    0.09 ms per token, 11671.82 tokens per second)
llama_print_timings: prompt eval time =    9440.69 ms /   626 tokens (   15.08 ms per token,    66.31 tokens per second)
llama_print_timings:        eval time =   47661.78 ms /   136 runs   (  350.45 ms per token,     2.85 tokens per second)
llama_print_timings:       total time =   58800.36 ms
```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant