multimodal - Improve LLaVA model accuracy and performance #3602

monatis · 2023-10-12T19:34:59Z

With #3436, llama.cpp has support for LLaVA, state-of-the-art large multimodal model. It appears that there is still room for improvement in its performance and accuracy, so I'm opening this issue to track and get feedback from the community. I'll continue to work on this, so any feedback is much appreciated.

hexbinoct · 2023-10-14T08:05:37Z

I have the latest build of the main branch, llava is working (pretty amazing), but it doesnt seem to be using cuda (while the release is built with blas support and works with llama.cpp really good):

PS F:\ai3\llama.cpp> .\build\bin\Release\llava.exe -m ..\models\llava\ggml-model-q5_k.gguf --mmproj ..\models\llava\mmproj-model-f16.gguf --image 'C:\Users\ab\Pictures\mystuff\1664779479174 - Copy.jpg' --temp 0.1
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3070 Laptop GPU, compute capability 8.6
.
.
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 4560.96 MB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/35 layers to GPU
llm_load_tensors: VRAM used: 0.00 MB

but its not like it doesnt use the GPU, the GPU shows a full activity while what seems like its processing the image, and then the gpu goes idle while the text inference is being streamed, i wish the text inference was also on the gpu (like normal llama). Yes i have tried the ngl & ngld but no changes.

KerfuffleV2 · 2023-10-14T10:38:25Z

@hexbinoct Does #3621 fix your issue?

edit: I merged it, so that fix should be in the next release or you can compile it from master yourself to be able to offload immediately.

hexbinoct · 2023-10-14T11:10:09Z

oh dear, the token generation just went from ~2/sec to 40/sec on a laptop nvidia gpu. Thanks a lot! i got updated source from master.

aiaicode · 2023-10-14T12:02:26Z

oh dear, the token generation just went from ~2/sec to 40/sec on a laptop nvidia gpu. Thanks a lot! i got updated source from master.

@hexbinoct What about image encoding speed? Did it increase as well?

KerfuffleV2 · 2023-10-14T12:07:06Z

What about image encoding speed? Did it increase as well?

That was already offloaded as far as I know. The performance stayed the same when I tested it.

hexbinoct · 2023-10-14T12:15:26Z

oh dear, the token generation just went from ~2/sec to 40/sec on a laptop nvidia gpu. Thanks a lot! i got updated source from master.

@hexbinoct What about image encoding speed? Did it increase as well?

its almost instant, like a second i think, as soon as the last loading message is printed "total VRAM used ..." the text inference starts writing itself meaning the image was completed already. thats what i understand by it.

hexbinoct · 2023-10-14T12:17:17Z

is there interactive mode? -i and -ins is not working. I am thinking like keep the model loaded, and then through some command format we keep on giving it images one by one, along with the -p. Right now the app closes itself once it has explained the image.

aiaicode · 2023-10-14T12:20:27Z

is there interactive mode? -i and -ins is not working. I am thinking like keep the model loaded, and then through some command format we keep on giving it images one by one, along with the -p. Right now the app closes itself once it has explained the image.

Issue 3593

y10ab1 · 2023-10-16T03:54:15Z

I am curious as to why the q4_k model provides faster evaluation time but slower prompt evaluation time, while the f16 model gives me slower evaluation time but faster prompt evaluation time?

Can we reduce the prompt evaluation time?

f16:

q4_k:

ggerganov · 2023-10-16T05:59:35Z

You can try disabling the MMQ feature: just add -nommq to the command line

monatis · 2023-10-16T16:14:02Z

Today I played a little bit with the 13b model to understand the reason why it performs worse than 7b even in f16. I extracted image features from the original Python implementation and saved them in a .bin file. Then I read it in C++ instead of encoding an image file. The quality was still bad. Not sure yet, but seems like the LLaMA part might have something to do with it. Will investigate this further, but I might be busy with for some time with the BERT implementation. Until then, any feedback might be helpful when I return back to this.

ggerganov · 2023-10-16T16:27:19Z

Where is the source of the 13B LLaMA model? Want to take a look at the config / hparams to see if we have converted everything correctly.

monatis · 2023-10-16T16:31:33Z

Here it is: https://huggingface.co/liuhaotian/llava-v1.5-13b/blob/main/config.json

rlancemartin · 2023-10-19T14:15:45Z

What is the scope of work needed to support new multi modal model releases, like Fuyu-8b? Weights are available but seems will need conversion to GGUF. (I can also move this to a new ticket, since I know this was meant to be focused on LLaVA.)

https://huggingface.co/adept/fuyu-8b

KerfuffleV2 · 2023-10-19T14:44:08Z

What is the scope of work needed to support new multi modal model releases, like Fuyu-8b?

It's probably going to depend a lot on how closely the new model's architecture and handling matches what already exists. Fuyu looks really interesting but also takes a much different approach from the existing LLaVA stuff as far as I can see. You slice the image into chunks and feed them to it like tokens, with rows separated by a special token.

monatis · 2023-10-19T15:06:33Z

It's probably going to depend a lot on how closely the new model's architecture and handling matches what already exists.

Agree, BakLLaVA can be readily supported with minor modifications to the surgery script for example (see #3682).

Fuyu looks really interesting but also takes a much different approach

I'm really suspicious of this approach. There have already existed several decoder-only multimodal models out there, but it's hard to beat image encoder + decoder approach in terms of performance (speed), accuracy and training / finetuning efficiency. I believe this approach has still way to go.

KerfuffleV2 · 2023-10-19T15:11:10Z

What they wrote about it makes it seem like their focus is on simplicity, ease of training and speed when deployed more than necessarily outperforming existing approaches in raw ability. https://www.adept.ai/blog/fuyu-8b

The tests they have makes it seem like it's basically on par with the typical approach though.

jxy · 2023-10-20T22:57:35Z

Is image resizing with linear sample a significant source of error here? Would using stb_image_resize.h help?

z3ugma · 2023-10-30T04:16:44Z

I'd like to mix and match -mmproj files (the CLIP) with -m models / different LLM models, to experiment

Manticore-13B.ggmlv3.q4_1.gguf
Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.gguf
WizardLM-13B-Uncensored.Q4_0.gguf

^^for example, these different models for writing stories, that are finetuned to give very long answers. If the mmproj model / CLIP can read in the image, and then use these models for prompting, that's what I'd like to experiment with.

Unfortunately a lot of these models have different embedding dimensions:

main: embedding dim of the multimodal projector (4096) is not equal to that of LLaMA (5120). Make sure that you use the correct mmproj file.
?1 llama.cpp %

Green-Sky · 2023-10-30T12:12:29Z

Unfortunately a lot of these models have different embedding dimensions:

Well, you can always go and train a projection matrix yourself, for the model of choice.

but if the same base model was used for the finetunes, they should have the same embedding dimensions...

monatis · 2023-10-30T12:37:16Z

@z3ugma Give a try to the mmproj file of the 13B-variant of LLaVA from here

I'm not sure about the accuracy / performance of this method although some community members reported that they get good results by mixing the mmproj and regular Vicuna models.

z3ugma · 2023-10-30T13:54:45Z

@monatis that worked, the 13B models had the 5120-dimensional embeddings along with the 13B CLIP .mmproj file you linked.

monatis · 2023-10-30T13:57:46Z

Great! I'd be interested in hearing about your impressions for the quality of generation you obtain with the models you cited.

github-actions · 2024-04-04T01:08:26Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

monatis self-assigned this Oct 12, 2023

monatis mentioned this issue Oct 12, 2023

Implement multimodal models (LLaVA) #3436

Merged

5 tasks

github-actions bot added the stale label Mar 19, 2024

github-actions bot closed this as completed Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multimodal - Improve LLaVA model accuracy and performance #3602

multimodal - Improve LLaVA model accuracy and performance #3602

monatis commented Oct 12, 2023

hexbinoct commented Oct 14, 2023

KerfuffleV2 commented Oct 14, 2023 •

edited

Loading

hexbinoct commented Oct 14, 2023

aiaicode commented Oct 14, 2023

KerfuffleV2 commented Oct 14, 2023

hexbinoct commented Oct 14, 2023

hexbinoct commented Oct 14, 2023

aiaicode commented Oct 14, 2023

y10ab1 commented Oct 16, 2023 •

edited

Loading

ggerganov commented Oct 16, 2023

monatis commented Oct 16, 2023

ggerganov commented Oct 16, 2023

monatis commented Oct 16, 2023

rlancemartin commented Oct 19, 2023 •

edited

Loading

KerfuffleV2 commented Oct 19, 2023

monatis commented Oct 19, 2023

KerfuffleV2 commented Oct 19, 2023

jxy commented Oct 20, 2023

z3ugma commented Oct 30, 2023

Green-Sky commented Oct 30, 2023

monatis commented Oct 30, 2023

z3ugma commented Oct 30, 2023

monatis commented Oct 30, 2023

github-actions bot commented Apr 4, 2024

multimodal - Improve LLaVA model accuracy and performance #3602

multimodal - Improve LLaVA model accuracy and performance #3602

Comments

monatis commented Oct 12, 2023

hexbinoct commented Oct 14, 2023

KerfuffleV2 commented Oct 14, 2023 • edited Loading

hexbinoct commented Oct 14, 2023

aiaicode commented Oct 14, 2023

KerfuffleV2 commented Oct 14, 2023

hexbinoct commented Oct 14, 2023

hexbinoct commented Oct 14, 2023

aiaicode commented Oct 14, 2023

y10ab1 commented Oct 16, 2023 • edited Loading

ggerganov commented Oct 16, 2023

monatis commented Oct 16, 2023

ggerganov commented Oct 16, 2023

monatis commented Oct 16, 2023

rlancemartin commented Oct 19, 2023 • edited Loading

KerfuffleV2 commented Oct 19, 2023

monatis commented Oct 19, 2023

KerfuffleV2 commented Oct 19, 2023

jxy commented Oct 20, 2023

z3ugma commented Oct 30, 2023

Green-Sky commented Oct 30, 2023

monatis commented Oct 30, 2023

z3ugma commented Oct 30, 2023

monatis commented Oct 30, 2023

github-actions bot commented Apr 4, 2024

KerfuffleV2 commented Oct 14, 2023 •

edited

Loading

y10ab1 commented Oct 16, 2023 •

edited

Loading

rlancemartin commented Oct 19, 2023 •

edited

Loading