Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multimodal - Improve LLaVA model accuracy and performance #3602

Closed
monatis opened this issue Oct 12, 2023 · 24 comments
Closed

multimodal - Improve LLaVA model accuracy and performance #3602

monatis opened this issue Oct 12, 2023 · 24 comments
Assignees
Labels

Comments

@monatis
Copy link
Collaborator

monatis commented Oct 12, 2023

With #3436, llama.cpp has support for LLaVA, state-of-the-art large multimodal model. It appears that there is still room for improvement in its performance and accuracy, so I'm opening this issue to track and get feedback from the community. I'll continue to work on this, so any feedback is much appreciated.

@monatis monatis self-assigned this Oct 12, 2023
@hexbinoct
Copy link

I have the latest build of the main branch, llava is working (pretty amazing), but it doesnt seem to be using cuda (while the release is built with blas support and works with llama.cpp really good):

PS F:\ai3\llama.cpp> .\build\bin\Release\llava.exe -m ..\models\llava\ggml-model-q5_k.gguf --mmproj ..\models\llava\mmproj-model-f16.gguf --image 'C:\Users\ab\Pictures\mystuff\1664779479174 - Copy.jpg' --temp 0.1
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3070 Laptop GPU, compute capability 8.6
.
.
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 4560.96 MB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/35 layers to GPU
llm_load_tensors: VRAM used: 0.00 MB

but its not like it doesnt use the GPU, the GPU shows a full activity while what seems like its processing the image, and then the gpu goes idle while the text inference is being streamed, i wish the text inference was also on the gpu (like normal llama). Yes i have tried the ngl & ngld but no changes.

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Oct 14, 2023

@hexbinoct Does #3621 fix your issue?

edit: I merged it, so that fix should be in the next release or you can compile it from master yourself to be able to offload immediately.

@hexbinoct
Copy link

oh dear, the token generation just went from ~2/sec to 40/sec on a laptop nvidia gpu. Thanks a lot! i got updated source from master.

@aiaicode
Copy link

oh dear, the token generation just went from ~2/sec to 40/sec on a laptop nvidia gpu. Thanks a lot! i got updated source from master.

@hexbinoct What about image encoding speed? Did it increase as well?

@KerfuffleV2
Copy link
Collaborator

What about image encoding speed? Did it increase as well?

That was already offloaded as far as I know. The performance stayed the same when I tested it.

@hexbinoct
Copy link

oh dear, the token generation just went from ~2/sec to 40/sec on a laptop nvidia gpu. Thanks a lot! i got updated source from master.

@hexbinoct What about image encoding speed? Did it increase as well?

its almost instant, like a second i think, as soon as the last loading message is printed "total VRAM used ..." the text inference starts writing itself meaning the image was completed already. thats what i understand by it.

@hexbinoct
Copy link

is there interactive mode? -i and -ins is not working. I am thinking like keep the model loaded, and then through some command format we keep on giving it images one by one, along with the -p. Right now the app closes itself once it has explained the image.

@aiaicode
Copy link

is there interactive mode? -i and -ins is not working. I am thinking like keep the model loaded, and then through some command format we keep on giving it images one by one, along with the -p. Right now the app closes itself once it has explained the image.

Issue 3593

@y10ab1
Copy link
Contributor

y10ab1 commented Oct 16, 2023

I am curious as to why the q4_k model provides faster evaluation time but slower prompt evaluation time, while the f16 model gives me slower evaluation time but faster prompt evaluation time?

Can we reduce the prompt evaluation time?

f16:
截圖 2023-10-16 上午11 50 52

q4_k:
截圖 2023-10-16 上午11 51 26

@ggerganov
Copy link
Owner

You can try disabling the MMQ feature: just add -nommq to the command line

@monatis
Copy link
Collaborator Author

monatis commented Oct 16, 2023

Today I played a little bit with the 13b model to understand the reason why it performs worse than 7b even in f16. I extracted image features from the original Python implementation and saved them in a .bin file. Then I read it in C++ instead of encoding an image file. The quality was still bad. Not sure yet, but seems like the LLaMA part might have something to do with it. Will investigate this further, but I might be busy with for some time with the BERT implementation. Until then, any feedback might be helpful when I return back to this.

@ggerganov
Copy link
Owner

Where is the source of the 13B LLaMA model? Want to take a look at the config / hparams to see if we have converted everything correctly.

@monatis
Copy link
Collaborator Author

monatis commented Oct 16, 2023

@rlancemartin
Copy link

rlancemartin commented Oct 19, 2023

What is the scope of work needed to support new multi modal model releases, like Fuyu-8b? Weights are available but seems will need conversion to GGUF. (I can also move this to a new ticket, since I know this was meant to be focused on LLaVA.)

https://huggingface.co/adept/fuyu-8b

@KerfuffleV2
Copy link
Collaborator

What is the scope of work needed to support new multi modal model releases, like Fuyu-8b?

It's probably going to depend a lot on how closely the new model's architecture and handling matches what already exists. Fuyu looks really interesting but also takes a much different approach from the existing LLaVA stuff as far as I can see. You slice the image into chunks and feed them to it like tokens, with rows separated by a special token.

@monatis
Copy link
Collaborator Author

monatis commented Oct 19, 2023

It's probably going to depend a lot on how closely the new model's architecture and handling matches what already exists.

Agree, BakLLaVA can be readily supported with minor modifications to the surgery script for example (see #3682).

Fuyu looks really interesting but also takes a much different approach

I'm really suspicious of this approach. There have already existed several decoder-only multimodal models out there, but it's hard to beat image encoder + decoder approach in terms of performance (speed), accuracy and training / finetuning efficiency. I believe this approach has still way to go.

@KerfuffleV2
Copy link
Collaborator

What they wrote about it makes it seem like their focus is on simplicity, ease of training and speed when deployed more than necessarily outperforming existing approaches in raw ability. https://www.adept.ai/blog/fuyu-8b

The tests they have makes it seem like it's basically on par with the typical approach though.

@jxy
Copy link
Contributor

jxy commented Oct 20, 2023

Is image resizing with linear sample a significant source of error here? Would using stb_image_resize.h help?

@z3ugma
Copy link

z3ugma commented Oct 30, 2023

I'd like to mix and match -mmproj files (the CLIP) with -m models / different LLM models, to experiment

Manticore-13B.ggmlv3.q4_1.gguf
Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_0.gguf
WizardLM-13B-Uncensored.Q4_0.gguf

^^for example, these different models for writing stories, that are finetuned to give very long answers. If the mmproj model / CLIP can read in the image, and then use these models for prompting, that's what I'd like to experiment with.

Unfortunately a lot of these models have different embedding dimensions:

main: embedding dim of the multimodal projector (4096) is not equal to that of LLaMA (5120). Make sure that you use the correct mmproj file.
?1 llama.cpp % 

@Green-Sky
Copy link
Collaborator

Unfortunately a lot of these models have different embedding dimensions:

Well, you can always go and train a projection matrix yourself, for the model of choice.

but if the same base model was used for the finetunes, they should have the same embedding dimensions...

@monatis
Copy link
Collaborator Author

monatis commented Oct 30, 2023

@z3ugma Give a try to the mmproj file of the 13B-variant of LLaVA from here

I'm not sure about the accuracy / performance of this method although some community members reported that they get good results by mixing the mmproj and regular Vicuna models.

@z3ugma
Copy link

z3ugma commented Oct 30, 2023

@monatis that worked, the 13B models had the 5120-dimensional embeddings along with the 13B CLIP .mmproj file you linked.

@monatis
Copy link
Collaborator Author

monatis commented Oct 30, 2023

Great! I'd be interested in hearing about your impressions for the quality of generation you obtain with the models you cited.

@github-actions github-actions bot added the stale label Mar 19, 2024
Copy link
Contributor

github-actions bot commented Apr 4, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants