-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multimodal - Improve LLaVA model accuracy and performance #3602
Comments
I have the latest build of the main branch, llava is working (pretty amazing), but it doesnt seem to be using cuda (while the release is built with blas support and works with llama.cpp really good): PS F:\ai3\llama.cpp> .\build\bin\Release\llava.exe -m ..\models\llava\ggml-model-q5_k.gguf --mmproj ..\models\llava\mmproj-model-f16.gguf --image 'C:\Users\ab\Pictures\mystuff\1664779479174 - Copy.jpg' --temp 0.1 but its not like it doesnt use the GPU, the GPU shows a full activity while what seems like its processing the image, and then the gpu goes idle while the text inference is being streamed, i wish the text inference was also on the gpu (like normal llama). Yes i have tried the ngl & ngld but no changes. |
@hexbinoct Does #3621 fix your issue? edit: I merged it, so that fix should be in the next release or you can compile it from master yourself to be able to offload immediately. |
oh dear, the token generation just went from ~2/sec to 40/sec on a laptop nvidia gpu. Thanks a lot! i got updated source from master. |
@hexbinoct What about image encoding speed? Did it increase as well? |
That was already offloaded as far as I know. The performance stayed the same when I tested it. |
its almost instant, like a second i think, as soon as the last loading message is printed "total VRAM used ..." the text inference starts writing itself meaning the image was completed already. thats what i understand by it. |
is there interactive mode? -i and -ins is not working. I am thinking like keep the model loaded, and then through some command format we keep on giving it images one by one, along with the -p. Right now the app closes itself once it has explained the image. |
|
You can try disabling the MMQ feature: just add |
Today I played a little bit with the 13b model to understand the reason why it performs worse than 7b even in f16. I extracted image features from the original Python implementation and saved them in a .bin file. Then I read it in C++ instead of encoding an image file. The quality was still bad. Not sure yet, but seems like the LLaMA part might have something to do with it. Will investigate this further, but I might be busy with for some time with the BERT implementation. Until then, any feedback might be helpful when I return back to this. |
Where is the source of the 13B LLaMA model? Want to take a look at the config / hparams to see if we have converted everything correctly. |
What is the scope of work needed to support new multi modal model releases, like Fuyu-8b? Weights are available but seems will need conversion to GGUF. (I can also move this to a new ticket, since I know this was meant to be focused on LLaVA.) |
It's probably going to depend a lot on how closely the new model's architecture and handling matches what already exists. Fuyu looks really interesting but also takes a much different approach from the existing LLaVA stuff as far as I can see. You slice the image into chunks and feed them to it like tokens, with rows separated by a special token. |
Agree, BakLLaVA can be readily supported with minor modifications to the surgery script for example (see #3682).
I'm really suspicious of this approach. There have already existed several decoder-only multimodal models out there, but it's hard to beat image encoder + decoder approach in terms of performance (speed), accuracy and training / finetuning efficiency. I believe this approach has still way to go. |
What they wrote about it makes it seem like their focus is on simplicity, ease of training and speed when deployed more than necessarily outperforming existing approaches in raw ability. https://www.adept.ai/blog/fuyu-8b The tests they have makes it seem like it's basically on par with the typical approach though. |
Is image resizing with linear sample a significant source of error here? Would using stb_image_resize.h help? |
I'd like to mix and match -mmproj files (the CLIP) with -m models / different LLM models, to experiment
^^for example, these different models for writing stories, that are finetuned to give very long answers. If the mmproj model / CLIP can read in the image, and then use these models for prompting, that's what I'd like to experiment with. Unfortunately a lot of these models have different embedding dimensions:
|
Well, you can always go and train a projection matrix yourself, for the model of choice. but if the same base model was used for the finetunes, they should have the same embedding dimensions... |
@monatis that worked, the |
Great! I'd be interested in hearing about your impressions for the quality of generation you obtain with the models you cited. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
With #3436, llama.cpp has support for LLaVA, state-of-the-art large multimodal model. It appears that there is still room for improvement in its performance and accuracy, so I'm opening this issue to track and get feedback from the community. I'll continue to work on this, so any feedback is much appreciated.
The text was updated successfully, but these errors were encountered: