This is kinda nuts (first time I try a LLM + vision)
Tried with a first person shooter screenshot, enemy on screen. Asked to give me the 2D coordinates of the enemy and it did, precisely.
Has anyone gotten this model working with oogabooga? If so what loader did you use?
Oh please do tell!
f so what loader did yo
Tried and did not succeed… waiting on more help to be available… i have a HORSE of a system and would love trying to run this locally!!!
This style of captioning could be amazing for text to image datasets and i wouldn’t be surprised to see them take a jump in quality as well.
Okay, what front-end can I use to run these type of multi modal models?
This looks good. Imagine this thing quantized. Pretty please u/The-Bloke make it possible.
Hopefully we get GGUFs soon
I converted and quantized this to work in llama.cpp
nice. From my tests it seems to be about the same as LLava v1.5 13B and Bakllava. I’m starting to suspect that the CLIP-Large model all of these multi-model LLMs are using is holding them back.
Ha, they used data generated by GPT-4V. It’s not a surprise that it got better than LLaVA 7B, and is comparable or slightly better than LLaVA 13B.
No innovation needed otherwise!
The ShareGPT4V-7B model follows the design of LLaVA- 1.5 [30], including three integral components: (1) A vision encoder utilizing the CLIP-Large model [45], with a reso- lution of 336×336 and a patch size of 14, converting input images into 576 tokens. (2) A projector, which is a two- layer multi-layer perception (MLP), is introduced to con- nect the vision and language modalities. (3) A LLM, based on the open-source Vicuna-v1.5 [8], derived from LLaMA2 [53].
Benchmark in their GitHub, even if it’s hard to be sure in current times
This isn’t comparing with the 13B version of LLAVA. I’d be curious to see that.
im new here. but is this true multimodality, or is it the llm communicating with a vision model?
and what are those 4 models being benchmark tested here for exactly?
Would love to use this for handling remote security camera footage.
Tried with LLAVA with little success. Has anyone successfully applied any of the Open Vision models to the problem of security?
I just think you have to set proper expectations. I use llava with my security cameras and it does what I want. Which is to know when something interesting is happening like when it sees someone. Llava gave me this from one of my security cameras earlier this morning.
The image features a person walking on a street, captured through a fisheye lens, which distorts the perspective of the scene. The person appears to be carrying a bag, possibly a backpack, while walking down the sidewalk.
Which IMO is very useful.