ShareGPT4V - New multi-modal model, improves on LLaVA

Cradawx@alien.top · 1 year ago

ShareGPT4V - New multi-modal model, improves on LLaVA

GeraltOfRiga@alien.top · 1 year ago

This is kinda nuts (first time I try a LLM + vision)

Tried with a first person shooter screenshot, enemy on screen. Asked to give me the 2D coordinates of the enemy and it did, precisely.

Lup0Grigi0@alien.top · 1 year ago

Has anyone gotten this model working with oogabooga? If so what loader did you use?

Then_Command_5222@alien.top · 1 year ago

Oh please do tell!

beans_fotos_@alien.top · 1 year ago

f so what loader did yo

Tried and did not succeed… waiting on more help to be available… i have a HORSE of a system and would love trying to run this locally!!!

metalman123@alien.top · 1 year ago

This style of captioning could be amazing for text to image datasets and i wouldn’t be surprised to see them take a jump in quality as well.

9wR8xO@alien.top · 1 year ago

Okay, what front-end can I use to run these type of multi modal models?

StraightChemistry629@alien.top · 1 year ago

This looks good. Imagine this thing quantized. Pretty please u/The-Bloke make it possible.

durden111111@alien.top · 1 year ago

Hopefully we get GGUFs soon

Cradawx@alien.top · 1 year ago

I converted and quantized this to work in llama.cpp

https://huggingface.co/nakodanei/ShareGPT4V-7B_GGUF

durden111111@alien.top · 1 year ago

nice. From my tests it seems to be about the same as LLava v1.5 13B and Bakllava. I’m starting to suspect that the CLIP-Large model all of these multi-model LLMs are using is holding them back.

pseudonerv@alien.top · 1 year ago

Ha, they used data generated by GPT-4V. It’s not a surprise that it got better than LLaVA 7B, and is comparable or slightly better than LLaVA 13B.

No innovation needed otherwise!

The ShareGPT4V-7B model follows the design of LLaVA- 1.5 [30], including three integral components: (1) A vision encoder utilizing the CLIP-Large model [45], with a reso- lution of 336×336 and a patch size of 14, converting input images into 576 tokens. (2) A projector, which is a two- layer multi-layer perception (MLP), is introduced to con- nect the vision and language modalities. (3) A LLM, based on the open-source Vicuna-v1.5 [8], derived from LLaMA2 [53].

LoSboccacc@alien.top · 1 year ago

hope, test, wait

…the cycle continues

M0ULINIER@alien.top · 1 year ago

https://preview.redd.it/vnony8f0ax1c1.png?width=1080&format=pjpg&auto=webp&s=dc261252751a0a1e209d9049854895688de25fa4

Benchmark in their GitHub, even if it’s hard to be sure in current times

lakolda@alien.top · 1 year ago

This isn’t comparing with the 13B version of LLAVA. I’d be curious to see that.

justletmefuckinggo@alien.top · 1 year ago

im new here. but is this true multimodality, or is it the llm communicating with a vision model?

and what are those 4 models being benchmark tested here for exactly?

yahma@alien.top · 1 year ago

Would love to use this for handling remote security camera footage.

Tried with LLAVA with little success. Has anyone successfully applied any of the Open Vision models to the problem of security?

fallingdowndizzyvr@alien.top · 1 year ago

I just think you have to set proper expectations. I use llava with my security cameras and it does what I want. Which is to know when something interesting is happening like when it sees someone. Llava gave me this from one of my security cameras earlier this morning.

The image features a person walking on a street, captured through a fisheye lens, which distorts the perspective of the scene. The person appears to be carrying a bag, possibly a backpack, while walking down the sidewalk.

Which IMO is very useful.