Ha, they used data generated by GPT-4V. It’s not a surprise that it got better than LLaVA 7B, and is comparable or slightly better than LLaVA 13B.
No innovation needed otherwise!
The ShareGPT4V-7B model follows the design of LLaVA- 1.5 [30], including three integral components: (1) A vision encoder utilizing the CLIP-Large model [45], with a reso- lution of 336×336 and a patch size of 14, converting input images into 576 tokens. (2) A projector, which is a two- layer multi-layer perception (MLP), is introduced to con- nect the vision and language modalities. (3) A LLM, based on the open-source Vicuna-v1.5 [8], derived from LLaMA2 [53].
Ha, they used data generated by GPT-4V. It’s not a surprise that it got better than LLaVA 7B, and is comparable or slightly better than LLaVA 13B.
No innovation needed otherwise!