• pseudonerv@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Ha, they used data generated by GPT-4V. It’s not a surprise that it got better than LLaVA 7B, and is comparable or slightly better than LLaVA 13B.

    No innovation needed otherwise!

    The ShareGPT4V-7B model follows the design of LLaVA- 1.5 [30], including three integral components: (1) A vision encoder utilizing the CLIP-Large model [45], with a reso- lution of 336×336 and a patch size of 14, converting input images into 576 tokens. (2) A projector, which is a two- layer multi-layer perception (MLP), is introduced to con- nect the vision and language modalities. (3) A LLM, based on the open-source Vicuna-v1.5 [8], derived from LLaMA2 [53].