I found out about this model browsing LLaMA-Adapter repo, it was released a few days ago.
Model page
Weights (40GB)
Paper
Demo
Seems to be able to handle different tasks on images such as bounding box and object-detection, text extraction. On benchmarks it shows a bit lower numbers than CogVLM, so I tried to test how well it can reason and compare it to CogVLM, I was able to get good results with SPHINX consistently, with higher temperature while CogVLM was missing the point with any configuration:
You must log in or register to comment.
It’s better than llava 1.5 for sure, remarkable better.
Given how they feed the image into the projector I’m not surprised about it.