As a beginner, I appreciate that there are metrics for all these LLMs out there so I don’t waste time downloading and trying failures. However, I noticed that the Leaderboard doesn’t exactly reflect reality for me. YES, I DO UNDERSTAND THAT IT DEPENDS ON MY NEEDS.

I mean really basic stuff of how the LLM acts as a coherent agent, can follow instructions and grasp context in any given situation. Which is often lacking in LLMs I am trying so far, like the boards leader for 30B models 01-ai/Yi-34B for example. I guess there is something similar going on like it used to with GPU benchmarks: dirty tricks and over-optimization for the tests.

I am interested in how more experienced people here evaluate an LLM’s fitness. Do you have a battery of questions and instructions you try out first?

  • LoSboccacc@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    I’ve a python script that runs a fixed dialogue with a bit of turn by turn instructions, comprehension tasks like recall or summarisation and a few reasoning. I package everything in vicuna format (user: assistant: ) then send it to gp4 where I ask: this is a chat between a user and an assistant, evaluate each assistant response individually for coherence and consistency and write a score in 10/10 and the problems you find, then I pick the minimum score of 10 samples.