Open LLM Leaderboard vs Reality: How do you evaluate "good" ?

BlueMetaMind@alien.top · 2 years ago

Open LLM Leaderboard vs Reality: How do you evaluate "good" ?

LoSboccacc@alien.top · 2 years ago

I’ve a python script that runs a fixed dialogue with a bit of turn by turn instructions, comprehension tasks like recall or summarisation and a few reasoning. I package everything in vicuna format (user: assistant: ) then send it to gp4 where I ask: this is a chat between a user and an assistant, evaluate each assistant response individually for coherence and consistency and write a score in 10/10 and the problems you find, then I pick the minimum score of 10 samples.