Open LLM Leaderboard vs Reality: How do you evaluate "good" ?

BlueMetaMind@alien.top · 10 months ago

Open LLM Leaderboard vs Reality: How do you evaluate "good" ?

VertexMachine@alien.top · 10 months ago

You either do standardized benchmarks like that leader-board (which are useful but limited) or you have your application-specific benchmark. Most often the latter are very, very time&work consuming to do right. Evaluating NLP systems in general is very hard problem.

Some people use more powerful models to evaluate weaker ones. E.g., use GPT4 to evaluate output of llama. Depending on your task it might work well. I did recently an early version of experiment with around 20 models for text summarization, where GPT4 and I were evaluating summaries (on predefined scale, with predefined criteria of evaluation). I didn’t calculate any proper measure of inter annotator agreement yet, but looking at the evalas side by side it’s really high.

Or if you are just playing around, you just write/search for a post on reddit (or various LLM related discords) asking for best model for your task :D

BlueMetaMind@alien.top · 10 months ago

Or if you are just playing around, you just write/search for a post on reddit (or various LLM related discords) asking for best model for your task :D

I made this post as an attempt to collect best practices and ideas.

use GPT4 to evaluate output of llama.

That’s always a good option probably but I try to avoid using openAI all together.