As a beginner, I appreciate that there are metrics for all these LLMs out there so I don’t waste time downloading and trying failures. However, I noticed that the Leaderboard doesn’t exactly reflect reality for me. YES, I DO UNDERSTAND THAT IT DEPENDS ON MY NEEDS.

I mean really basic stuff of how the LLM acts as a coherent agent, can follow instructions and grasp context in any given situation. Which is often lacking in LLMs I am trying so far, like the boards leader for 30B models 01-ai/Yi-34B for example. I guess there is something similar going on like it used to with GPU benchmarks: dirty tricks and over-optimization for the tests.

I am interested in how more experienced people here evaluate an LLM’s fitness. Do you have a battery of questions and instructions you try out first?

  • VertexMachine@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    You either do standardized benchmarks like that leader-board (which are useful but limited) or you have your application-specific benchmark. Most often the latter are very, very time&work consuming to do right. Evaluating NLP systems in general is very hard problem.

    Some people use more powerful models to evaluate weaker ones. E.g., use GPT4 to evaluate output of llama. Depending on your task it might work well. I did recently an early version of experiment with around 20 models for text summarization, where GPT4 and I were evaluating summaries (on predefined scale, with predefined criteria of evaluation). I didn’t calculate any proper measure of inter annotator agreement yet, but looking at the evalas side by side it’s really high.

    Or if you are just playing around, you just write/search for a post on reddit (or various LLM related discords) asking for best model for your task :D

    • BlueMetaMind@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      Or if you are just playing around, you just write/search for a post on reddit (or various LLM related discords) asking for best model for your task :D

      I made this post as an attempt to collect best practices and ideas.

      use GPT4 to evaluate output of llama.

      That’s always a good option probably but I try to avoid using openAI all together.