• 0 Posts
  • 10 Comments
Joined 1 year ago
cake
Cake day: November 1st, 2023

help-circle





  • I doubt there is any model really… follow the trail, you’ll end up at a company founded by single person from India (who is founder of another company with a single app for collaborative drawing)… that at least doesn’t have any employees on LinkedIn…

    And the founder looks like a relatively young person that most likely wouldn’t be even able to gather the required funding to have enough GPU compute for making model that’s better than gpt4 (or know how). I think that’s just a front for him trying to get some hype or funding.





  • You either do standardized benchmarks like that leader-board (which are useful but limited) or you have your application-specific benchmark. Most often the latter are very, very time&work consuming to do right. Evaluating NLP systems in general is very hard problem.

    Some people use more powerful models to evaluate weaker ones. E.g., use GPT4 to evaluate output of llama. Depending on your task it might work well. I did recently an early version of experiment with around 20 models for text summarization, where GPT4 and I were evaluating summaries (on predefined scale, with predefined criteria of evaluation). I didn’t calculate any proper measure of inter annotator agreement yet, but looking at the evalas side by side it’s really high.

    Or if you are just playing around, you just write/search for a post on reddit (or various LLM related discords) asking for best model for your task :D