Hi all.
I was researching generative model evaluation and found this post interesting: https://deepsense.ai/evaluation-derangement-syndrome-in-gpu-poor-genai
A lot of it kind of corresponds to what I see happening in the industry and feels like a good fit here
Well it depends on what you are building. If you are actually doing ML research, i.e. you want to publish papers, people are doing evaluation and you won’t get published without it. There’s a bunch of tricks that have been used to evaluate generative models that you can find in these papers. I remember in grad school our TA made us read a paper and then in the discussion he said that he thought the method they proposed was not good at all, he wanted us to read it to learn about their evaluation metric which he deemed “very clever”.
you won’t get published without doing proper evaluation
Idk man, I’ve seen some pretty sketchy papers this year.
Like what?
I mean there’s always sketchy papers because of p-hacking. But I doubt that there’s papers that don’t have a proper evaluation at all.
i mean the evaluation process itself is an active field of research…
That’s kind of what my original comment was all about.
Love the graphic :)
It’s kind of weird that they use HFRL as the initialism instead of the much more common RLHF.
Quite insightful and interesting comments there!
The typical measure for most ML conferences is the Fréchet inception distance (FID) but I have seen a number of generative AI papers, and what those values actually mean practically can be extremely obtuse, I appreciate papers that reports both the FID as a metric and also produce some representative examples of the output. (in the suplementary material if space is an issue)
Vibes with what I’ve seen in my job and the industry in general. Sadly, the greatest fun is only for huge corporations. Worth reading, definitely!