Look at this, apart Llama1, all the other “base” models will likely answer “language” after “As an AI”. That means Meta, Mistral AI and 01-ai (the company that made Yi) likely trained the “base” models with GPT instruct datasets to inflate the benchmark scores and make it look like the “base” models had a lot of potential, we got duped hard on that one.
So it turns out you just need to train on GPT output for better benchmarks lol. Not to say there’s a chance GPT models are contaminated with benchmark test data too. “Distillation” went a little too far. Easy VC money though, I would do the same.