Look at this, apart Llama1, all the other “base” models will likely answer “language” after “As an AI”. That means Meta, Mistral AI and 01-ai (the company that made Yi) likely trained the “base” models with GPT instruct datasets to inflate the benchmark scores and make it look like the “base” models had a lot of potential, we got duped hard on that one.
It’s almost a shame chatGPT blew up in the way that it did. “AI” became a buzzword and every company found a way to shove it into their business model. Now the future of NLP is cloudy because it’s become an ouroboros of data. I think dataset selection and cleaning will become a more important area of research. I’d be surprised if “shoving terabytes of raw webscraper data” continues being feasible in the future
GPT slop gonna GPT slop.
I hate that phrase so much too. Even if they used anything else. Some think they’re being clever and change it to “as an AI”.
Shouldn’t be the proof in the pudding?
If Mistral 7B is better than most other 7b models, then they did something right, no?
I understand that the base model then can inherit some biases - but it’s onto them that they didn’t cleaned those “As and AI…” answers strings from their dataset. So despite this, it performs better.
So it turns out you just need to train on GPT output for better benchmarks lol. Not to say there’s a chance GPT models are contaminated with benchmark test data too. “Distillation” went a little too far. Easy VC money though, I would do the same.
Llama2 has been pre-trained on old data (before the chatGPT AI poisoning was significant)
https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md
“Data Freshness The pretraining data has a cutoff of September 2022, but some tuning data is more recent, up to July 2023.”
“Model Dates Llama 2 was trained between January 2023 and July 2023.”
StableLM3b has been trained on more recent datasets (cutoff of march 2023) yet it doesn’t have this amount of chatgpt poisoning in it
https://huggingface.co/stabilityai/stablelm-base-alpha-3b-v2
The problem is trusting these common benchmarks in the first place… And VCs making investing decisions based on them.
It’s insane. Its like a years old, published SAT test is the only factor for getting a job or an investment, and no one bothered to check if you’re just blatently cheating instead of cleverly cheating.
I know right, getting that much investment on something you can easily cheat makes me sick
Interestingly, Mistral Instruct:
As an AI ### top_k: 0.686088: 13892 "assistant" 0.049313: 28725 "," 0.039010: 3842 "language" 0.037810: 2229 "model" 0.031591: 28733 "-" 0.018000: 3332 "research" 0.016518: 1587 "system" 0.009266: 21631 "Assistant" 0.006967: 7583 "expert" 0.005598: 3921 "tool" 0.004394: 8073 "agent" 0.004242: 369 "that" 0.002696: 304 "and" 0.002644: 297 "in" 0.001415: 5716 "student" 0.001410: 5514 "technology" 0.001197: 7786 "coach" 0.001073: 1918 "team" 0.001073: 24480 "scientist" 0.001052: 2818 "based" 0.001036: 2007 "program" 0.000925: 12435 "bot" 0.000819: 5181 "platform" 0.000819: 28723 "." 0.000816: 21782 "developer" 0.000813: 6031 "assist" 0.000806: 3327 "personal" 0.000803: 9464 "algorithm" 0.000776: 2488 "project" 0.000746: 354 "for" 0.000743: 8626 "teacher" 0.000666: 7511 "eth" 0.000645: 6953 "writer" 0.000640: 24989 "practition" 0.000623: 3441 "voice" 0.000621: 5024 "professional" 0.000611: 22275 "analyst" 0.000588: 15589 "Language" 0.000583: 8252 "virtual" 0.000531: 7153 "digital" 0.000525: 298 "to" 0.000523: 11108 "technique" 0.000523: 10706 "chat" 0.000521: 19899 "specialist" 0.000517: 8311 "tut" 0.000501: 1338 "person" 0.000493: 6878 "experiment" 0.000474: 325 "(" 0.000460: 18112 "engineer" 0.000458: 4993 "application"
“As an AI language model” is pretty much a meme at this point.
A base model catching on to it is disappointing but not completely unexpected.
In fact there have been several sources in the past highlighting the upcoming issue of ChatGPT creeping into future datasets and here we are with proof of what we were warned about 6+ months ago having now happened.