"Base" models were actually trained with some GPT instruct datasets

Wonderful_Ad_5134@alien.top · 3 years ago

"Base" models were actually trained with some GPT instruct datasets

metaprotium@alien.top · 3 years ago

It’s almost a shame chatGPT blew up in the way that it did. “AI” became a buzzword and every company found a way to shove it into their business model. Now the future of NLP is cloudy because it’s become an ouroboros of data. I think dataset selection and cleaning will become a more important area of research. I’d be surprised if “shoving terabytes of raw webscraper data” continues being feasible in the future

a_beautiful_rhind@alien.top · 3 years ago

GPT slop gonna GPT slop.

I hate that phrase so much too. Even if they used anything else. Some think they’re being clever and change it to “as an AI”.

FPham@alien.top · 3 years ago

Shouldn’t be the proof in the pudding?

If Mistral 7B is better than most other 7b models, then they did something right, no?

I understand that the base model then can inherit some biases - but it’s onto them that they didn’t cleaned those “As and AI…” answers strings from their dataset. So despite this, it performs better.

trailer_dog@alien.top · 3 years ago

So it turns out you just need to train on GPT output for better benchmarks lol. Not to say there’s a chance GPT models are contaminated with benchmark test data too. “Distillation” went a little too far. Easy VC money though, I would do the same.

Wonderful_Ad_5134@alien.top · 3 years ago

Llama2 has been pre-trained on old data (before the chatGPT AI poisoning was significant)

https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md

“Data Freshness The pretraining data has a cutoff of September 2022, but some tuning data is more recent, up to July 2023.”

“Model Dates Llama 2 was trained between January 2023 and July 2023.”

StableLM3b has been trained on more recent datasets (cutoff of march 2023) yet it doesn’t have this amount of chatgpt poisoning in it

https://huggingface.co/stabilityai/stablelm-base-alpha-3b-v2

https://preview.redd.it/gl46fo50n10c1.png?width=518&format=png&auto=webp&s=c7cae52b292dcba45dee735a4ca7efac5630a4ae

mcmoose1900@alien.top · 3 years ago

The problem is trusting these common benchmarks in the first place… And VCs making investing decisions based on them.

It’s insane. Its like a years old, published SAT test is the only factor for getting a job or an investment, and no one bothered to check if you’re just blatently cheating instead of cleverly cheating.

Wonderful_Ad_5134@alien.top · 3 years ago

I know right, getting that much investment on something you can easily cheat makes me sick

phree_radical@alien.top · 3 years ago

Interestingly, Mistral Instruct:

As an AI

### top_k:

0.686088: 13892 "assistant"
0.049313: 28725 ","
0.039010:  3842 "language"
0.037810:  2229 "model"
0.031591: 28733 "-"
0.018000:  3332 "research"
0.016518:  1587 "system"
0.009266: 21631 "Assistant"
0.006967:  7583 "expert"
0.005598:  3921 "tool"
0.004394:  8073 "agent"
0.004242:   369 "that"
0.002696:   304 "and"
0.002644:   297 "in"
0.001415:  5716 "student"
0.001410:  5514 "technology"
0.001197:  7786 "coach"
0.001073:  1918 "team"
0.001073: 24480 "scientist"
0.001052:  2818 "based"
0.001036:  2007 "program"
0.000925: 12435 "bot"
0.000819:  5181 "platform"
0.000819: 28723 "."
0.000816: 21782 "developer"
0.000813:  6031 "assist"
0.000806:  3327 "personal"
0.000803:  9464 "algorithm"
0.000776:  2488 "project"
0.000746:   354 "for"
0.000743:  8626 "teacher"
0.000666:  7511 "eth"
0.000645:  6953 "writer"
0.000640: 24989 "practition"
0.000623:  3441 "voice"
0.000621:  5024 "professional"
0.000611: 22275 "analyst"
0.000588: 15589 "Language"
0.000583:  8252 "virtual"
0.000531:  7153 "digital"
0.000525:   298 "to"
0.000523: 11108 "technique"
0.000523: 10706 "chat"
0.000521: 19899 "specialist"
0.000517:  8311 "tut"
0.000501:  1338 "person"
0.000493:  6878 "experiment"
0.000474:   325 "("
0.000460: 18112 "engineer"
0.000458:  4993 "application"

arekku255@alien.top · 3 years ago

“As an AI language model” is pretty much a meme at this point.

A base model catching on to it is disappointing but not completely unexpected.

In fact there have been several sources in the past highlighting the upcoming issue of ChatGPT creeping into future datasets and here we are with proof of what we were warned about 6+ months ago having now happened.