Thinking about what people ask for in llama 3

vatsadev@alien.top · 2 years ago

SlowSmarts@alien.top · 2 years ago

I made a mild wishlist in another thread - Cool things for 100k LLM

If I were making an expensive LLM from scratch, these would be some of my thoughts before spending the dough:

A very large percentage of people use OSS LLMs for roleplay or coding, might as well just bake it into the base
Most coding examples and general programming data is years old and lacks knowledge of many new and groundbreaking projects and technologies; updated scrapes of coding sites needs to be made
Updated coding examples need to be generated
Longer coding examples are needed that can deal with multiple files in a codebase
Longer examples of summarizing code need to be generated (like book summing, but for long scripts)
Fine tuning datasets need a lot of cleaning from incorrect examples, bad math, political or sexual bias/agendas injected by wackjobs
Older math datasets seem way more error prone than newer ones
GPT-4 is biased and that will carry through into synthetic datasets, anything from it will likely taint the LLM, be it subtle; more creative dataset cleaning needed
Stop having datasets that contain stupid things like “As an AI…”
Excessive alignments is like sponsoring from birth a highly prized and educated genius, just to give them a lobotomy on graduation day
People regularly circumvent censorship and sensationalize “jailbreaking” it anyway, might as well leave the base model “uncensored” and advertise it as such
Cleaner datasets seems more important than maximizing the number of tokens trained
Multimodal and tool-wielding is the future, bake some cutting edge examples into the base

Speaking of clean databases, have you checked out the new RedPajama-Data v2? There’s your 10T+ of clean dataset