• 0 Posts
  • 8 Comments
Joined 1 year ago
cake
Cake day: November 8th, 2023

help-circle







  • I made a mild wishlist in another thread - Cool things for 100k LLM

    If I were making an expensive LLM from scratch, these would be some of my thoughts before spending the dough:

    • A very large percentage of people use OSS LLMs for roleplay or coding, might as well just bake it into the base
    • Most coding examples and general programming data is years old and lacks knowledge of many new and groundbreaking projects and technologies; updated scrapes of coding sites needs to be made
    • Updated coding examples need to be generated
    • Longer coding examples are needed that can deal with multiple files in a codebase
    • Longer examples of summarizing code need to be generated (like book summing, but for long scripts)
    • Fine tuning datasets need a lot of cleaning from incorrect examples, bad math, political or sexual bias/agendas injected by wackjobs
    • Older math datasets seem way more error prone than newer ones
    • GPT-4 is biased and that will carry through into synthetic datasets, anything from it will likely taint the LLM, be it subtle; more creative dataset cleaning needed
    • Stop having datasets that contain stupid things like “As an AI…”
    • Excessive alignments is like sponsoring from birth a highly prized and educated genius, just to give them a lobotomy on graduation day
    • People regularly circumvent censorship and sensationalize “jailbreaking” it anyway, might as well leave the base model “uncensored” and advertise it as such
    • Cleaner datasets seems more important than maximizing the number of tokens trained
    • Multimodal and tool-wielding is the future, bake some cutting edge examples into the base

    Speaking of clean databases, have you checked out the new RedPajama-Data v2? There’s your 10T+ of clean dataset