Getting LLMs to generate vast amounts of high quality dialogue covering a broad range

WaterdanceAC@alien.top · 10 months ago

Getting LLMs to generate vast amounts of high quality dialogue covering a broad range

AutomataManifold@alien.top · 10 months ago

They provide the source code for generating your own dialog datasets. Interesting.

https://github.com/skywalker023/sodaverse

AutomataManifold@alien.top · 10 months ago

It’s fairly easy to get it to talk to the continuation endpoint in the server for text-generation-webui or llama.cpp instead of OpenAI; actually the painful part was reformatting it to use an instruction format. Just plugging it in to the chat endpoint might work better.

Just prefixing the prompt with some random facts about a fictional world is enough to steer the generation in a way that makes the conversations mention enough stuff about your world to generate a few hundred thousand high-quality conversations with a 13B Llama model. They look like they’re pretty diverse, but obviously I haven’t had time to train anything on the generated data.

That’s probably enough for most applications. Next level is probably generating a world-specific symbolic knowledge distillation so it include elves and dragons in the source. That looks like it requires more accuracy, but they got good enough results with GPT-3 so it’s probably feasible. A lot of applications will probably be fine with just generating custom Sodaverse data.

Getting LLMs to generate vast amounts of high quality dialogue covering a broad range

Getting LLMs to generate vast amounts of high quality dialogue covering a broad range

Just a moment...