Hello LocalLLama.
Do you have tips how to make best use of models that have not been fine-tuned for chat or instruct?
Here’s my issue: I use LLMs for storywriting and making character profiles (I’ve been doing that a lot for D&D character sheets for example).
I feel that most models have a strong bias to make positive stories or happy endings or use really cliched phrases, or something similar. The stories have perfect grammar but they are boring and cliched as heck. Using instructions to tell it not to do that don’t work that well. I checked out r/chatgpt for what tips they have for making good stories when using ChatGPT and it seems there are no great solutions there either. Maybe this leaks to local models because bunch of them use GPT-4 derived training data, so now local models want overly positive outputs as well.
So I thought “Alright. I’ll try using a base model. Instead of giving it instructions, I’ll make it think it’s completing a book or something”.
But that also doesn’t work that well. Lllama-2-70B for example easily gets into repetitive patterns and I feel it’s even worse than using positive-biased chat or instruct-tuned model.
I’m looking for answers or insights into these following thoughts in my head:
-
Are there any base models worth using? I’ve tried Yi base models for example; seems about the same as Llama2-70B base (just faster). I’m more than willing to spend time prompt engineering in exchange for more interesting outputs.
-
Do you know resources/tricks/tips/insights about how to make best use of base models? Resources on how to prompt them? Sampler settings?
-
Why do base models seem to suck so bad, even if I’m prompting them assuming it’s just completing text and they have no concept of following instructions? Mostly I see them fall into repeating the same sentence or structure over and over again. Fine-tuned models don’t do this even if I otherwise don’t like their outputs.
-
Out of curiosity, are you aware of any models that have been fine-tuned that are not tuned for chat or instruct? Kinda wondering if anyone has found any interesting use cases.
I think base model is preferable in many cases for for developers, particularly if instruction-following abilities don’t cut it, or you worry about instruction injection, or just want to make sure the text you get isn’t bent into the curves of the “helpful” fine-tuning distribution
It’s easy to recommend base model for targeted generations that leverage the pattern-following ability. You get what you want after a number of examples, almost like fine-tuning examples. I went through my history for examples of few-shot completion: classification, rewrite sentence copying style, classify, basic Q&A example, fact check yes/no, rewrite copying style and sentiment, extract list of musicians, classify user intent, tool choice, rewrite copying style again, flag/filter objectionable content, detect subject changes, classify profession, extract customer feedback into json, write using specified words, few-shot cheese information, answer questions from context, classify sentiment w/ probabilities, summarize, replace X in conversation
Most of that is aimed at developers, though, and with many use-cases necessitating using temperature of 0
For long-form writing, on the other hand, you’ve found some hindrances. First, results will benefit a great deal from longer context. Second, you’ll probably get some looping patterns you can avoid by increasing repetition penalties in your generator
Finetunes for storywriting do seem like a good idea, I found at least this one
https://www.reddit.com/r/LocalLLaMA/comments/17yxoxv/local_llm_for_hot_dog_or_not_hot_dog_kind_of_fact/
Would you say your advice in this post is applicable to my post? I think I’m in this same camp. I don’t want to go through the hundreds of fine-tuned models. I just want to talk to the model with the kinds of things you’ve mentioned.
Then why do people fine-tune for instruction? Perhaps the answer to my question is how do you fine tune a model for instruction? Is there a document or steps?
That’s a good point about few-shot prompting: the big thing about GPT-3 and instruction training was that it allowed for zero-shot prompting (i.e., prompting with zero examples). But if we’re manually prompting a base model, there’s no reason not to provide those examples, and you get dramatically improved performance versus the same model with no examples.