I’ve been using self-hosted LLM models for roleplay purposes. But these are the worst problems I face every time, no matter what model and parameter preset I use.
I’m using :
Pygmalion 13B AWQ
Mistral 7B AWQ
SynthIA 13B AWQ [Favourite]
WizardLM 7B AWQ
-
It messes up with who’s who. Often starts to behave like the user.
-
It writes in third person perspective or Narrative.
-
Sometimes, generates the exact same reply (exactly same to same text) back to back even though new inputs were given.
-
It starts to generate more of a dialogue or screenplay script instead of creating a normal conversation.
Anyone has any solutions for these?
Since I started using 70B, I have never encountered these problems again. It is that much better.
I have a RTX 4090, 96GB of RAM and a i9-13900k CPU, and I still keep going back to 20b (4-6bpw) models due to the awful performance of 70b models, which 2.4bpw is supposed to fully fit the VRAM in… even using Exllama2…
What is your trick to get better performance? If I don’t use a small lame context of 2048, the speed of generating is actually un-usable (under 1 token/sec), what context are you using and what settings? Thank you.