I’ve been using self-hosted LLM models for roleplay purposes. But these are the worst problems I face every time, no matter what model and parameter preset I use.

I’m using :

Pygmalion 13B AWQ

Mistral 7B AWQ

SynthIA 13B AWQ [Favourite]

WizardLM 7B AWQ

  1. It messes up with who’s who. Often starts to behave like the user.

  2. It writes in third person perspective or Narrative.

  3. Sometimes, generates the exact same reply (exactly same to same text) back to back even though new inputs were given.

  4. It starts to generate more of a dialogue or screenplay script instead of creating a normal conversation.

Anyone has any solutions for these?

  • cleverestx@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    I have a RTX 4090, 96GB of RAM and a i9-13900k CPU, and I still keep going back to 20b (4-6bpw) models due to the awful performance of 70b models, which 2.4bpw is supposed to fully fit the VRAM in… even using Exllama2…

    What is your trick to get better performance? If I don’t use a small lame context of 2048, the speed of generating is actually un-usable (under 1 token/sec), what context are you using and what settings? Thank you.