🐺🐦‍⬛ Huge LLM Comparison/Test: Part II (7B-20B) Roleplay Tests

WolframRavenwolf@alien.top · 2 years ago

🐺🐦‍⬛ Huge LLM Comparison/Test: Part II (7B-20B) Roleplay Tests

Familiar-Art-6233@alien.top · 2 years ago

It really is fascinating how Mistral is able to punch above its weight class so consistently. I can’t wait for a 13b version!

Historical-Lead-8961@alien.top · 2 years ago

I am considering switching from Mythalion to Tiefighter 13b. Is Tiefighter really significantly better than Mythalion in roleplay, adventure, and storytelling in your experience?

empire539@alien.top · 2 years ago

I’ve been waiting for this one! Thanks for the hard work as always.

Slightly off topic, but I’m also curious how everyone is evaluating “quality” of writing. Oftentimes when I try out different models, it’s hard for me to tell if one is better or not, e.g. I’ve tried 13Bs for Mythomax vs Mythalion vs Athena vs Tiefighter and feel like they all more or less produce similar levels of quality.

Are there any objective measures people look for when they say (for example) Tiefighter beats Mythomax, or is it just purely subjective based on initial impression?

Inevitable-Start-653@alien.top · 2 years ago

Thank you for another great post, I’m using models I wouldn’t have come acro without your posts ❤️

Robot1me@alien.top · 2 years ago

Out of curiosity since both models have been out for a while, what is your impression of Mistral 7B OpenOrca compared to OpenHermes?

dampflokfreund@alien.top · 2 years ago

Great test!

Unfortunately the Llama 2 Chat template is completely broken in SillyTavern. It not only uses a new line as separator instead of the correct one, but also ends the prompt after the system prompt with the input sequence [INS] instead of [/INST] if you are using the vector storage or an example dialogue. You can see for yourself by comparing the output to what the format should look like.

So these Airoboros 3.1.2 tests are unfortunately borked. Still though, interesting result for the other models.

WolframRavenwolf@alien.top · 2 years ago

Yeah, looks impossible to get a proper Llama 2 Chat format in SillyTavern when using example dialog. That really sucks, hopefully gets fixed in SillyTavern, but even better would be for model creators to drop that unnecessarily complicated format. If any format is that hard to get right, it’s not a good format, period!

HadesThrowaway@alien.top · 2 years ago

You would have to convince Eric. He’s adamant that chatml is the future.

WolframRavenwolf@alien.top · 2 years ago

I’m with Eric on that. ChatML is more complex than the popular Alpaca or Vicuna format, but that’s OK because it has its advantages, like clear indication where the message starts and ends, and if it’s a system or user message.

The Llama 2 Chat format, however, is an abomination. So complicated that when it was announced, there were posts trying to explain how to use it properly, and even those got it wrong in various ways. It doesn’t add anything that another format wouldn’t handle more elegantly, and the system message being inside the first user message is a terrible design decision that ruins it completely in my eyes.

It also doesn’t support the concept of the AI initiating the chat. In SillyTavern, most bots have a greeting message so the prompt should start with a bot message before the first user message, something all other formats allow but Llama 2 Chat doesn’t because the bot message is outside the instruct tags.

So yes, please, drop the Llama 2 Chat format and let it die! ChatML is so much better…

uti24@alien.top · 2 years ago

20B Verdict:

All these 20B models exhibited logical errors, word-finding difficulties

I used MXLewd-L2-20B-GGUF and it rarely if ever do errors like that. Problem with template used?

IXAbdullahXI@alien.top · 2 years ago

I honestly prefer mythomax/mythalion over tiefighter for only one reason, which is the balance between actions and speech. Sure, I like tiefighter descriptive actions, but its speech is way too short, like, sometimes it doesn’t even write any speech in the whole message!

Anyway, it’s all personal preferences, and I really appreciate the efforts you put into these comparisons. Keep up the good work!👍

CloudRawrr@alien.top · 2 years ago

But that also depends on your prompt. If you have it set that {{ char }} must speak in every response, that should happen more often or always (I mean you see the results in these tests here, always would be too good :D).

Tupletcat@alien.top · 2 years ago

What does it mean when you say " Official prompt format"? Where does that go or how is it used?

WolframRavenwolf@alien.top · 2 years ago

With “official” I mean the format that the model author (or TheBloke) notes on their model card. Then I just choose that from the ones included with SillyTavern.

ProperSauce@alien.top · 2 years ago

How do I run a 70B llm on my 4090? Most of the 70B say they require like 40gb of Vram.

WolframRavenwolf@alien.top · 2 years ago

With just one 4090, you either need a very small quant that fits into your 24 GB VRAM or use CPU inference with layers offloaded to the GPU.

With koboldcpp, you should be able to run a 4-bit quant and put half the layers into VRAM and the other half into system RAM. It won’t be as fast as all of it on GPU, but at least it will run (if you have enough RAM).

AloofPenny@alien.top · 2 years ago

You may need another 4090. Or several

CloudRawrr@alien.top · 2 years ago

No you dont. But you need enough System Ram and its still very very slow like < 1tkn/s

Susp-icious_-31User@alien.top · 2 years ago

You can run it off your CPU using koboldcpp and offload how ever many layers (that equals your GPU VRAM size) using --gpulayers 40 for example.

CloudRawrr@alien.top · 2 years ago

Oh gott danke :).

Oh I was just thinking of asking if someone has done something like that yesterday. Thank you for the work! A Website about this would be look with consistent checks, but I guess its a lot of work.

Based on your knowledge, what is currently the best < 30B Roleplay Model? I prefer 20B for speed but that size doesnt seem to be trending :(

WolframRavenwolf@alien.top · 2 years ago

I’ve been thinking about putting it on a website, but since all of that information gets outdated so quickly with new models coming out daily, I’m not so sure how useful that would be. Site creation and maintenance would take precious time away from testing, so I’d fall behind even faster.

Regarding the best < 30B RP model, IMHO? Well, that’s the point of this whole test:

Both OpenHermes-2-Mistral-7B and LLaMA2-13B-Tiefighter-GGUF are the winners in their size categories. So I recommend both - if you don’t need 8K context (which OpenHermes gives you) or have very complex scenarios (which Tiefighter worked with better), it’s entirely up to personal preference. Try both to see how they work on your system and which one gives you better output according to your taste.

Spasmochi@alien.top · 2 years ago

Thanks for the great write up (as usual). I’m looking forward to the 70b post!

psi-love@alien.top · 2 years ago

Recommending a model that produces EOS tokens randomly, feels off to me. The OpenHermes 2 Mistral Model sucks in my opinion. It seems to have serious flaws.

HalfBurntToast@alien.top · 2 years ago

Another you might wanna look into was a sleeper hit for me: Echidna-13B-v0.3-GGUF. Where tiefighter had problems with speaking for me and going off the rails, Echidna seems to have less of a problem with this. The same creator made a variant based on it called Nethena, which comes in 13B and 20B, which actually seem to have a bit more problems in my limited testing. But, I’m having a lot of good luck with Echidna.

IntergalacticTowel@alien.top · 2 years ago

Wow.

This is fantastic. I vastly prefer this level of information to benchmarks. This must have taken you countless hours, and it’s appreciated. Thanks.

WolframRavenwolf@alien.top · 2 years ago

Thanks, and yes, it’s time-consuming. That’s why I decided to make another post for the 70Bs later, as not to delay this further.

At the rate new models come out, it feels like there are two new models released before I finish evaluating one. But in actuality, it’s probably even more. ;)

The automated benchmarks at least help me narrow down which models to test in-depth. And I’m glad when my reviews help others find their favorite models.

🐺🐦‍⬛ Huge LLM Comparison/Test: Part II (7B-20B) Roleplay Tests

🐺🐦‍⬛ Huge LLM Comparison/Test: Part II (7B-20B) Roleplay Tests

Models tested:

Testing methodology:

7B:

7B Verdict:

13B:

13B Verdict:

20B:

20B Verdict:

70B: