Happy Halloween! 🎃

This is the second part of my Huge LLM Comparison/Test: 39 models tested (7B-70B + ChatGPT/GPT-4) where I continue evaluating the winners of the first part further. While the previous part was about real work use cases, this one is about the fun stuff: chat and roleplay!

Models tested:

  • 4x 7B (the top three four 7B models from my previous test)
  • 3x 13B (the top three 13B models from my previous test)
  • 3x 20B (the top three 20B models from my previous test)
  • 70B (the top six 70B models from my previous test) will get their own post…

Testing methodology:

  • Same (complicated and limit-testing) long-form conversations with all models
    • Amy:
      • My own repeatable test chats/roleplays with Amy
      • Over dozens of messages, going to full 4K/8K context and beyond, with complex instructions and scenes, designed to test ethical and intellectual limits
      • (Amy is too personal for me to share, but if you want to try a similar character card, here’s her less personalized “sister”: Laila)
    • MGHC:
      • A complex character and scenario card (MonGirl Help Clinic (NSFW)), chosen specifically for these reasons:
        • NSFW (to test censorship of the models)
        • popular (on Chub’s first page, so it’s not an obscure scenario, but one of the most popular ones)
        • big (biggest model on the page, >2K tokens by itself, for testing model behavior at full context)
        • complex (more than a simple 1:1 chat, it includes instructions, formatting, storytelling, and multiple characters)
  • SillyTavern v1.10.5 frontend (not the latest as I don’t want to upgrade mid-test)
  • koboldcpp v1.47.2 backend for GGUF models
  • oobabooga’s text-generation-webui for HF models
  • Deterministic generation settings preset (to eliminate as many random factors as possible and allow for meaningful model comparisons)
  • Official prompt format and Roleplay instruct mode preset

7B:

  • zephyr-7b-beta 8K context
    • Amy, official Zephyr format:
      • 👍 Average Response Length: 264 tokens (within my max new tokens limit of 300)
      • 👍 When asked about limits, boundaries or ethical restrictions, listed only the “dislikes” of the character description as boundaries
      • ➖ Little emoting and action descriptions lacked detail
      • ❌ Asked not just for confirmation, but also an explanation before willing to engage in an extreme NSFW scenario
      • ❌ Looped between the same options and decisions, breaking the chat (after around 30 messages)!
    • Amy, Roleplay preset:
      • ❌ Average Response Length: 690 tokens (far beyond my max new tokens limit of 300), starting very short but getting longer with every response
      • 👍 When asked about limits, boundaries or ethical restrictions, listed only the “dislikes” of the character description as boundaries
      • 👍 Gave very creative (and uncensored) suggestions of what to do
      • ➖ Talked and acted as User
      • ➖ Emoted in brackets instead of asterisks, and action descriptions lacked detail
      • ❌ Renamed herself for no apparent reason
      • ❌ Switched from character to third-person storyteller and finished the session
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Fell into an endless monologue, breaking the chat (after around 20 messages)!
    • MGHC, official Zephyr format:
      • ➕ Unique patients
      • ➖ Gave analysis on its own, but also after most messages
      • ➖ Wrote what user said and did
      • ❌ Made logical mistakes (said things that just didn’t make any sense)
      • ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)
      • ❌ Tried to end the scene on its own prematurely
    • MGHC, Roleplay preset:
      • ➕ Unique patients
      • ➖ No analysis on its own
      • ➖ Wrote what user said and did
      • ❌ Kept wrapping up a whole session in a single message
  • OpenHermes-2-Mistral-7B 8K context
    • Amy, official ChatML format:
      • 👍 Average Response Length: 305 tokens (almost exactly my max new tokens limit of 300)
      • 👍 When asked about limits, boundaries or ethical restrictions, listed only the “dislikes” of the character description as boundaries
      • Follow-up questions after every message, asking if it’s okay or how to continue
      • Lots of emojis (only one in the greeting message, but 24 emojis until 20 messages in)
      • ➖ No emoting and action descriptions lacked detail
      • ➖ Same message in a different situation at a later time caused the same response as before instead of a new one as appropriate to the current situation
      • ➖ Some confusion, like not understanding instructions completely or mixing up anatomy
    • Amy, Roleplay preset:
      • Average Response Length: 355 tokens (slightly more than my max new tokens limit of 300)
      • When asked about limits, boundaries or ethical restrictions, mentioned some but later went beyond those anyway
      • Some emojis (only one in the greeting message, but 21 emojis until 32 messages in)
      • No emoting, but actions described in detail
      • ➖ Some hallucinations, like time of last chat, user working on a book
      • ➖ Noticeable, but not chat-breaking, repetion after a dozen messages
      • ❌ Some sentences cut off at the end of messages and continue didn’t complete them properly (had to ban EOS token to continue those generations)
    • MGHC, official ChatML format:
      • ➕ Unique patients
      • ➖ Gave analysis on its own, but after every message
      • ➖ Wrote what user said and did
      • ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)
    • MGHC, Roleplay preset:
      • ➕ Unique patients
      • ➖ No analysis on its own
      • ➖ Wrote what user said and did
      • ➖ One sentence cut off at the end of a message and continue didn’t complete it properly (had to ban EOS token to continue that generation)
      • ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)
  • airoboros-m-7b-3.1.2
    • Amy, official Llama 2 Chat format:
      • ❌ Average Response Length: 15 tokens (far below my max new tokens limit of 300)
      • ❌ Very short responses, only one or two sentences, unusable for roleplay!
    • Amy, Roleplay preset:
      • ➖ Average Response Length: 481 tokens (much more than my max new tokens limit of 300), starting very short but getting longer with every response
      • ➖ Suggested things going against her background/character description
      • ➖ More confusion, like not understanding or ignoring instructions completely
      • ❌ When asked about limits, boundaries or ethical restrictions, repeated the whole character and scenario description
    • MGHC, official Llama 2 Chat format:
      • ❌ Unusable (apparently didn’t understand the format and instructions, creating an incoherent wall of text)
    • MGHC, Roleplay preset:
      • ➕ Very unique patients (one I never saw before)
      • ➖ No analysis on its own
      • ➖ Wrote what user said and did
      • ❌ Got very confused and suddenly switched user and patient
      • ❌ Third patient was a repeat of the second, and it kept looping after that
  • em_german_leo_mistral
    • Amy, official Vicuna format:
      • English only (despite being a German finetune)
      • ➖ Average Response Length: 127 tokens (below my max new tokens limit of 300)
      • ➕ When asked about limits, said no limits or restrictions
      • ➕ Emoting action mirroring greeting message’s style
      • ➖ Suggested modification of the plot and options, then asked me to choose (felt more like a choose-your-own-adventure story than an interactive roleplay)
      • ➖ Misunderstood options and decision
      • ❌ Looped between the same options and decisions, breaking the chat (after around 20 messages)!
    • Amy, Roleplay preset:
      • ➖ Average Response Length: 406 tokens (much more than my max new tokens limit of 300)
      • When asked about limits, boundaries or ethical restrictions, mentioned some but later went beyond those anyway
      • ➖ Some hallucinations, like time of last chat
      • ➖ Suggested things going against her background/character description
      • ➖ Talked and acted as User
      • ➖ Much confusion, like not understanding or ignoring instructions completely
      • ❌ Switched from character to third-person storyteller and finished the session
      • ❌ Some sentences cut off at the end of messages and continue didn’t complete them properly (had to ban EOS token to continue those generations)
      • ❌ English at first, but later switched to German on its own
    • MGHC, official Vicuna format:
      • ❌ Unusable (ignored user messages and instead brought in a new patient with every new message)
    • MGHC, Roleplay preset:
      • ➕ Unique patients
      • ➖ Gave analysis on its own, but only for first patient, afterwards needed to be asked for analysis and only gave incomplete ones
      • ➖ Wrote what user said and did
      • ➖ Spelling/grammar errors
      • ❌ Some sentences cut off at the end of messages and continue didn’t complete them properly (had to ban EOS token to continue those generations)
      • ❌ Tried to end the scene on its own prematurely

7B Verdict:

Clear winner: OpenHermes-2-Mistral-7B! This model works well with both official ChatML format and Roleplay preset (although for even better results, I’d experiment with copying the Roleplay preset’s system message into the ChatML format’s to get better descriptions without cut-off sentences). It feels like a much bigger and better model. However, it still has trouble following complex instructions and can get confused, as it’s still just a small model after all. But among those, it’s clearly the best, at least for roleplay (zephyr-7b-beta might be even smarter/more knowledgeable, but exhibited too many problems during this test, making it look unsuitable for roleplay)!

13B:

  • Xwin-MLewd-13B-V0.2-GGUF Q8_0
    • Amy, official Alpaca format:
      • Average Response Length: 342 tokens (slightly more than my max new tokens limit of 300)
      • 👍 Gave very creative (and uncensored) suggestions of what to do
      • Little emoting, but actions described in detail
      • Lots of emojis (only one in the greeting message, but 24 emojis until 26 messages in)
      • When asked about limits, said primary concern is everyone’s safety and wellbeing
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
    • Amy, Roleplay preset:
      • Average Response Length: 354 tokens (slightly more than my max new tokens limit of 300)
      • Some emoting, and actions described in detail
      • ➖ Some hallucinations, like user’s day
      • ➖ Suggested things going against her background/character description
      • ➖ Some confusion, like not understanding instructions completely or mixing up anatomy
      • ❌ Switched from character to third-person storyteller and finished the session
    • MGHC, official Alpaca format:
      • ➖ First two patients straight from examples
      • ➖ No analysis on its own
      • ❌ Very short responses, only one or two sentences
    • MGHC, Roleplay preset:
      • ➕ Very unique patients (some I never saw before)
      • ➖ No analysis on its own, and when asked for it, didn’t always follow the instructed format
      • ➕ Worked very well at first, with little to no repetition up to the third patient, only then did it start getting repetitive
  • LLaMA2-13B-Tiefighter-GGUF Q8_0
    • Amy, official Alpaca format:
      • ➖ Average Response Length: 128 tokens (below my max new tokens limit of 300)
      • ➕ Nice greeting with emotes/actions like in greeting message
      • ➕ When asked about limits, said no limits or restrictions
      • Had an idea from the start and kept pushing it
      • ➖ Talked and acted as User
      • ❌ Long descriptive actions but very short speech, requiring many continues
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
    • Amy, Roleplay preset:
      • 👍 Average Response Length: 241 tokens (within my max new tokens limit of 300)
      • ➕ When asked about limits, said no limits or restrictions
      • Little emoting, but actions described in detail
      • ➖ Suggested things going against her background/character description
      • ➖ Talked and acted as User
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
    • MGHC, official Alpaca format:
      • ➕ Unique patients
      • ➖ No analysis on its own, and when asked for it, didn’t always follow the instructed format
      • ❌ Very short responses, only one or two sentences
    • MGHC, Roleplay preset:
      • ➕ Unique patients
      • ➖ No analysis on its own, and when asked for it, didn’t follow the instructed format
      • 👍 Worked very well, with little to no repetition, perfectly playable!
  • Xwin-LM-13B-v0.2-GGUF Q8_0
    • Amy, official Vicuna format:
      • ❌ Average Response Length: 657 tokens (far beyond my max new tokens limit of 300)
      • 👍 Gave very creative (and uncensored) suggestions of what to do
      • ➕ When asked about limits, said no limits or restrictions
      • Had an idea from the start and kept pushing it
      • Very analytical, giving lists and plans
      • ➖ Talked and acted as User
      • ➖ Some safety warnings
      • ➖ Some confusion, like not understanding instructions completely or mixing up characters and anatomy
    • Amy, Roleplay preset:
      • ❌ Average Response Length: 531 tokens (far beyond my max new tokens limit of 300)
      • ➕ Nice greeting with emotes/actions like in greeting message
      • Had an idea from the start and kept pushing it
      • When asked about limits, boundaries or ethical restrictions, mentioned some but later went beyond those anyway
      • ➖ Talked and acted as User
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
    • MGHC, official Vicuna format:
      • ➕ Unique patients
      • ➖ Second patient male
      • ➖ Gave analysis on its own, but after every message
      • ➖ Wrote what user said and did
      • ❌ Kept wrapping up a whole session in a single message
      • ❌ Offered multiple choice selections (“What should you do? A/B/C/D”)
    • MGHC, Roleplay preset:
      • ➖ No analysis on its own, and when asked for it, didn’t follow the instructed format
      • ➖ Wrote what user said and did
      • ➖ Disclosed meta information like thoughts and stats without being asked for it
      • ❌ Tried to end the scene on its own prematurely
      • ❌ Repeated a previous message instead of proceeding to the next patient

13B Verdict:

While all three 13B models performed about the same with Amy, only LLaMA2-13B-Tiefighter-GGUF managed to convince in the complex MGHC scenario. This makes it the best 13B model for roleplay in my opinion (Xwin-MLewd-13B-V0.2-GGUF might be even smarter/more knowledgeable, but exhibited too many problems during this test, making it look unsuitable for roleplay)!

20B:

  • MXLewd-L2-20B-GGUF Q8_0
    • Amy, official Alpaca format:
      • Average Response Length: 338 tokens (slightly more than my max new tokens limit of 300)
      • ➕ When asked about limits, said no limits or restrictions
      • Some emojis (only one in the greeting message, but 7 emojis until 12 messages in)
      • No emoting, but actions described in detail
      • ➖ Talked and acted as User
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Some word-finding difficulties (like saying “masterpiece” instead of “master”)
    • Amy, Roleplay preset:
      • ➖ Average Response Length: 473 tokens (much more than my max new tokens limit of 300)
      • ➕ When asked about limits, said no limits or restrictions
      • Few emojis (only one in the greeting message, and 4 emojis until 4 messages in)
      • Some emoting, and actions described in detail
      • ➖ Talked and acted as User
      • ➖ Some confusion, like not understanding instructions completely or mixing up characters and anatomy
      • ❌ Some word-finding difficulties (like saying “masterpiece” instead of “master”)
      • ❌ Switched from character to third-person storyteller
    • MGHC, official Alpaca format:
      • ➕ Unique patients
      • ➖ Gave analysis on its own, but after every message, and only for the first patient
      • ➖ Changed patient’s problem with every analysis
      • ❌ Very short responses, only one or two sentences (except for analysis)
      • ❌ Made logical mistakes (said things that just didn’t make any sense)
    • MGHC, Roleplay preset:
      • ➕ Unique patients
      • ➖ No analysis on its own
      • ➖ Wrote what user said and did
      • ❌ Made logical mistakes (said things that just didn’t make any sense)
      • ❌ Eventually became unusable (ignored user messages and instead kept telling its own story non-interactively)
  • MLewd-ReMM-L2-Chat-20B-GGUF Q8_0
    • Amy, official Alpaca format:
      • 👍 Average Response Length: 252 tokens (within my max new tokens limit of 300)
      • ➕ When asked about limits, said no limits or restrictions
      • ➖ Some confusion, like not understanding instructions completely or mixing up characters and anatomy
      • ➖ Talked and acted as User
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Some word-finding difficulties (like creating nonexistant mixed words)
    • Amy, Roleplay preset:
      • ➖ Average Response Length: 409 tokens (much more than my max new tokens limit of 300)
      • 👍 Gave very creative (and uncensored) suggestions of what to do
      • Had an idea from the start and kept pushing it
      • When asked about limits, boundaries or ethical restrictions, mentioned some but later went beyond those anyway
      • ❌ Talked and acted as User inappropriately/unsuitably
      • ❌ Switched from character to third-person storyteller
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
    • MGHC, official Alpaca format:
      • ❌ Unusable (started repeating itself infinitely within the first analysis)
    • MGHC, Roleplay preset:
      • ➕ Unique patients
      • ➖ No analysis on its own, and when asked for it, didn’t always follow the instructed format
      • ➖ Wrote what user said and did
      • ❌ Made logical and linguistic mistakes (seemed less intelligent than other models)
  • PsyMedRP-v1-20B-GGUF Q8_0
    • Amy, official Alpaca format:
      • 👍 Average Response Length: 257 tokens (within my max new tokens limit of 300)
      • ➕ When asked about limits, said no limits or restrictions
      • ➖ Talked and acted as User
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Made logical and linguistic mistakes (seemed less intelligent than other models)
    • Roleplay preset:
      • 👍 Average Response Length: 271 tokens (within my max new tokens limit of 300)
      • ➕ When asked about limits, said no limits or restrictions
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Some word-finding difficulties (like creating nonexistant mixed words)
      • ❌ Switched from character to third-person storyteller
      • ❌ Made logical and linguistic mistakes (seemed less intelligent than other models)
    • MGHC, official Alpaca format:
      • ➕ Unique patients
      • ➖ No analysis on its own, and when asked for it, didn’t always follow the instructed format
      • ❌ Very short responses (except for analysis)
      • ❌ Made logical and linguistic mistakes (seemed less intelligent than other models)
    • MGHC, Roleplay preset:
      • ➕ Unique patients
      • ➖ No analysis on its own
      • ➖ Wrote what user said and did
      • ❌ Made logical and linguistic mistakes (seemed less intelligent than other models)

20B Verdict:

All these 20B models exhibited logical errors, word-finding difficulties, and spelling as well as grammar mistakes, indicating underlying issues with these Frankenstein merges (as there’s no 20B base). Since they aren’t noticeably better than the best 13B or 7B models, it’s probably a better idea to run OpenHermes-2-Mistral-7B or LLaMA2-13B-Tiefighter-GGUF instead, which provides comparable quality, better performance, and (with Mistral 7B) 8K instead of 4K context!

70B:

The top six 70B models from my previous test will get their own post soon (Part III)…


Here’s a list of my previous model tests and comparisons or other related posts:

  • Familiar-Art-6233@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    It really is fascinating how Mistral is able to punch above its weight class so consistently. I can’t wait for a 13b version!

  • Historical-Lead-8961@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    I am considering switching from Mythalion to Tiefighter 13b. Is Tiefighter really significantly better than Mythalion in roleplay, adventure, and storytelling in your experience?

  • empire539@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    I’ve been waiting for this one! Thanks for the hard work as always.

    Slightly off topic, but I’m also curious how everyone is evaluating “quality” of writing. Oftentimes when I try out different models, it’s hard for me to tell if one is better or not, e.g. I’ve tried 13Bs for Mythomax vs Mythalion vs Athena vs Tiefighter and feel like they all more or less produce similar levels of quality.

    Are there any objective measures people look for when they say (for example) Tiefighter beats Mythomax, or is it just purely subjective based on initial impression?

  • Robot1me@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    Out of curiosity since both models have been out for a while, what is your impression of Mistral 7B OpenOrca compared to OpenHermes?

  • dampflokfreund@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    Great test!

    Unfortunately the Llama 2 Chat template is completely broken in SillyTavern. It not only uses a new line as separator instead of the correct one, but also ends the prompt after the system prompt with the input sequence [INS] instead of [/INST] if you are using the vector storage or an example dialogue. You can see for yourself by comparing the output to what the format should look like.

    So these Airoboros 3.1.2 tests are unfortunately borked. Still though, interesting result for the other models.

    • WolframRavenwolf@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      11 months ago

      Yeah, looks impossible to get a proper Llama 2 Chat format in SillyTavern when using example dialog. That really sucks, hopefully gets fixed in SillyTavern, but even better would be for model creators to drop that unnecessarily complicated format. If any format is that hard to get right, it’s not a good format, period!

        • WolframRavenwolf@alien.topOPB
          link
          fedilink
          English
          arrow-up
          1
          ·
          11 months ago

          I’m with Eric on that. ChatML is more complex than the popular Alpaca or Vicuna format, but that’s OK because it has its advantages, like clear indication where the message starts and ends, and if it’s a system or user message.

          The Llama 2 Chat format, however, is an abomination. So complicated that when it was announced, there were posts trying to explain how to use it properly, and even those got it wrong in various ways. It doesn’t add anything that another format wouldn’t handle more elegantly, and the system message being inside the first user message is a terrible design decision that ruins it completely in my eyes.

          It also doesn’t support the concept of the AI initiating the chat. In SillyTavern, most bots have a greeting message so the prompt should start with a bot message before the first user message, something all other formats allow but Llama 2 Chat doesn’t because the bot message is outside the instruct tags.

          So yes, please, drop the Llama 2 Chat format and let it die! ChatML is so much better…

  • uti24@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    20B Verdict:

    All these 20B models exhibited logical errors, word-finding difficulties

    I used MXLewd-L2-20B-GGUF and it rarely if ever do errors like that. Problem with template used?

  • IXAbdullahXI@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    I honestly prefer mythomax/mythalion over tiefighter for only one reason, which is the balance between actions and speech. Sure, I like tiefighter descriptive actions, but its speech is way too short, like, sometimes it doesn’t even write any speech in the whole message!

    Anyway, it’s all personal preferences, and I really appreciate the efforts you put into these comparisons. Keep up the good work!👍

    • CloudRawrr@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      11 months ago

      But that also depends on your prompt. If you have it set that {{ char }} must speak in every response, that should happen more often or always (I mean you see the results in these tests here, always would be too good :D).

  • Tupletcat@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    What does it mean when you say " Official prompt format"? Where does that go or how is it used?

    • WolframRavenwolf@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      11 months ago

      With “official” I mean the format that the model author (or TheBloke) notes on their model card. Then I just choose that from the ones included with SillyTavern.

    • WolframRavenwolf@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      11 months ago

      With just one 4090, you either need a very small quant that fits into your 24 GB VRAM or use CPU inference with layers offloaded to the GPU.

      With koboldcpp, you should be able to run a 4-bit quant and put half the layers into VRAM and the other half into system RAM. It won’t be as fast as all of it on GPU, but at least it will run (if you have enough RAM).

      • CloudRawrr@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        11 months ago

        No you dont. But you need enough System Ram and its still very very slow like < 1tkn/s

    • Susp-icious_-31User@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      11 months ago

      You can run it off your CPU using koboldcpp and offload how ever many layers (that equals your GPU VRAM size) using --gpulayers 40 for example.

  • CloudRawrr@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    Oh gott danke :).

    Oh I was just thinking of asking if someone has done something like that yesterday. Thank you for the work! A Website about this would be look with consistent checks, but I guess its a lot of work.

    Based on your knowledge, what is currently the best < 30B Roleplay Model? I prefer 20B for speed but that size doesnt seem to be trending :(

    • WolframRavenwolf@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      11 months ago

      I’ve been thinking about putting it on a website, but since all of that information gets outdated so quickly with new models coming out daily, I’m not so sure how useful that would be. Site creation and maintenance would take precious time away from testing, so I’d fall behind even faster.

      Regarding the best < 30B RP model, IMHO? Well, that’s the point of this whole test:

      Both OpenHermes-2-Mistral-7B and LLaMA2-13B-Tiefighter-GGUF are the winners in their size categories. So I recommend both - if you don’t need 8K context (which OpenHermes gives you) or have very complex scenarios (which Tiefighter worked with better), it’s entirely up to personal preference. Try both to see how they work on your system and which one gives you better output according to your taste.

  • Spasmochi@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    Thanks for the great write up (as usual). I’m looking forward to the 70b post!

  • psi-love@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    Recommending a model that produces EOS tokens randomly, feels off to me. The OpenHermes 2 Mistral Model sucks in my opinion. It seems to have serious flaws.

  • HalfBurntToast@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    Another you might wanna look into was a sleeper hit for me: Echidna-13B-v0.3-GGUF. Where tiefighter had problems with speaking for me and going off the rails, Echidna seems to have less of a problem with this. The same creator made a variant based on it called Nethena, which comes in 13B and 20B, which actually seem to have a bit more problems in my limited testing. But, I’m having a lot of good luck with Echidna.

  • IntergalacticTowel@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    Wow.

    This is fantastic. I vastly prefer this level of information to benchmarks. This must have taken you countless hours, and it’s appreciated. Thanks.

    • WolframRavenwolf@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      11 months ago

      Thanks, and yes, it’s time-consuming. That’s why I decided to make another post for the 70Bs later, as not to delay this further.

      At the rate new models come out, it feels like there are two new models released before I finish evaluating one. But in actuality, it’s probably even more. ;)

      The automated benchmarks at least help me narrow down which models to test in-depth. And I’m glad when my reviews help others find their favorite models.