🐺🐦‍⬛ LLM Comparison/Test: 2x 34B Yi (Dolphin, Nous Capybara) vs. 12x 70B, 120B, ChatGPT/GPT-4

WolframRavenwolf@alien.top · 3 years ago

🐺🐦‍⬛ LLM Comparison/Test: 2x 34B Yi (Dolphin, Nous Capybara) vs. 12x 70B, 120B, ChatGPT/GPT-4

doublechord@alien.top · 3 years ago

what kinda system do I need to runt he capybara 34b?

fab_space@alien.top · 3 years ago

I want to share my test with u for reviewing, and hopefully, integration.

how it sounds?

Perimeter666@alien.top · 3 years ago

Goliath is a masterpiece so far. Running it on 4x4090, speed is OK, but not the best still.

For my taste it writes stories better than GPT4 itself, immersing deeper and avoiding useless watery poetic shit GPT4 is full of.

Just give the thing 16k context and with a 16x4096 setup it’ll be divine lol

Ok_Relationship_9879@alien.top · 3 years ago

That’s pretty amazing. Thanks for all your hard work!
Does anyone know if the Nous Capybara 34B is uncensored?

FullOf_Bad_Ideas@alien.top · 3 years ago

I am not serious, but the results clearly suggest that what we should try next is to stack 2 various finetunes of Yi-34B onto each other in the same way it’s done in Goliath and then quantize it.

Inevitable-Start-653@alien.top · 3 years ago

God I love these posts! Thank you so much 🙏😊

drifter_VR@alien.top · 3 years ago

But why did SauerkrautLM 70B, a German model, fail so miserably on the German data protection trainings tests?

Does it write decent german, at least ?

I ask because I tried another Llama-2-70B model fine-tuned to speak another language than english (Vigogne-2-70b-chat) and I have been disappointed by its poor writing style.

Maybe it’s my settings or the fine-tuning. Or maybe the base model is the issue (relatively small and trained mainly in english)

norsurfit@alien.top · 3 years ago

You’re doing the lord’s work, son…

mcmoose1900@alien.top · 3 years ago

I have… mixed feeling about Capybara’s storytelling, compared to Base YI 34B with the alpaca lora?

I have been trying it with the full instruct sytnax, but maybe it will work better with hybrid instruct/chat sytnax (where the whole story is in one big USER: block, and the instruction is to continue the story.)

iChrist@alien.top · 3 years ago

I found out that for a simple task like “list 10 words that end with the letters en” i get only wrong answers with the dolphin 34B variant, while 13B tiegihter gets it right, am i doing something wrong with template?

sophosympatheia@alien.top · 3 years ago

Another great contribution, Wolfram! I was pleased to see one of my 70b merges in there and it didn’t suck. More good stuff to come soon! I have a xwin-stellarbright merge I still need to upload that is hands down my new favorite for role play. I’m also excited to see what opus can do in the mix.

kindacognizant@alien.top · 3 years ago

> Deterministic generation settings preset

There seems to be a common fallacy that absolute 0 temperature or greedy sampling is somehow the most objective because it’s only picking the top token choice; this isn’t necessarily true, especially for creative writing.

Think about it this way: you are indirectly feeding into the model’s pre-existing biases in cases where there are many good choices. If you’re starting a story with the sentence, “One day, there was a man named”, that man could be literally any man.

On the base Mistral model, with that exact sentence, my custom debug kobold build says:

Token 1: 3.3%

Token 2: 2.4%

Token 3: 1.6%

Token 4: 1.6%

Token 5: 1.18%

Token 6: 1.15%

Token 7: 1.14%

Token 8: 1.03%

Token 9: 0.99%

Token 10: 0.98%

When the most confidence the model has in a token is 3.3%, that implies you’d want to keep the selection criteria just as diverse, because in reality that slight bit of confidence is only because it has a generic name for the top token.

Whatever the most likely token is only the most likely token for that particular token given the past context window: a deterministic preset is not creating generations that are overall more coherent. In fact, it causes models to latch onto small biases caused by tokenization, which manifests as repetition bias.

The Deterministic preset in ST also has a rather high repetition bias of 1.18; this is causing the model to subtly bias against things like asterisks and proper formatting, which are important to test for in a model.

mrpogiface@alien.top · 3 years ago

You should apply here. Lemme know if you want an intro

https://twitter.com/natfriedman/status/1723100077718438065?t=NHJL-ge61DWGos77QaH58g&s=19

coderguyofficial@alien.top · 3 years ago

my experience so far…

i can confirm yi-capybara-34b-2k is actually pretty good

better than zephyr-beta-8-bit at following instructions
better than chatgpt-3.5-turbo on chatgpt web app
gpt-4 is the best one, but no longer by a large gap

RepresentativeOdd276@alien.top · 3 years ago

Can you add a test in your next comparisons where you ask the LLM to output in less than x amount of words? I have noticed that most LLMs including large ones fail to follow this instruction successfully.

🐺🐦‍⬛ LLM Comparison/Test: 2x 34B Yi (Dolphin, Nous Capybara) vs. 12x 70B, 120B, ChatGPT/GPT-4

🐺🐦‍⬛ LLM Comparison/Test: 2x 34B Yi (Dolphin, Nous Capybara) vs. 12x 70B, 120B, ChatGPT/GPT-4

Models tested:

Testing methodology

1st test series: 4 German data protection trainings

Observations:

Conclusion: