Is Open LLM Leaderboard reliable source ? yi:34B is at the top but I get better results with neural-chat:7B model

grigio@alien.top · 1 year ago

Is Open LLM Leaderboard reliable source ? yi:34B is at the top but I get better results with neural-chat:7B model

out_of_touch@alien.top · 1 year ago

I’m curious what results you’re seeing from the Yi models. I’ve been playing around with LoneStriker_Nous-Capybara-34B-5.0bpw-h6-exl2 and more recently LoneStriker_Capybara-Tess-Yi-34B-200K-DARE-Ties-5.0bpw-h6-exl2 and I’m finding them fairly good with the right settings. I found the Yi 34B models almost unusable due to repetition issues until I tried settings recommended in this discussion:

https://www.reddit.com/r/LocalLLaMA/comments/182iuj4/yi34b_models_repetition_issues/

I’ve found it much better since.

I tried out one of the neural models and found it couldn’t keep track of details at all. I wonder if my setting weren’t very good or something. I would have been using a EXL2 or GPTQ version though.

TeamPupNSudz@alien.top · 1 year ago

I found the Yi 34B models almost unusable due to repetition issues until I tried settings recommended in this discussion:

I have the same issue with LoneStriker_Nous-Capybara-34B-5.0bpw-h6-exl2. Whole previous messages will often get shoved into the response. I basically gave up and went back to Mistral-OpenHermes.

bacocololo@alien.top · 1 year ago

To stop any repetition. you could try to add a stop token in model as ‘### Human’ it works well for me

TeamPupNSudz@alien.top · 1 year ago

Capybara doesn’t use Alpaca format, so that wouldn’t do anything. Regardless, it’s not that type of repetition. It’s not speaking for the user, it’s literally just copy/pasting part of the conversation into the answer.

USM-Valor@alien.top · 1 year ago

I’ve had the same experiences with the Yi finetunes. I tried them on single-turn generations and they were very promising. However, starting with one from scratch I was having a ton of repetition and looping. Some models need a very tight set of parameters to get them to perform well, whereas other ones will function will under almost any sane set of guidelines. I’m thinking Yi leans more towards the former, which will have users thinking they are inferior to simpler, but more flexible models.