Asking for tips how to use base models instead of instruct/chat tuned models

noeda@alien.top · 11 months ago

Some quotes I found on the pages:

“No! The model is not going to be available publically. APOLOGIES. The model like this can be misused very easily. The model is only going to be provided to already selected organisations.”

“[SOMETHING SPECIAL]: AIN’T DISCLOSING!🧟”

“Hallucinations: Reduced Hallucinations 8x compared to ChatGPT 🥳”

My guess: it’s just another merge like Goliath. At best it’s marginally better than a good 70B.

I can also “successfully build 220B model” easily with mergekit. Would it be good? Probably not.

The lab should write on their model card why should I not think it’s just bullshit. Not exactly the first mystery lab making big claims.

noeda@alien.top · 11 months ago

I’ve seen the “… beats GPT-4” enough times that now whenever I see a title that suggests a tiny model can compete with GPT-4 I see it as a negative signal; that the authors are bullshitting through some benchmarks or some other shenanigans.

It’s annoying because the models might be legitimately good models for being open and within their weight class but now you’ve put my brain in BS detecting mode and I can’t trust you’ve done good faith measurement anymore.

noeda@alien.top · 11 months ago

The first image posted; looks like it’s not even close to GPT-4?

noeda@alien.top · 1 year ago

I have noticed too, that Goliath makes spelling errors somewhat frequently, more often than other models.

It doesn’t seem to affect the “smarts” part as much though. It otherwise still makes high quality text.

noeda@alien.top · 1 year ago

I will set this to run overnight on Hellaswag 0-shot like I did here on Goliath when it was new: https://old.reddit.com/r/LocalLLaMA/comments/17rsmox/goliath120b_quants_and_future_plans/k8mjanh/

Thanks for the model! I started investigating some approaches to combine models and see if it can be better than its individual parts. Just today I finished code to use a genetic algorithm to pick out parts and frankenstein 7B models together (trying to prove that there is merit to this approach using smalelr models…but we’ll see).

I’ll report back on the Hellaswag results on this model.

noeda@alien.top · 1 year ago

I always use the Raw tab, even when chatting (I look up the template manually if I’m using it chat-way). I like to see exactly what is given to the model and what it generates back. Sometimes I use command line software when I’m not using the UI.

noeda@alien.top · 1 year ago

Asking for tips how to use base models instead of instruct/chat tuned models

noeda@alien.top · 1 year ago

I think the GPT-isms maybe why my AI storywriting attempts tend to be overly positive and cliched. Not exactly a world shattering problem but it is annoying shakes fist.

I think if I thought a possible serious problem, it’s that the biases that OpenAI initially inserted into ChatGPT and their GPT models now spread around the local models as well.

It’s annoying because it feels like all models respond to questions in a similar way. Some are just a bit smarter than others or tuned to respond a bit differently.

If the GPT-like data spreads around Internet as well then it might be difficult to avoid having it in training data unless you only include old data in your training.

noeda@alien.top · 1 year ago

Not sure if you misread, but it’s actually high, i.e. it’s better than Xwin and Euryale it’s made out of (in this particular quick test).

It beat all the 70B models I tested there, although the gap is not super high.

noeda@alien.top · 1 year ago

Just finished the Hellaswag trial runs. First, here’s a table from best to worst:

The euryale and xwin models are the ones used to Frankenstein together the Goliath model.

The Goliath .gguf was quantized by myself, as was the Yi model. The rest are downloaded from TheBloke.

Even though Goliath shows up as the top model, here is why I don’t think you should run off and tell everyone Goliath’s the best model ever:

The trials ran 400 random tests from the Hellaswag set. There is a random element in the final score. When I plugged in Goliath and Euryale results for 400 trials to compute the probability that Goliath is better at 0-shot Hellaswag vs. Euryale, I got 84% as result (97.83% for vs. Xwin). 84% is good but I was hoping it would be more like 99%. In other words, it’s possible I randomly got a better result for Goliath simply because it got lucky in the choice of which Hellaswag tests it was asked to complete.
This was the first time I ever tried running more rigorous tests on LLMs rather than eyeballing it so I may have made mistakes.
The numbers can’t be compared with the OpenLLM leaderboard (they use N-shot Hellaswag, forgot what N was), and I noticed they also don’t line up with the llama.cpp link there. OpenLLM leaderboard, I expected it to not be the same but I can’t explain why it doesn’t match with the llama.cpp discussion.
Hellaswag is just one benchmark and I looked at the examples inside the tests what it’s actually asking the models and I think 0-shot testing is a bit brutal for these models. It might be a bit unfair for them. I thought the Yi model for example was supposed to be real good.

I would wait until proper benchmarks run by people with more resources can test this out. I don’t plan on myself on updating these numbers.

BUT. It does look promising. I’m hoping more rigorous benchmarks will give some more confidence.

noeda@alien.top · 1 year ago

I’ve done bunch of D&D character sheets with this and yeah I think is pretty good. (Still not sure if it’s just Euryale though which looks like has been trained on that kind of data).

I would love to see where Goliath ranks in the traditional benchmarks, Hellaswag, Winogrande etc. (has anyone run them yet?) Very curious if this model is strictly better than the two models it was made out of in a more rigorous test.

I’m really hoping the frankensteining method can be proven that it really does improve the smarts compared to the models it is made out of.

I’ve been using a Q6 gguf quant I made myself on day 1 and it works well. 1.22 tokens per second on a pure CPU + DDR5 memory and I think around 90GB of memory.

noeda@alien.top · 1 year ago

Do you follow Chinese LLM development?