Hello!
By popular demand I am planning a fine-tune of https://huggingface.co/dreamgen/opus-v0-7b on top of Yi-34B and wonder whether to use the 200K as the base.
The regular Yi-34B seems slightly better than Yi-34B-200K on standard benchmarks, but I wonder how it “feels” and whether the loss of performance on short context is worth it, given that the regular version can be used up to 32K tokens.
Did anyone try an analysis of these 2 models on various sequence lengths (<4K, <8K, <16K, etc.)?
The regular 34b “feels” like it ignores my prompt a lot.
It’s supposed to be a base model and not Instruction finetuned model. That’s how base models generally behave unless they are sold as base but actually finetuned (llama 2 base models).
I felt this too. It seems to “grab on” when you give it a longer context to continue though.