A few people here tried the Goliath-120B model I released a while back, and looks like TheBloke has released the quantized versions now. So far, the reception has been largely positive.

https://huggingface.co/TheBloke/goliath-120b-GPTQ

https://huggingface.co/TheBloke/goliath-120b-GGUF

https://huggingface.co/TheBloke/goliath-120b-AWQ

The fact that the model turned out good is completely unexpected. Every LM researcher I’ve spoken to about this in the past few days has been completely baffled. The plan moving forward, in my opinion, is to finetune this model (preferably a full finetune) so that the stitched layers get to know each other better. Hopefully I can find the compute to do that soon :D

On a related note, I’ve been working on LLM-Shearing lately, which would essentially enable us to shear down a transformer down to much smaller sizes, while preserving accuracy. The reason goliath-120b came to be was an experiment in moving at the opposite direction of shearing. I’m now wondering if we can shear a finetuned Goliath-120B to around ~70B again and end up with a much better 70B model than the existing ones. This would of course be prohibitively expensive, as we’d need to do continued pre-train after the shearing/pruning process. A more likely approach, I believe, is shearing Mistral-7B to ~1.3B and perform continued pretrain on about 100B tokens.

If anyone has suggestions, please let me know. Cheers!

  • a_beautiful_rhind@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Surprise!.. xwin doing poorly among the 70b. It does bad when I test it vs chat logs too. Shows higher perplexity than yi-34b and a gaggle of other models, including base.

    • randomfoo2@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      It depends on the use case. Each model may have their own strengths. I picked XWin and Airoboros as baseline 70B models for 2nd language conversational testing, and XWin outperformed (in human-evaled testing with a native speaker) a 70B model that had been pre-trained on an additional 100B tokens of said 2nd language. Shocking to say the least.

      • a_beautiful_rhind@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        My test was logs of chats with characters. Something that isn’t widely publicly available so it can’t be gamed. Xwin has very bad perplexity on those. Below that of codellama-34b.

        xwin: 4.876139163970947

        Codellama: 4.689054489135742

        Same quantization…

        70b-base scores: 3.69110918045044 Euryale-1.3: 3.8607137203216553

        Dolphin 2.2 did surprisingly bad: 4.39600133895874 but not as bad as xwin.

        Obviously it doesn’t 100% track to a good model but all things combined about xwin (refusals, repeat issue, perplexity) put me off from it in a big way.