• georgejrjrjr@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    The model seems cool and all, but the paper is better.

    Intel eliminated the preference data from direct preference optimization. Preference data is expensive and collecting it is a hassle, so this is a big deal. Best of all, it looks like their no-preference DPO actually performs better.

    The trick is sampling rejects from a small model. Let’s say you have a dataset of GPT-4 completions. You mark those as good (“preferred”). You prompt Llama 2 13B and mark its responses as rejects.

    Tl;dr This could boost the performance of nearly every model with a minimal increase in complexity (though obviously it’s non-zero compute).