• til_life_do_us_part@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I think a natural way to do it would be simultaneously train the same model to predict user responses by negative log likelihood on chat data while optimizing the assistant responses to maximize a reward signal. Then you could have the language model generate imagined user responses and optimize the reward signal on the imagined user responses, perhaps in addition to the actual dataset of user interactions. This could be more powerful than conventional RLHF as the model could generate multi step interactions and optimize its responses for utility over multiple steps rather than greedily based on human preference for the immediate response. One tricky question in this case is the reward signal. If it comes from human feedback then naively you might need to get human preferences over entire dialogues rather than single responses which is both more labour intensive and a sparser signal for training.

      • koi691337@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Then you could have the language model generate imagined user responses and optimize the reward signal on the imagined user responses

        Wouldn’t this just constitute to the model sort of overfitting to noise?

        • til_life_do_us_part@alien.topB
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          It’s a risk if your model can’t accurately predict user responses, but I don’t see how it’s a necessary characteristic of the approach. If so the same issue would apply to model based RL in general no? Unless you are suggesting something special about language modelling or user responses which makes it fundamentally hard to learn a model of.