• pseudonerv@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Form huggingface model card,

    Starling-RM-7B-alpha is a reward model trained from Llama2-7B-Chat.

    From their webpage, https://starling.cs.berkeley.edu

    Our reward model is fine-tuned from Llama2-7B-Chat

    Yet, the model config.json

    "max_position_embeddings": 8192,
    "model_type": "mistral",
    "num_attention_heads": 32,
    "num_hidden_layers": 32,
    "num_key_value_heads": 8,
    "rms_norm_eps": 1e-05,
    "rope_theta": 10000.0,
    "sliding_window": 4096,
    

    SO? Whoever is doing the PR has no f***ing idea what their student labors are actually doing.

    • Warm_Shelter1866@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      What does it mean that an LLM is a reward model ? , I always thought of rewards only in the RL field . And how would the reward model be used during finetuning?