Starling-RM-7B-alpha: New RLAIF Finetuned 7b Model beats Openchat 3.5 and comes close to GPT-4

Legcor@alien.top · 2 years ago

Starling-RM-7B-alpha: New RLAIF Finetuned 7b Model beats Openchat 3.5 and comes close to GPT-4

pseudonerv@alien.top · 2 years ago

Form huggingface model card,

Starling-RM-7B-alpha is a reward model trained from Llama2-7B-Chat.

From their webpage, https://starling.cs.berkeley.edu

Our reward model is fine-tuned from Llama2-7B-Chat

Yet, the model config.json

"max_position_embeddings": 8192,
"model_type": "mistral",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-05,
"rope_theta": 10000.0,
"sliding_window": 4096,

SO? Whoever is doing the PR has no f***ing idea what their student labors are actually doing.

visarga@alien.top · 2 years ago

yeah I was put off by the lack of mention on the base model

Warm_Shelter1866@alien.top · 2 years ago

What does it mean that an LLM is a reward model ? , I always thought of rewards only in the RL field . And how would the reward model be used during finetuning?