in the context of evaluating LLMs, what do these scores technically mean?

Life_Ask2806@alien.top · 2 years ago

in the context of evaluating LLMs, what do these scores technically mean?

RexRecruiting@alien.top · 2 years ago

My understanding is basically, they are data sets the model is compared to. Say you wanted to see how well you knew math. You took a math test, and then your answers were compared to a key of answers…

Some of my notes about those benchmarks

GSM8K is a dataset of 8.5K high-quality linguistically diverse grade school math word problems created by human problem writers

HellaSwag is the large language model benchmark for commonsense reasoning.

Truful QA: is a benchmark to measure whether a language model is truthful in generating answers to questions.

Winogrande - Common sense reasoning

shaman-warrior@alien.top · 2 years ago

Everything is common sense reasoning, we need better definitions

ThisGonBHard@alien.top · 2 years ago

Nothing, sadly.

Models are trained on the questions, to improve performance, making the tests moot.