• RexRecruiting@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    My understanding is basically, they are data sets the model is compared to. Say you wanted to see how well you knew math. You took a math test, and then your answers were compared to a key of answers…

    Some of my notes about those benchmarks

    GSM8K is a dataset of 8.5K high-quality linguistically diverse grade school math word problems created by human problem writers

    HellaSwag is the large language model benchmark for commonsense reasoning.

    Truful QA: is a benchmark to measure whether a language model is truthful in generating answers to questions.

    Winogrande - Common sense reasoning

  • ThisGonBHard@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Nothing, sadly.

    Models are trained on the questions, to improve performance, making the tests moot.