It’s no secret that many language models and fine-tunes are trained using datasets, many of them are made using GPT models. The problem arises when many “GPT-isms” end up in the dataset. And I am not only referring to the typical expressions like “however, it’s important to…”, “I understand your desire to…”, but I am also referring to the structure of the outputs in the model’s responses. ChatGPT (GPT models in general) tend to have a very predictable structure when in its “soulless assistant” mode, which makes it very easy to say “this is very GPT-like”.

What do you think about this? Oh, and by the way, forgive my English.

  • BackwardsPuzzleBox@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    The very idea of using GPT models to create datasets is such a mind-numbing, dumb incestuous decision to begin with. Essentially the 21st century version creating a xerox of a xerox.

    In a lot of ways, it’s kind of heralding the future enshitification of AI as dabblers think every problem can be automated away without human judgement or editorialisation.

  • Robot1me@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    What do you think about this?

    I think an interesting experiment is when you edit an AI output message to start with “As an AI language model” and then let it continue the rest. If it completely loses character and just sounds like ChatGPT, it’s then quite telling.

  • arekku255@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    As an AI language model I do not have an opinion on GPT-isms polluting datasets. However it is important to remember to respect other people and work together to achieve the optimal outcome.

  • noeda@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I think the GPT-isms maybe why my AI storywriting attempts tend to be overly positive and cliched. Not exactly a world shattering problem but it is annoying shakes fist.

    I think if I thought a possible serious problem, it’s that the biases that OpenAI initially inserted into ChatGPT and their GPT models now spread around the local models as well.

    It’s annoying because it feels like all models respond to questions in a similar way. Some are just a bit smarter than others or tuned to respond a bit differently.

    If the GPT-like data spreads around Internet as well then it might be difficult to avoid having it in training data unless you only include old data in your training.

  • stereoplegic@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I’m more concerned with the community’s outsized reliance on/promotion of OAI-generated datasets and models trained on them. But then, commercial viability isn’t generally a concern when you want a spicy waifu.