I have some basic confusions over how to prepare a dataset for training. My plan is to use a model like llama2 7b chat, and train it on some proprietary data I have (in its raw format, this data is very similar to a text book). Do I need to find a way to reformat this large amount of text into a bunch of pairs like “query” and “output” ?

I have seen some LLM’s which say things like “trained on Wikipedia” which seems like they were able to train it on that large chunk of text alone without reformatting it into data pairs - is there a way I can do that, too? Or since I want to target a chat model, I have to find a way to convert the data into pairs which basically serve as examples of proper input and output?

  • Glat0s@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Below is a link to a sample i’ve put together for me recently to create a QA training dataset from source text with llamaindex dataset generator.

    I’ve used Oobabooga with extension “openai” as inference api (with a zephyr 7b model).

    It worked quite well to generate a dataset fully local. One should use a smaller + a larger model in service_context and service_context_large (which i didn’t so far).

    Also you have to change the beginning where it currently only reads in a single file “output/Test.txt”. And maybe change chunk_size and num_questions_per_chunk.

    The output json consists of “input” and “output” (which i did for a mistral model…). For llama based models i would maybe change it to “instruction”, “input” (=empty), “output”, “text” (=text chunk)

    Please keep in mind that this is only a ugly early prototype that needs cleanup etc…

    https://pastebin.com/cjF1eawK