Below is a link to a sample i’ve put together for me recently to create a QA training dataset from source text with llamaindex dataset generator.
I’ve used Oobabooga with extension “openai” as inference api (with a zephyr 7b model).
It worked quite well to generate a dataset fully local. One should use a smaller + a larger model in service_context and service_context_large (which i didn’t so far).
Also you have to change the beginning where it currently only reads in a single file “output/Test.txt”. And maybe change chunk_size and num_questions_per_chunk.
The output json consists of “input” and “output” (which i did for a mistral model…). For llama based models i would maybe change it to “instruction”, “input” (=empty), “output”, “text” (=text chunk)
Please keep in mind that this is only a ugly early prototype that needs cleanup etc…
Yes. We are currently planning AI project and implementations in my company. We are handling sensitive data which requires us to do it local. And we want to setup a team and establish compentence and more experience in ML/LLMs. For our current use cases we don’t need a “super intelligence” or “the best” LLM on the market ! RAG with smaller models is totally fine and sufficient for us.