Hello fellow LLM enthusiasts!

I’m building a little code+chat CLI called gptme, that aims to leverage the capabilities of local LLMs to mimic the functionalities offered by OpenAI’s “Advanced Data Analysis” (formerly known as “Code Interpreter”). It is similar in spirit to the more popular open-interpreter, which some of you might have heard of.

It works really well by now, and I currently use it in my day-to-day work with GPT-4 (to help collect quality data), where GPT-4 performs much better with tools than GPT-3.5-turbo, which in turn performs much better than the open/local models I’ve tested. In my experience, local models are clearly struggling in this domain, which leads me to…

Now I’m interested in how I can use my existing conversation logs to fine-tune it. I’ve read a bit on how to fine-tune completion models, but less so for chat models.

I’m hoping it can improve the model’s general performance, in large part by making it better at following the prompt and using tools, but also to make it familiar with the process of interactively running and debugging code (where output is fed back, possibly with errors to address). I also have hopes it will minimize the need for a verbose system prompt overall (“standardizing” the tools in the training data), saving on context.

A curious anecdote when using gptme today: I was surprised to see gpt-3.5-turbo suddenly reply as if it was the OpenAI “Advanced Data Analysis” thing with support for writing to and serving files from /mnt. I use a messy system prompt that outlines the tools available and examples, but nothing mentioning this! It suggests to me that training the models directly on these instructions is a good way to go, and removes the need for them in the system prompt.

So, does anyone have experience fine-tuning chat+code models for something similar? Any good guides/tools out there that I’ve missed in my search?

Thank you all in advance! Looking forward to reading your replies.