• 0 Posts
  • 14 Comments
Joined 1 year ago
cake
Cake day: October 30th, 2023

help-circle
  • Doing a full fine tune on Mistral 7B is the only way I’ve gotten human, literary text out of any of these models. Occasionally the vanilla Llama-2 70B will output something great. Yi-34B-Chat, while not by default a literary writer (it’s got that clunky, purple prose, GPT-4 feel to it) impressed me with its ability to write in a requested style.

    The old Guanaco 33B and 65B models produced nice prose, but unfortunately they’re only 2048 context and they weren’t the best at following instructions.


  • Sure.

    I’m using an instruct style dataset with a system field (in Axolotl I use either the orcamini dataset type or chatml). I’ve then collated a bunch of writing that I like (up to 4096 tokens in length) and then reverse prompted it in an LLM to create instructions. So, for example, one sample might have a system field that is “You are a professional author with a raw, visceral writing style” or “You are an AI built for storytelling.” Then the instruction might be “write a short story about X that touches on themes of Y and Z, write in the style of W.” Or the instruction might be a more detailed template, setting out genre, plot, characters, scene description, POV, etc. Then the response is the actual piece. My dataset also includes some contemporary non-rhyming poetry, some editing/rephrasing samples, and some literary analysis.

    I have three datasets. A small one that is purely top quality writing in a dataset structured as above, a middle sized one that also works in some fiction-focused synthetic GPT-4 data I’ve generated myself and curated from other datasets, and a larger one that also incorporates conversational responses derived from a dataset that is entirely Claude generated.

    I’ve then run a full fine-tune on Mistral with those datasets using Axolotl on RunPod, using either 2 or 3 A100s.

    I find utilising a system prompt very beneficial – it seems to help build associative connections.

    Overall results have been pretty good. The larger dataset model is a great all round writer and still generalises well. The smaller dataset model produces writing that is literary, verbose, and pretty.

    I’ve also had some success training on Zephyr as a base model. It helps to give underlying structure and coherence. Finding the right balance of writing pretty and long, with enough underlying reasoning to sustain coherence has been the key challenge for me.


  • Out of the box, I actually find the vanilla Llama-2 70b chat model produces the most natural prose, if prompted correctly. Long Alpaca 70b is also good at following style if you feed it a chunk of writing.

    But the best results I’ve had have come from fine-tuning Mistral 7B myself. Mistral writes crazy good if trained right, though can get muddled at longer contexts.


  • With a full finetune I don’t think so – the LIMA paper showed that 1000 high quality samples is enough with a 65B model. With QLoRA and LoRA, I don’t know. The number of parameters you’re affecting is set by the rank you choose. It’s important to get the balance between the rank, dataset size, and learning rate right. Style and structure is easy to impart, but other things not so much. I often wonder how clean the merge process actually is. I’m still learning.


  • I’ve been training a lot lately, mostly on RunPod, a mix of fine-tuning Mistral 7B and training LoRA and QLoRAs on 34B and 70Bs. My main takeaway is that the LoRA outcomes are just… not so great. Whereas I’m very happy with the Mistral fine-tunes.

    I mean, it’s fantastic we can tinker with a 70B at all, but it doesn’t matter how good your dataset is, you just can’t have the same impact as you can with a full finetune. I think this is why model merging/frankensteining has become popular, it’s an expression of the limitations of LoRA training.

    Personally, I have high hopes for a larger Mistral model (in the 13-20B range) that we can still do a full fine-tune on. Right now, between my own specific tunes of Mistral and some of the recent external tunes like Starling I feel like I’m close to having the tools I want/need. But Mistral is still 7B, it doesn’t matter how well it’s tuned, it will still get a little muddled at times, particular with longer term dependencies.




    1. I think you should be okay. I’ve been doing full fine-tunes on Mistral using either two A100s (80GB VRAM) with a total batch size of 8, or three A100s (80GB again) with a batch size of 12. This is using the Adam 8-bit optimizer and training with a max sequence length of 4096, lots of long samples. I think it should be possible to do a full fine tune on a single 80GB A100, but I haven’t tried. This is without Deep Speed. I’ve done a few Deep Speed runs and that significantly lowers VRAM usage.
    2. RunPod is what I’ve been using and it’s straight forward.
    3. Instilling new knowledge is possible with a fine tune, much less possible with a LoRA. Can’t comment on languages. People seem to have had mixed success creating multi-lingual models.