@thereisonlythedance

thereisonlythedance@alien.top · 1 year ago

Doing a full fine tune on Mistral 7B is the only way I’ve gotten human, literary text out of any of these models. Occasionally the vanilla Llama-2 70B will output something great. Yi-34B-Chat, while not by default a literary writer (it’s got that clunky, purple prose, GPT-4 feel to it) impressed me with its ability to write in a requested style.

The old Guanaco 33B and 65B models produced nice prose, but unfortunately they’re only 2048 context and they weren’t the best at following instructions.

thereisonlythedance@alien.top · 1 year ago

Sure.

I’m using an instruct style dataset with a system field (in Axolotl I use either the orcamini dataset type or chatml). I’ve then collated a bunch of writing that I like (up to 4096 tokens in length) and then reverse prompted it in an LLM to create instructions. So, for example, one sample might have a system field that is “You are a professional author with a raw, visceral writing style” or “You are an AI built for storytelling.” Then the instruction might be “write a short story about X that touches on themes of Y and Z, write in the style of W.” Or the instruction might be a more detailed template, setting out genre, plot, characters, scene description, POV, etc. Then the response is the actual piece. My dataset also includes some contemporary non-rhyming poetry, some editing/rephrasing samples, and some literary analysis.

I have three datasets. A small one that is purely top quality writing in a dataset structured as above, a middle sized one that also works in some fiction-focused synthetic GPT-4 data I’ve generated myself and curated from other datasets, and a larger one that also incorporates conversational responses derived from a dataset that is entirely Claude generated.

I’ve then run a full fine-tune on Mistral with those datasets using Axolotl on RunPod, using either 2 or 3 A100s.

I find utilising a system prompt very beneficial – it seems to help build associative connections.

Overall results have been pretty good. The larger dataset model is a great all round writer and still generalises well. The smaller dataset model produces writing that is literary, verbose, and pretty.

I’ve also had some success training on Zephyr as a base model. It helps to give underlying structure and coherence. Finding the right balance of writing pretty and long, with enough underlying reasoning to sustain coherence has been the key challenge for me.

thereisonlythedance@alien.top · 1 year ago

Out of the box, I actually find the vanilla Llama-2 70b chat model produces the most natural prose, if prompted correctly. Long Alpaca 70b is also good at following style if you feed it a chunk of writing.

But the best results I’ve had have come from fine-tuning Mistral 7B myself. Mistral writes crazy good if trained right, though can get muddled at longer contexts.

thereisonlythedance@alien.top · 1 year ago

With a full finetune I don’t think so – the LIMA paper showed that 1000 high quality samples is enough with a 65B model. With QLoRA and LoRA, I don’t know. The number of parameters you’re affecting is set by the rank you choose. It’s important to get the balance between the rank, dataset size, and learning rate right. Style and structure is easy to impart, but other things not so much. I often wonder how clean the merge process actually is. I’m still learning.

thereisonlythedance@alien.top · 1 year ago

I’ve been training a lot lately, mostly on RunPod, a mix of fine-tuning Mistral 7B and training LoRA and QLoRAs on 34B and 70Bs. My main takeaway is that the LoRA outcomes are just… not so great. Whereas I’m very happy with the Mistral fine-tunes.

I mean, it’s fantastic we can tinker with a 70B at all, but it doesn’t matter how good your dataset is, you just can’t have the same impact as you can with a full finetune. I think this is why model merging/frankensteining has become popular, it’s an expression of the limitations of LoRA training.

Personally, I have high hopes for a larger Mistral model (in the 13-20B range) that we can still do a full fine-tune on. Right now, between my own specific tunes of Mistral and some of the recent external tunes like Starling I feel like I’m close to having the tools I want/need. But Mistral is still 7B, it doesn’t matter how well it’s tuned, it will still get a little muddled at times, particular with longer term dependencies.

thereisonlythedance@alien.top · 1 year ago

I was sceptical, but darn it’s good. Mistral is a fantastic base and with this technique these guys have pushed it another step closer. A lot of the answers I’m getting are on on par with old GPT-4 (pre-turbo, turbo in the API is a step up on old GPT-4 IMO).

thereisonlythedance@alien.top · 1 year ago

An A100 ((80GB) costs between $1.70=$1.99 per hour on RunPod. How long you need depends on dataset size, sequence length, the optimizer you choose and how many epochs you train for. I can get a full finetune of Mistral (5 epochs) with an Adam 8-bit optimizer done on my small (1300 samples) but long sequence length (most samples are 4096 tokens) dataset in around an hour with 3x A100s.

thereisonlythedance@alien.top · 1 year ago

I think you should be okay. I’ve been doing full fine-tunes on Mistral using either two A100s (80GB VRAM) with a total batch size of 8, or three A100s (80GB again) with a batch size of 12. This is using the Adam 8-bit optimizer and training with a max sequence length of 4096, lots of long samples. I think it should be possible to do a full fine tune on a single 80GB A100, but I haven’t tried. This is without Deep Speed. I’ve done a few Deep Speed runs and that significantly lowers VRAM usage.
RunPod is what I’ve been using and it’s straight forward.
Instilling new knowledge is possible with a fine tune, much less possible with a LoRA. Can’t comment on languages. People seem to have had mixed success creating multi-lingual models.

thereisonlythedance@alien.top · 1 year ago

With Claude lobotomised to the point of uselessness and OpenAI on the rocks it’s an interesting time in the LLM space. Very glad to have made the move to local early on, and I hope we‘ll have models that are capable of delivering roughly Claude 1.3 level in the not too distant future.

thereisonlythedance@alien.top · 1 year ago

I don’t see the board backing down. We are witnessing the wanton destruction of something beautiful by unhinged ideologues.

I hope if nothing else comes out of this, that the public will at least be more aware and wary of the EAs and their strange crusade.

thereisonlythedance@alien.top · 1 year ago

I think it’s bad news. It’s an EA coup, and now being reported as such. There‘s been a schism in OpenAI for some time now — a tension between the EA AGI/safety agenda and commercialism. Anthropic exists because of this schism and we will now see another split with some people following Sam and Greg out the door. The result will be an even more closed OpenAI, more censorship, more attacks by them on open source AI.

I hope it makes more people aware of the cult that effective altruism has become. Maybe in the medium term it could be net good if Sam sets up a new AI shop. But I suspect he’ll form an entity to work with Apple or some other big corp to transform mobile AI or the like.

thereisonlythedance@alien.top · 1 year ago

This will be about AI safety.

thereisonlythedance@alien.top · 1 year ago

I had to split it something strange like 12/24GB to make it work. Even then I couldn’t get past 3K context.

thereisonlythedance@alien.top · 1 year ago

This is produced by effective altruists with ties to Anthropic:

https://jeffreyladish.com/about/

This is not objective science, it’s produced with an agenda, for a purpose.

The actual results are laughable. Nothing you couldn’t google and find far more sinister responses or instructions. Maybe somebody should write a paper actually comparing incremental risks versus googling. But no, that wouldn’t help dig the moat.