@toothpastespiders

toothpastespiders@alien.top · 10 months ago

What I have so far is such a hacky mess that I’m nowhere near comfortable uploading yet. But I’ve been trying to put together an automated system to do something similar. Basically, toss books and journal articles into a folder by subject, get dataset out in the morning.

The bad news is that I didn’t have much luck with any of the local models in their default state. But that was months ago so things might have improved. And I only tested what I thought were promising models. Wouldn’t shock me to find that I just missed something that would have worked right out of the box. Also wouldn’t shock me if someone hopped on and just pointed you to a model that works perfectly for this.

That said, here’s what’s mostly worked for me. I just made a dataset with 100 to 150 or so examples for the subjects I was working with. Basically a dataset filled with examples of how to make a dataset.

Sounds mind numblingly tedious, I know. But a handful a day and it’s over pretty quick. I made a point in particular to include a lot of examples where the source data was messed up. Poorly converted pdfs, text with image file names scattered, page numbers littering the text, etc. To make sure that I was giving it the worst that it might encounter as well as the more optimal examples.

Made the instruction prompt something like “Format the following text about x into alpaca formatted json. Here’s the text:” folowed by the text. Then put the json data I’d want in the output field. Then I did some additional training with that dataset on a few models. That was enough to vastly improve the results.

Up until very recently the best results I got with that method were with dolphin 2.0 mistral 7B and Microsoft’s Orca-2 13b. Not perfect, but I’m hand editing the generated dataset for about ten minutes for a textbook. Sometimes less, sometimes more.

The big shock was Capybara Tess Yi 34b 200k though. I only ran a couple of tests, so this might be a fluke, but after training with the same dataset I was getting perfect results. Something I’d never seen before with anything other than gpt4. Though I’m finishing up a big batch right now with the 13b model so haven’t had a chance to swap it in through the automation and see if it lives up to that outside the quick test run. It’s worth noting too that I never tried the dataset generation with Capybara Tess Yi 34b 200k in its normal state, without my extra training applied to it. Might very well be that it’d be just as perfect in its default state. So if you’re testing models, that’s the one I’d point to for a possible solution that wouldn’t require any more work.

So yeah, in short my advice is to just make a dataset with about 100 to 150 examples of how to make a dataset. Going by my results that should be enough to get you pretty close to what you’re looking for.

toothpastespiders@alien.top · 10 months ago

I’m still shocked at how good mistral is. I wrote it off as a meme model for far too long just because of how overstated the praise seemed to be. But the thing really is amazing for the size.

toothpastespiders@alien.top · 10 months ago

Just one, and assuming no extra training? I think I’d go with Capybara Tess Yi 34b. In part because of how well it seems to follow instructions. But also because it has the broadest scope of knowledge that I’ve seen in any of the models so far. A lot of the models tap out on a lot of things past what you’d get from the first paragraph of Wikipedia. I get that feeling far less often with capy so far.

toothpastespiders@alien.top · 10 months ago

I’d like to know too if there’s one for exactly $1. Even half a buck or so difference builds up over time.

But runpod’s close at least, at $1.69/hour.

toothpastespiders@alien.top · 10 months ago

Oh yeah, you’re absolutely going to want to go with a llama2 model over the options you’ve looked at already. The only one of them I have direct experience with is GPT-2. But the worst llama models I’ve seen still feel like night and day in comparison to GPT2.

Personally, I think you’d be best off going with a combination of fine-tuning with your own data and using RAG in order to get as far away from hallucinations as possible. Not everyone agrees, but I think that both in tandem is the way to go.

I think that the language is going to be the larger issue. This is just conjecture on my part. But I suspect that a powerful model that’s only trained on ‘your’ dutch data and is otherwise focused on English would probably end up performing worse to Dutch prompts than a less capable model that was trained with large amounts of miscellaneous Dutch language data in addition to your own.

I remember this Dutch 7b model was released fairly recently. It was created from a base llama2 chat model. Which means it probably also has a lot of the more “corporate style” tone that most people here are trying to avoid. But given the context, I think that might actually be an advantage for you. Being safe for work/school is probably a bit of a priority.

7b also has the advantage of being very light on resource usage. And I mean very, very, light. I’ve been using a 7b model for some automated tasks on spare hardware that doesn’t even have a GPU. It’s entirely running on an ancient CPU. And while slow, it’s not unbearably so.

toothpastespiders@alien.top · 10 months ago

I’m really late on this one, but dolphin 2.0 mistral 7b. I did a little extra training on it for some automation and the thing’s ridiculously solid, fast, and light on resource usage. I’m still cleaning up the output a bit after it’s chugging away at night. But to a pretty minor degree.

Though if failures count then Yi 34b’s up there in terms of usage this week too. As I fail a million times over just to train a simple, single, usable, lora for it.

toothpastespiders@alien.top · 10 months ago

Holy shit. I’ve been holding off on looking too deeply into LLaVA given how many things are always popping up. But that’s just too cool to pass up on. The amount of potential applications, if it works as well as I’m hoping, is wild.

toothpastespiders@alien.top · 10 months ago

I feel like I just inadvertently sold my soul for access to an 8b model with all that agreement clicking.

toothpastespiders@alien.top · 10 months ago

They told me the grammar police would come for me one day. Why wasn’t I more careful with my interrobanging‽

toothpastespiders@alien.top · 10 months ago

Dang, after that 34b drought it’s like suddenly stumbling onto the great lakes right now.

toothpastespiders@alien.top · 10 months ago

The choice of question in there is particularly insightful. All AI-related tasks should focus on spiders.

toothpastespiders@alien.top · 11 months ago

Dang, given that I was already impressed with a model trained on half the tokens I suspect I will be impressed!

toothpastespiders@alien.top · 11 months ago

Creating alpaca formatted json data from big blocks of text that often have a lot of garbage in it. The untrained orca 3b model wasn’t able to stick to the format if I provided it as an example in the instructions. But it did great with it after training on a small dataset of about 100 examples or so.

It’s still a bit early to call it a total success since I’ve only ran it through a handful of tests on similar blocks of text. But just the fact that it’s grabbing facts from the text and correctly formulating prompts around it is really impressive to me. 13b trained on the same data set is, unsurprisingly, still quite a bit better. But 3b’s still doing far far better than I would have thought possible. It’d be really cool to get a little scraping pipe going with next to no resource use.

toothpastespiders@alien.top · 11 months ago

I was extraordinarily skeptical of the utility of 3b models until…about 1 day ago when I gave orca mini a fair shot. In particular by training it on one specialized task. Which wound up producing results that honestly floored me.

All of which is to say that I’m VERY excited to see this. I really think the 3B models can be something of a perfect swiss army knife. Compact and always available. Multi modal capabilities are just a perfect fit for that exact type of methodology. Can’t wait to give this a shot!