• 0 Posts
  • 14 Comments
Joined 11 months ago
cake
Cake day: October 30th, 2023

help-circle
  • What I have so far is such a hacky mess that I’m nowhere near comfortable uploading yet. But I’ve been trying to put together an automated system to do something similar. Basically, toss books and journal articles into a folder by subject, get dataset out in the morning.

    The bad news is that I didn’t have much luck with any of the local models in their default state. But that was months ago so things might have improved. And I only tested what I thought were promising models. Wouldn’t shock me to find that I just missed something that would have worked right out of the box. Also wouldn’t shock me if someone hopped on and just pointed you to a model that works perfectly for this.

    That said, here’s what’s mostly worked for me. I just made a dataset with 100 to 150 or so examples for the subjects I was working with. Basically a dataset filled with examples of how to make a dataset.

    Sounds mind numblingly tedious, I know. But a handful a day and it’s over pretty quick. I made a point in particular to include a lot of examples where the source data was messed up. Poorly converted pdfs, text with image file names scattered, page numbers littering the text, etc. To make sure that I was giving it the worst that it might encounter as well as the more optimal examples.

    Made the instruction prompt something like “Format the following text about x into alpaca formatted json. Here’s the text:” folowed by the text. Then put the json data I’d want in the output field. Then I did some additional training with that dataset on a few models. That was enough to vastly improve the results.

    Up until very recently the best results I got with that method were with dolphin 2.0 mistral 7B and Microsoft’s Orca-2 13b. Not perfect, but I’m hand editing the generated dataset for about ten minutes for a textbook. Sometimes less, sometimes more.

    The big shock was Capybara Tess Yi 34b 200k though. I only ran a couple of tests, so this might be a fluke, but after training with the same dataset I was getting perfect results. Something I’d never seen before with anything other than gpt4. Though I’m finishing up a big batch right now with the 13b model so haven’t had a chance to swap it in through the automation and see if it lives up to that outside the quick test run. It’s worth noting too that I never tried the dataset generation with Capybara Tess Yi 34b 200k in its normal state, without my extra training applied to it. Might very well be that it’d be just as perfect in its default state. So if you’re testing models, that’s the one I’d point to for a possible solution that wouldn’t require any more work.

    So yeah, in short my advice is to just make a dataset with about 100 to 150 examples of how to make a dataset. Going by my results that should be enough to get you pretty close to what you’re looking for.





  • Oh yeah, you’re absolutely going to want to go with a llama2 model over the options you’ve looked at already. The only one of them I have direct experience with is GPT-2. But the worst llama models I’ve seen still feel like night and day in comparison to GPT2.

    Personally, I think you’d be best off going with a combination of fine-tuning with your own data and using RAG in order to get as far away from hallucinations as possible. Not everyone agrees, but I think that both in tandem is the way to go.

    I think that the language is going to be the larger issue. This is just conjecture on my part. But I suspect that a powerful model that’s only trained on ‘your’ dutch data and is otherwise focused on English would probably end up performing worse to Dutch prompts than a less capable model that was trained with large amounts of miscellaneous Dutch language data in addition to your own.

    I remember this Dutch 7b model was released fairly recently. It was created from a base llama2 chat model. Which means it probably also has a lot of the more “corporate style” tone that most people here are trying to avoid. But given the context, I think that might actually be an advantage for you. Being safe for work/school is probably a bit of a priority.

    7b also has the advantage of being very light on resource usage. And I mean very, very, light. I’ve been using a 7b model for some automated tasks on spare hardware that doesn’t even have a GPU. It’s entirely running on an ancient CPU. And while slow, it’s not unbearably so.









  • Creating alpaca formatted json data from big blocks of text that often have a lot of garbage in it. The untrained orca 3b model wasn’t able to stick to the format if I provided it as an example in the instructions. But it did great with it after training on a small dataset of about 100 examples or so.

    It’s still a bit early to call it a total success since I’ve only ran it through a handful of tests on similar blocks of text. But just the fact that it’s grabbing facts from the text and correctly formulating prompts around it is really impressive to me. 13b trained on the same data set is, unsurprisingly, still quite a bit better. But 3b’s still doing far far better than I would have thought possible. It’d be really cool to get a little scraping pipe going with next to no resource use.


  • I was extraordinarily skeptical of the utility of 3b models until…about 1 day ago when I gave orca mini a fair shot. In particular by training it on one specialized task. Which wound up producing results that honestly floored me.

    All of which is to say that I’m VERY excited to see this. I really think the 3B models can be something of a perfect swiss army knife. Compact and always available. Multi modal capabilities are just a perfect fit for that exact type of methodology. Can’t wait to give this a shot!