Any Models Out There Specialized in Crafting Datasets from Docs? 📄

Ruinous_Calamity@alien.top · 10 months ago

Any Models Out There Specialized in Crafting Datasets from Docs? 📄

toothpastespiders@alien.top · 10 months ago

What I have so far is such a hacky mess that I’m nowhere near comfortable uploading yet. But I’ve been trying to put together an automated system to do something similar. Basically, toss books and journal articles into a folder by subject, get dataset out in the morning.

The bad news is that I didn’t have much luck with any of the local models in their default state. But that was months ago so things might have improved. And I only tested what I thought were promising models. Wouldn’t shock me to find that I just missed something that would have worked right out of the box. Also wouldn’t shock me if someone hopped on and just pointed you to a model that works perfectly for this.

That said, here’s what’s mostly worked for me. I just made a dataset with 100 to 150 or so examples for the subjects I was working with. Basically a dataset filled with examples of how to make a dataset.

Sounds mind numblingly tedious, I know. But a handful a day and it’s over pretty quick. I made a point in particular to include a lot of examples where the source data was messed up. Poorly converted pdfs, text with image file names scattered, page numbers littering the text, etc. To make sure that I was giving it the worst that it might encounter as well as the more optimal examples.

Made the instruction prompt something like “Format the following text about x into alpaca formatted json. Here’s the text:” folowed by the text. Then put the json data I’d want in the output field. Then I did some additional training with that dataset on a few models. That was enough to vastly improve the results.

Up until very recently the best results I got with that method were with dolphin 2.0 mistral 7B and Microsoft’s Orca-2 13b. Not perfect, but I’m hand editing the generated dataset for about ten minutes for a textbook. Sometimes less, sometimes more.

The big shock was Capybara Tess Yi 34b 200k though. I only ran a couple of tests, so this might be a fluke, but after training with the same dataset I was getting perfect results. Something I’d never seen before with anything other than gpt4. Though I’m finishing up a big batch right now with the 13b model so haven’t had a chance to swap it in through the automation and see if it lives up to that outside the quick test run. It’s worth noting too that I never tried the dataset generation with Capybara Tess Yi 34b 200k in its normal state, without my extra training applied to it. Might very well be that it’d be just as perfect in its default state. So if you’re testing models, that’s the one I’d point to for a possible solution that wouldn’t require any more work.

So yeah, in short my advice is to just make a dataset with about 100 to 150 examples of how to make a dataset. Going by my results that should be enough to get you pretty close to what you’re looking for.

IndianaCahones@alien.top · 10 months ago

Any reason you want to use an LLM to perform a basic ETL pipeline? Look into intelligent document processing. I believe you are looking for a data engineering solution with no code.