Are there any data cleaning focused LLMs? [also, rant]

AnomalyNexus@alien.top · 10 months ago

Are there any data cleaning focused LLMs? [also, rant]

n_girard@alien.top · 10 months ago

What I wanted to experiment but haven’t yet, is to provide the models with markdown text whose both sections and paragraphs are numbered ; and instruct them to list the section numbers and/or paragraph numbers related to some topic.

The next step would be to extract the relevant passages using the list(s).

Has anyone ever tried this approach ? Any thoughts ?

CodeGriot@alien.top · 10 months ago

I don’t think you should be surprised that a 34B model is mostly failing, considering the fact that a 200B model (GPT-3.5) is only getting to 40%. What you’re asking the LLM to do is very hard for it without further training/tuning.

andrewlapp@alien.top · 10 months ago

Say I’ve got a paragraph about something and the text block contains some other unrelated comment

Have you considered creating text embeddings, calculating their distance matrix, and applying pagerank?

AnomalyNexus@alien.top · 10 months ago

That’s a sharp comment.

Potentially beyond my technical ability but I can vaguely see where you’re going with it.

Next step was embeddings anyway (hence attempt to clean the data - get it ready for that).

I’ve not heard of pagerank applied to this before though. Thanks!

herozorro@alien.top · 10 months ago

most all of what you wrote can be done with python out of the box

arthurwolf@alien.top · 10 months ago

Sounds like you want to train a custom model/qlora just for this task.

No-Activity-4824@alien.top · 10 months ago

You need an AI like GPT 4 not an LLM

LocoMod@alien.top · 10 months ago

Ideally we would be better in a timeline where LLMs could do this better than classical methods but we’re not there yet. You can code a handler that cleans up html retrieval quite trivial since you’re just looking for the text in specific tags like articles, headers, paragraphs, etc. There are a ton of frameworks and examples out there on how to do this and a proper handler would execute the cleanup in a fraction of the time even the most powerful LLM ever hoped to.

georgejrjrjr@alien.top · 10 months ago

Sort-of.

Refuel.ai finetuned a 13B llama 2 for data labeling; not hard to imagine applications for that here if the data volume were reasonable. Simplest thing that might work: take a paragraph at a time and have a data labeling model answer “Is this boilerplate or content?”

Another possibility is using the TART classifier head from Hazy Research, find as many as 256 pairs of boilerplate vs. content, and use only as large a model as you need to get good classification results. If your data volume is large, you would do this for a while, get a larger corpus of content vs. boilerplate, and train a more efficient classifier with fasttext or something similar (probably bigram based).

Small-Fall-6500@alien.top · 10 months ago

Having a dozen or so examples doesn’t help/work?

I don’t see you mentioning this (or any other comments here) but few-shot prompting should help immensely compared to only giving a detailed instruction. If you add at least a couple examples after the instruction, I would imagine most models would do much better.

(Assuming you can fit multiple examples in the context window)