@arthurwolf

arthurwolf@alien.top · 2 years ago

I’d really like a version of llava that can process comic/manga pages (read the text, say which character is saying what, doing what, in what order. pretty much turn the manga into a novel or something like that).

Anyone know of any project that is going in that direction/working on that?

arthurwolf@alien.top · 2 years ago

It’s a long shot, but I think if you took DeepPanel (see github), and instead of training it on comic book panels, you set up a training dataset with PDF tables, it would generate the same kind of masks/heatmaps it generates for comic book panels, but for PDF tables (this gives you an image that represents where “table lines” are, and that removes all text and other random stuff, allowing you to process only the table lines).

Then from there, you could scan the image vertically first, doing an average of the pixel of each line of the heatmap to detect where “lines” are, and cut the table into rows. Then once you have the rows, you do the same on each row to get the columns/cell.

I do this for comic book panels and it works very well, I see no reason why it wouldn’t work for PDF tables.

It’s a lot of work but I’m fairly certain it’d work.

Then once you have the cells, it’s just a matter of OCR (you could even maybe try llava for that, I suspect it might work).

Tell me if you need help with this/more details about how I did it for comic books/how I would do it for PDF tables.

arthurwolf@alien.top · 2 years ago

Sounds like you want to train a custom model/qlora just for this task.

arthurwolf@alien.top · 2 years ago

Hey @OP. Really interesting initiative. There seems to be some parallels with something I’m working on, I’d love your opinion on it if you have a moment: https://github.com/arthurwolf/llmi/blob/main/README.md

arthurwolf@alien.top · 2 years ago

It definitely should have a live token counter as you type the prompt in.

Do you plan to make this a PR against llama.cpp? It really deserves to be merged in.

arthurwolf@alien.top · 2 years ago

I’m working on a project that intends to solve that issue/do that work (though not using agents, essentially by training custom models with the ability to do “recursive”/in-depth work). The example I give in the docs is for a dissertation, but future plans are to enable something like a novel (in particular by implementing a memory system).

The project: https://github.com/arthurwolf/llmi/blob/main/README.md

arthurwolf@alien.top · 2 years ago

Oh boy! Let’s train a right-wing and a left-wing model, and have them talk to each other on Twitter forever!

… Profit!