G_S_7_wiz@alien.topB to

LocalLLaMA@poweruser.forumEnglish · 1 year ago

Extract Tables from PDFs

2

1

Extract Tables from PDFs

G_S_7_wiz@alien.topB to

LocalLLaMA@poweruser.forumEnglish · 1 year ago

2

I am working on a project where I have to extract tables from PDFs(usually financial reports which contain lot of tables(simple tables and cells merged tables) and graphs).
Following are the libraries that have been used without much great results:
Naugat, PyMuPDF(fitz) , PyPDF2 , pdfplumber, PDFMiner, Camelot, Tabula, pdfquery

What other OCR, LLMs or other tools do you recommend to proceed further? Thanks in advance!

You must log in or register to comment.

Chat

vec1nu@alien.topB
link
fedilink
English
arrow-up
1·
1 year ago
I’ve had good results using https://github.com/DevashishPrasad/CascadeTabNet
Chaosdrifer@alien.topB
link
fedilink
English
arrow-up
1·
1 year ago
You might want to look into llamaIndex’s SECinsight repo. https://github.com/run-llama/sec-insightsz they do a lot of parsing on financial documents.