PDF Import Improvements: Page Numbers

February 5, 2020

Hello

If I open a PDF in Word, the Header and Footer get recognized.

I would love if Accordance does that too, but to extract the page numbers like it is in the Accordance modules, so we can search for it.

Currently Accordance import all the infos, but it distinguish not between Header and Footer and the Page Number. Its simply in the line of text and has to be deleted.

I talk about:

One Line Shortbiblical theology of the messiah 49

Thanks

Greetings

Fabian

February 5, 2020

+1 to (numerous) improvements in PDF / HTML Import

February 5, 2020

+1

February 7, 2020

Just saw this advertised https://apps.apple.com/ch/app/pdf-inspector/id1497698069?l=en&mt=12 I guess this can be a huge help.

February 7, 2020

Although PDFs can containing tagging, and I think this could identify a header, there's no guarantee that a PDF will be tagged that way. A header or footer can perfectly well be characters that just happen to be at the top or bottom of a page, but which are not distinguished from other text except by the coordinates at which they are printed. In this (common) case, the only way to identify a header or footer is by heuristics based on the position of the text and its distance from other elements on the page. It's certainly possible to create such heuristics, but to make them reliable requires lots of PDFs to test with and time to work out a heuristic that will work most of the time. I note that FineReader (for Mac) often fails to correctly identify headers, footers, and headings, even though doing so is much more central to the functionality of an OCR program than it is to Accordance, and so many hours of developer time have probably been invested in identifying them.

So I think identifying headers and footers is likely to take a lot of developer time. I would rather see that time spent on improving import capabilities for formats with well-defined semantics, rather than formats like PDF in which the computer must guess what the various elements are supposed to be. I want to get things like block quotes, tables, and footnotes into Accordance. I don't care whether that's made possible by improved HTML support (footnotes would be hard) or support for some new format like DOCX or TEI XML. I don't care if I can edit the result in Accordance, so long as it is imported correctly. So I don't need it to be imported to a User Tool.

PDF Import Improvements: Page Numbers

Recommended Posts

Fabian

Link to comment

Share on other sites

TYA

Link to comment

Share on other sites

miketisdell

Link to comment

Share on other sites

Fabian

Link to comment

Share on other sites

jlm

Link to comment

Share on other sites

Please sign in to comment

Browse

Activity