Jump to content

PDF Import Improvements: Page Numbers


Fabian

Recommended Posts

Hello

 

If I open a PDF in Word, the Header and Footer get recognized.

post-32723-0-29614200-1580927875_thumb.png

 

I would love if Accordance does that too, but to extract the page numbers like it is in the Accordance modules, so we can search for it.

 

 

Currently Accordance import all the infos, but it distinguish not between Header and Footer and the Page Number. Its simply in the line of text and has to be deleted.

post-32723-0-87984000-1580928209_thumb.png

I talk about: 

 

One Line Shortbiblical theology of the messiah 49

 

 

Thanks

 

Greetings 

 

Fabian

  • Like 1
Link to comment
Share on other sites

+1 to (numerous) improvements in PDF / HTML Import

  • Like 1
Link to comment
Share on other sites

Although PDFs can containing tagging, and I think this could identify a header, there's no guarantee that a PDF will be tagged that way. A header or footer can perfectly well be characters that just happen to be at the top or bottom of a page, but which are not distinguished from other text except by the coordinates at which they are printed. In this (common) case, the only way to identify a header or footer is by heuristics based on the position of the text and its distance from other elements on the page. It's certainly possible to create such heuristics, but to make them reliable requires lots of PDFs to test with and time to work out a heuristic that will work most of the time. I note that FineReader (for Mac) often fails to correctly identify headers, footers, and headings, even though doing so is much more central to the functionality of an OCR program than it is to Accordance, and so many hours of developer time have probably been invested in identifying them.

 

So I think identifying headers and footers is likely to take a lot of developer time. I would rather see that time spent on improving import capabilities for formats with well-defined semantics, rather than formats like PDF in which the computer must guess what the various elements are supposed to be. I want to get things like block quotes, tables, and footnotes into Accordance. I don't care whether that's made possible by improved HTML support (footnotes would be hard) or support for some new format like DOCX or TEI XML. I don't care if I can edit the result in Accordance, so long as it is imported correctly. So I don't need it to be imported to a User Tool.

  • Like 1
Link to comment
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now
×
×
  • Create New...