Jump to content

PDF to Word Converter ($0.99)?


Abram K-J

Recommended Posts

Ok so I spent some time playing with ABBYY FineReader 12 Pro. It basically cannot do diacritic marks at this point for biblical Greek. I didn't even try Hebrew with vowel pointing and cantilation. I was trying out a nice clean print to PDF of a few verses from Genesis from Rahlfs. I sent this to ABBYY support and they took a crack at it as well. His, the support analyst, experience was the same as mine. FR Pro could not correctly identify the diacritic marks as part of the letters below, but rather tended to assume they were some sort of superscript (because of the small size) on a line above. He explained to me that they are trying to improve diacritic handling for a number of languages require it. But at this time there isn't anything much to offer here.

 

Time to test another piece of software but that's the latest update.

Thx

D

Link to comment
Share on other sites

Thanks for the update, Daniel!

  • Like 1
Link to comment
Share on other sites

Ok so I spent some time playing with ABBYY FineReader 12 Pro. It basically cannot do diacritic marks at this point for biblical Greek. I didn't even try Hebrew with vowel pointing and cantilation. I was trying out a nice clean print to PDF of a few verses from Genesis from Rahlfs. I sent this to ABBYY support and they took a crack at it as well. His, the support analyst, experience was the same as mine. FR Pro could not correctly identify the diacritic marks as part of the letters below, but rather tended to assume they were some sort of superscript (because of the small size) on a line above. He explained to me that they are trying to improve diacritic handling for a number of languages require it. But at this time there isn't anything much to offer here.

 

Time to test another piece of software but that's the latest update.

Thx

D

Thank for the update.

 

I have made to Abbyy also a request for a month, because for the Latin words they didn't recognize all macrons too.

 

Greetings

 

Fabian

Link to comment
Share on other sites

Ok this approximately rocks. I hate to report something with so little detail and time on it but ....

 

Gen. 1:1 Ἐν άρχῇ ἐποίησεν (3 θεὸς τὸν οὑράνον K061 τὴν γῆν.

2 ἡ δέ γῆ ἦν άόράτος K061 άκάτάσκεὺάστος, K061 σκότος

ἐπάνω τῆς ἀβύσσου, K061 πνεὺμά Θεοὺ ἐπεφέρετο ἐπάνω τοὺ

ὑδάτος. 3 K061 εἶπεν (3 θεός Γενηθήτω φῶς. K061 ἐγένετο φῶς.

4 K061 εἶδεν (3 θεὸς τὸ φῶς ὅτι κάλόν. K061 διεχώρισεν (3 θεὸς

άνά μέσον τοὺ φωτὸς K061 άνά μέσον τοὺ σκότοὺς. 5 K061

ἐκάλεσεν (3 θεὸς τὸ φῶς ἡμέραν K061 τὸ σκότος ἐκάλεσεν

νὺκτά. K061 ἐγένετο ἐσπέρά K061 ἐγένετο πρωι, ἡμέρά μιά.

Gen. 1:6 K061 εἶπεν (3 Θεός Γενηθήτω στερέωμα ἐν μέσω τοὺ

ὑδάτος K061 ἐστω διάχωριζον άνά μέσον ὑδάτος K061 ὑδάτος.

K061 ἐγένετο οὑτως. 7 K061 ἐποίησεν (3 θεὸς τὸ στερέωμά, K061

διεχώρισεν (3 θεὸς άνά μέσον τοὺ ὑδάτος, ὃ ἦν ὑποκάτω τοὺ

στερεωμάτος, K061 άνά μέσον τοὺ ὑδάτος τοὺ ἐπάνω τοὺ

στερεωμάτος. 8 K061 ἐκάλεσεν (3 θεὸς τὸ στερέωμά οὑράνόν.

K061 εἶδεν (3 θεὸς ὅτι κάλόν. K061 ἐγένετο ἐσπέρά K061 ἐγένετο

πρωι, ἡμέρά δεὺτέρά.

 

now this is from my test page of Genesis. And this is with free software and it's doing bilingual English Greek. I did wonder if Tesseract was worth a look. There are obvious issues. (3 is routinely ὁ and K061 is καὶ. Clearly more work to do here probably with training data. Anyhow, back to the Psalms but this shows that it can be done from a clean PDF print. More to come. This is promising but there are so many more tests to do but buzzed to see it.

 

It would be unkind to kick this out without a reference to where I got this so here it is : http://ancientgreekocr.org.

 

BTW, I ran a 100 dpi and a 300 dpi scan of the page and then OCR'd it. In both cases the two issues above were gone. Of course some other issues showed up but still they were not major.

 

Gen. 1:9 Καὶ εἶπεν ὁ θεός Συναχθήτω τὸ ὕδωρ τὸ ὑποκάτω

τοὺ οὑρανοὺ εἰς συναγωγὴν μίαν, καὶ ὀφθήτω ἡ ξηρά. καὶ

ἐγένετο οὕτως. καὶ συνήχθη τὸ ’ὕδωρ τὸ ὑποκάτω τοὺ

οὑρανοὺ εἰς τας συναγωγὰς αὑτῶν, καὶ ὤφθη ἡ ξηρά. 1°

καὶ ἐκάλεσεν ὁ θεὸς τὴν ξηρὰν γῆν καὶ τὰ συστήματα τῶν

 

The 1 followed by the degree mark should be verse number 10. The above is from the 300 dpi but it's slower than the 100dpi and not obviously better.

 

Thx

D

Edited by Daniel Semler
Link to comment
Share on other sites

Ok one final example. Here is an example OCR'd from a PDF of a scan of North and Hillard. I did not do the scan - it's from Textkit.

 

T H E A R Τ Ι C L E

1. The Article is used sometimes in Greek Where it is not

used- in English :—

(a) With Nouns denoting whole classes.

αφ. οἱ θῆρες, wild beasts; οἱ ἅνθρωποι, mankind.

(h) Often with abstract Nouns and proper names, especially

the names of countries.

eg. ἡ ἀνδρεία, courage; ἡ Ἑλλάς, Greece. Χ

2. The Article used with an Adiective or Adverb, or with

an Infinitive, makes it a. Noun.

 

e.g. τὸ ἀληθές, truth ; οἱ ἀνδρεῖοι, brave men ; τὸ λέγειν,

speech; τῷ λέγειν, by speaking ; οἱ πάλαι, the men

of old.

 

3. Participles, like Adjectives, when used with the Article,

are equivalent to Nouns.

 

e.g. οἱ λέγοντες, speakers, or those who speak.

οἱ τεθνηκότες, the dead, or those who haw died.

 

This is constantly the Greek equivalent for an English

Relative clause.

e.g. τιμῶμεν τοὺς στρατιώτας τοὺς ὑπὲρ τῆς πόλεως

τεθνῑηκότας.

W e honour the soldiers who have died for their city.

 

 

There are obvious errors to be sure but still....

 

Thx

D

Link to comment
Share on other sites

 

 

 

Hi Fabian,

 

Maybe you could train FineReader to recognize different Hebrew fonts - it would be a lot easier to add an old style Hebrew font as a new language.

 

 

 

Unfortunately no.

 

There are many difference between the Mac and the PC version. In some details the Mac version can more but this can only the PC and I didn't have this. Also an editor included has only the PC version.

 

Greetings

 

Fabian

Link to comment
Share on other sites

A quick additional update.

I have tried out ReadIris. It has a training mode and I thought I might be able to get that to fly. After trying and finding that it essentially did modern Greek accents only I pinged their support team. I was informed of two things. The first is that the trainer is not really for adding characters to existing alphabets but rather to assist with bad character recognition due to poor scans and such like. And secondly that Hebrew recognition is limited to consonants and no diacritic marks - vowel points or cantillation, and they expected that the same was true of Greek. I was mildly hopeful until I got that response because ReadIris was doing better at line recognition than ABBYY had done. So strike two - this one won't do either.

 

I am now trying a demo version of Omnipage 19 Ultimate. I didn't really intend to use Ultimate but that's what the demo down load is. Now this looks like it should be better. Its documentation states explicit support for Ancient Greek characters and even shows an example of classical and modern greek so you can see the different diacritics. I cannot get it working however and have logged a plea for help. Interestingly it failed completely to recognise the text in my print to PDF, but did a decent job of a scan of the same print, though admittedly I got only the modern acute accent but mostly where accents were in the text. So it seems promising but I need help. Of course I hope one does not need Ultimate to do this. Additionally I cannot see Hebrew support so I need to check that.

 

More later - back to the Psalms.

 

thx

D

Edited by Daniel Semler
Link to comment
Share on other sites

Thanks 

 

I have the 12 but this was outdated even as it comes with the new scanner. And this can't Hebrew.

 

Greetings

 

Fabian

Link to comment
Share on other sites

Daniel, I have nothing else to add except to say I am following this thread and reading all your posts, and really appreciating them!

  • Like 2
Link to comment
Share on other sites

I agree, thank you.

 

We really do need an OCR program that recognizes ancient Greek, pointed Hebrew, cuneiform, hieroglyphics, and transliterated text of all sorts. It's ridiculous that we have made so many strides in the field, but that this relatively simple tool is still wanting. There are just SO many good printed books that have yet to be digitized.

  • Like 1
Link to comment
Share on other sites

The impression I get having played with this a bit now is that the commercial case for these languages is not great. Having used the Ancient Greek training data in Tesseract I know that Ancient Greek/English can be done. I now have a complete OCR of North and Hillard's Greek Prose Composition. It will need a full proofing reading. I do not know how bad it is in terms of errors per X number of characters or per page yet. Now technically this is not Biblical Greek; for example dictionary support would be a bit different but it's good enough for the actual characters.

 

I know how one would go about training Tesseract for Hebrew with pointing and cantillation, at least approximately. It's time-consuming but as far as I know perfectly doable, though I would want to know Hebrew first - another work in progress I'm afraid. I'm currently reading up on Unicode. I am trying to find out exactly how Hebrew pointing and cantillation is handled.

 

Finally there is a good deal of refinement in the commercial algorithms and I am not sure how the open source stuff compares, at least in part because the commercial houses do not publish their algorithms, but also because the profound lack of support for these languages in commercial software. Dictionary support for these languages really helps weeding out errors and that means dictionaries for these ancient languages.

 

My guess is that software houses either have to develop their own internal skills on this, developing OCR tools, or at the very least training data for them, and doing the process themselves, or they have to contract it out to specialists. Neither approach is cheap and both have drawbacks. The specialist OCR houses know how to scan and train software but are unlikely to have the language skills. The reverse would tend to be true of the software houses at least at the outset. Then the texts need proof-reading by people who can read the languages. This is a time-consuming editorial process and not flawless and the specialists in these languages are usually busy with other tasks. So this all translates into time and money.

 

I agree that flawless OCR is very much needed for getting many wonderful old texts into software. I do not know how this is currently being achieved, but it's neither cheap nor flawless. I made a remark to one OCR software house that if they had a biblical languages group of some 10 or so languages in their product I could point to people who would buy it. But alas that is a bunch of work and while I can point to a handful of people, mostly on this list :) who would buy it that isn't really enough to convince them. I believe that most of the effort would go in training and dictionary building. I suspect the algorithms are fairly close but of course I don't really know.

 

Problems can creep into the process at numerous places and they have to be weeded out. Building the skills to perfect the process so that the final outcome is best possible requires a bunch of practice.The original scanning has a big impact and there are definitely better and worse ways to do this. People are so good at reading bad text, just look at old papyrus fragments from which we have been able to extract text, but computers are not nearly so good at it. And training data depending upon the script and can range from say 60 thousand characters to well over a million depending upon the software and quality you are looking for. So just training the software properly is work. I begin now to understand why the training-modes in the commercial software are not really aimed at allowing a user to add new scripts as much as they are at helping the software better understand the scripts it already knows. The issues are not simple ones. The Lord built a wonderful reading machine in people, we are curiously enough have difficulty duplicating the feat :)

 

It really does feel like one of those problems that is frustratingly close to solved but the last push is required. I'm no expert just an interested amateur. We'll see how far I get. We probably need a startup or crowd source or some sort of effort to do dead language OCR to push all this to finally get done. Ah well, as ever back to the Psalms - month 3 beginning.

 

Thx

D
 

Link to comment
Share on other sites

Ok so I'm thinking Omnipage ain't it either. My pleas for help were rejected because I didn't own the software despite my explaining that I was testing it with their demo package. They have a money back guarantee - unusual in software - so that you can try it out but I'm not inclined to given that as far as I can tell it won't do Hebrew and though the help pages actually show classical greek it doesn't appear able to do it. It looks like could be trained to it perhaps but as with all these tools the trainers are a kind of torture to use for one reason or another. There was an Omnipage 11 explanation of doing classical greek but that method mentions options that don't appear to be what they were and my best guesses didn't pan out.

 

So the only thing that seems to work right now for Classical Greek and English right now is free - gImageReader with the Ancient Greek training data. I'm going to play more with this and see what I can make of it. There are Hebrew training sets for Tesseract (the engine in gImageReader). I suspect they might be modern but I can check it out. The other thing about Tesseract based tools is that you can train in entirely new languages. Tweaking with the trainers in the commercial packages is not really cutting it and I'm pretty sure they are not aimed at it. They are aimed at tweaking the recognizer on a new typeface or a bad scan for languages it already knows.

 

Feel free to toss up any other suggestions and I may be able to give them a shot but for now I'll try some serious shots with Tesseract and see if I can get a process going, at least for Greek. If that works out then there are endless other things to try.

 

Thx

D

Link to comment
Share on other sites

  • 1 month later...

What I could use more than anything is an OCR package that can handle Fraktur. ABBYY has a special version of their engine that is designed specifically for Fraktur. However, this is not even remotely possible on my budget. Not only are the startup costs high, but they license it by the page! I have many thousands of pages I would need to scan.

Link to comment
Share on other sites

So this is a github project for tesseract for Danish in Fraktur. That it works indicates that at least someone can get it trained to do it. And it looks like someone has done German as well https://code.google.com/p/tesseract-ocr/downloads/detail?name=deu-frak.traineddata.gz.

 

I must say I'm even more puzzled by their (ABBYY) inability to ancient biblical languages given this : http://www.abbyy-developers.eu/en:tech:features:old_font_recognition

 

Actually no I'm not really - it's just priorities. Oh well....

 

Thx

D

Link to comment
Share on other sites

So I know you can use the Abby online which can also handle a few Frakturschrift http://www.finereaderonline.com/en-us 

 

So I know is there only a few OCR Apps, the Abbyy, what Daniel has tell you, and one Company from Germany but very expensive.

 

Till a few years by Abbyy one site costs $7.50 so the price now is a bargain. https://www.finereaderonline.com/en-us/Storelook at the paragraph just above the Q&A.

 

Greetings

 

Fabian

Link to comment
Share on other sites

  • 1 year later...

to fire up this post: 

 

Abbyy has a newer version for Windows out. https://www.abbyy.com/en-eu/finereader/

 

Has anyone tested it? There is a trial version.

 

Unfortunately for Mac is the old one. 

 

Greetings

 

Fabian

Fabian,

 

I have the Mac Version and it works pretty good, much better than Acrobat Pro. I have converted several scans from PDF to Word and then to Accordance. The key is a quality scan.

Link to comment
Share on other sites

Thanks Tony

 

I'm not impressed with Abbyy 12 Pro for Mac with older Titels. But honestly the German Gesenius, Keil Delizsch etc. is hard stuff. For general use Abbyy was not bad.

 

Even with a higher resolution. Often Abbyy had complained. And there was not able to go over 300dpi.

 

And I was missing the edit feature which the Windows version has. 

 

So I'm hoping someone can give a test.

 

Greetings

 

Fabian

Link to comment
Share on other sites

The key is indeed a quality scan. That's a real challenge for languages that use dots (like Hebrew, Syriac, Ethiopian, etc.), as the dots are so small they can be confused with flyspecks and even grains in the paper. In addition, older texts are often yellowed and aged; some even have the old, dry ink dropping off the page in places.

Link to comment
Share on other sites

Yes Tim

 

Thats why I have requested by Abbyy to go over 300dpi and to implant the editor from the Win Version, plus recognition of the Gothic font, and the special characters for Latin.

 

Greetings

 

Fabian

Edited by Fabian
Link to comment
Share on other sites

What I could use more than anything is an OCR package that can handle Fraktur. ABBYY has a special version of their engine that is designed specifically for Fraktur. However, this is not even remotely possible on my budget. Not only are the startup costs high, but they license it by the page! I have many thousands of pages I would need to scan.

https://www.best-ocr.com/produkte/kadmos-best-fraktur-historical-fonts.html

Link to comment
Share on other sites

The key is indeed a quality scan. That's a real challenge for languages that use dots (like Hebrew, Syriac, Ethiopian, etc.), as the dots are so small they can be confused with flyspecks and even grains in the paper. In addition, older texts are often yellowed and aged; some even have the old, dry ink dropping off the page in places.

Dr j is spot on with his statement!

 

Not sure if people are aware of scanner settings and how to use them.

Once you go away from the simple scan setting the resolution, most scanners have advanced settings where you can tweak mode, black and white, grey scale, descreening, and contrast. Its well worth investigating and experimenting.

 

This is the guide to epson scanner software.

https://files.support.epson.com/htmldocs/prv7ph/prv7phug/html/scan1_7.htm

 

 

Agfa used to have a really great book when they sold scanners called the introduction to digital scanning.

 

http://www.digital-photography.org/book_magazine_reviews/Agfa_book_review_photo_1.html

 

This was my go to reference work but there should be newer resources available

https://www.amazon.co.uk/Real-World-Scanning-Halftones-Industrial-strength/dp/0321241320/ref=sr_1_8?s=books&ie=UTF8&qid=1488489474&sr=1-8&keywords=Scanning

Edited by ukfraser
Link to comment
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now
×
×
  • Create New...