Jump to content


Photo

PDF to Word Converter ($0.99)?


  • Please log in to reply
39 replies to this topic

#21 Daniel Semler

Daniel Semler

    Platinum

  • Active Members
  • PipPipPipPipPip
  • 1,913 posts
  • Gender:Male
  • Accordance Version:11.x

Posted 05 February 2015 - 11:28 PM

I just downloaded a trial copy of ABBYY FineReader 12 Pro for Windows. I printed a few verses of Genesis from the LXX to a PDF file and then tried reading that. It can mostly get it but the diacritics are almost all lost. I have tried the Language Editor to sort that out but have hit a snag. I've sent a note to their support people. Hopefully I can get this simple test going soon. Then I can try stuff from scans. Will report back.

 

Once I've got a grip on some of this I'll probably give OmniPage a go.

 

Thx

D
 


Sola lingua bona est lingua mortua

ἡ μόνη ἀγαθὴ γλῶσσα γλῶσσα νεκρὰ ἐστιν

lišanu ēdēnitu damqitu lišanu mītu

 

Accordance Configurations :
 
Mac : 2009 27" iMac                 Windows : HP 4540s laptop
      Intel Core Duo                          Intel i5 Ivy Bridge
      12GB RAM                                8GB RAM
      Accordance 11.0.1                       Accordance 11.0.1
      OSX 10.9 (Mavericks)                    Win 7 Professional x64 SP1


#22 Abram K-J

Abram K-J

    Mithril

  • Active Members
  • PipPipPipPipPipPip
  • 2,132 posts
  • Gender:Male
  • Location:Greater Boston, MA
  • Accordance Version:11.x

Posted 05 February 2015 - 11:36 PM

Thanks! Reports back are welcomed.


Abram K-J
Pastor, Writer, Editor, Blogger
Web: Words on the Word
 
Accordance 11 on Yosemite: early 2008 iMac / late 2008 MacBook
Latest iAccord on latest iOS on iPad mini

#23 Pchris

Pchris

    Gold

  • Active Members
  • PipPipPipPip
  • 269 posts
  • Gender:Male
  • Location:Denmark
  • Interests:Old Testament Exegesis, The Ancient Near East, the Hamito-Semitic languages, Ancient Greek, Mythology
  • Accordance Version:11.x
  • Platforms:Mac OS X

Posted 06 February 2015 - 12:52 AM

I've done OCR with Adobe Acrobat in the past and the problem with doing OCR for Biblical Greek is that the rules for using diacritical signs in Modern Greek were changed in 1982 so that only the acute and diaeresis are used - and thus the only diacritical signs to be recognized by OCR engine. From what I've seen, Spiritus Asper and Lenis will more often than not be read as an acute, but sometimes the entire letter will not be rendered at all. This is often the case for letters having iota subscriptum and circumflex.

 

A similar issue also exists with doing OCR on Biblical Hebrew, as it only works when nothing but the consonants are present, which makes sense as it is the standard in Modern Hebrew. Texts with vowels and/or cantillation will always come out completely distorted.

 

I've also tried ABBYY, but I find that the "learning process" it has isn't very effective for the ancient languages with diacritics. I also tried to "teach" it the Syriac script Estrangela once. It did not go well, to say the least. At least it was an interesting experiment.

 

With kind regards

 

Peter Christensen


Edited by Pchris, 06 February 2015 - 12:53 AM.

Accordance Version: 11.0.4 

Hardware: MacBook Pro 2.6 GHz Intel Core i7 (medio 2012)
Operating System: OSX 10.9.5 Mavericks.


#24 Daniel Semler

Daniel Semler

    Platinum

  • Active Members
  • PipPipPipPipPip
  • 1,913 posts
  • Gender:Male
  • Accordance Version:11.x

Posted 06 February 2015 - 10:14 AM

Thanx Peter. I had a rather similar experience last night ABBYY. I got beyond my little hiccup. I got the training mode going but there are several issues. The two biggies appear to be :

 

  1. It tends misidentify the diacritical marks as a separate line of superscripted marks rather than as part of the characters below. I was able to do a very targetted selection of a part of a line and that then did not do this. I don't know exactly how I got it to do that, and in any case that's hardly viable for any reasonable text. With that I was able to use training mode.

 

  2. There is a fairly severe restriction on the learning mode documented in the doc :

 

3. A pattern can only be used for documents that have the same font, font size, and
resolution as the document used to create the pattern.
 

    I'm pretty sure that this means that you could be ok if you could scan all your own documents in a consistent way. But using random scans from the web you may or may not be ok, leading to a retraining.

 

  3. The training process itself is not particularly quick.

 

  4. Multilingual OCR'ing this line using which has a mix of English and Greek :

 

Gen. 1:1     Ἐν ἀρχῇ ἐποίησεν ὁ θεὸς τὸν οὐρανὸν καὶ τὴν γῆν.

 

  results in confusion of some letters for English where they should be interpreted as Greek. τ being misinterpreted as t, ο as o and so on. I was able once to get ῇ trained in but then ὴ was read as ῇ.

 

 This might seem minor but quite frankly it will cause a lot of rework in the document.

 

  So I'm not yet massively impressed. I need to find a solution for 1 before the training will be at all effective. Then I need to explain to it how to deal with things like 4 above. And what I was testing with was not a document scan but a direct print from Accordance to PDF, so it has no scanning artifacts or such to confused the OCR process with. I think I'll retry the experiment with a larger font size print and see if that helps with any of the above.

 

  Then time to get another tool. By the way, anyone done any work with Tesseract ?

 

Thx

D
 


Edited by Daniel Semler, 06 February 2015 - 10:14 AM.

  • Timothy Jenney likes this

Sola lingua bona est lingua mortua

ἡ μόνη ἀγαθὴ γλῶσσα γλῶσσα νεκρὰ ἐστιν

lišanu ēdēnitu damqitu lišanu mītu

 

Accordance Configurations :
 
Mac : 2009 27" iMac                 Windows : HP 4540s laptop
      Intel Core Duo                          Intel i5 Ivy Bridge
      12GB RAM                                8GB RAM
      Accordance 11.0.1                       Accordance 11.0.1
      OSX 10.9 (Mavericks)                    Win 7 Professional x64 SP1


#25 Michel Gilbert

Michel Gilbert

    Silver

  • Active Members
  • PipPipPip
  • 139 posts
  • Gender:Male
  • Location:Monet
  • Interests:old houses, antiques, gardens, fresh food, good coffee, live music
  • Accordance Version:11.x

Posted 06 February 2015 - 12:06 PM

Hi Peter and Daniel,

 

Along with the blessings of working in ancient languages, there is also the curse - we spend an inordinate amount of time on word processing tasks that are simple and straightforward in English, etc. We are always hoping that some of these tasks will be addressed and simplified, and when one of them is, it ranks among the most important events of our lives. The day that Microsoft fixed the right to left issue with proper word wrapping and Unicode (in Office 2003; XP Office almost had it) ranks almost as high as my wedding day and the births of my children.

 

Based on my experiences, I thought it would be difficult to ocr anything other than a clean consonantal/letter text. If ocr even worked for transliterated texts it would be a step forward. Perhaps ABBYY could be trained to recognize transliteration. But I would only try again based on your findings Daniel. So I look forward to your reports.

 

I'm thankful for your efforts.

 

 

Regards,

 

Michel


  • Timothy Jenney and Pchris like this

Desktop, Duo E6320 x32, W8.1

Laptop, i7 x64, W8.1 Pro

iPad Air, A7 Chip, iOS 8.1.3


#26 Daniel Semler

Daniel Semler

    Platinum

  • Active Members
  • PipPipPipPipPip
  • 1,913 posts
  • Gender:Male
  • Accordance Version:11.x

Posted 23 February 2015 - 03:19 PM

Ok so I spent some time playing with ABBYY FineReader 12 Pro. It basically cannot do diacritic marks at this point for biblical Greek. I didn't even try Hebrew with vowel pointing and cantilation. I was trying out a nice clean print to PDF of a few verses from Genesis from Rahlfs. I sent this to ABBYY support and they took a crack at it as well. His, the support analyst, experience was the same as mine. FR Pro could not correctly identify the diacritic marks as part of the letters below, but rather tended to assume they were some sort of superscript (because of the small size) on a line above. He explained to me that they are trying to improve diacritic handling for a number of languages require it. But at this time there isn't anything much to offer here.

 

Time to test another piece of software but that's the latest update.

Thx

D


Sola lingua bona est lingua mortua

ἡ μόνη ἀγαθὴ γλῶσσα γλῶσσα νεκρὰ ἐστιν

lišanu ēdēnitu damqitu lišanu mītu

 

Accordance Configurations :
 
Mac : 2009 27" iMac                 Windows : HP 4540s laptop
      Intel Core Duo                          Intel i5 Ivy Bridge
      12GB RAM                                8GB RAM
      Accordance 11.0.1                       Accordance 11.0.1
      OSX 10.9 (Mavericks)                    Win 7 Professional x64 SP1


#27 Pchris

Pchris

    Gold

  • Active Members
  • PipPipPipPip
  • 269 posts
  • Gender:Male
  • Location:Denmark
  • Interests:Old Testament Exegesis, The Ancient Near East, the Hamito-Semitic languages, Ancient Greek, Mythology
  • Accordance Version:11.x
  • Platforms:Mac OS X

Posted 23 February 2015 - 03:53 PM

Thanks for the update, Daniel!


  • Michel Gilbert likes this

Accordance Version: 11.0.4 

Hardware: MacBook Pro 2.6 GHz Intel Core i7 (medio 2012)
Operating System: OSX 10.9.5 Mavericks.


#28 Fabian

Fabian

    Gold

  • Active Members
  • PipPipPipPip
  • 402 posts
  • Gender:Male
  • Interests:www.internetkirche.com
    www.iglesia-del-internet.com
  • Accordance Version:11.x

Posted 23 February 2015 - 03:53 PM

Ok so I spent some time playing with ABBYY FineReader 12 Pro. It basically cannot do diacritic marks at this point for biblical Greek. I didn't even try Hebrew with vowel pointing and cantilation. I was trying out a nice clean print to PDF of a few verses from Genesis from Rahlfs. I sent this to ABBYY support and they took a crack at it as well. His, the support analyst, experience was the same as mine. FR Pro could not correctly identify the diacritic marks as part of the letters below, but rather tended to assume they were some sort of superscript (because of the small size) on a line above. He explained to me that they are trying to improve diacritic handling for a number of languages require it. But at this time there isn't anything much to offer here.

 

Time to test another piece of software but that's the latest update.

Thx

D

Thank for the update.

 

I have made to Abbyy also a request for a month, because for the Latin words they didn't recognize all macrons too.

 

Greetings

 

Fabian


Mac Air (13-inch, Mid 2013)

1,3 GHz Intel Core i5

4GB Ram

Next time: I'll buy only one with Retina, and hopefully without a glossy screen. A faster CPU and more RAM.

 

Yosemite 10.10.1

Accordance 11.0.4 and waiting on 11.1

 

iPhone 4S

iOS 8.1.2

iAccord 1.7.9 and waiting on 2.0


#29 Enoch

Enoch

    Platinum

  • Active Members
  • PipPipPipPipPip
  • 658 posts

Posted 23 February 2015 - 11:29 PM

thanks for a great discussion



#30 Daniel Semler

Daniel Semler

    Platinum

  • Active Members
  • PipPipPipPipPip
  • 1,913 posts
  • Gender:Male
  • Accordance Version:11.x

Posted 23 February 2015 - 11:35 PM

Ok this approximately rocks. I hate to report something with so little detail and time on it but ....

 

Gen. 1:1 Ἐν άρχῇ ἐποίησεν (3 θεὸς τὸν οὑράνον K061 τὴν γῆν.

2 ἡ δέ γῆ ἦν άόράτος K061 άκάτάσκεὺάστος, K061 σκότος

ἐπάνω τῆς ἀβύσσου, K061 πνεὺμά Θεοὺ ἐπεφέρετο ἐπάνω τοὺ

ὑδάτος. 3 K061 εἶπεν (3 θεός Γενηθήτω φῶς. K061 ἐγένετο φῶς.

4 K061 εἶδεν (3 θεὸς τὸ φῶς ὅτι κάλόν. K061 διεχώρισεν (3 θεὸς

άνά μέσον τοὺ φωτὸς K061 άνά μέσον τοὺ σκότοὺς. 5 K061

ἐκάλεσεν (3 θεὸς τὸ φῶς ἡμέραν K061 τὸ σκότος ἐκάλεσεν

νὺκτά. K061 ἐγένετο ἐσπέρά K061 ἐγένετο πρωι, ἡμέρά μιά.

Gen. 1:6 K061 εἶπεν (3 Θεός Γενηθήτω στερέωμα ἐν μέσω τοὺ

ὑδάτος K061 ἐστω διάχωριζον άνά μέσον ὑδάτος K061 ὑδάτος.

K061 ἐγένετο οὑτως. 7 K061 ἐποίησεν (3 θεὸς τὸ στερέωμά, K061

διεχώρισεν (3 θεὸς άνά μέσον τοὺ ὑδάτος, ὃ ἦν ὑποκάτω τοὺ

στερεωμάτος, K061 άνά μέσον τοὺ ὑδάτος τοὺ ἐπάνω τοὺ

στερεωμάτος. 8 K061 ἐκάλεσεν (3 θεὸς τὸ στερέωμά οὑράνόν.

K061 εἶδεν (3 θεὸς ὅτι κάλόν. K061 ἐγένετο ἐσπέρά K061 ἐγένετο

πρωι, ἡμέρά δεὺτέρά.

 

now this is from my test page of Genesis. And this is with free software and it's doing bilingual English Greek. I did wonder if Tesseract was worth a look. There are obvious issues. (3 is routinely ὁ and K061 is καὶ. Clearly more work to do here probably with training data. Anyhow, back to the Psalms but this shows that it can be done from a clean PDF print. More to come. This is promising but there are so many more tests to do but buzzed to see it.

 

It would be unkind to kick this out without a reference to where I got this so here it is : http://ancientgreekocr.org.

 

BTW, I ran a 100 dpi and a 300 dpi scan of the page and then OCR'd it. In both cases the two issues above were gone. Of course some other issues showed up but still they were not major.

 

Gen. 1:9 Καὶ εἶπεν ὁ θεός Συναχθήτω τὸ ὕδωρ τὸ ὑποκάτω

τοὺ οὑρανοὺ εἰς συναγωγὴν μίαν, καὶ ὀφθήτω ἡ ξηρά. καὶ

ἐγένετο οὕτως. καὶ συνήχθη τὸ ’ὕδωρ τὸ ὑποκάτω τοὺ

οὑρανοὺ εἰς τας συναγωγὰς αὑτῶν, καὶ ὤφθη ἡ ξηρά. 1°

καὶ ἐκάλεσεν ὁ θεὸς τὴν ξηρὰν γῆν καὶ τὰ συστήματα τῶν

 

The 1 followed by the degree mark should be verse number 10. The above is from the 300 dpi but it's slower than the 100dpi and not obviously better.

 

Thx

D


Edited by Daniel Semler, 24 February 2015 - 12:48 AM.

Sola lingua bona est lingua mortua

ἡ μόνη ἀγαθὴ γλῶσσα γλῶσσα νεκρὰ ἐστιν

lišanu ēdēnitu damqitu lišanu mītu

 

Accordance Configurations :
 
Mac : 2009 27" iMac                 Windows : HP 4540s laptop
      Intel Core Duo                          Intel i5 Ivy Bridge
      12GB RAM                                8GB RAM
      Accordance 11.0.1                       Accordance 11.0.1
      OSX 10.9 (Mavericks)                    Win 7 Professional x64 SP1


#31 Daniel Semler

Daniel Semler

    Platinum

  • Active Members
  • PipPipPipPipPip
  • 1,913 posts
  • Gender:Male
  • Accordance Version:11.x

Posted 24 February 2015 - 01:00 AM

Ok one final example. Here is an example OCR'd from a PDF of a scan of North and Hillard. I did not do the scan - it's from Textkit.

 

T H E A R Τ Ι C L E

1. The Article is used sometimes in Greek Where it is not

used- in English :—

(a) With Nouns denoting whole classes.

αφ. οἱ θῆρες, wild beasts; οἱ ἅνθρωποι, mankind.

(h) Often with abstract Nouns and proper names, especially

the names of countries.

eg. ἡ ἀνδρεία, courage; ἡ Ἑλλάς, Greece. Χ

2. The Article used with an Adiective or Adverb, or with

an Infinitive, makes it a. Noun.

 

e.g. τὸ ἀληθές, truth ; οἱ ἀνδρεῖοι, brave men ; τὸ λέγειν,

speech; τῷ λέγειν, by speaking ; οἱ πάλαι, the men

of old.

 

3. Participles, like Adjectives, when used with the Article,

are equivalent to Nouns.

 

e.g. οἱ λέγοντες, speakers, or those who speak.

οἱ τεθνηκότες, the dead, or those who haw died.

 

This is constantly the Greek equivalent for an English

Relative clause.

e.g. τιμῶμεν τοὺς στρατιώτας τοὺς ὑπὲρ τῆς πόλεως

τεθνῑηκότας.

W e honour the soldiers who have died for their city.

 

 

There are obvious errors to be sure but still....

 

Thx

D


Sola lingua bona est lingua mortua

ἡ μόνη ἀγαθὴ γλῶσσα γλῶσσα νεκρὰ ἐστιν

lišanu ēdēnitu damqitu lišanu mītu

 

Accordance Configurations :
 
Mac : 2009 27" iMac                 Windows : HP 4540s laptop
      Intel Core Duo                          Intel i5 Ivy Bridge
      12GB RAM                                8GB RAM
      Accordance 11.0.1                       Accordance 11.0.1
      OSX 10.9 (Mavericks)                    Win 7 Professional x64 SP1


#32 Fabian

Fabian

    Gold

  • Active Members
  • PipPipPipPip
  • 402 posts
  • Gender:Male
  • Interests:www.internetkirche.com
    www.iglesia-del-internet.com
  • Accordance Version:11.x

Posted 24 February 2015 - 02:15 AM

 

 

 

Hi Fabian,

 

Maybe you could train FineReader to recognize different Hebrew fonts - it would be a lot easier to add an old style Hebrew font as a new language.

 

 

 

Unfortunately no.

 

There are many difference between the Mac and the PC version. In some details the Mac version can more but this can only the PC and I didn't have this. Also an editor included has only the PC version.

 

Greetings

 

Fabian


Mac Air (13-inch, Mid 2013)

1,3 GHz Intel Core i5

4GB Ram

Next time: I'll buy only one with Retina, and hopefully without a glossy screen. A faster CPU and more RAM.

 

Yosemite 10.10.1

Accordance 11.0.4 and waiting on 11.1

 

iPhone 4S

iOS 8.1.2

iAccord 1.7.9 and waiting on 2.0


#33 Daniel Semler

Daniel Semler

    Platinum

  • Active Members
  • PipPipPipPipPip
  • 1,913 posts
  • Gender:Male
  • Accordance Version:11.x

Posted 28 February 2015 - 01:26 AM

A quick additional update.

I have tried out ReadIris. It has a training mode and I thought I might be able to get that to fly. After trying and finding that it essentially did modern Greek accents only I pinged their support team. I was informed of two things. The first is that the trainer is not really for adding characters to existing alphabets but rather to assist with bad character recognition due to poor scans and such like. And secondly that Hebrew recognition is limited to consonants and no diacritic marks - vowel points or cantillation, and they expected that the same was true of Greek. I was mildly hopeful until I got that response because ReadIris was doing better at line recognition than ABBYY had done. So strike two - this one won't do either.

 

I am now trying a demo version of Omnipage 19 Ultimate. I didn't really intend to use Ultimate but that's what the demo down load is. Now this looks like it should be better. Its documentation states explicit support for Ancient Greek characters and even shows an example of classical and modern greek so you can see the different diacritics. I cannot get it working however and have logged a plea for help. Interestingly it failed completely to recognise the text in my print to PDF, but did a decent job of a scan of the same print, though admittedly I got only the modern acute accent but mostly where accents were in the text. So it seems promising but I need help. Of course I hope one does not need Ultimate to do this. Additionally I cannot see Hebrew support so I need to check that.

 

More later - back to the Psalms.

 

thx

D


Edited by Daniel Semler, 01 March 2015 - 12:44 PM.

Sola lingua bona est lingua mortua

ἡ μόνη ἀγαθὴ γλῶσσα γλῶσσα νεκρὰ ἐστιν

lišanu ēdēnitu damqitu lišanu mītu

 

Accordance Configurations :
 
Mac : 2009 27" iMac                 Windows : HP 4540s laptop
      Intel Core Duo                          Intel i5 Ivy Bridge
      12GB RAM                                8GB RAM
      Accordance 11.0.1                       Accordance 11.0.1
      OSX 10.9 (Mavericks)                    Win 7 Professional x64 SP1


#34 Fabian

Fabian

    Gold

  • Active Members
  • PipPipPipPip
  • 402 posts
  • Gender:Male
  • Interests:www.internetkirche.com
    www.iglesia-del-internet.com
  • Accordance Version:11.x

Posted 28 February 2015 - 03:16 AM

Hello Daniel

 

Which version of Readiris have you tried?

 

Greetings

 

Fabian


Mac Air (13-inch, Mid 2013)

1,3 GHz Intel Core i5

4GB Ram

Next time: I'll buy only one with Retina, and hopefully without a glossy screen. A faster CPU and more RAM.

 

Yosemite 10.10.1

Accordance 11.0.4 and waiting on 11.1

 

iPhone 4S

iOS 8.1.2

iAccord 1.7.9 and waiting on 2.0


#35 Daniel Semler

Daniel Semler

    Platinum

  • Active Members
  • PipPipPipPipPip
  • 1,913 posts
  • Gender:Male
  • Accordance Version:11.x

Posted 28 February 2015 - 10:46 AM

ReadIris 15 Pro.

 

Thx

D


Sola lingua bona est lingua mortua

ἡ μόνη ἀγαθὴ γλῶσσα γλῶσσα νεκρὰ ἐστιν

lišanu ēdēnitu damqitu lišanu mītu

 

Accordance Configurations :
 
Mac : 2009 27" iMac                 Windows : HP 4540s laptop
      Intel Core Duo                          Intel i5 Ivy Bridge
      12GB RAM                                8GB RAM
      Accordance 11.0.1                       Accordance 11.0.1
      OSX 10.9 (Mavericks)                    Win 7 Professional x64 SP1


#36 Fabian

Fabian

    Gold

  • Active Members
  • PipPipPipPip
  • 402 posts
  • Gender:Male
  • Interests:www.internetkirche.com
    www.iglesia-del-internet.com
  • Accordance Version:11.x

Posted 28 February 2015 - 10:49 AM

Thanks 

 

I have the 12 but this was outdated even as it comes with the new scanner. And this can't Hebrew.

 

Greetings

 

Fabian


Mac Air (13-inch, Mid 2013)

1,3 GHz Intel Core i5

4GB Ram

Next time: I'll buy only one with Retina, and hopefully without a glossy screen. A faster CPU and more RAM.

 

Yosemite 10.10.1

Accordance 11.0.4 and waiting on 11.1

 

iPhone 4S

iOS 8.1.2

iAccord 1.7.9 and waiting on 2.0


#37 Abram K-J

Abram K-J

    Mithril

  • Active Members
  • PipPipPipPipPipPip
  • 2,132 posts
  • Gender:Male
  • Location:Greater Boston, MA
  • Accordance Version:11.x

Posted 28 February 2015 - 08:30 PM

Daniel, I have nothing else to add except to say I am following this thread and reading all your posts, and really appreciating them!


  • Daniel Semler and Michel Gilbert like this
Abram K-J
Pastor, Writer, Editor, Blogger
Web: Words on the Word
 
Accordance 11 on Yosemite: early 2008 iMac / late 2008 MacBook
Latest iAccord on latest iOS on iPad mini

#38 Timothy Jenney

Timothy Jenney

    Platinum

  • Accordance
  • 1,867 posts
  • Gender:Male
  • Location:sunny Winter Haven, FL
  • Interests:a great cup of coffee, sci-fi, jazz and the blues, kayaking, camping, fishing and the great outdoors
  • Accordance Version:11.x
  • Platforms:Mac OS X, iOS

Posted Yesterday, 08:52 AM

I agree, thank you.

 

We really do need an OCR program that recognizes ancient Greek, pointed Hebrew, cuneiform, hieroglyphics, and transliterated text of all sorts. It's ridiculous that we have made so many strides in the field, but that this relatively simple tool is still wanting. There are just SO many good printed books that have yet to be digitized.


  • Pchris likes this

Blessings,
"Dr. J"

Timothy P. Jenney, Ph. D.
"Lighting the Lamp" Host and Producer

 

Mac: Early 2011 17" MBP (8,3), 2.3 GHz Quad core, 16 GB RAM, Mercury 6G 480 SSD + 1.5 TB HD, OSX 10.10, Yosemite

iPhone 6 plus 64 GB iOS 8.1


#39 Daniel Semler

Daniel Semler

    Platinum

  • Active Members
  • PipPipPipPipPip
  • 1,913 posts
  • Gender:Male
  • Accordance Version:11.x

Posted Yesterday, 10:40 AM

The impression I get having played with this a bit now is that the commercial case for these languages is not great. Having used the Ancient Greek training data in Tesseract I know that Ancient Greek/English can be done. I now have a complete OCR of North and Hillard's Greek Prose Composition. It will need a full proofing reading. I do not know how bad it is in terms of errors per X number of characters or per page yet. Now technically this is not Biblical Greek; for example dictionary support would be a bit different but it's good enough for the actual characters.

 

I know how one would go about training Tesseract for Hebrew with pointing and cantillation, at least approximately. It's time-consuming but as far as I know perfectly doable, though I would want to know Hebrew first - another work in progress I'm afraid. I'm currently reading up on Unicode. I am trying to find out exactly how Hebrew pointing and cantillation is handled.

 

Finally there is a good deal of refinement in the commercial algorithms and I am not sure how the open source stuff compares, at least in part because the commercial houses do not publish their algorithms, but also because the profound lack of support for these languages in commercial software. Dictionary support for these languages really helps weeding out errors and that means dictionaries for these ancient languages.

 

My guess is that software houses either have to develop their own internal skills on this, developing OCR tools, or at the very least training data for them, and doing the process themselves, or they have to contract it out to specialists. Neither approach is cheap and both have drawbacks. The specialist OCR houses know how to scan and train software but are unlikely to have the language skills. The reverse would tend to be true of the software houses at least at the outset. Then the texts need proof-reading by people who can read the languages. This is a time-consuming editorial process and not flawless and the specialists in these languages are usually busy with other tasks. So this all translates into time and money.

 

I agree that flawless OCR is very much needed for getting many wonderful old texts into software. I do not know how this is currently being achieved, but it's neither cheap nor flawless. I made a remark to one OCR software house that if they had a biblical languages group of some 10 or so languages in their product I could point to people who would buy it. But alas that is a bunch of work and while I can point to a handful of people, mostly on this list :) who would buy it that isn't really enough to convince them. I believe that most of the effort would go in training and dictionary building. I suspect the algorithms are fairly close but of course I don't really know.

 

Problems can creep into the process at numerous places and they have to be weeded out. Building the skills to perfect the process so that the final outcome is best possible requires a bunch of practice.The original scanning has a big impact and there are definitely better and worse ways to do this. People are so good at reading bad text, just look at old papyrus fragments from which we have been able to extract text, but computers are not nearly so good at it. And training data depending upon the script and can range from say 60 thousand characters to well over a million depending upon the software and quality you are looking for. So just training the software properly is work. I begin now to understand why the training-modes in the commercial software are not really aimed at allowing a user to add new scripts as much as they are at helping the software better understand the scripts it already knows. The issues are not simple ones. The Lord built a wonderful reading machine in people, we are curiously enough have difficulty duplicating the feat :)

 

It really does feel like one of those problems that is frustratingly close to solved but the last push is required. I'm no expert just an interested amateur. We'll see how far I get. We probably need a startup or crowd source or some sort of effort to do dead language OCR to push all this to finally get done. Ah well, as ever back to the Psalms - month 3 beginning.

 

Thx

D
 


Sola lingua bona est lingua mortua

ἡ μόνη ἀγαθὴ γλῶσσα γλῶσσα νεκρὰ ἐστιν

lišanu ēdēnitu damqitu lišanu mītu

 

Accordance Configurations :
 
Mac : 2009 27" iMac                 Windows : HP 4540s laptop
      Intel Core Duo                          Intel i5 Ivy Bridge
      12GB RAM                                8GB RAM
      Accordance 11.0.1                       Accordance 11.0.1
      OSX 10.9 (Mavericks)                    Win 7 Professional x64 SP1


#40 Daniel Semler

Daniel Semler

    Platinum

  • Active Members
  • PipPipPipPipPip
  • 1,913 posts
  • Gender:Male
  • Accordance Version:11.x

Posted Yesterday, 11:22 PM

Ok so I'm thinking Omnipage ain't it either. My pleas for help were rejected because I didn't own the software despite my explaining that I was testing it with their demo package. They have a money back guarantee - unusual in software - so that you can try it out but I'm not inclined to given that as far as I can tell it won't do Hebrew and though the help pages actually show classical greek it doesn't appear able to do it. It looks like could be trained to it perhaps but as with all these tools the trainers are a kind of torture to use for one reason or another. There was an Omnipage 11 explanation of doing classical greek but that method mentions options that don't appear to be what they were and my best guesses didn't pan out.

 

So the only thing that seems to work right now for Classical Greek and English right now is free - gImageReader with the Ancient Greek training data. I'm going to play more with this and see what I can make of it. There are Hebrew training sets for Tesseract (the engine in gImageReader). I suspect they might be modern but I can check it out. The other thing about Tesseract based tools is that you can train in entirely new languages. Tweaking with the trainers in the commercial packages is not really cutting it and I'm pretty sure they are not aimed at it. They are aimed at tweaking the recognizer on a new typeface or a bad scan for languages it already knows.

 

Feel free to toss up any other suggestions and I may be able to give them a shot but for now I'll try some serious shots with Tesseract and see if I can get a process going, at least for Greek. If that works out then there are endless other things to try.

 

Thx

D


Sola lingua bona est lingua mortua

ἡ μόνη ἀγαθὴ γλῶσσα γλῶσσα νεκρὰ ἐστιν

lišanu ēdēnitu damqitu lišanu mītu

 

Accordance Configurations :
 
Mac : 2009 27" iMac                 Windows : HP 4540s laptop
      Intel Core Duo                          Intel i5 Ivy Bridge
      12GB RAM                                8GB RAM
      Accordance 11.0.1                       Accordance 11.0.1
      OSX 10.9 (Mavericks)                    Win 7 Professional x64 SP1





0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users