Jump to content

Automatic grammatical tagging


jlm

Recommended Posts

Various source code is available for lemmatizing and similar text-processing tasks. One that attracted my attention last year is the Classical Language ToolKit (CLTK: http://cltk.org/),which "offers natural language processing (NLP) support for the languages of Ancient, Classical, and Medieval Eurasia. Greek, Latin, Akkadian, and the Germanic languages are currently most complete." It is Python 3 code with an MIT license. It has lemmatizers for Greek and Latin: ambiguous forms are reported as the most common form in the training corpus. That's usually good enough to get you to the right section of the dictionary.

 
For Akkadian, it can derive a "bound form": I don't know anything about Akkadian, but perhaps that is equivalent to a lemma. There's also a "stemmer," but I fear that may just be a form for search algorithms (e.g., a stemmer for English reduces any of "probable," "probability," and "probably" to "probabl" so that these related words can be found in a search).
 
There are other tools of this sort. Diogenes, an aging macOS program for the TLG and PHI corpora, has a simple word form parser based on a list of inflected forms. Helmut Schmid's TreeTagger uses Markov models to make more intelligent guesses about lemmas and parts of speech based on context (reported accuracy 95–97%).
 
These sorts of tools could make untagged texts more useful with Accordance. I see two possible approaches. One is on the user side: if user bible import supported grammatical tagging, a sophisticated user could use tools of this sort to produce crudely tagged texts. It would then be possible to use the lexeme and other grammatical information in lexicon lookups and searches.
 
The other approach would be for Accordance to add a lemmatizer to their apps, so that when a word with no grammatical tagging is selected, a lemmatizer is used to find possible lexical forms so that a lexicon entry and perhaps other grammatical information can be displayed. This might also be used to make an interlinear, FWIW. I would find it useful also for the short texts in Greek and Latin that are quoted in certain tools (the ICC comes to mind). It would probably be too slow to use in a search, although perhaps it could be done in the opposite direction: expand a lexical form to a list of all possible inflected forms and search for those. If they're interested in doing this, the tools above would probably not be useful directly, but their datasets and algorithms may be reusable. Accordance also has many tagged texts that might (licensing terms) be available for making proprietary datasets.
 
On a related note, implementing a general stemmer (Google "Porter Stemmer") would make flex searches more useful in non-English Bibles.
 
These are obviously suggestions for a major version upgrade.
Edited by jlm
  • Like 4
Link to comment
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now
×
×
  • Create New...