Order Toll Free 1-877-339-5855
News, How-tos, and assorted Views on Accordance Bible Software.

Wednesday, May 10, 2006  

Learning New Ways to Count

In the comments on Monday's post, one user asked about a little known feature of Accordance: the various Count options in the Set Analysis Display dialog box. This is a feature which has been there since version 3.6 (released way back in January 1999) which enables you to analyze vocabulary usage across a particular search range from a variety of statistical angles.

It's easier to show you how this works than it is to explain it, so here's a little case study. Let's say that we want to examine the vocabulary used in the Gospel of John. Here's how to do it:

  1. Open a search window and choose the tagged Greek New Testament (GNT-T) as your search text. If you don't have the Greek, an English translation will work as well.
  2. In the argument entry box, type an asterisk, then choose the AND command from the Enter Command submenu of the Search menu (or use the keyboard shortcut shift-command-A). Now choose the Range command from the same submenu (or use the keyboard shortcut shift-command-R).
  3. Type "john" to replace the selected question mark inside the Range command, then click OK.

You've just searched for every word in the book of John. Note also that we used the Range command as a quick way to specify a range that we may not have previously defined in the Range pop-up menu of the More Options section of the Search window.

Now, what good is there in searching for every word in a book? "Much in every way!" That is, provided you click the Details button:

  1. Click the Details button to open the Details Workspace, then click the Analysis tab to bring it to the front.
     
    By default, the Analysis gives an alphabetical listing of every word which was found in John. But we can tweak this information using the one keyboard shortcut you absolutely must learn: Command-T.
  2. Use Command-T to open the Set Analysis Display dialog box.
  3. In the dialog box, select Count Down from the Sort pop-up menu, then click OK.

The Analysis window will now list the words which were found in descending order of occurrence. Thus, the words which appear the most number of times appear at the top of the list, while the words which appear only once will be at the bottom of the list.

Now, viewing the word list by the number of times each word appears is a helpful way to analyze vocabulary usage, but it tends to place all the common words at the top of the list: words like the definite article, the conjunction "and," personal pronouns like "him," "I," and "you," etc. You have to scroll past these common words before you get to the words which may represent a particular focus or interest of the author.

This observation led our programmer to consider other ways to count word usage. In addition to "Number," he came up with "Frequency," "Uniqueness," and "Importance," and he placed these options in the Count pop-up menu of the Set Analysis Display dialog box. Follow these steps to compare the various options:

  1. While looking at the Analysis tab of the Details workspace, choose Duplicate Tab from the File menu (or use the keyboard shortcut Command-D).
  2. Use Command-T to open the Set Analysis Display dialog box.
  3. Choose Uniqueness from the Count pop-up menu and click OK.

When you compare this list to the one before it, you'll see a much different group of words appearing at the top of the list. Rather than the common words, you get the words which are most "unique" to the current search range. The number beside each word is the ratio of hit words in the current range to hit verses in the entire text. If you select the first word in the list (helos, nail) and then choose GNT-T from the Resource palette, you'll see that this word appears two times in the GNT-T in a single verse, John 20:25. Thus, the ratio 2.0 next to this word is arrived at by dividing 2 hits by 1 verse. (Hey, that's a level of math which even I can understand!)

Now look at the words in the list with a ratio of 1.0. Some of these words are marked with an asterisk, while others are not. If you scroll all the way down to the bottom of the Analysis window, you'll see a note explaining that the asterisk marks "words appearing only once in the entire text." In other words, the asterisk marks all the hapax legomena. Ainon is one such word. It appears only once in the entire Greek New Testament. Contrast that with antleo, "to draw." This word appears four times in four verses, all of which are in John's Gospel. It therefore has a ratio of 1.0, even though it is not a hapax.

Okay, I've analyzed the way Accordance calculates "Uniqueness" to the point that your eyes are now glazing over. The point is simply this. Counting by Uniqueness gives the greatest weight to those words which only appear within your current search range. Common words such as the definite article get pushed much further down the list.

While it's interesting to see words which are more or less unique to your search range, uniqueness does not necessarily equal importance. For example, is Aenon an important concept in John's Gospel? Hardly! It's a place mentioned one time, in connection with the ministry of John the Baptist. To find the important words, we need to:

  1. Choose Duplicate Tab from the File menu (or use the keyboard shortcut Command-D) to create another Analysis tab.
  2. Use Command-T to open the Set Analysis Display dialog box.
  3. Choose Importance from the Count pop-up menu and click OK.

Examine this vocabulary list, and you start to identify words which represent special emphases in the Gospel of John: such as "Jesus," "Father," "believe," "world," "Jew," and "disciple." The Importance ratio is calculated by multiplying the number of hits in the search range by the Uniqueness ratio described above. The result is that words used often in the book of John but which appear much less frequently in the rest of the New Testament get pushed toward the top of the list. Some common words also get pushed back up toward the top, but even some of these show interesting trends. For example, houtos and ekeinos ("this" and "that" respectively) tend to be used more frequently in John than in the rest of the New Testament. The use of oun ("so, then, therefore") is even more clearly concentrated in John than in the rest of the New Testament. Does this mean that John more explicitly makes causal connections between events than other New Testament writers? Perhaps. To find out, I can select oun in the Analysis window, select GNT-T from the Greek Texts pop-up menu of the Resource palette, and then examine each occurrence of that word in context.

The cool thing about these various ways of "counting" words in a word list is that it helps me to spot trends which I might not otherwise have seen. Obviously, simple algorithms can only do so much to help us gauge the relative importance of a word to a particular author; but they do give us a great place to start.

Note: The perceptive among you will notice that I didn't bother with counting by Frequency. Frequency is the ratio of hits per 1000 words which is used to generate bars of the Graph. Choosing to count by frequency will show you that ratio rather than the total number of hits, but it doesn't really change the sort order. It's included for the sake of completeness, but "Uniqueness" and "Importance" are the Count options which are really of interest.





Comments:
This post is going into a file on my iBook. I really appreciate you taking the time and effort to illustrate these features. VERY helpful and useful. Hopefully they will not be "little known" any longer!

SDG
 

David, I searched for all words in Zephaniah and used the "uniqueness" sort field, and the word at the top of the list has an asterisk and the number is 417. It is the Hebrew term for 'cedar work' and it is a hapax. Why would the number be 417?
 

This post has brought to light a discrepancy in the uniqueness and importance numbers depending on whether you use the RANGE command or a set range in the Range dialog. The numbers that David shows used the temporary Range command. There a hapax shows up as Uniqueness one, and words that occurr more than once but only within the range get a higher number.

We will take a look at the discrepance and how the numbers are being calculated when we get a chance.
 

Thanks for the update, Helen. That is helpful. I look forward to hearing the results of your investigation.

SDG
 

Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?