Jump to content

Word occurrences Applescript, GNT and a spreadsheet


Λύχνις Δαν

Recommended Posts

Hi ya,

 

I was wondering the other day a bit about the distribution of words in the GNT. Were there very few words that just occurred once ? What was the maximum occurrence count of a word ? What did the number of forms look like when plotted against the number of occurrences ? I can easily see that that ὁ in its various forms is the single word most frequently occurring in the GNT at 19865 occurrences - a simple verse search of [count 17000-20000] reveals that. And likewise [count 1-1] (or [count 1]) finds all the words occurring only once in any form. But in between what does the curve look like. What I wanted was to run a bunch of queries and plot the results because I don't know another way to do this in Accordance. (Incidentally if anyone does know I'd love to hear about it). So my solution was to script it with AppleScript. I figured those interested in the possibilities here might be interested in the code involved. I've attached it and a workspace to run it against though creating it yourself is not hard.

 

For those who just want to see the chart here it is : post-32023-0-92646700-1388737105_thumb.jpg. It was produced by taking the CSV file produced by the script and graphing it in LibreOffice. Note that where buckets were a range rather than a single value I plotted the x axis as the end of the bucket (highest occurrence count) rather than the lowest.

 

I had wondered if I would get a bell shape or multiple peaks or what. I did not which still surprises me - there are very few cases where one bucket has more than the preceding comparably sized bucket. With bucket sizes of just 1 you do see fluctuation but the general trend is such that as words become less frequently represented in the text the number of such words increases. (Note the bumps in the curve are caused by the bucketing boundaries on the logarithmic x axis.) But spot checks confirm that the results appear to be as they are.

 

This leads to the rather interesting find that of the 5426 distinct words in the GNT, 1940 occur only once. Thus the last 35% of your vocab (if you learn by frequency which a lot of 1st year grammars seem to teach) is gonna hurt to acquire and you are not going to get representative usage in the GNT alone. Conversely of course there are over 130000 word occurrences in the text so a couple of thousand rare ones hopefully won't cause too much trouble :)

 

Now this is a somewhat trivial example though I found it fun, but the code solved a number of problems that I expect I'll hit again in subsequent experiments.

 

Anyhow, feel free to shoot holes in my analysis, or my code, or both. The attached code is pretty simple - I don't know how to do anything complicated in AS yet. I documented what I could and its only really prototype code in a sense. It's helping me learn about GUI scripting and when it helps and what it can do for me. Perhaps others will find the techniques useful.

 

Thx

D

ScriptEx1.zip

Link to comment
Share on other sites

Hey James, I've previously read about Smile but did not realise it was still in production. I'll check it out. Looks very interesting.

 

thx

D

Link to comment
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now
×
×
  • Create New...