The Book of Mormon Zipf Index


Jamie123
 Share

Recommended Posts

Statisticians among you may be interested in something I've just discovered.

As part of one of my "little projects" (which I hope to get published someday) I've been measuring the Zipf indices of English texts. I have analysed 50 documents so far, and of all of them the ''Book of Mormon'' has the highest Zipf index when measured across low-frequency words. To summarize:

Low frequency (word ranks 101-400), Zipf index = 1.28

For comparison, the next runner up is ''Huckleberry Finn'', with an index of 1.25, and the average for the group is around 1.15.

I have also invented my own indirect method* of measuring Zipf index based on vocabulary growth, which puts the ''Book of Mormon'' index at 1.346. This is again the highest in the group, the runner-up being ''Pilgrim's Progress'' at 1.203.

Interestingly, the BoM also has the lowest Zipf index of the group when measured across high-frequency words  (ranks 1-20), namely 0.80. For comparison, the next lowest was well over 0.9! In fact I've noticed an inverse relationship between the high and low-frequency Zipf indices - it's not a very strong correlation, but it does meet the 95% confidence criterion.

*To be fair the program crashes when I use it on very long books (a bug I still have to fix) so these particular figures are based on the first 100,000 words of longer texts like the BoM.

Edited by Jamie123
Link to comment
Share on other sites

Ah, I meant this for statisticians (of whome there are a few in this group) but basically Zipf's law is a relationship between the frequency of something (loosely the number of times it is observed) and its rank highest-to-lowest among other alternative things. For example, the most common word in any English test is "the" and has rank 1 and has a frequency of about 7%. The second most frequent is usually "and", which has a frequency of about 4%, and has rank 2....and so on. Zipf's law states that if r is the rank, the frequency is proportional to 1/r^a, where the constant a is called the "Zipf index". I say "constant" but actually it's not constant at all - but it is approximately constant over limited ranges of frequency. This law applies to words in texts, but also to many other things like city sizes, people's incomes, web-page visits etc.

By the way, these are the most common words in the BoM (first integer is the rank, second is the order of first appearance in the text, and the fractional number is the frequency (as a simple ratio):

1 1 THE 0.0697387
2 21 AND 0.059325
3 3 OF 0.0428328
4 43 THAT 0.0249667
5 24 TO 0.0233973
6 44 THEY 0.0162784
7 57 IN 0.0134004
8 41 UNTO 0.0132047
9 282 I 0.012023
10 150 HE 0.0115192
11 16 IT 0.0111821
12 14 NEPHI 0.0101853
13 80 THEIR 0.0101745
14 269 THEM 0.00960901
15 79 FOR 0.00914142
16 47 BE 0.00910518
17 121 SHALL 0.00902181
18 3644 ALMA 0.00831862
19 134 HIS 0.00816639
20 61 WHICH 0.0080359

The lowest low-frequency Zipf index I measured was 0.919 for "The Song of Hiawatha", the runner-up being "Gulliver's Travels" at 1.004.  

Edited by Jamie123
Link to comment
Share on other sites

Guest Mores
1 hour ago, Jamie123 said:

Ah, I meant this for statisticians (of whome there are a few in this group) but basically Zipf's law is a relationship between the frequency of something (loosely the number of times it is observed) and its rank highest-to-lowest among other alternative things. For example, the most common word in any English test is "the" and has rank 1 and has a frequency of about 7%. The second most frequent is usually "and", which has a frequency of about 4%, and has rank 2....and so on. Zipf's law states that if r is the rank, the frequency is proportional to 1/r^a, where the constant a is called the "Zipf index". I say "constant" but actually it's not constant at all - but it is approximately constant over limited ranges of frequency. This law applies to words in texts, but also to many other things like city sizes, people's incomes, web-page visits etc.

By the way, these are the most common words in the BoM (first integer is the rank, second is the order of first appearance in the text, and the fractional number is the frequency (as a simple ratio):

1 1 THE 0.0697387
2 21 AND 0.059325
3 3 OF 0.0428328
4 43 THAT 0.0249667
5 24 TO 0.0233973
6 44 THEY 0.0162784
7 57 IN 0.0134004
8 41 UNTO 0.0132047
9 282 I 0.012023
10 150 HE 0.0115192
11 16 IT 0.0111821
12 14 NEPHI 0.0101853
13 80 THEIR 0.0101745
14 269 THEM 0.00960901
15 79 FOR 0.00914142
16 47 BE 0.00910518
17 121 SHALL 0.00902181
18 3644 ALMA 0.00831862
19 134 HIS 0.00816639
20 61 WHICH 0.0080359

The lowest low-frequency Zipf index I measured was 0.919 for "The Song of Hiawatha", the runner-up being "Gulliver's Travels" at 1.004. 

I kind of have a picture, though it will take me a while to let it sink in.  But what significance does that have in any sort of critical analysis of text?

Link to comment
Share on other sites

6 hours ago, Mores said:

I kind of have a picture, though it will take me a while to let it sink in.  But what significance does that have in any sort of critical analysis of text?

Are you referring to the BoM coming top (and bottom!), or the entire project in general? In the former case, perhaps none: I just thought it was interesting. At some point I may see if other religious texts (the Bible, the Bahagavad Gita, the Koran etc.) have similar properties - it could now I think about it be a property of the "King James" English and no more than that - but it will be interesting to find out.

As for the project in general, the "critical analysis of text" is not really my aim (though it would be a nice bonus). I am mostly interested in two things: firstly generating automatic text mimicking the style of existing text for cryptographic purposes, and (turning the chessboard around) distinguishing such artificial texts from the real thing in terms of their statistical properties. Secondly, a text document can serve as a model for other complex systems; back in the 1970s Efron and Thisted (you can read their paper here https://www.jstor.org/stable/2335721?seq=1#metadata_info_tab_contents) suggested a link with biodiversity, the frequencies of words within documents parallelling the populations of species within a habitats. I wonder if maybe there an analogy to malware within computer systems, and therefore computer security applications. (Then again I may be wrong.)

Anyway I'm not suggesting this is anything especially important; but since it is about the BoM, and you guys are the only LDS people I associate with, I thought it was worth a post.

Edited by Jamie123
Link to comment
Share on other sites

Curiouser and curiouser.....look at these results....

Bible: First 200,000 words (Genesis to start of 1 Samuel):

New International Version:

High Frequency Zipf Index: 0.810

Low Frequency Zipf Index: 1.087

 King James Version :

High Frequency Zipf Index: 0.847

Low Frequency Zipf Index: 1.216

Book of Mormon (Entire Document):

High Frequency Zipf Index: 0.802

Low Frequency Zipf Index: 1.292

The high-frequency indices are actually very close together, and all remarkably low. Whatever causes "Biblical low-frequency Zipf" behaviour would appear to survive translation. On the other hand, the low-frequency indices for the BoM and KJV are both unusually large and similar, while the value for the NIV is by contrast below average. (Average is around 1.15.) The BoM value is still significantly higher than that of the KJV, despite the similarity of language style.

Bonus information: Here are some words which appear only once in the Book of Mormon (the technical term is hapex legomena) :

5598 5595 BRUTALITY 3.62467e-06
5599 5596 DEPRAVITY 3.62467e-06
5600 5597 PERVERSION 3.62467e-06
5601 5598 BRUTAL 3.62467e-06
5602 5600 PRINCIPLE 3.62467e-06
5603 5602 WILFULNESS 3.62467e-06
5604 5603 GRIEVE 3.62467e-06
5605 5604 WEIGH 3.62467e-06
5606 5605 SITTETH 3.62467e-06
5607 5606 EXHORTATION 3.62467e-06
5608 5607 ACKNOWLEDGETH 3.62467e-06
5609 5608 SEVERALLY 3.62467e-06
5610 5609 PERFECTED 3.62467e-06
5611 5611 REUNITE 3.62467e-06
5612 5612 TRIUMPHANT 3.62467e-06

I was a little surprised by some of these!

Edited by Jamie123
Link to comment
Share on other sites

  • 2 weeks later...

Of all the texts I've run the complete analysis on, the Book of Mormon is not only the longest, but also has the smallest vocabulary in proportion to its length. By "in proportion" I don't mean as a simple ratio, but as a ratio of Heap's Law optimized for the entire collection - the red dotted line on the graph. (Heap's law btw states that the vocabulary of a text is proportional to size^n where n is around about 0.6) Interestingly by far the largest vocabulary is that of James Joyce's Ulysses, which is almost exactly the same length as the BoM.

Here there could be a "critical analysis" application: BoM contains a lot of repetition ("And it came to pass..." etc.) - bulking out the length without adding to the vocabulary. Ulysses on the other hand is full of "nelogisms" (words invented by Joyce for a single use) which boost the vocabulary without any significant size-increase.

The shortest text in the set was Longfellow's The Song of Hiawatha - which also has a smaller than average vocabulary for its size. In fact there seems to be some evidence of bimodality - many data points clustering above or below the line, while almost none actually lie on it. I need at some point to see if there's any commonality among those two groups, but unfortunately my "proper" work keeps getting in the way!

I hope you're all enjoying this riveting series of symposia! 😜

heapcorpora.png.338d7015870a39bbf3846ef5c22a169c.png

P.S. In case you're wondering, the Zipf indices for Ulysses are 0.758 (high freq.) and 1.073 (low freq.) - quite similar to those of the NIV Bible!

Edited by Jamie123
Link to comment
Share on other sites

This is fascinating. Depending on your view of the origin of the Book of Mormon, these results point toward quite different geneses. For example, if (as probably pretty much every non-Latter-day Saint believes) the Book of Mormon is simply a fictitious story, the work's tiny vocabulary seems indicative of a weak vocabulary of the author. When the book is examined at a deeper level than Samuel Clemens' famous "chloroform in print" remark, this gets a bit hard to rectify with the Book of Mormon's clearly complex structure in both story and textual coherence.

If (as a typical faithful Latter-day Saint believes) the text is the product of a divinely inspired "translation" by Joseph Smith (or by anyone else), the matter takes on a much different hue. In that case, you have to start thinking about Nephi and the following prophets laboriously carving characters into malleable plates of a gold alloy, using something like the Egyptian Demotic script, which the authors themselves called "reformed Egyptian" (a perfect description of what we today call Demotic, by the way). This provides an elegant, organic explanation for the limited vocabulary: The authors were using a highly compressed shorthand that depended heavily on stylized or unique characters to represent whole ideas in a highly compact way, and thus needed to simplify their expressions as much as possible. Any "non-standard" expression would need to be spelled out in phonetic reformed Egyptian (or in Hebrew, which they didn't use because it was too large for their purposes), which would tend to defeat the purpose of a supercompact written language.

 

Edited by Vort
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
 Share