In analyzing the word-distribution of Necktie for a Two-Headed Tadpole in my last post, I came across some interesting patterns. If you look at the distribution of words in a text, you’ll find that there’s the usual inverse ratio whereby high-frequency words like “the” and “of” are clumped to the left of the graph, and low-frequency words like “Persian-tinted” and “sombrero” for example are clumped to the right. This is a fairly common statistical result. But what I’ve done is a meta-analysis of such an analysis — this meta-analysis gives me a way to determine if a text is more like poetry or more like prose. The score from this meta-analysis (Prosody Index) is a number between 1 and 100, with a score less than 50 being poetry and a higher score indicating prose.This is actually revolutionary, so let me explain. The first graph shows the Word distribution of words in a text. A value on the x-axis is a numerical representation of a given word; its corresponding value on the y-axis is the frequency of use of this word. Words on the x-axis are sorted by frequency. So for example, the most frequent word (”the”) has a value of 1 for the x-axis and 889 for the y-axis, so this point is plotted near the upper-left corner of the chart. There are lots of least-common words which only appear once (like “sombrero”, for example) and they start from around 1140 and continue through 2381 on the x-axis; because they only appear once, their y-value is 1.

Looking at the chart, you can see that words are clumped into groups; for example, there are (2381 - 1140 = 1241) words which only appear once. Likewise, there are 483 words which only appear twice; these are the next “group” on the graph. Each group can be counted this way, and we can come up with a graph of word-group by group-density. Group-density is word-group times frequency. The first data point has a group-density of 1 x 1241 = 1241. The next data point has a group-density of 2 x 483 = 966. And so on. A sample of this graph for my book is shown in the next illustration.
We can consider this graph as a curve, and calculate the area under the curve. A chunk of the area is concentrated to the left, and a chunk is concentrated to the right, with a little bit in the middle. A very simple calculation on the data (using the mean-value theorem from calculus) will tell us at which point on the graph half of the area is to the left, and half is to the right. The value of this point in the case of my own book is at 29 on the x-axis. Since there’s a total of 78 points on the x-axis, the half-way value occurs 37% the way through the text (37% = 29/78) and so the Prosody Index of my book is 37.The next two illustrations show similar graphs for two other texts from the Gutenberg archives, The Frogs by Aristophanes and Sketches New and Old by Mark Twain (chosen because it includes the story of “The Jumping Frog of Calaveras County” — yes, we’re going with a frog theme in this analysis).
The Prosody Indexes for these texts are 28 and 59, respectively. What does this mean? As you can tell by inspecting the graph, more of Aristophanes’ words are to the left-hand side, the side where we can expect to find groups of words which are used less-frequently. (Remember, all the words that only appear only once are on the first value of the x-axis, all the words that are used twice are on the second value of the x-axis, and so on.) Aristophanes wrote poetic plays, and one of the hallmarks of poetry is a sparse but specific choice of words; in contrast, by inspecting Mark Twain’s graph, you can see that most of his words cluster to the right, indicating that he reuses words frequently. Such a reuse of words is more appropriate to Mark Twain’s journalistic style, and is a hallmark of prose.In analyzing a selection of texts from the Gutenberg archive, you see the same characteristics appearing; texts which are more prose than poetry cluster to the right and yield higher Prosody Indexes, and those which are more poetry than prose cluster to the left and yield lower Prosody Indexes. The following chart shows a selection of such Prosody Indexes by text. I have chosen 50 as the cutoff between poetry and prose, since a Prosody Index of 50 indicates that 50% of the words are to either side of the midpoint of the graph. This choice is confirmed by a Prosody Index of 53 for James Joyce’s Ulysses, commonly considered to be the most poetic novel in the English language. Not unsurprisingly, Nietzsche’s Beyond Good and Evil comes in at 50 — part poetry, part prose, and undeniably philosophy. I have only sampled 25 texts from the Gutenberg archive so far, but I will continue doing this as time permits to see if the assumptions underlying the Prosody Index are still valid.

Some Spring Days in Iowa, by Frederick John Lazell 23.0
The Frogs, by Aristophanes 27.5
Autumn Leaves, by John Bartlett 34.2
The Divine Comedy, by Dante Alighieri, translated by the Rev. H. F. Cary, M.A. 35.8
Necktie for a Two-Headed Tadpole, by Jason Murk 37.2
A Christmas Carol, by Charles Dickens 44.9
Beyond Good and Evil, by Friedrich Nietzsche 50.6
The Adventures of Grandfather Frog, by Thornton W. Burgess 52.0
Ulysses, by James Joyce 53.1
The Time Machine, by H. G. Wells 54.0
The Importance of Being Earnest, by Oscar Wilde 54.2
Alice’s Adventures in Wonderland, by Lewis Carroll 57.1
Sketches New and Old, by Mark Twain 58.6
Frankenstein, by Mary Wollstonecraft Shelley 61.5
The World War and What Was Behind It, by L. P. Benezet 61.7
The Fat and the Thin, by Emile Zola 62.7
The Picture of Dorian Gray, by Oscar Wilde 65.4
A Tramp Abroad, by Mark Twain 67.6
A Tale of Two Cities, by Charles Dickens 68.3
Sense and Sensibility, by Jane Austen 72.3
Crime and Punishment, by Fyodor Dostoevsky 74.7
Gargantua and his Son Pantagruel, by Master Francis Rabelais 74.8
Huckleberry Finn, by Mark Twain 75.1
Dracula, by Bram Stoker 77.2
Bleak House, by Charles Dickens 80.1
Don Quixote, by Miguel de Cervantes 84.7

Of course, going meta to this meta-analysis, is this what you get when a math major from MIT takes to writing novels?