In analyzing the word-distribution of Necktie for a Two-Headed Tadpole in my last post, I came across some interesting patterns. If you look at the distribution of words in a text, you’ll find that there’s the usual inverse ratio whereby high-frequency words like “the” and “of” are clumped to the left of the graph, and low-frequency words like “Persian-tinted” and “sombrero” for example are clumped to the right. This is a fairly common statistical result. But what I’ve done is a meta-analysis of such an analysis — this meta-analysis gives me a way to determine if a text is more like poetry or more like prose. The score from this meta-analysis (Prosody Index) is a number between 1 and 100, with a score less than 50 being poetry and a higher score indicating prose.This is actually revolutionary, so let me explain. The first graph shows the Word distribution of words in a text. A value on the x-axis is a numerical representation of a given word; its corresponding value on the y-axis is the frequency of use of this word. Words on the x-axis are sorted by frequency. So for example, the most frequent word (”the”) has a value of 1 for the x-axis and 889 for the y-axis, so this point is plotted near the upper-left corner of the chart. There are lots of least-common words which only appear once (like “sombrero”, for example) and they start from around 1140 and continue through 2381 on the x-axis; because they only appear once, their y-value is 1.

Looking at the chart, you can see that words are clumped into groups; for example, there are (2381 - 1140 = 1241) words which only appear once. Likewise, there are 483 words which only appear twice; these are the next “group” on the graph. Each group can be counted this way, and we can come up with a graph of word-group by group-density. Group-density is word-group times frequency. The first data point has a group-density of 1 x 1241 = 1241. The next data point has a group-density of 2 x 483 = 966. And so on. A sample of this graph for my book is shown in the next illustration.

We can consider this graph as a curve, and calculate the area under the curve. A chunk of the area is concentrated to the left, and a chunk is concentrated to the right, with a little bit in the middle. A very simple calculation on the data (using the mean-value theorem from calculus) will tell us at which point on the graph half of the area is to the left, and half is to the right. The value of this point in the case of my own book is at 29 on the x-axis. Since there’s a total of 78 points on the x-axis, the half-way value occurs 37% the way through the text (37% = 29/78) and so the Prosody Index of my book is 37.The next two illustrations show similar graphs for two other texts from the Gutenberg archives,
The Frogs by Aristophanes and
Sketches New and Old by Mark Twain (chosen because it includes the story of “The Jumping Frog of Calaveras County” — yes, we’re going with a frog theme in this analysis).


The Prosody Indexes for these texts are 28 and 59, respectively. What does this mean? As you can tell by inspecting the graph, more of Aristophanes’ words are to the left-hand side, the side where we can expect to find groups of words which are used less-frequently. (Remember, all the words that only appear only once are on the first value of the x-axis, all the words that are used twice are on the second value of the x-axis, and so on.) Aristophanes wrote poetic plays, and one of the hallmarks of poetry is a sparse but specific choice of words; in contrast, by inspecting Mark Twain’s graph, you can see that most of his words cluster to the right, indicating that he reuses words frequently. Such a reuse of words is more appropriate to Mark Twain’s journalistic style, and is a hallmark of prose.In analyzing a selection of texts from the Gutenberg archive, you see the same characteristics appearing; texts which are more prose than poetry cluster to the right and yield higher Prosody Indexes, and those which are more poetry than prose cluster to the left and yield lower Prosody Indexes. The following chart shows a selection of such Prosody Indexes by text. I have chosen 50 as the cutoff between poetry and prose, since a Prosody Index of 50 indicates that 50% of the words are to either side of the midpoint of the graph. This choice is confirmed by a Prosody Index of 53 for James Joyce’s
Ulysses, commonly considered to be the most poetic novel in the English language. Not unsurprisingly, Nietzsche’s
Beyond Good and Evil comes in at 50 — part poetry, part prose, and undeniably philosophy. I have only sampled 25 texts from the Gutenberg archive so far, but I will continue doing this as time permits to see if the assumptions underlying the Prosody Index are still valid.
| Some Spring Days in Iowa, by Frederick John Lazell |
23.0 |
| The Frogs, by Aristophanes |
27.5 |
| Autumn Leaves, by John Bartlett |
34.2 |
| The Divine Comedy, by Dante Alighieri, translated by the Rev. H. F. Cary, M.A. |
35.8 |
| Necktie for a Two-Headed Tadpole, by Jason Murk |
37.2 |
| A Christmas Carol, by Charles Dickens |
44.9 |
| Beyond Good and Evil, by Friedrich Nietzsche |
50.6 |
| The Adventures of Grandfather Frog, by Thornton W. Burgess |
52.0 |
| Ulysses, by James Joyce |
53.1 |
| The Time Machine, by H. G. Wells |
54.0 |
| The Importance of Being Earnest, by Oscar Wilde |
54.2 |
| Alice’s Adventures in Wonderland, by Lewis Carroll |
57.1 |
| Sketches New and Old, by Mark Twain |
58.6 |
| Frankenstein, by Mary Wollstonecraft Shelley |
61.5 |
| The World War and What Was Behind It, by L. P. Benezet |
61.7 |
| The Fat and the Thin, by Emile Zola |
62.7 |
| The Picture of Dorian Gray, by Oscar Wilde |
65.4 |
| A Tramp Abroad, by Mark Twain |
67.6 |
| A Tale of Two Cities, by Charles Dickens |
68.3 |
| Sense and Sensibility, by Jane Austen |
72.3 |
| Crime and Punishment, by Fyodor Dostoevsky |
74.7 |
| Gargantua and his Son Pantagruel, by Master Francis Rabelais |
74.8 |
| Huckleberry Finn, by Mark Twain |
75.1 |
| Dracula, by Bram Stoker |
77.2 |
| Bleak House, by Charles Dickens |
80.1 |
| Don Quixote, by Miguel de Cervantes |
84.7 |
Of course, going meta to this meta-analysis, is this what you get when a math major from MIT takes to writing novels?