CHARACTERISTIC CURYES OF COMPOSITION. T. C. MENDENHALL TERRE HAUTE, INI).: Moore & Langen, Printers and Binders. 1887. CHARACTERISTIC CURVES OF COMPOSITION. T. C. MENDENHALL. TERRE HAUTE, IND.: Moore & Langen, Printers and Binders. 1887. The Characteristic Curves of Composition. From “Science,” March 11th, 1887. Augustus DeMorgan somewhere remarks (I think it is in his “ Budget of paradoxes ’) that some time somebody will institute a comparison among writeis in regard to the average length of words used in composition, and that it may be found possible to identify the author of a book, a poem, or a play, in this way. In reflecting upon this remark at various times within the past five or six years, always with the determination to test the value of the suggestion whenever time for the work seemed available, a more comprehensive and satisfactory method of analysis than that based simply upon mean word-length suggested itself. The new method, while scarcely more laborious than that proposed by DeMorgan, promised to yield results more quickly and of a definitely higher order. It also had the advantage of including, in its application, all that was necessary to the determination of mean word-length ; so that, in reality, it furnished two distinct tests. Preliminary trials of the method have furnished strong grounds for the belief that it may prove useful as a method of analysis leading to identification or discrimination of authorship, and it is therefore brought to the attention of the scientific and literary public in the hope that some one may be found who is at once able and willing to secure a satisfactory test of its validity. The nature of the process is extremely simple, but it may be useful to point out its similarity to a well-known method of material analysis, the consideration of which actually first suggested to the writer its literary analogue. By the use of the spectroscope, a beam of non-homogeneous light is analyzed, and its components assorted according to their wave-length. As is well known, each element, when intensely heated under proper conditions, sends forth light which, upon prismatic analysis, is found to consist of groups of waves of definite length, and appearing in certain definite proportions. So certain and uniform are the results of this analysis, that the appearance of a particular spectrum is indisputable evidence of the presence of the element to which it belongs. In a manner very similar, it is proposed to analyze a composition by forming what may be called a “word-spectrum,” or “characteristic curve,” which shall be a graphic representation of an arrangement of words according to their length and to the relative frequency of their occurrence. If, now, it shall be found that with every author, as with every element, this spectrum persists in its form and appearance, the value of the method will be at once conceded. It has been proved that the spectrum of hydrogen is the same, whether that element is obtained from the water of the ocean or from the vapor of the atmosphere. Wherever and whenever it appears, it means hydrogen. If 4 The Characteristic Curves of Composition. it can be proved that the word-spectrum or characteristic curve exhibited by an analysis of “David Copperfield” is identical with that of “Oliver Twist,” of “Barnaby Rudge,” of “Great Expectations,” of the “Child’s History of England,” etc., and that it differs sensibly from that of “Vanity Fair,” or “Eugene Aram,” or “Robinson Crusoe,” or “Don Quixote,” or any thing else in fact, then the conclusion will be tolerably certain that when it appears it means Dickens. The validity of the method as a test of authorship, then, implies the following assump- tions : that every writer makes use of a vocabulary which is peculiar to himself, and Fig. 1.—First One Thousand Words in “ Oliver Twist.” the character of which does not materially change from year to year during his pro- ductive period ; that, in the use of that vocabulary in composition, personal peculiarities in the construction of sentences will, in the long ran, recur with such regularity that short words, long words, and words of medium length, will occur with definite relative frequencies. The first assumption will, perhaps, be admitted in a general way, without debate. It is easily seen that to prove or disprove the second will require the expenditure of an enormous amount of labor. The following results are offered as a means of properly The Characteristic Curves of Composition. 5 exhibiting the method, and as evidence, in some degree, at least, of its real value. It is important, first, to determine to what extent an author may be said to agree with himself; and, second, to what extent does he differ from others. As an instance in which two writers might well be expected to greatly resemble each other in their curves, and consequently as a severe test of the method, two contempo- raneous novelists, Dickens and Thackeray, were selected for the first examination. The operation consisted simply in counting the number of letters in every word, and re- cording the number of words of one letter, two letters, three letters, etc. The count Fig. 2.— Showing Five Groups, of One Thousand Words Each, from “ Omver Twist.” began in both cases at the beginning of the volume, and, after a few thousand words had been counted in order, the book was opened at random near the middle, and the count continued. In no case was any personal choice exercised, except that both counts began with the first chapter. Words were counted always in groups of one thousand. The graphic display of the result was made by the common method of rectangular co- ordinates, using the number of letters in a word as an abscissa, and the corresponding number of such words in a thousand as an ordinate. As an illustration, the first one thousand words counted from “ Oliver Twist” may be cited ; they were as follows : 6 The Characteristic Cerves of Composition. Number of letters 1 2 8 4 5 6 7 8 9 10 11 12 Number of words 38 170 235 175 123 91 (32 41 85 10 13 7 Even in so small a number as one thousand, the relative distribution of words is approximately the same as in a much larger number, although, as would naturally be expected, accidental variations or “runs” overshadow personal characteristics to a great extent; but not completely, as will be seen in the characteristic curves shown in the following pages. In fact, when the ten groups, of a thousand words each, from Fig. 3.—Two Consecutive Groups, of One Thousand Words Each, from “ Vanity Fair.” These Groups Show Sensibly the Same Average Word-Lengths. Dickens, are compared with ten similar groups from John Stuart Mill, no one of the first set could by any possibility be mistaken for any one of the second. The graphic representation of the results will be readily understood. It is only nec- esssary to take a sheet of “squared” paper, or paper ruled in two directions at right angles to each other, and, after placing the numbers showing letters in each word at points along a horizontal line separated from each other by equal distances, above each of these place other points whose distance from the base line shall he proportional to The Characteristic Curves of Composition. 7 the number of such words in a thousand ; then join these points by a broken line, and the characteristic curve is shown. Fig. 1 shows the curve thus constructed from the first thousand words in “ Oliver Twist,” the numerical analysis of which is shown above. The next diagram (Fig. 2) exhibits five curves constructed from the first five thou- sand words from the same work, in groups of one thousand each. It is presented in order to show the variation among groups based on a relatively small number of words. The superiority of this method over that of simple word averages, as suggested by Fio. 4.—Tavo Groups, ok Five Thousand Words Each, from “ Oliver Tavist.” DeMorgan, is clearly shown in Fig. 3. which exhibits two consecutive groups, of one thousand words each, from “ Vanity Fair.” The numerical analysis of these groups is as follows: Letters 1 2 3 4 5 (i 7 8 9 10 11 12 13 14 Words in (irst group . . . 25 Hill 232 187 109 78 79 48 28 20 10 10 2 3 Words in second group . . . . 33 14(» 248 134 135 79 73 | 52 35 23 9 0 2 2 It will be seen that the total number of letters in the first group is 4,507, and in the 8 The Characteristic Curves of Composition. second 4,508, or an average of 4.507 and 4.508 letters to each word in the respective groups. If this average, or “ mean word-length,” be alone considered, the two groups must be regarded as sensibly identical; but an inspection of the diagram shows that they are in reality quite different. When the number of words in a group is increased to live thousand, the accidental irregularities begin to disappear, the curve becomes smoother, approximating more nearly to the normal curve which, it is assumed, is characteristic of the writer. Fig. 4 exhibits two groups, each of five thousand words, from “ Oliver Twist,” and it will be Fig. 5.—Curve for Ten Thousand Words from “ Oliver Twist.” seen that considerable differences still exist. One of the curves shows an excess of nine-letter words, which does not appear in the other. They agree in showing a greater number of six-letter words than a smooth curve would demand. I his excess may per- sist, and prove to be a real characteristic ol Dickens’s composition. 1' ig. 5 exhibits these two groups of five thousand words combined in one of ten thousand, giving a curve of greater smoothness, and approximating still more closely to the normal curve of the writer. In Fig. K. two groups of five thousand words each, from “Vanity Fair,” are shown; The Characteristic Curves of Composition. 9 and in Fig. 7, two groups of ten thousand each, from “Oliver Twist” and “ Vanity Fair,” are placed side by side for comparison, the former being represented by the con- tinuous line, and the latter by the broken line. Although these curves differ, and while it is believed that the difference will persist with an increased number of words, it is certainly surprising, that in the analysis of ten thousand words from Dickens, and the same number from Thackeray, so close an agreement should be found. This agree- ment is particularly striking in words of eleven, twelve and thirteen letters, the numerical comparison erf which is as follows: . Fig. 0.—Two Gkoui’s, of Five Thousand Words Each, from “ Vanity Fair.” Number of letters 12 13 Number of words in Dickeus ... 85 57 29 Number of words in Thackeray 58 29 This closeness to identity must be largely the result of accident, and it would not be likely to repeat itself in another analysis. The writer next examined was John Stuart Mill; and to test the persistence of form in compositions belonging to different periods of the author’s life, and upon different 10 The Characteristic Curves of Composition. subjects, two groups of five thousands word each were taken—one from his “ Political Economy,” and the other from his “ Essay on Liberty.” It was anticipated, of course, that words of greater length would occur far more frequently than in the case of the novelists; but I confess to considerable surprise on finding from the very beginning, that although, on the whole, this anticipation was realized, the word which occurred most frequently was not the three-letter word, as with both Dickens and Thackeray, but the word of two letters.- Indeed, the word of two letters was not only relatively more frequent, but absolutely ; that is to say, it occurred more frequently in the com- position of Mill than in that of either of the novelists, and with great uniformity, as Fig. 7.—Two Groups, of Ten Thousand Words Each, from ‘Oliver Twist,’ ; and from ‘Vanity Fair,’ . it was in excess in each thousand of the ten analyzed. The explanation is easy, and is to be found in the liberal use of prepositions in sentence-building. The proposed method of analysis is designed to reveal any peculiarity of this kind, and the exem- plification of its power thus early in the work was encouraging. Figures 8 and 9 show the curves for five thousand words from the “Political Econ- omy” and from the “ Essay on Liberty.” It will be observed, that, while they differ considerably, there is still, in a general way. a striking resemblance, and that they are in marked contrast with the curves of the novelists. An interesting case was furnished The Characteristic Curves of Composition. 11 in two recent addresses on the labor question by Mr. Edward Atkinson. In reality, one address was given to two very different audiences. One was made up from the workingmen of Providence, and the other from the alumni of the Andover theological seminary. On reading the two, one cannot avoid being struck by the marked difference in style, although the two papers are much alike in substance. It was interesting, then, to inquire whether their curves of composition would show any marked resem- blance. An analysis of five thousand words from each paper was made, and the result is show in Fig. 10. A very satisfactory, indeed a striking general resemblance will be Fig. 8.—Curve of Five Thousand Words from Mii.l’s “ Political Economy.” observed ; and it will also be seen that Mr. Atkinson’s curve differs decidedly from others previously figured and described. It is shown in contrast with that of John Stuart Mill in Fig. 11. Mr. Atkinson’s composition is remarkable in respect to the shortness of the words used. The average length of ten thousand words in his addresses on the labor question is 4.298 letters. The mean word-length of the writers thus far examined, based upon a count of ten thousand words from each, is as follows: Atkinson 4 298 Dickons 4.342 Thackeray 4.481 Mill 4.775 12 The Characteristic Curves of Composition. A friend has furnished me with the result of the count of the first five thousand five hundred words of Caesar’s “Commentaries.” The mean word-length is 6.065. The most extensive word-counting that I know of is that of the words and letters in the Bible. I cannot vouch for the reliability of the information which periodically floats through the columns of the public press, that the Old Testament contains 592,493 words with 2,728,100 letters, and the New Testament 181,253 words with 838.380 letters. It is interesting to note, however, that these numbers give averages of 4.604 and 4.625 respectively, agreeing within less than one-half of one per cent. Fig. 9.—Curve of Five Thousand Words from Mill’s “ Essay on Liberty.” Before making an analysis of Mr. Atkinson’s composition, and after having counted more than thirty thousand from other writers, I had concluded that a group of one thousand words whose average length was less than four letters would not occur, except in compositions especially written in short words. Out of ten such groups from Mr. At- kinson’s addresses, however, one was found whose mean worddength was 3.991. I have recently received from him a brief paper, entitled “How do We All Get a Living?” which was published in Work and Wages, and in the preparation of which he made a special effort, to use the simplest language possible. The article contains a little more The Characteristic Curves of Composition. 13 than two thousand words, the number being too small for the construction of a curve which would be comparable with those already exhibited. The general form ot one based upon two thousand words is similar to that previously obtained from the same writer, and the mean word-length is 3.771. Interesting evidence of the validity of this method of analysis and identification has been furnished by several friends who have had the patience to enumerate the letters in many thousand words from different sources. Prof. Stanley Coulter sends me Fig. 10.—Two Groups, of Five Thousand Words Each, from Addresses of Edward Atkinson : Address to Work- ingmen, : To Alumni of Theological Seminary", — — . the result of a count of ten thousand from Dickens’s “Christmas Carol.” He writes : “ I became exceedingly interested in watching how little tricks of composition affected the ‘curve.’ For instance, one of the characters, ‘Scrooge,’ appears in one place very often, and an excess of 7’s is the result; in another place ‘ Fizziwig,’ and the 8’s creep up [this is doubtless owing to the frequent appearance of the names]. Other variations and excesses seem to come from Dickens’s love of certain forms of description, which he iterates and reiterates upon a single page.” 14 The Characteristic Curves of Composition. i have plotted these ten thousand words from the “ Carol” with the ten thousand already shown from “Oliver Twist,” in Fig. 12. A very close resemblance will be ob- served, and it will be noticed that the mean of these two curves would be free from certain irregularities which occur in both, and would be a much closer approximation to the normal characteristic curve of Dickens. It is hardly necessary to say that the method is not necessarily confined to the analysis of a composition by means of its mean word-length ; it may equally well be applied to the study of syllables, of words in sentences, and in various other ways. The results Fig. 11.—Two Groups, of Tf.n Thousand Words Each. Atkinson, ; Mill, — —. thus far obtained from its application would appear to justify the claim that it is worthy of a thorough test through which the validity of its assumptions might be proved or disproved. Its principal merits are, that it offers a means of investigating and display- ing the mere mechanism of composition, and that it is purely mechanical in its appli- cation. In virtue of the first, it might reveal characteristics which a writer would make no attempt to conceal, being himself unaware of their existence; and, of the second, the conclusions reached through its use would be independent of personal bias, the work of one person in the study of an author being at once comparable with that of any other. The Characteristic Curves of Composition. 15 Many interesting applications of the process will suggest themselves to every reader ; the most notable, of course, being the attempt to solve questions of disputed author- ship, such as exist in reference to the letters of Junius, the plays of Shakspeare, and other less widely known examples. It might also be utilized in comparative language studies, in tracing the growth of a language, in studying the growth of the vocabulary from childhood to manhood, and in other directions too numerous to be catalogued. An illustration of its application to another language is shown in the analysis of more than five thousand words in Caesar's “Commentaries,” already referred to, which is rep- Fig. 12.—Two Groups, of Tex Thousand Words Each, from Dickens : “ Oliver Twist,” : “ Christmas Carol,” . resented in Fig. 13. The curve shows a relatively large use of long words, and its peculiar feature is the evident indication of two maximum ordinates nearly equal to each other. From the examinations thus far made, I am convinced that one hundred thousand words will be necessary and sufficient to furnish the characteristic curve of a writer— that is to say, if a curve is constructed from one hundred thousand words of a writer, taken from any one of his productions, then a second curve constructed from another hundred thousand words would be practically identical with the first—and that this 16 The Characteristic Curves of Composition. curve would, in general, differ from that formed in the same way from the composition of another writer, to such an extent that one could always be distinguished from the other. To demonstrate the existence of such a curve will require the enumeration of the letters in several hundred thousand words from each of a number of writers. Should its existence be established, the method might then be applied to cases of dis- puted authorship. If striking differences are found between the curves of known and suspected compositions of any writer, the evidence against identity of authorship would be quite conclusive. If the two compositions should produce curves which are Fig. 13.—Group of Five Thousand Five Hundred Words from Cacsar's “ Commentaries.” practically identical, the proof of a common origin would be less convincing; for it is possible, although not probable, that two writers might show identical characteristic curves. T. C. Mendenhall. Note.—Since the appearance of the above in “ Science,” I have received from Professor Coulter the record of the count of another group of ten thousand words in the “Christmas Carol.” The agreement with the first group is remarkable. The average word-length of the first is 4.265 and of the second 4.269, a difference of less than one-tenth of one per cent. When this group is plotted its curve is nearly identical with the mean of the curves shown in Fig. 12. T. C. M.