Comparing Texts and Identifying Authors

by Eric Johnson

Published in TEXT Technology, 4.1 (Spring, 1994), 7-12.

      There are several ways that computers can be used to compare texts of novels and other works to attempt to identify indications of authorship. Some methods that have been used to identify authors' styles are extremely complex, but the two programs discussed in this article use rather simple principles. I created the programs in my language of choice for textual computing: SPITBOL-386.

Comparing Word Lengths

      It has been argued that a straight-forward comparison of word lengths can identify the author of a text. It appears that a careful calculation of the percentage of words of various lengths (one-letter words, two-letter words, and so on) may produce a unique fingerprint for any author. A uniform definition of a word (or token) must be used, of course; the program that I wrote to compare word lengths identifies as a word any string of characters consisting of upper-case letters, lower- case letters, digits, hyphen, and apostrophe -- but the initial character of the string must be a letter.

      Figure 1 shows the output from my program that compares the number and percentage of words of various lengths in two novels by Nathaniel Hawthorne. The difference in percentage is listed and also graphed. The file identified as # 1 is The Scarlet Letter, and # 2 is The House of the Seven Gables. The percentage for each word length is remarkably similar. The greatest difference is eight tenths of one percent. Seven word lengths are within one tenth of a percent.

      Figure 2 shows a comparison of the word lengths of two novels by Hawthorne (identified as # 1) and three novels written by my colleague James A. Janke (identified as # 2). They are not too similar. The differences in the percentages of lengths are often over one, and the greatest difference is 3.8. Only one length is within one tenth of a percent.

      A comparison of the word lengths of one group of four plays by Shakespeare with another group of five of his plays showed similarities about like those of the two novels by Hawthorne. The word lengths of James Janke's novels differ one from another a little more than those of Hawthorne. Janke suggested to me that some differences may be due to his working with different publishers and editors; one of his editors told him to use direct, active verbs always, and he did so in his next two novels. Nevertheless, regardless of editors, subject matter, the date of composition, or other variations, my (limited) tests indicate that texts written by the same author seem to have few (or no) differences in word lengths over one percent, and rarely (if ever) differ by two percent or more. Therefore, we might conclude that a comparison of word lengths that looked like Figure 1 indicates that the texts were written by the same author, but that a correlation like that in Figure 2 is good evidence that they were written by different authors.

      It is difficult to understand how such a simple comparison of word lengths can indicate an author's identity. Perhaps some authors prefer words of Anglo-Saxon origin (which tend to be short) and others prefer words from Latin (which tend to be longer than Anglo-Saxon words). In any case, the comparison of word lengths seems to be sound only for fairly large texts: eighty thousand words or more. The texts compared in Figures 1 and 2 are substantial: The Scarlet Letter has about 84,000 words, The House of the Seven Gables has about 103,000 words; the Janke novels have 141,000 words and the two novels by Hawthorne have 187,000.

Frequency of Common Words

      A newspaper publisher gave me files of several short editorials that he had written and files of similar editorials that his father had written; he also gave me two unidentified editorials, and he challenged me to tell him which one he had written and which was written by his father. The sizes of the texts were 300 to 1600 words: far too small to expect that a comparison of word lengths would help identify them. After much trial and error, I succeeded in distinguishing the texts by comparing the frequency of ten very common words.

      Figure 3 shows comparisons in the number and percent of common words found in four files. I knew who wrote file A and who wrote file B, and these two writers had also created file X and file Y, but I did not know which was written by whom.

      The first table in Figure 3 compares the number and percent of ten words found in file A and in File X. They are rather similar. The range of differences in percentage is 0.02 to 0.51. The second table in Figure 3 compares the number and percent of words in file B and in Y, and they, too, are similar. The range of differences in percentage is 0.01 to 0.87. It appears that the author of A is the author of X, and that the writer of B also wrote Y.

      The third table in Figure 3 shows a cross comparison: file A and file Y. They differ by as much as 1.48 percent, and they are not nearly as similar as A and X. The fourth table compares file B and X; they differ by as much as 1.57 percent, and three words differ by over one percent; they are not close to being as similar as B and Y.

      Therefore, I confidently concluded that the author of A had also written X, and that the writer of B produced Y. I was correct.

      I selected the particular set of ten words for comparison because they were commonly used regardless of the subject, but also for other reasons. Some of the words were selected in order to make general comparisons that might be made for any type of writing: for example, it may be a characteristic of an author's style to prefer the definite "the" to the indefinite "a" or to avoid the negative "no" and to prefer "but" or "with." Other words were selected with the expectation that they might demonstrate distinctive features of newspaper editorials: some editors avoid the first-person singular and write using "our," "us," and "we."

      It will always be tricky to base a conclusion on an analysis of samples as small as these, but, at least in this case, it proved accurate. There is no reason similar calculations cannot be made for novel-length works, of course, and with larger samples, the comparisons might be even more dependable.
 
 
 

      There are far more sophisticated (and more statistical) models for determining authors' styles than the two described in this article. Whether the simplicity of the methods used here is an advantage or a weakness can be decided by readers and researchers.


Eric Johnson is a former Editor of TEXT Technology and the author of more than one hundred volumes and articles about computers, writing, and literature. He can be contacted by email at johnsone@jupiter.dsu.edu.
 
 



Word  Number Number  % of  % of  Dif 
Size  in # 1 in # 2   # 1   # 2  in % Graph of Difference ( | is 0.1)

   1    2593   3295   3.1   3.2  0.1  |
   2   14441  18363  17.3  17.8  0.5  |||||
   3   18711  22790  22.4  22.1  0.3  |||
   4   14153  16642  16.9  16.1  0.8  ||||||||
   5    9486  11537  11.3  11.2  0.1  |
   6    7587   8716   9.1   8.5  0.6  ||||||
   7    5392   6711   6.4   6.5  0.1  |
   8    4242   5848   5.1   5.7  0.6  ||||||
   9    2920   3812   3.5   3.7  0.2  ||
  10    1953   2414   2.3   2.3  0.0  
  11    1057   1344   1.3   1.3  0.0  
  12     575    791   0.7   0.8  0.1  |
 13+     595    787   0.7   0.8  0.1  |

Figure 1.  A comparison of lengths of words in two novels by Hawthorne.

Word Number Number % of % of Dif Size in # 1 in # 2 # 1 # 2 in % Graph of Difference ( | is 0.1) 1 5888 4342 3.2 3.1 0.1 | 2 32804 20986 17.6 14.9 2.7 ||||||||||||||||||||||||||| 3 41501 33508 22.2 23.8 1.6 |||||||||||||||| 4 30795 28580 16.5 20.3 3.8 |||||||||||||||||||||||||||||||||||||| 5 21023 16982 11.3 12.0 0.7 ||||||| 6 16303 13756 8.7 9.8 1.1 ||||||||||| 7 12103 10940 6.5 7.8 1.3 ||||||||||||| 8 10090 6868 5.4 4.9 0.5 ||||| 9 6732 2651 3.6 1.9 1.7 ||||||||||||||||| 10 4367 1476 2.3 1.0 1.3 ||||||||||||| 11 2401 523 1.3 0.4 0.9 ||||||||| 12 1366 263 0.7 0.2 0.5 ||||| 13+ 1382 174 0.7 0.1 0.6 |||||| Figure 2. A comparison of word lengths in novels by Hawthorne and Janke.


Word Number Number Percent Percent Difference in # A in # X in # A in # X in Percent a 46 10 3.09% 3.13% .04 but 3 0 .20% .00% .20 in 39 10 2.62% 3.13% .51 no 2 0 .13% .00% .13 our 10 3 .67% .94% .27 the 78 18 5.25% 5.64% .40 us 9 2 .61% .63% .02 we 13 3 .87% .94% .07 which 4 0 .27% .00% .27 with 11 4 .74% 1.25% .51 Figure 3, Table 1.


Word Number Number Percent Percent Difference in # B in # Y in # B in # Y in Percent a 25 11 1.56% 1.62% .05 but 14 6 .87% .88% .01 in 42 21 2.62% 3.08% .46 no 2 1 .12% .15% .02 our 1 1 .06% .15% .08 the 115 43 7.18% 6.31% .87 us 1 1 .06% .15% .08 we 5 4 .31% .59% .28 which 4 1 .25% .15% .10 with 2 1 .12% .15% .02

Figure 3, Table 2


Word Number Number Percent Percent Difference in # A in # Y in # A in # Y in Percent a 46 11 3.09% 1.62% 1.48 but 3 6 .20% .88% .68 in 39 21 2.62% 3.08% .46 no 2 1 .13% .15% .01 our 10 1 .67% .15% .53 the 78 43 5.25% 6.31% 1.07 us 9 1 .61% .15% .46 we 13 4 .87% .59% .29 which 4 1 .27% .15% .12 with 11 1 .74% .15% .59

Figure 3, Table 3


Word Number Number Percent Percent Difference in # B in # X in # B in # X in Percent a 25 10 1.56% 3.13% 1.57 but 14 0 .87% .00% .87 in 42 10 2.62% 3.13% .51 no 2 0 .12% .00% .12 our 1 3 .06% .94% .88 the 115 18 7.18% 5.64% 1.54 us 1 2 .06% .63% .56 we 5 3 .31% .94% .63 which 4 0 .25% .00% .25 with 2 4 .12% 1.25% 1.13



Página creada y actualizada por grupo "mmm".
     Para cualquier cambio, sugerencia,etc. contactar con: fores@uv.es
     © a.r.e.a./Dr.Vicente Forés López
      Universitat de València Press
    Creada: 15/09/2000 Última Actualización: 18/06/2001