Key words: World Wide Web, SNOBOL4, SPITBOL, computer text analysis
Students in my on-campus literature classes have made good use of a series of computer programs that I have created for text analysis. I offered a course via the World Wide Web that provided fourteen of my programs to students throughout the world. My course taught those students how to use the programs, and as they completed the assignments, students not only learned to use software for literary analysis, but they also often gained new kinds of insights into the study of texts.
As a professor who teaches literature and who enjoys computer programming, I have created about two dozen computer programs for textual research that have been used by students and colleagues at my university. Sometimes I wrote programs because students told me that they could not find software that would help them do a specific kind of text analysis that they wanted to do; other times I created a program and then suggested to my students that they might find it useful. In any case, students using my programs were often able to complete research that would have been impossible without them.
All of the computer programs that I have written for students were coded in SPITBOL-386 which is a fast, modern implementation of the powerful SNOBOL4 programming language (see note 1). Using the SPITBOL-386 compiler, I created stand-alone binary files for each of my programs, and they can be executed easily on any computer that can run DOS programs and has at least two megabytes of RAM (although more RAM is necessary to process large files with some of the programs).
2.0 CHUM 650: Computing for the Humanities
I became persuaded that there was interest in a course that would provide my programs to students throughout the world and would guide them through assignments using the programs. During the summer of 1996, I taught CHUM 650: Computing for the Humanities. This three-semester-hour graduate course was offered via the Internet, and it used the World Wide Web to distribute the programs and to provide information about the programs. The fourteen students who enrolled in the class were located in Germany, Japan, Thailand, Australia, New Zealand, and throughout the United States. The closest student to me was almost a thousand miles away. By entering a name and a password, students accessed a special Web site that I created for the course. The course Web site consisted of about 25 primary HTML files with links to more than 50 additional program or data files. Students read articles that I had written; they downloaded the programs, and they executed the programs on their own hardware using data files that were provided on the course Web site or using data files that they had at hand. Students used email to return descriptions of how the programs worked and what they learned. Although students' descriptions were not sent to other students in the course, the students were encouraged to chat with one another by email about the course (or about the professor or about anything they wanted to discuss). They were, of course, asked not to attempt to complete class assignments for other students.
2.1 Assignment One: Electronic Texts and their Processing
The first assignment for CHUM 650 was a very simple one. It asked students to read two articles that I had published on the topic of electronic texts and the ways that they can be used for literary research (see note 2). Like all reading assignments for the class, the texts of these articles were contained on the Web site for the course. Students could read the articles online, or they could print them. They were required to write about uses for electronic texts other than those mentioned in the articles, and to consider any drawbacks of electronic analysis of texts. A few students were mildly skeptical about the value of computer analysis of literary works, but almost all of them had ideas about working with texts, and some of them looked forward to computer analysis of works by favorite authors such as Henry James, Thomas Hardy, Mark Twain, Charles Darwin, and Arthur Conan Doyle. One student proposed computer analysis to compare original works and their sequels. Other students had specialized texts that they wanted to examine: math instructions written by 80 math teachers, stock market reports, and news stories about the 1996 Olympics. Some students expressed interest in using electronic versions of particular works (such as novels by Dickens), and I was able to suggest how they could obtain at least some of them.
2.2 Assignment Two: Counting Words
Students often want to count the number of words in a text file. There are many programs (including most word processors) that will do that. However, sometimes students need to control the definition of what constitutes a word form: for example, hyphens, apostrophes, and numeric digits may or may not be wanted as constituent parts of word forms. WORDS is a program that counts the number of running words in a text and the number of unique word forms -- based on parameters for recognition of a word that the user can set.
Assignment two of CHUM 650 first asked students to read my article about WORDS (see note 3). Then, by pointing and clicking, students saved the executable file WORDS.EXE from the course Web page to their disks -- along with a control file and text files of two short novels by Joseph Conrad. Students were asked to run WORDS several times with each novel, and to notice that the output differed when the definition of a word was changed in the control file. Some students used ten or more differing versions of the control file.
WORDS allows the user to exclude a list of words from the counts. With 125 common words (such as articles, forms of "to be," and so on) excluded, one student mused that half of the dozen most frequently used words in The Lagoon formed a clumsy statement of the plot of a Conrad novel: "man," "white," "water," "out," "boat," "over." Again, excluding 125 common words, a student noticed that the most frequently used word in Conrad's Youth is "like." She wondered if the novel contained a high number of similes, and she confirmed by searching for the context of all occurrences of "like" that similes are indeed very frequently used in the story.
A student who remembered that a character in Conrad's The Lagoon spoke "with an intense whispering," was interested in noting the great number of words in Conrad novels beginning with an "s" sound. She used WORDS to discover a high ratio of such sibilants that sound not only like whispering, but also sound like the sea.
A student who had a collection of math instruction texts written by 80 math teachers used WORDS to study whether the teachers used a similar vocabulary to express their ideas about math pedagogy. Her initial evaluation was that although the teachers had very different backgrounds, they all used a similar, limited number of unique word forms.
Using WORDS with various parameters for recognition of a word helped students understand why the word counts of various programs used later in the course might differ one from another and from the counts produced by a word processor. They had a more sophisticated understanding of the concept of a word than they would have otherwise had.
2.3 Assignment Three: Comparing Texts
Many students of literature seem fascinated by the idea that the author of a text might be identified by computing the counts of mechanical features such as word lengths or the frequencies of common words. Exactly such topics are discussed in my "Comparing Texts and Identifying Authors" which I asked students to read for Assignment three (see note 4). Like most of the other articles that students read for the course, this one not only introduced the topics covered in the assignment, but it was intended to serve a pedagogical function of prompting students to discover interesting things about texts for themselves.
I instructed students to download and execute my program called MENDEN2 which computes and compares the lengths of words in two texts. Figure 1 and Figure 2 show the output of MENDEN2: a comparison of the lengths of words in two novels. The word lengths shown in Figure 1 are very similar for two novels by Hawthorne: none of the lengths differ by as much as one percent, and, in fact, only one length (four-letter words) is close to one percent; seven lengths differ by one tenth of a percent or less. By comparison, Figure 2 shows the lengths of words in a novel by Hawthorne and one by Emily Bronte. Four lengths differ by more than one percent, and only one length differs by less than three tenths of a percent. Students were allowed to draw their own conclusions, but most saw a comparison like that in Figure 1 as indicating that the two texts were written by the same author, and a comparison like that in Figure 2 as indicating that the two texts were not written by the same author. One student remarked that it was obvious from his use of MENDEN2 that works by Arthur Conan Doyle were not written by Mark Twain.
File 1 is LETTER.ASC (83,705 words). File 2 is HOUSE.ASC (103,050 words). Word Number Number % of % of Dif Size in # 1 in # 2 # 1 # 2 in % Graph of Difference (= is 0.1) 1 2593 3295 3.1 3.2 0.1 = 2 14441 18363 17.3 17.8 0.5 ===== 3 18711 22790 22.4 22.1 0.3 === 4 14153 16642 16.9 16.1 0.8 ======== 5 9486 11537 11.3 11.2 0.1 = 6 7587 8716 9.1 8.5 0.6 ====== 7 5392 6711 6.4 6.5 0.1 = 8 4242 5848 5.1 5.7 0.6 ====== 9 2920 3812 3.5 3.7 0.2 == 10 1953 2414 2.3 2.3 0.0 11 1057 1344 1.3 1.3 0.0 12 575 791 0.7 0.8 0.1 = 13+ 595 787 0.7 0.8 0.1 =Figure 1. A comparison of lengths of words in two novels by Hawthorne.
File 1 is HEIGHTS.ASC (117,688 words). File 2 is HOUSE.ASC (103,050 words). Word Number Number % of % of Dif Size in # 1 in # 2 # 1 # 2 in % Graph of Difference (= is 0.1) 1 5861 3295 5.0 3.2 1.8 ================== 2 20572 18363 17.5 17.8 0.3 === 3 28362 22790 24.1 22.1 2.0 ==================== 4 20446 16642 17.4 16.1 1.3 ============= 5 12574 11537 10.7 11.2 0.5 ===== 6 9642 8716 8.2 8.5 0.3 === 7 7660 6711 6.5 6.5 0.0 8 4946 5848 4.2 5.7 1.5 =============== 9 3503 3812 3.0 3.7 0.7 ======= 10 2076 2414 1.8 2.3 0.5 ===== 11 1028 1344 0.9 1.3 0.4 ==== 12 541 791 0.5 0.8 0.3 === 13+ 477 787 0.4 0.8 0.4 ====Figure 2. A comparison of lengths of words in novels by Bronte and Hawthorne.
Word length comparisons such as those produced by MENDEN2 do not appear to give sound evidence of authorship if the text size is less than about 80,000 words. (Several students asked why the text needed to be so large, and I could only answer that I did not know why, but from my experimentation, texts less than 80,000 words rarely produced conclusive evidence of authorship.) However, for shorter texts, a program that calculates the percentages of specific words can be used, and IDENT is such a program. The words "the," "and," "of," and "a" are usually the most frequent words in any text, but there can be significant differences among writers, and computing such differences using IDENT can lead to author identification. Based on a suggestion that I made in my article describing IDENT, a student said that the program not only shows differences among authors, but it can indicate the type of writing: a short novel typically has a high percentage of articles and conjunctions, and an newspaper editorial has a high percentage of pronouns.
Calculating the amount and location of dialogue in a text can often be interesting. I asked students to read my "Computing the Amount of Quotation in Novels" (see note 5). They then downloaded the program DIALOG20. This program computes the overall amounts of quotation -- for example, The Scarlet Letter contains about 18.8 percent quotation, and The House of the Seven Gables contains about 21.1 percent (other novels may have significantly more or less quotation). The program also computes the amount of quotation in each chapter, and some students were interested in tracking the rhythm of chapters containing larger and smaller amounts of talking: The Scarlet Letter frequently has alternating chapters of light and heavy quotation, but The House of the Seven Gables more often has clusters of chapters of heavy and light quotation.
Students were sometimes mildly surprised by the results of computer analysis. For example, DIALOG20 sometimes indicated levels of quotation that were unexpected: novels and parts of novels that were remembered as having little dialogue often had a good deal of quotation, and novels and parts of novels that were thought of as containing lots of talking were sometimes mostly narrative. One student said that she remembered Wuthering Heights as an action novel in which Heathcliff and other characters are rushing about and throwing things, when, in fact, DIALOG20 shows that the novel is mostly about people talking.
Students could also use DIALOG20 to count and locate special classes of words. If a student wanted to record the number and frequency of words referring to nature in a series of sonnets, the texts of the sonnets (which contained no dialogue, no quotation marks) could be edited to place words such as "tree," "leaf," and "stream" within quotation marks. When this edited text was processed by DIALOG20, the output showed the number and percentage of any of the nature-words.
2.4 Assignment 4: Indexing Texts
A colleague who is a technical writing teacher wanted an indexing program that his students could use to produce indexes for the software manuals they had written. He wanted the program to index the words in a text by page number or by line number. I created a program called BITZER to do such indexing. (The program is named after a character in Dickens' Hard Times; he is a light porter of "the steadiest principle," and "all his proceedings were the result of the nicest and coldest calculation." Unlike most commercial software producers, professors can be whimsical in naming programs.) BITZER will produce an index showing the page (or line) location of every word in a text, and it can be set to exclude indexing for specific words (found on a "stop list"). Figure 3 shows a small part of an index generated by BITZER for the novel Jeremiah Bacon.
WORD PAGE NUMBERS Aback 21, 22, 124, 158 Abandoned 82 Abandoning 14 Abbot 67 Abdomen 136 Able 4, 22, 30, 61, 84, 98, 102, 108, 130, 140, 161 Aboard 91 . . . Anita 13, 17, 19, 20, 21, 22, 23, 59, 60, 61, 62, 63, 65, 66, 67, 68, 69, 72, 76, 77, 86, 88, 93, 94, 95, 96, 97, 98, 120, 121, 122, 123, 124, 126, 127, 129, 130, 148, 149, 150, 153, 155, 156, 159, 161, 167, 172, 173 Announce 41 Announced 2, 21, 67, 93, 140 Another 1, 2, 3, 8, 12, 25, 26, 27, 32, 35, 37, 42, 46, 48, 49, 50, 52, 53, 60, 63, 64, 71, 73, 74, 83, 90, 97, 108, 111, 113, 114, 115, 118, 119, 120, 124, 125, 130, 132, 134, 142, 143, 152, 153, 155, 157, 159, 164, 165 Answer 7, 123 Answered 4, 6, 30, 35, 40, 41, 54, 67, 71, 72, 75, 86, 103, 111, 123, 129, 139, 149, 152, 155, 165 Answering 96 Antagonist 98 Antagonists 27 Antelope 69, 75 Antelope's 83 Anticipate 141 Anticipated 49 Anticipation 85 Anticlimactic 56Figure 3. A small part of an index created by BITZER.
BITZER requires that a text to be indexed contain unique page codes at the start of each page; the program's recognition of these codes is the basis of its indexing. The "page numbers" that are coded in the text can be section numbers or even chapter titles. An ingenious student used BITZER to index the words of lyrics of songs by the song titles. He simply substituted song titles for page numbers in the texts of the lyrics, and BITZER produced a list of words indexed by the titles of the songs in which they are found.
I asked students in CHUM 650 to read an article that I had written about BITZER (see note 6). Then they downloaded the executable file for the program and a test file. After being sure that BITZER produced an index for the test file, students added the codes for page numbers in another text file and produced an index for it. As always, students were asked to describe what they had learned from completing the assignment and to send their descriptions to me by email.
In order to avoid the tedious insertion by hand of the page codes that BITZER required in a text, several students figured out how to make their word processors automatically insert the necessary codes as page headers. Because they worked at a distance with different brands of word processors, it was difficult for students to receive much assistance with such matters from me. For at least one student, making her word processor add page codes that could be "printed" to an electronic file was the most agonizing part of the course.
In addition to the obvious uses of the display of page locations for words, an index can indicate features of the content of a novel such as the presence and absence of a character in various portions of the story. BITZER's index for Jeremiah Bacon shown in Figure 3 documents the use of the name of the hero's beloved, Anita, in the novel. Near the start of the novel, the word "Anita" is not found for 35 pages (between 23 and 59); near the middle of the novel, her name is not found for 21 pages (between 98 and 120); in the last quarter of the novel, "Anita" is not used for 13 pages (between 130 and 148); it is probably significant that Anita's absences grow shorter as the novel progresses. One student noticed from BITZER's index of a Conrad novel that "water" and "darkness" repeatedly appeared in parallel positions, and she speculated that "water" functioned symbolically to represent the concept of darkness. This interesting discovery would probably not have been made (even by a professional critic) without using BITZER or some kind of similar software. A student who is completing a graduate degree in linguistics remarked that linguists who want to study the usage of a specific word or word form would find BITZER useful.
One student who is a programmer said that he liked the idea of using BITZER to index computer source files by line number -- perhaps in order to find specific function calls. Although I did not mention this use to students, BITZER, was, in fact, created with exactly such a function in mind. Another student said that attorneys might use BITZER in a similar way with legal documents.
2.5 Assignment 5: The Rhythm of Words and Sentences
Assignment 5 is based on four computer programs that are described in my "Rhythm in the Novel" (see note 7). Figure 4 shows a tiny part of the output of a program called INTER that keeps track of all occurrences of words in a text and then gives the intervals between successive occurrences of the same word. For example, in Lord Jim, the word "cur" is found four times. The first two occurrences are separated by 1063 words; the next two are separated by 2288 words; the last two are separated by 3615 words. On the average, 2322 words are found between one instance of "cur" and the next. It is interesting that this word appears in the novel at rather even distances. The word "cur" is significant in the novel, of course, because Jim erroneously believes that a comment about a yellow dog is directed at him; the mistake leads to his meeting Marlow.
Word Average (and each interval) coward 14731 (2, 5, 5692, 260, 8628, 73804) . . . cur 2322 (1063, 2288, 3615) . . . jump 5732 (11377, 5758, 9865, 1, 1, 78, 4, 50, 831, 1427, 33, 15909, 7400, 687, 4451, 6394, 958, 17505, 4530, 25151, 7967)Figure 4. Output from INTER showing intervals between multiple occurrences of three words in Lord Jim.
The word "jump" is another significant word in the novel since Jim, a ship's officer, is on trial for jumping from his "sinking" ship, and leaving the passengers behind. In contrast to "cur," the word "jump" occurs at widely-varying distances. The first two instances of "jump" are divided by 11,377 words. Then, in the passage in which Jim narrates how he did, somehow, jump from the ship, the word is found very near other occurrences: "jump" is used three times in a string of five words, and is found six times in 89 words. In the last part of the novel, each instance of "jump" is a fairly distant echo of the last.
The question of whether Jim is a coward is, obviously, central to Lord Jim. The first mention of "coward" is separated by only two words from the second use of the word, and the third occurrence is five words later. It is probably significant that "coward" is used the last two times separated by 73,804 words: more than half the length of the novel.
A student interested in words indicating friends and those signifying family used INTER to process plays by Shakespeare. She found that "friend" occurs at regular intervals throughout several plays, but "brother" and "sister" occur in irregular clusters.
One student commented, almost in a tone of awe, that results such as those about Lord Jim that were produced by INTER in a matter of minutes could not be achieved by a researcher without computers in a lifetime. In fact, he said, programs such as INTER invite us to do research that would not even be thought of otherwise. It was exactly this kind of insight about the study of literature that I had hoped would result from using my programs in CHUM 650.
Two other programs used in Assignment 5 show the position of words in a text. BYCHAPT indicates which chapters in a novel contain specific words. WORDCELL first divides a text into 66 parts or cells of equal size, and then it displays which words are found in each cell. As students noted, it appears that programs that compute the position of words in a text can show a rhythm of word usage that often reflects the meaning. For example, a student said that in a novel by Arthur Conan Doyle, words that are clues ("boots" or "blood") are found throughout a novel, but words such as "criminal" are found only in the last half.
Another student used WORDCELL to plot the location of "love" and "hate" in Austen's Northanger Abbey. She discovered that "love" was used in most parts of the novel, and that "hate" was used in the first half of the novel in cells adjoining those with "love" -- almost in counterpoint. However, "hate" disappeared altogether in the second half of the novel.
A student noted that study of human psychology has demonstrated that we have basic needs for both regularity and for innovation. INTER, BYCHAPT, and WORDCELL all catalog the often complex patterns used by authors to meet and defeat readers' expectations in texts.
The last of the four programs used in Assignment 5 displays a comparison of the lengths of sentences in a text. SL computes the length of each sentence and graphs each, and the program compares the similarity of the length of each sentence with the length of the next. Thus from the output of SL it is easy to see at a glance if all of the sentences in a text are relatively uniform in length or if they increase and decrease in any kind of pattern throughout a work. One student said that any novel that kept her attention invariably had wide variations in sentence lengths. Another student declared that seeing the graph of sentence lengths in a text produced by SL was similar to listening to Debussey's La Mer: readers ride the waves of sentences to a crescendo of words, and then gently relax before the next surging pattern occurs.
After using the programs for Assignment 5, a student commented that computer analysis gave insights that not even the most careful reading could produce. Using WORDCELL and BYCHAPT to trace the rhythm of occurrences of words such as "love" and "life" as well as "hate" and "death" in Wuthering Heights, she concluded that she learned a good deal about the theme of the novel -- not merely about the plot. This is a significant insight since it might be supposed that computer examination of the words of a text would produce only a superficial surface analysis rather than point to the fundamental meaning or theme. Also, the same programs can sometimes reveal an author's methods of sustaining mystery: one reason that the first-time reader of The Scarlet Letter seldom guesses that Dimmesdale is Pearl's father is that words such as "minister" are rarely used in the same parts of the novel as "baby" and "infant."
2.6 Assignment 6: Finding Lists of Words and the Context of Words
FINDLIST is a program that computes the percent of words on multiple lists that are found in multiple text files. As Figure 5 shows, it can be used for analysis of the multiple files of dialogue of characters in Austen novels. Students found it interesting to note how frequently various male and female characters use masculine and feminine pronouns (as well as other pronouns) and what percent of their dialogue contains words for love and color. Various characters use a greater or lower percentage of 125 common English words.
Character List 1 List 2 List 3 List 4 List 5 List 6 List 7 Text File He She Other Me Love Color Common Words Words Words Words Words Words Words ELINOR.DAS 3.21% 2.22% 6.28% 3.83% 0.21% 0.00% 55.82% MARIANNE.DAS 2.17% 0.77% 3.67% 6.71% 0.21% 0.00% 56.85% ELIZABET.BEN 3.10% 1.97% 5.87% 4.32% 0.27% 0.01% 56.07% JANE.BEN 3.37% 1.54% 6.24% 6.20% 0.21% 0.02% 55.35% FANNY.PRI 2.46% 5.32% 8.55% 2.11% 0.22% 0.01% 55.18% MARY.CRA 2.13% 1.37% 4.27% 4.73% 0.20% 0.03% 55.97% EMMA.WH 2.50% 3.44% 6.66% 2.76% 0.23% 0.01% 54.57% JANE.FF 0.68% 0.54% 1.86% 8.53% 0.00% 0.00% 57.17% HARRIET.SM 3.47% 1.82% 6.06% 5.98% 0.10% 0.03% 57.59% CATHERIN.MOR 1.82% 3.02% 5.79% 4.32% 0.13% 0.07% 56.19% ELEANOR.TIL 1.43% 1.08% 3.05% 5.80% 0.15% 0.00% 57.75% ANNE.ELL 2.97% 2.58% 6.51% 2.71% 0.16% 0.02% 54.07% CLAY.MRS 1.68% 0.51% 4.71% 4.04% 0.00% 0.00% 50.34% FERRARS.ED 1.05% 1.09% 2.71% 8.42% 0.12% 0.04% 56.33% JOHN.DAS 2.10% 2.36% 5.32% 3.34% 0.06% 0.00% 54.80% BRANDON.COL 1.79% 3.13% 5.48% 6.46% 0.13% 0.00% 57.24% DARCY.FW 2.09% 0.87% 3.60% 6.54% 0.17% 0.00% 55.58% COLLINS.MR 1.54% 1.90% 4.03% 5.91% 0.08% 0.00% 54.62% EDMUND.BER 1.52% 2.56% 4.86% 4.45% 0.19% 0.02% 56.84% HENRY.CRA 2.53% 1.99% 4.99% 4.62% 0.13% 0.01% 55.51% KNIGHTLE.MR 3.28% 2.68% 6.57% 4.37% 0.23% 0.00% 55.40% FRANK.CH 1.18% 2.18% 4.00% 7.15% 0.09% 0.00% 55.11% HENRY.TIL 1.10% 1.22% 3.18% 3.59% 0.21% 0.08% 54.68% JOHN.THP 2.06% 0.38% 3.38% 5.88% 0.03% 0.00% 56.42% WENTWORT.CPT 1.80% 1.61% 3.97% 6.37% 0.13% 0.00% 57.89% WALTER.SIR 2.68% 1.64% 5.14% 3.17% 0.00% 0.11% 50.87% NARRATOR.SS 2.51% 5.38% 9.53% 0.00% 0.11% 0.02% 50.62% NARRATOR.PP 3.05% 5.27% 9.95% 0.00% 0.09% 0.01% 50.93% NARRATOR.MP 2.72% 4.54% 8.24% 0.01% 0.13% 0.03% 50.61% NARRATOR.EM 2.75% 4.33% 8.07% 0.00% 0.10% 0.02% 49.94% NARRATOR.NA 1.87% 5.28% 8.51% 0.07% 0.09% 0.02% 50.16% NARRATOR.PER 2.45% 4.10% 8.02% 0.01% 0.06% 0.02% 50.55% AUSTEN.ALL 2.43% 3.41% 6.90% 2.51% 0.13% 0.02% 53.20%Figure 5. Output from FINDLIST showing the percentage of words found in Austen characters' text files.
I asked students to read two articles that I have written about FINDLIST (see note 8), and then to run the program using the texts of seven novels which could be downloaded from the course Web site. Because of the large quantity of processing that FINDLIST performs, it is the slowest executing of any computer program used in the course. Students found it frustrating to wait half an hour (or more) for the program to produce its output.
I provided students with some lists of words that could be used with FINDLIST, and students frequently created their own lists. I gave them a list of basic words for colors, and several students found it interesting to compare the percentages of these basic colors with a larger list containing more exotic names of colors such as "aqua," "azure," "bisque," "bronze," "cornsilk," and "pearl." In a similar way, students created both small and large files of words for food and eating, money, nature, and weather. A student with an interest in the novels of Arthur Conan Doyle determined that one of Doyle's novels contained at least double the percentage of words associated with justice than ten other novels he tested. Students seemed to find FINDLIST more interesting and valuable when they used the program with a large number of lists and files of texts.
CONCORD is a program that keeps track of the context of each word as it is used in a text; its output presents each word within the line in which it is found. Figure 6 shows a tiny part of the output of CONCORD for Moby Dick indicating the way "white" is used. In contrast to FINDLIST which can use massive amounts of input and produces modest output, CONCORD can produce enormous amounts of output from limited input: a full concordance of Lord Jim produced by CONCORD is 11 MB! The numbers at the left of each line of CONCORD's output indicate the position of each word in the original text: the line number and the word position in the original line -- separated by a period.
239.10 nging sign over the door with a WHITE painting upon it, faintly repre 418.6 ply brown and burnt, making his WHITE teeth dazzling by the contrast; 609.4 me. I remembered a story of a WHITE man --a whaleman too--who, fall 616.12 heard of a hot sun's tanning a WHITE man into a purplish yellow one. 1297.11 e yet afloat. And ever, as the WHITE moon shows her affrighted face 1330.12 observe his prayer, and so many WHITE bolts, upon his prison. Then J 1733.10 e so companionable; as though a WHITE man were anything more dignifie 1864.14 starboard hand till we opened a WHITE church to the larboard, and the 3003.2 in the moonlight; and like the WHITE ivory tusks of some huge elepha 3046.9 eedlessly, ye harpooneers; good WHITE cedar plank is raised full thre 3462.8 ity in looking up at him; and a WHITE man standing before him seemed 3462.15 an standing before him seemed a WHITE flag come to beg truce of a for 3556.2 ers of discernment. So that no WHITE sailor seriously contradicted h 3562.5 mness was owing to the barbaric WHITE leg upon which he partly stood. . . . 8274.6 -head. In the distance, a great WHITE mass lazily rose, and rising hi 8281.10 he breaches! right ahead! The WHITE Whale, the White Whale! Upon t 8282.2 ht ahead! The White Whale, the WHITE Whale! Upon this, the seamen r 8292.3 did he distinctly perceive the WHITE mass, than with a quick intensi 8307.13 m, than to have seen thee, thou WHITE ghost! . . . 16362.12 c fountain in his head, did the WHITE Whale now reveal his vicinity; 16370.3 his immeasureable bravadoes the WHITE Whale tossed himself salmon-lik 16390.14 p's three masts to his eye; the WHITE Whale churning himself into fur 16399.3 his untraceable evolutions, the WHITE Whale so crossed and recrossed, 16414.12 fast again. That instant, the WHITE Whale made a sudden rush among 16429.6 rpendicularly from the sea, the WHITE Whale dashed his broad foreheadFigure 6. A small part of the output of CONCORD for Moby Dick.
Several students exclaimed that CONCORD was a terrific time saver in presenting exactly the words of a text that they wanted in their context. Many readers have been interested in the way the word "white" is used in Moby Dick. Figure 6 shows that "white" is used in a variety of ways early in the novel, but later in the novel, it is used to describe only the whale, Moby Dick. In several of Conrad's novels, a student noticed, the word "white" is used only in the phrase "the white man."
A student with an interest in the history of the English language used CONCORD to assist in dating texts. She knew that some word forms that were once used exclusively as nouns have been recently used as verbs. Using CONCORD to examine the context of word forms allowed her to determine whether they functioned as nouns or verbs, and thus she could estimate the date of the text. Also, the program allows classification of the sense of a word: the noun "water" is rarely used in the sense of a beverage in Conrad, a student said.
Another student commented that CONCORD could be used to identify and verify plagiarized documents. By searching for the context of unusual words, differing, similar, and identical passages could be distinguished.
2.7 Assignment 7: Marked-up Texts of Shakespeare
Assignment 7 required students to use files of the works of Shakespeare that contain tags that indicate play titles, acts, scenes, stage entrances and exits for characters, and similar kinds of information. Because the text files contain such tags, my program SHAKWORD could be used by students to search for the exact locations of words in a specific play by Shakespeare or in his complete works. Figure 7 shows output from SHAKWORD that indicates the locations of six words in Shakespeare's works. It shows, for example, that "lawyers" is used four times: the first use is in the Contention of the Two Famous Houses of York and Lancaster (more commonly known as the Second Part of Henry VI) in act 4, scene 2, line 78, and that line is frequently quoted: "The first thing we do let's kill all the lawyers."
Words References: play title, act:scene,line actor Luc, :Arg,25; MND, 3:1,74; R2, 5:2,24; AYL, 3:4,54; Ham, 2:2,393; Ham, 2:2,397; Ham, 3:2,97; Son23, :23,1; MM, 2:2,37; MM, 2:2,41B; AWW, 2:3,25; Ant, 2:5,9; Cor, 5:3,40B actors RDY, 2:3,28; Luc, :,608; LLL, 5:2,498; MND, 1:2,9; MND, 1:2,14; MND, 4:2,37; MND, 5:1,116; JC, 2:1,225; Ham, 2:2,394; Ham, 2:2,398; Tmp, 4:1,148 lawyer 1H6, 2:4,0; 1H6, :,108; Ham, 5:1,96; Tim, 2:2,108; LrQ, Sc:4,125; LrF, 1:4,128; Cym, 2:3,72 lawyers CYL, 4:2,78; CYL, 4:4,35; AYL, 3:2,322; WT, 4:4,206 student LLL, 3:1,33; Wiv, 3:1,37; TN, 4:2,8 students LLL, 2:1,64Figure 7. Output from SHAKWORD showing the locations of six words in Shakespeare's works.
Students said SHAKWORD was uncommonly useful in its ability to pinpoint the precise location of any word in the nearly 900,000 running words of Shakespeare's complete works. One student used SHAKWORD to locate the mention of various characters' names in the speeches of other characters; in a moment the program found the 79 uses of "Hamlet" and the 31 uses of "Horatio." SHAKWORD, again, produced some mild surprises; for example, since Hamlet is a university student, a student in CHUM 650 expected that the word "student" would be found in Hamlet, but, as Figure 7 reveals, the word is not used in that play.
A director of a play on my campus was extremely interested to know if a computer program could identify which characters in a play were never on stage simultaneously -- using such information a director could cast actors in multiple roles. ACTORS is a program that processes a play's text, and it notes each character's entrance or exit. With this information, ACTORS produces a series of lists of the characters that are on stage at each moment, and then it uses those lists to construct tables of characters who do (and do not) appear on stage at any point simultaneously. Using these tables, the program can suggest multiple casting -- often with the minimum number of actors -- see Figure 8.
Possible Doubling for Performance of HAMLET with Minimum Number of Actors Actor 1 Claudius, Army, Barnardo, OneWithRecorder, Reynaldo, Sailors, SecondClown, Servant Actor 2 Gertrude, Captain, Francisco Actor 3 Guard, Ambassadors, Cornelius, FirstClown, Followers, Ghost Actor 4 Guildenstern, Attendants, Council, Marcellus, Messenger, Priest Actor 5 Hamlet Actor 6 Horatio, Others Actor 7 Lords, Valtemand Actor 8 Ophelia, Colours Actor 9 PlayerFive, Drummer, Lucianus, Prologue Actor 10 PlayerFour, Fortinbras Actor 11 PlayerKing, Laertes Actor 12 PlayerQueen, Osric Actor 13 PlayerThree Actor 14 Polonius Actor 15 RosencrantzFigure 8. Output from ACTORS.
I asked students in CHUM 650 to read my "Project Report: ACTORS: Computing Dramatic Characters that Are on Stage Simultaneously" (see note 9). I stressed that they should pay particular attention to the sections of the article that describe how ACTORS processes the tags in the text (for characters' entrances and exits) in order to calculate which characters are on stage simultaneously and which are not.
Students downloaded a cluster of four executable files that together makeup ACTORS, and they downloaded a file containing Hamlet. They ran ACTORS using the text of Hamlet, and they examined the output. They were asked to closely inspect the markup tags in the text file of Hamlet to understand exactly how that text was processed by ACTORS. It is often necessary to edit a file in order to produce accurate results. For example, as Figure 8 shows, the names of some characters in Hamlet were edited so as to form one word (OneWithRecorder, FirstClown, PlayerThree, and so on). After students had studied the markup of Hamlet, they downloaded a file containing Macbeth, and they edited that file as necessary in order to process it accurately with ACTORS.
In their editing of Macbeth, students noticed that they were forced to make performance interpretations: for example, "Witches" in a stage direction was sometimes sufficient to indicate the entrance or exit of the three witches, but in most places "Witches" had to be coded as "FirstWitch," "SecondWitch," and "ThirdWitch" since they had individual speeches and different business on stage.
Some students edited several versions of Macbeth in different ways and compared the differences in output from ACTORS. For example, a student noted that the stage directions indicated twice that the witches "vanish" rather than "exit." Two versions of the play can, thus, be coded: one in which the witches become invisible to the characters on the stage, but the actors playing the witches do not exit at once, and a second version in which the actors do indeed exit when the direction says that they vanish. If the actors portraying the witches remain on stage, then they cannot, of course, be doubled with any of the other characters who are in the scene, and more actors are required for the production.
One student commented that the output from ACTORS was valuable not only in determining possible doubling of actors, but in providing information about whether such doubling is practical. For example, the program identifies the exact lines of entrances and exits of each character; that information can be used to calculate whether there is sufficient time for an actor's costume change between an exit as one character and the entrance as another character.
2.8 Assignment 8: Jane Austen and SGML
For their last assignment in CHUM 650, students read Web versions of two articles that I had published: a review of a new electronic edition of Jane Austen's novels encoded with Standard Generalized Markup Language (SGML) (see note 10) and an article about Austen's novels -- the research for which was made possible by using the new electronic edition (see note 11). The articles provided students with a description of the SGML markup in the Austen texts and information about how the texts can be processed with computers.
Students downloaded four programs that I had written to process the Austen SGML texts: JAFORMAT, JAWORDS, JATALK, and JADIALOG. I invited students to use the programs in a variety of ways with the Austen texts.
JAFORMAT converts the SGML encoded Austen texts by removing the tags and reformatting the text as necessary to produce clean ASCII text. It seems rather perverse to convert a carefully tagged text into a plain vanilla ASCII text, but sometimes a text without tags is what is needed. In the first place, such text can be loaded directly into a word-processed paper about Austen. Second, and more importantly in this course, only a plain ASCII text of Austen novels can be analyzed by computer programs used earlier in the course -- programs such as WORDS, DIALOG, CONCORD, SL, WORDCELL, and FINDLIST.
Because this Austen electronic edition codes page and line numbers corresponding to the standard printed edition (edited by R. W. Chapman), it was rather straightforward for me to create a computer program to index the location of words in the standard edition. JAWORDS produces an index of the page and line location of selected words in the SGML texts. Figure 9 shows the output of JAWORDS when the program was used to search for the locations of four words: it shows that "drama" is not found in the six novels, and "actor" is found three times -- all three on page 165 of Mansfield Park -- in lines 22, 26, and 33.
Select index for the file AUSTEN.TXT. Words References (TITLE, page number.line number) acting SS, 68.27; SS, 80.22; SS, 101.23; SS, 118.6; SS, 320.30; SS, 344.31; SS, 378.5; PP, 22.26; PP, 137.13; PP, 148.25; MP, 26.2; MP, 50.29; MP, 121.16; MP, 121.32; MP, 123.15; MP, 124.19; MP, 124.19; MP, 126.34; MP, 127.6; MP, 127.12; MP, 128.5; MP, 129.16; MP, 133.12; MP, 145.5; MP, 153.32; MP, 154.35; MP, 156.17; MP, 156.25; MP, 156.31; MP, 160.30; MP, 164.19; MP, 167.25; MP, 181.6; MP, 181.7; MP, 181.8; MP, 182.37; MP, 184.7; MP, 184.25; MP, 185.38; MP, 186.13; MP, 187.3; MP, 190.9; MP, 286.11; MP, 337.19; MP, 337.21; MP, 338.37; MP, 358.25; MP, 395.27; MP, 395.32; MP, 404.32; EM, 88.22; EM, 145.3; EM, 197.19; EM, 374.29; EM, 408.11; EM, 412.27; EM, 419.6; EM, 467.16; NA, 132.35; NA, 189.1; NA, 201.2; NA, 237.19; PE, 12.23; PE, 213.33; PE, 214.30; PE, 241.2; PE, 251.36 actor MP, 165.22; MP, 165.26; MP, 165.33 comedy MP, 122.38; MP, 123.26; MP, 124.29; MP, 130.19; MP, 135.29; MP, 136.19; MP, 136.20; NA, 92.27 drama not foundFigure 9. Output from JAWORDS showing the locations of four words.
A student used JAWORDS to calculate the frequency and location of specific words in Austen's novels. Working with "love" and "pleasure," she found that "pleasure" occurs before "love," and that both words are used regularly after the first occurrence. When 125 common words are excluded, the most common word in Emma is "Mr." -- although women are mentioned nearly twice as often as men in the novel.
JATALK processes the codes in the SGML Austen text that indicate the name of each speaker (including the narrator), and the program counts the number of words of direct and indirect dialogue of each speaker, and thus it shows exactly how much each character talks. Figure 10 shows the output of JATALK for Austen's Emma. As is usual, the narrator speaks the most; as is not usual, one character, Emma, speaks almost as much as her narrator. Emma has almost four times as much to say as the next most frequent talker, Mr. Knightley. Poor Mr. Cole is not allowed a single word of dialogue. The program indicates the code for each speaker (such as EM0 for the narrator) which enables students to search for each character's dialogue in the SGML text if they wish to do so.
Character Number of words of dialog Narrator (EM0) 53529 Emma (EMA) 42800 Mr. Knightley (EMB) 11357 Frank Churchill (EME) 9189 Miss Bates (EMD) 7993 Mrs. Elton (EMI) 6606 Harriet Smith (EMM) 6105 Mr. Woodhouse (EMP) 5066 Miss Taylor/Mrs. W. (EMO) 5027 Mr. Weston (EMN) 3628 Mr. Elton (EMH) 2287 Jane Fairfax (EMJ) 2204 John Knightley (EMK) 1845 Isabella Knightley (EML) 1284 Mrs. Cole (EMG) 448 Minor Men (EMW) 309 Imaginary/unidentifiable (EMX) 140 Literary & Real people (EMZ) 130 Inseparable Groups (EMY) 109 Minor Women (EMV) 108 Mrs. Bates (EMC) 88 Robert Martin (EMR) 55 Ford (EMQ) 14 Mr. Cole (EMF) 0 Total 160321Figure 10. Output of JATALK showing the number of words of dialogue for characters in Emma.
JADIALOG processes the speaker codes in a manner similar to JATALK, but rather than simply counting the words of dialogue of each character, it separates and writes the dialogue into individual files for each speaker, thus making it available for analysis with other programs. Figure 5 shows the output of FINDLIST when it has been used to process 32 files of dialogue that were produced by JADIALOG.
Two students noticed that the female characters in Austen's six novels have about twice as much dialogue as the male characters. Although the male and female characters use similar kinds of speech, there are some differences; for example, women mention "love" three times as often as men do, and women use the exclamation "Oh!" about ten times more frequently than men use it.
Examining the patterns of dialogue in Pride and Prejudice, a student noted differences between the two main characters. The word "I" is the most frequently used word by Darcy; for Elizabeth, "I" is the fourth most used. Both characters use the word "you" about as commonly. In her greater use of "not," Elizabeth is slightly more negative than Darcy. Darcy says "Elizabeth" only 5 times, but Elizabeth says "Darcy" 63 times.
Students said that they were able to do much more with the Austen texts than with any others. They used JAFORMAT to produce ASCII texts of the novels, and then processed them with many of the programs introduced in assignments 2 through 6. They seemed to take a particular delight in using JADIALOG and JATALK to determine which of their favorite Austen characters talked most: Elizabeth Bennet or her mother; Emma or Mr. Knightley; Darcy or Bingley. Since JAWORDS produces a select index keyed to the standard Chapman edition of the Austen Novels, the program was, understandably, most used by those students who had access to the Chapman editions. One student who is a university teacher said that she will use the program in the classroom to find passages as they are being discussed by her students.
Working at very different rates, students in CHUM 650 used my programs, completed the assignments, and they sent me accounts of the results. A few students had problems making some programs run on systems with limited amounts of RAM, and I rewrote the code of five programs so that the programs could allocate and use memory in a smarter way. Usually when students reviewed the program output, they formed imaginative, sophisticated conclusions about the texts almost immediately. However, occasionally, a student would be overwhelmed by the quantity of output from a program; two students said that they had no idea what to make of 400 pages of output showing the interval among multiple occurrences of words produced by INTER -- or 5000 pages of output from CONCORD. After cautioning the students not to print all of such output, I suggested that they use a word processor (or utility program) to search for words that might be of particular interest. They did so, and they discovered patterns of word use that would not have been possible to find without having the output from INTER and CONCORD.
Many of the students in CHUM 650 said that they not only had gained insights into the specific works that were used in the course, but also they indicated that they had learned to think in different ways about texts. There were significant patterns and rhythms of words and sentences that were impossible to notice by simply reading texts, but they could be discovered with computer programs such as those used in the class: they could determine a range of surface details, matters of symbolism and theme, and, sometimes, how an author's craft achieves particular effects. They had learned to ask questions about texts that would not have occurred to them prior to the course. Students said they would continue to use the programs long after they had completed the assignments for the course (which I invited them to do), and they expected their future research and teaching to profit from the programs in many ways.
1 For an introduction to the computer languages SNOBOL4 and SPITBOL, see Hockey (1985), Johnson (1995), and Johnson (1994).
2 Students were asked to read slightly revised versions of my "Electronic Shakespeare: Making Texts Compute," Computer-Assisted Research Forum, 1.3 (1993), 1-3; and my "Electronic Texts We Want and Need," TEXT Technology, 4.2 (1994), 90-92.
3 "Counting Words and Computing Word Frequency: Project Report: WORDS," TEXT Technology, 5.1 (1995), 8-17.
4 The article on the Web site is a slightly revised version of what was published in TEXT Technology, 4.1 (1994), 7-12.
5 The article was published in TEXT Technology, 3.6 (November, 1993), 3-5.
6 "Creating an Index: Project Report: BITZER," TEXT Technology, 5.2 (1995), 91-100.
7 The article on the Web site is a revised version of what was published in TEXT Technology, 3.5 (September, 1993), 3-6.
8 "Computing the Kinds of Words used in Novels," TEXT Technology, 5.4 (1995), 276-282 and "The Kinds of Words used in the Novels of Jane Austen, Charles Dickens, and James Janke," TEXT Technology, 6.2 (1996), 91-96.
9 The article was published in Computers and the Humanities, 28.6 (1994-95), 393-400.
10 "Oxford Electronic Text Library Edition of the Complete Works of Jane Austen," Computers and the Humanities, 28.4-5 (1994-95), 317-321.
11 "How Jane Austen's Characters Talk," TEXT Technology, 4.4 (1994), 263-267.
Hockey, Susan. SNOBOL Programming for the Humanities. New York: Oxford University Press, 1985.
Johnson, Eric. Computer Programming for the Humanities in SNOBOL4. Madison, SD: Dakota State University Press, 1995.
---. "SPITBOL-386: The Language of Choice for Non-Numeric Computing." TEXT Technology, 4.3 (1994), 177-185.
Página creada y actualizada por grupo "mmm".
Para cualquier cambio, sugerencia,etc. contactar con: email@example.com
© a.r.e.a./Dr.Vicente Forés López
Universitat de València Press
Creada: 15/09/2000 Última Actualización: 18/06/2001