Description of the programs

TACT (Text Analysis Computing Tools), a system of 15 programs for MS-DOS, is designed to do text-retrieval and analysis on literary works. Typically, researchers use TACT to retrieve occurrences of a word, word pattern, or word combination. Output takes the form of a concordance, a list, or a table. Programs also can do simple kinds of analysis, such as sorted frequencies of letters, words or phrases, type-token statistics, or ranking of collocates to a word by their strength of association.

TACT is intended for individual literary texts, or small to mid-size groups of such texts, such as Chaucer's poetry, Francis Bacon's Essays, Shakespeare's plays, Jane Austen's Pride and Prejudice, John Irving's The Cider House Rules, similar works in French, German, Italian, Spanish, Latin, and other modern European languages or languages using a roman alphabet, and classical Greek. Using TACT for large corpora can raise problems best handled by software like ICAME Lexa or Open Text Systems Pat.

Processing a text with TACT normally begins when the researcher tags or marks up an ASCII copy of the text. In most instances, mark-up helps the researcher do analysis afterwards. The researcher first uses a text-editor to insert these tags, usually within diamond-bracket delimiters. This mark-up helps one to refine word-selections and to provide reference citations to retrieved passages. In a play, for example, acts, scenes, and speeches are obvious things to mark; in a novel, chapters; in a narrative poem, books and stanzas; in a lexicon, subdivisions of the entry; and so forth. The researcher may also, however, want to mark proper names (of people and places), episodes, date, location, audience, narrative mode, theme, etc. For instance, words may be retrieved by speaker if the original text includes a tag before each passage that is spoken by someone different.

The researcher may also employ four programs, Preproc, Makedct, Tagtext, and Satdct, to add tags to each word of the ASCII text. These include the word's lemma (the dictionary form of the word), part-of-speech, or conceptual label.

The TACT system is multilingual. In order to display foreign languages, it supports the extended ASCII character set of the IBM PC, and with other font-editing tools, its capabilities can be extended to other modern European languages, such as French, German, and Greek. (Hebrew, Arabic, Cyrillic, and languages such as Chinese are beyond its present design.) It supports multilingual analysis as well by allowing for proper alphabetization, convenient keyboard entry, and printing on devices that require special "escape codes" to produce non-ASCII characters -- even if these sequences are different from those that would be used to enter the character from the keyboard, or display it on screen.

Once the text is marked up, Makebase converts it into a database for efficient retrieval. Makebase invites the researcher to define, interactively, the alphabet and its collation sequence, special characters, and the reference tags used for markup. Use a word-processor or text editor to divide large texts into smaller files for sequential processing by a batch file you create with Buildbat. This batch file uses both Makebase and a second program, Mergebas, to create a large textual database out of smaller ones.

After Makebase creates the textual database (or .TDB file) out of the ASCII text file, a researcher may employ six programs to retrieve information from, or to analyse, that text.

Most researchers begin with Usebase, which allows one to select a word, a group of words, or a word-pattern, and then to display it in five ways: a keyword-in-context (KWIC) concordance, a variable-context concordance, the whole text, an occurrence- distribution graph, and a table of collocates. The collocate table shows all words that co-occur with the queried word, words or word- pattern and orders those collocates by strength of association. Displays in Usebase are linked so that, for example, the researcher can go directly from a position in a distribution graph to the text it represents. Any display may also be modified in various ways.

Working with the database, Usebase can present a complete list of words from which a subset for retrieval may be selected, one word at a time. Through what is called "regular expression" capability, the researcher may also write a query according to a pattern of characters, including "wildcards" (for example, all words beginning with the letter "a" and ending with "ed" or "ing"). Queries may also contain refinements called "selectors" that specify (a) proximity or collocation (two or more words found together within a user-specified span of words), (b) similarity (in spelling), (c) frequency of occurrence, and (d) a condition related to whether or not words or patterns have one or more tag attributes in the markup. All queries may be kept in one or more ASCII files external to the program, from which queries may be selected; thus, for example, the researcher can construct a lexicon of words and expressions in such a file.

Once a set of words has been selected by whatever means, it can be saved within Usebase as a "group". Groups can in turn be combined to form other groups. Thus, for example, all words and expressions the researcher regards as concerning the semantic field "earth" can be saved as the group "earthgrp" and then be combined with the groups "airgrp", "firegrp" and "watergrp" to produce the group "4elementsgrp". Group names can be included within queries as easily as words, so that, for example, a researcher could ask to see all passages in which "airgrp" words occur within two lines of "firegrp" words. Groups are really collections of "locations" in a text; and so groups are specific to one text. However, they may be saved in a group index (.GIX) file for reuse. Unlike groups, queries stored in an external file are independent of any one textual database.

When creating a group from a query, the user can examine all retrieved citations in the text and choose which to include or exclude. This ability to choose by context can eliminate homographs and produce lemmatized groups.

Four other TACT programs, like Usebase, operate off the textual database. (1) Collgen lists all repeating fixed phrases and all node-collocate pairs (two words that occur more than once near to one another in the text). (2) TACTstat produces type-token statistics for word- length and word-frequency. (3) TACTfreq produces alphabetical, reverse alphabetical, and descending-frequency word- lists. (4) Anagrams discovers anagrams of words in which the user has some interest.

Fcompare compares ASCII lines (optionally consisting of one or more tab-delimited fields) from two files and outputs three files that list which lines are shared and which not. Preproc can be used to generate the word-lists intended for input to Fcompare. TACTsort sorts the lines of an ASCII file (optionally using a tab-delimited field as the key). All three of the above programs use the TACT sort order specified by an existing .MKS file, or by the DEFAULT.MKS file.

Most TACT-system programs will output lists, tables, graphs and other displays as ASCII files that can in turn be imported into database management systems, spreadsheet programs, and wordprocessors for post-processing of many kinds.


Last Updated: 09/17/96
URL: http://www.indiana.edu/~letrs/index.html
Comments: Library Electronic Text Resource Service / LETRS@indiana.edu.
Indiana University

El diseño de la página y las imágenes son
© 1996-2000, Universitat de València Press
© Dr. Vicent Fores
València  15th September 2000