by
Jeff Collins and Dave Kaufer
Carnegie Mellon University
DocuScope is © Copyright 1998-2001 by Carnegie Mellon University.
See Section 3.6 for a list of the current development team.
DocuScope is text tagging and visualization software, developed at Carnegie Mellon University's English Department to support writing courses taught to information design and professional writing students.
What does DocuScope do? DocuScope is designed to let people visualize and understand representational effects in texts. It is not an attempt at artificial intelligence and the program does not ``understand'' or analyze anything it ``reads.'' DocuScope simply goes through text documents and finds patterns of words that the humans using the program have told it are relevant to representation (more on this in Section 3). DocuScope then displays its findings in ways that help users see the representational patterns in the texts. The software will also output quantifications of these patterns to comma-delimited text files for analysis in statistical packages.
Where can DocuScope be downloaded? At this time it's not available to be downloaded. We have taught several courses and given workshops with DocuScope at academic conferences, but it is not a commercial product. If you are an academic researcher interested in evaluating DocuScope, contact David Kaufer (kaufer+@andrew.cmu.edu). DocuScope is owned by the authors and Carnegie Mellon University. You will need to sign a release from Carnegie Mellon's technology transfer office before we can send you a copy of the software.
What is representational composition? Representational composition is the language theory responsible for the patterns DocuScope visualizes. We are at work on a forthcoming book [1] that explains the language theory fully. Even without the book, we've observed students learning this theory by using DocuScope and seeing how the computer tool ``reads'' and visualizes their texts. We discuss the theory in some detail in Section 2.
To help you see the big picture, here's a quick introduction to the point of this language theory. For a moment pretend the nasty chemical DDT has not yet been banned from widespread use in the U.S. and you want to write a paper arguing DDT should be banned. There are a myriad of options for designing such an argument, but here are three possible high-level plans:
Plan 1 Write an explicit argument that will provide your reader with the main reasons DDT ought to be banned.
Plan 2 Write a paper outlining your opinions of DDT and what you think about its continued use.
Plan 3 Write a description of DDT's effects (or potential future effects) on the environment.
Writing papers based on these three plans would lead toward quite different reading experiences for your reader: the first would feel like a research paper, leading the reader to consider your evidence critically; the second would feel like an opinion piece, leading the reader to consider how much credence to pay you and your opinion; the third would feel like a narrative, leading the reader to visualize the world you describe in your paper and compare it to the one the reader lives in and knows. These striking differences in reading experience are created linguistically: the language used to carry out your chosen plan will determine which experience your reader will have.
All three of these plans could potentially lead to effective papers arguing against DDT. The idea behind representational composition and DocuScope is that it is useful to understand the way language choices create these experiences for readers. In other words, representational composition attempts to draw attention to the linguistic choices authors make that lead readers to have particular experiences with texts. We believe writers design these experiences for their readers as they write. DocuScope is a program that lets writers see these differences explicitly and get better control of their representational language choices.
In this section, we provide an overview of the rhetorical theory behind DocuScope and give you some information about how the theory operates.
DocuScope's method of characterizing language choice is drawn from rhetoricians' long-standing interest in the patterns of language that provide interactive experiences for an audience [2,3].
Perhaps an example of how subtle language changes affect rhetorical purpose would make this clear. The original (1) and revised (2) sentences below were written by a rental property owner to the manager of the property:
There is nothing particularly wrong with sentence (1) to cause its revision. The difference between the sentences is a difference of rhetorical effect, achieved through subtle language change. The first sentence implies renters have been lined up and are ready to move in during the summer. The second sentence makes the renters more tentative and uncertain, suggesting it is the reader (the property manager) who would know about potential renters, not the writer of the letter.
The second sentence conveys a slightly more tentative way for the manager to approach the current tenants and suggests the manager should take action to look for renters who might want to move in. Of course there are other ways a writer could achieve this impression. It's quite likely a writer would combine this sentence with others in the letter to help clarify the relationship and to make the direction to the manager more careful (or perhaps the relationship is already clear from the context of the letter). Nonetheless, by making the specific language changes to the first sentence, the author revised it for a rhetorical purpose-a purpose that was probably designed to help make the overall point of the letter.
Experienced writers have control of these subtle rhetorical shifts-manipulating their language choices here and there in their attempts to achieve different impressions for their readers. This is what is meant by the cumulative rhetorical effect of language choice: no one choice necessarily makes a strong impression, but cumulatively the choices lead to a particular impression for a reader. Historically, writers have attained this control through years of reading and writing practice-both in the school setting and beyond it. To try to help our writing students understand these cumulative language effects more explicitly, we created DocuScope.

Figure 1: The two sentences in DocuScope. The different colors indicate different clusters of representational cues.
Using DocuScope to examine the two example sentences (Figure 1), we see the additions to the revised sentence have provided two dimensions of interactive experience for the reader (we describe all the dimensions in Section 2.3). First, the phrase ``any possible'' adds an indication of the writer's inner thinking. This means ``the renters'' have moved from being explicit, tangible people in (1) to being figures of imagination or potentiality in (2): they've gone from being in the world to being imagined in the writer's mind. They're not renters, but possible renters. The writer makes this clear for the reader by adding ``any possible'' to the sentence.
Second, the added word ``might'' combines with the pronoun ``who'' to create a phrase that overtly cues the reader that interactivity is required-it adds a conditional nature to the revised sentence. This means the renters have changed from being known by the writer in (1) to being unknown, potential renters who might want to move in, but that some action is required on the part of the reader to know this. In other words, by formulating the phrase ``who might'' in sentence (2), the writer implies an understanding something like ``You need to find a renter'' or ``When you find a renter.'' The writer achieves this impression of interactivity with the reader by composing the interactive phrase ``who might'' in the revised sentence.
The task this software undertakes is to distinguish the patterns suggested by this representational theory for the user. It accomplishes this by first isolating and categorizing the strings responsible for the effects and then presenting this information in a useful way.
When we talk about the ``effects'' of representational composition, we are talking about extremely different micro-experiences for the reader. Consider the different effects of the word ``smeared'' in the following sentences:
The catalog behind DocuScope uses language strings to disambiguate such rhetorical functions, isolating recognized strings and assigning each to a classification within a hierarchical classification scheme. (This large-scale effort is beyond the scope of this paper and will be detailed elsewhere [1]).
The hierarchies of effects bear some rough and eclectic overlaps with the work of rhetorician I.A. Richards and linguist Michael Halliday [4,5]. From Richards, we took the important, but little noticed, idea that much of English separates into concepts of inner thought and outer sense-an important determiner for a rhetorical action is whether it is mental action spawning from a mind or outward action, projecting a descriptive reality. Based on Richards' work, we constructed two major divisions (which we call ``clusters'') at the top of our hierarchy, Thought and Description . These two clusters provide the major source of language at the word and phrase level for disclosing thoughts and for implementing the reader effects of immersion in spatial and temporal situations. Thought and description, together, create the combined inner and outer depictions required for building acquaintance effects with readers.
From Halliday's systemic-functional grammar, we took the idea that a fundamental function of language involves what Halliday calls the ``interpersonal metafunction,'' the ability of language to structure interactive relationships with implied or addressed audiences. From Halliday's work, we created a third major cluster in our hierarchy, called ``Linear Cues.'' These supply the main workhorse for information effects on readers. It is central to decision-making insofar as getting the reader to decide depends upon getting and sustaining the reader's attention through a chain of reasoning, an immersive example, or an exhortation. Combined with (spatial) description, linear cues are central as micro-level actions to navigate readers through the spatial task of reading the text.
.gif)
Each of the three clusters of effects are further divided into two subcategories, as indicated in Figure 2. DocuScope uses these subcategories to choose the color for each of the matched strings. This use of color is designed to benefit users by drawing attention to specific consistencies and variabilities within and among texts (more on this text visualization is found in section 3.3).
We have not yet formally evaluated how well these clusters and subcategories correspond to a broad range of readers and their reading experiences with texts. Our anecdotal experience using the software in the writing classroom indicates these clusters and subcategories are adequate for drawing student attention to the particular representational effects of their writing that is our aim. The hierarchy (and concomitant color scheme) help users of the software compare their use of representational effects in their writing to the writings of others.
Each description below is followed by phrases containing a few highlighted strings that are assigned to the representational effect.
This section outlines the effort that went into software development and describes some key aspects of the software's functionality.
While we can draw on a long history of rhetorical interest in patterns of language effect for the above understandings, neither the ancient nor the contemporary writers on rhetoric have had a theory of rhetorical patterning that a modern computer scientist would call specific enough to implement. Thus, the first six years of this development project was spent refining a detailed theory of rhetorical patterning. During this development, we were more concerned with the quality of the theory than with whether a computer program could ever result.
Drawing broadly from the rhetorical tradition, but mainly from rhetorical practice, we started to build a theory of rhetorical patterns within texts. From 1992-1995, Kaufer and Butler studied tens of thousands of phrases from the Lincoln-Douglas debates, coding and classifying them for the different functions they served. The book that resulted from their work [3] was a systematic exploration of the large variety of patterned experiences the debates makes available to a listener or a reader. The texts of the debates were threads of heterogeneous patterns creating a wide variety of interactive experiences. Before the 1996 book project, they thought that what was known about genres (in this case the genre of public debate) could help them understand the patterns available in the debates. After that project, they were convinced of the need for a more empirical, bottom-up, theory of English patterns and variation to understand not only the debates, but also textual genres in general.
The challenge became to put various textual genres under the microscope for their distinctive rhetorical patterning. Kaufer and Butler pursued this challenge, beginning in 1996. They designed a writing curriculum in comparative genres and, from the student papers, began indexing characteristic patterns associated with different genres. They also began to keep external archives of texts, cross-checking the patterns they were finding from their students against published writers. After several years of iterations of the course, the patterns for the different genres began to crystallize and stabilize. Kaufer and Butler published a second book [2] to describe these patterns at a high level of generality.
While Kaufer and Butler had a large inventory of English genres and patterns and a theory of their contribution to rhetorical effect, two lingering problems were encountered that technology seemed well equipped to solve.
First, although they had kept massive journals from their work on English genres, most of the notes on rhetorical patterns were less precise than literal catalogs of English phrases. English phrasal patterns are extremely difficult to store or manage on paper or even in standard databases. Different English patterns can share multiple words. What we needed was an information system to manage and differentiate rhetorical knowledge from naturally occurring English patterns. This required a system with the flexibility to categorize English patterns differently, even when they overlapped substantially. We knew of no existing information system equipped to do this.
Second, teaching the course in comparative genres was becoming both frustrating and exhausting. Each semester, Kaufer had to summon for students his tacit knowledge of English genres and phrases. This seemed to be re-inventing the wheel, semester after semester, without building on what we already knew. In discussion with Suguru Ishizaki and Kerry Ishizaki, the idea was hatched to design a computer program that would parse and visualize this knowledge automatically. Students could have some of a reader's tacit knowledge built into the system and could explore their own texts for projected rhetorical effects. They could learn at their own pace as part of discovery learning. They would not have to try to read the teacher's mind about tens and hundreds of thousands of rhetorical patterns the teacher could not articulate.
Creating a program that could visualize rhetorical patterns in a simple interface posed a new and daunting challenge. This required enormous talent and skill in text parsing and visual interface design. (See Section 3.3 for a description).
As our system began working well in our classrooms, our categories of reader-centered rhetorical function at the micro-level began to stabilize and become more concrete. We were able to enlist student help to fill in strings that constituted micro-rhetorical action. We created a public process by which students could make ``bug'' reports about omissions (a string should be added), ambiguities (a classified string has multiple classifications and can be lengthened to disambiguate it), and misclassification (a sting classified in category A should be in Category B). Through in-class discussion and final logs, students made a game of trying to predict and improve upon the string-matchers performance at capturing rhetorical actions in texts they wrote and read within the class.
To get a sense of this process. Consider the motion word jump . This word seems to change rhetorical roles abruptly when embedded in the longer string ``jump to the conclusion'' or ``jump at the chance.'' When embedded in these longer forms, does the motion sense still survive or is it superseded by rhetorical actions of other types? Shall we, for example, classify the string ``jump to the conclusion'' as negative information based on one datum a student can find or think up (e.g., ``he jumped to the conclusion too quickly'') or is it better regarded as a syntax of anticipation based on another conceivable string (e.g., ``jumped to the conclusion that...'')? Should jump at retain its spatial action categorization based on datum like ``jumped at Fred'' or should it be abstracted out of space into a general marker of intensity (``I'd jump at the chance''). By inquiry of this sort, we were able to assign ``jump'' to multiple functions, depending upon disambiguated strings in which it was embedded.
DocuScope could be accurately described as a flexible string-matching software: it can match strings up to any length. We designed the string-matcher to support human coding schemes. Thus, the string-matcher does no automatic analysis of its own, but rather implements across a text, or corpus of texts, whatever coding of categories a human has supplied. In our case, the coding categories have been based on representational composition, as discussed in Section 2.
The idea of a string-matcher that can match strings of different lengths is very important, because a series of words may not disambiguate themselves with respect to rhetorical action until deep into the series. Consider, for example, the two strings:
The software we have built tags and recognizes contiguous strings that appear to do rhetorical work. In relation to theorists of written communication, our software captures an account of language that overlaps with much of the language phenomena associated with writer-reader involvement, tying writers to readers through textual experience that can stand in for face-to-face experience.
For several years now, the catalogs of strings have continued to grow incrementally and as a cooperative process with students in our writing classrooms. While the rhetorical tagging program we designed has done the job we needed it to do, it has limitations. For example, it tags only contiguous language parts. It thus knows nothing about logical dependencies or about shifting speaking roles within textual dialog, for example.
Differences in rhetorical function at the string-level are often hard for the human eye to detect. Therefore, we attached our string-matcher to a visual interface that allowed human coders to see its performance on actual texts.
Visualization systems attempt to tap people's natural strengths in rapid visual pattern recognition to support performance in activities that involve information processing [6,7,8]. As has been demonstrated in many areas, graphics are useful for enhancing human performance because complex cognitive tasks (e.g., comparing numbers from a table) may be replaced by perceptual inferences (e.g., perceiving the relative height of two adjacent figures).
One of the key considerations in designing visualization systems, as Lohse [9] suggests, is to ``facilitate and direct attention to visual feature(s) that communicate the requisite information when it is needed during the task'' (p. 385, emphasis added). While visualization technologies have been shown to support thinking and decision-making in many non-technical, multivariate areas such as management and legal studies [10,11,12,13], they have not yet been widely employed in language study, outside of concept mapping and annotation or collaboration support [14,15,16,17]. There are likely several reasons for this, including the necessity of reading texts closely and serially [18,19]; the difficulty aggregating textual understandings visually [20,21,22,23]; and the difficulty in providing closely-coordinated and appropriate task representations and graphical support for reading/writing/revising performance [24,25].
The extraordinary modularity of texts requires people to manage their attention in support of their current writing and reading task(s) [26,27]. While empirical observation of these processes is difficult, researchers have outlined several of the constructive reading and writing processes [28,29,30]. Underlying these processes are four, interdependent cognitive tasks that we hypothesize could be supported with visualization technology: (1) fixation and (2) spatial/temporal attention that leads toward (3) explicit and implicit ``seeing'' and (4) conscious perception [9,31].
To support our writing students and to begin to test our theories, we've associated the different representational effects with specific colors (see Sections 2.2 and 2.3) and, using colors, weighting, and positioning, have developed several visualization schemes within DocuScope to support some of the cognitive processes that underlie reading and writing. We have not formally evaluated the effectiveness of these schemes. In this paper we provide four screen captures to illustrate some of the visualization schemes that our students have found most beneficial (Figures 3-6 in Appendix A).
Over the four years of its classroom use, DocuScope has evolved toward becoming a mature text tagging and visualization system. It has two integrated applications. One application is for single text visualization (STV) and another for multiple-text visualization (MTV). In writing classrooms, students tend to start in MTV to locate significant rhetorical trends across the classroom of texts. They then drill down into single texts to understand how individual student writers (and texts) produce these trends. It is beyond the scope of this document to describe the program's use in the writing classroom. A short tutorial is being tested and will (eventually) be disseminated to students, educators and researchers learning to use the software.
We have made DocuScope available free of charge to academic researchers interested in evaluating it. Since it is not a commercial product, we cannot guarantee support but we help as we can. DocuScope is a Java application, running under Windows, Mac OS X, and Unix/Linux. If you are interested, contact David Kaufer (Kaufer@andrew.cmu.edu). DocuScope is owned by the authors and Carnegie Mellon University. You will need to sign a release from Carnegie Mellon's technology transfer office before we can send you a copy of the software.
We have given several workshops with DocuScope for writing teachers in the Pittsburgh public and independent schools. The response has been enthusiastic. DocuScope is very intuitive to use. It takes about an hour of face-to-face training to get teachers and students up to speed with it. We have not had the resources to provide true training and multi-institutional testing needed for large-scale phased dissemination.

The STV dimensional view shows a single text that has been tagged by the software. The text's score on each of the 18 representational variables is listed at the left of the screen. Each of the 18 matrices at the right of the screen is a view of the same portion of the text.

The STV text view shows a single text that has been tagged by the software. The text's score on each of the 18 representational variables is listed at the left of the screen. The text itself, with tagged strings underlined, appears at the right of the screen. The tags may be toggled off or on by clicking on the variable name in the variable list at left.

The MTV range view summarizes the scores of all the tagged texts in a collection of texts. The 18 representational variables are listed at the left of the screen with a modified boxplot* indicating the range of scores for the texts in the collection. The texts in the collection are listed at the right of the screen, sorted by the currently-highlighted variable (``ThinkNegative'' in this shot, indicated by the green color). Different variables are highlighted by clicking on them. The texts may be sorted into groups and colored for easy separation, as indicated in the lower right-hand corner of the screen shot.
* The endpoints of the boxplots represent the high and low values, including any outliers. Therefore, we call these modified boxplots. The ``fences'' are marked by the yellow indications above and below the boxplots. Outlying text scores are also indicated with asterisks beside the text name at the right of the screen.

The MTV map view shows all the texts in a collection of texts. The 18 representational variables are listed along the X and Y axes. Clicking on a variable on an axis plots the texts' scores on that variable along that axis. The texts may be sorted into groups and colored for easy separation, as indicated in the lower right-hand corner of the screen shot. The texts are listed at the right of the screen. Clicking on an individual text highlights that text within the plot and generates the bar graph shown just to the left of the text names. The plot may be scaled, as necessary.