ELECTRONIC INFORMATION MANAGEMENT: A Perspective Daniel Berleant, Hal Berghel, and Karthikeyan Viswanathan University of Arkansas, Fayetteville ABSTRACT This paper describes a new paradigm for information processing in general, as well as for research on text analysis software in particular. This is a paradigm which helps motivate not only our own specific research plan but also provides a way of thinking about information processing in tomorrow's information rich society. This paper does not emphasize original research, although the current status of our project is reviewed briefly. This paper does emphasize issues of more general interest in the hope of inviting comment. ELECTRONIC INFORMATION MANAGEMENT Recent years have seen increasing realization that the usefulness of information in the information age is based not only on the existence of large quantities of varied information, but also on effective access to that information. Advances in effective access are increasingly important: the sheer volume of information in existence is rising to such a degree that the usefulness of information items is in jeopardy as those items risk being lost in a swelling sea of data. We present next a brief overview of the science and technology of electronic information management (EIM). We draw an analogy with tangible products: just as society produces, stores, and distributes manufactured goods, society produces, stores, and distributes information. Just as society often customizes goods, perhaps altering a suit or improving an auto sound system subsequent to the distribution process, society needs even more to customize information. ELECTRONIC INFORMATION PRODUCTION Modern society has produced vast quantitities of information. The rate of production is accelerating, and will in all likelihood continue to accelerate into the indefinite future. Computer technology has played the pivotal role in this explosion of information and its role will continue to expand as it reaches into new areas and technologies, particularly when they are related to communications. Increasingly, our faxes have electronic rather than hardcopy sources. Our telexes are giving way to more efficient e-mail communication. Even our cable systems are flooding us with digitized data as we move into interactive systems for home shopping, home study, A/V remote conferencing, and so forth. We are digitizing what we choose to hear, read, see, and with virtual reality technology we are even digitizing what we feel. We are truly immersed in digital information, and the modern high-speed computer is the enabling tool in the production of that information. However, there is a steep downside to all of these technological advances. The information flow has increased to the point where we require additional computer technology to cope with the volume. This very success of information production focuses our attention on the critical need to effectively manage our information resources. ELECTRONIC INFORMATION STORAGE Information that has been produced must be stored if it is to have lasting value. Because of the increasing volume of information and the importance of real-time access, information storage in the future will become almost exclusively digital. All but the most ephemeral information will be accessed electronically. Much research is taking place in the electronic information storage field. Recent decades have witnessed several revolutions in storage technology. Physical media (paper tape, punched cards) gave way to magnetic media. Within the "magnetic revolution," sequential media (tapes, data cells) evolved into random access technologies (disks, drums). We are now undergoing the "post- magnetic" revolution, as we slowly complete the process begun by the passing of magnetic core memory, and move away from magnetic media to CDs, floptical disks and portable silicon (flash cards). Each stage of this revolution is characterised by increases in storage densities, more efficient data access and increased convenience of use. These three elements contribute to the expanding role of storage technology. Society's databases are being augmented by knowledge bases, rule bases, and case bases. Larger and larger amounts of information may be input, output, organized, classified, sorted, duplicated, merged, purged - and finally stored. However, the forward looking technologist should keep in mind that these data depositories are closer to the beginning of the information application process than they are to the end. Appropriately stored information is really only a raw resource in the modern information management scheme of things. The true value of this resource is unrealized until information is delivered to the right person in a timely fashion. ELECTRONIC INFORMATION DISTRIBUTION The timely, accurate delivery of information to information consumers is a pressing problem in today's information age. Information delivery research aims to efficiently deliver the right information to the right end user. The delivery of information electronically may take on many forms. The oldest class of digital technology, still in wide use, involves the distribution of information on physical storage media (disk, optical disks, tapes). In this case, the information is transported like other goods, but the package is electronic. Although the technology is 20th century, the concept of physically moving information is nothing new. A more sophisticated system involves the use of an electronic conveyer to transport the digital information. Computer communications networks fall into this category. Here, the electronic message is delivered to end users with no physical medium involved. In principle, this is the ultimate in convenience: computer telephony. As stored information becomes more plentiful, facilities for distributing it are becoming more varied. Software like ftp, email, telnet, archie, and xmosaic, net news readers, telephone-related services from answering machines to voice mail, and on-line database services like library catalogs and literature search services located at virtually all research libraries, are examples of society's recent successes in improving electronic information distribution. Research relevant to electronic information distribution includes networking protocols and standards, high bandwidth information transmission, access control and security, and other aspects of distributed computing networks. Regrettably, this increased sophistication is creating a whole new set of problems for information specialists. The convenience is the culprit. Regrettably, our ability to control and regulate information delivery pales in comparison to the raw volume deliverable. For example, the Internet is currently expanding at a rate of 20% a month (Schaller and Carlson, 1993). Information overload is a real and present danger for computer and information technology professionals. ADVANCED INFORMATION DELIVERY SYSTEMS Traditional information delivery makes information available. Advanced information delivery helps to deliver to the user the right information at the right tme. Convenient, accurate mechanisms are needed to separate what is really needed from what isn't. Research activity exists which may eventually provide the needed control over the technology. As a whole, this research attempts to develop better techniques by means of which users attract the information they need and repel the information they don't. It is necessary to sift larger bodies of data, such as a stream of arriving email or bulletin board postings, to effectively deliver information relevant to a particular person or purpose. Routing and distribution lists are one such technique. In the simplest case, a static distribution list of email addresses is created by a list manager so that all communications addressed to the list will be sent to each address in the list. Increasingly, distribution lists are managed dynamically whereby individuals send subscribe, unsubscribe, etc., commands to the list which are obeyed automatically by the list management software. The listserv systems exemplify this technology, which is successful in delivering information to groups of information consumers. That is, it is effective at attracting information - but largely ineffective at repelling asociated unwanted information, which is often plentiful at least in the typical case of bulletin boards (Schaller and Carlson, 1993). While routing selectively delivers on the basis of membership in a list, categorization systems selectively deliver on the basis of keywords. The content of the information is classified at the preparation stage, and these category tags are then used to deliver information to users whose interest profiles agree. Such systems abound and are an essential characteristic of modern electronic publishing projects. Categorization systems, it should be understood, deliver on the basis of key words, not actual content. Another advanced information delivery method is document clustering, which automatically finds groups of especially similar articles. This aids users who are doing literature searches. One approach to document clustering is described by Bhatia et al. (1993). Document clustering exemplifies passive delivery, in which information is automatically structured in ways that aid users who invoke the system. In contrast to such passive electronic information delivery is active electronic information delivery, in which a system takes the initiative in locating information desired by the user and brings this information to the user's attention. Active delivery is exemplified by the information filtering field. Recent work on information filtering was the subject of a recent group of articles in Communications of the ACM (1992). An information filtering system might develop a profile of an individual's interests extracted from the internet bulletin board articles that individual has read. Then, the large numbers of incoming articles can be checked against that profile automatically and, when a good match is found, the matched article brought to the individual's attention as especially likely to be of interest. Information filtering has usually operated on textual information but multimedia technology is increasingly having an impact. For example, Story et al.'s (1992) RightPages system uses actual images of journal covers and pages and even has a prototype voice output module. INFORMATION CUSTOMIZATION While information delivery technology is an important area, we believe that significant further improvement in the value and usability of information will be needed in the information rich environment of the present and future. This improvement will be possible through electronic information customization. Customized information is processed specifically to the needs of the user beyond the raw information in the form in which it is produced, stored, and delivered. Much as a fine suit needed to give a paper at this conference might require altering to fit the specific user, information that is produced, stored, and distributed in the same form for all users needs to be further customized to the particular needs of a given user. Customizing of electronic information means changing its form to be better suited to one- time needs. Hypertext is one popular approach to customizing information. Hypertext allows the user to access textual passages in the particular order that is best at the time of access, in contrast to just reading a document which is severely constrained by the structure imposed upon it by the writer. An active research area in information customization is information extraction. Information extraction systems provide an alternative to the inherent weaknesses in routing and categorization systems (Sundheim 1991). Extraction systems attempt to draw the content out of the data "on the fly" and according to the particular interests and inclinations of the user. This is no longer just information distribution. While information is indeed being delivered, it is also customized before presentation to the user. Information extraction has the advantage over categorization systems in that the user can discriminate on the basis of the match between his interests and the document profile, but not the disadvantage of being simplistically based on keywords, which tend to be ambiguous and independent of context. Another advantage of extraction systems is that they may be done on documents which have not been profiled by keyword. Routing, categorization, and extraction systems are all subsumed under the general category of information filtering systems. However it is important to keep in mind that routing and categorizing are methods of delivery, while extraction is a method of customization. A variation on extraction systems we call document "gisting." A gist of a document is a set of several sentences which were automatically created by gisting software to reflect the actual content of the document. A condensed summary of our approach to document gisting is given next. GISTING: OUR METHOD Our approach to computer based gisting is in the tradition of the earlier work on automatic generation of abstracts from documents, which dates back as far as 1958 (Baxendale; Luhn). Subsequent work includes Edmundson and Wyllys (1961), Oswald et al. (1961), Edmundson (1964; 1969), Rush et al. (1971), and Paice (1981). More recently, Story et al. (1992) extract outlines of journal pages from section headings and first and last sentences of paragraphs as part of a comprehensive information filtering system. A frequently useful source of information for systems that seek to extract relatively important material from documents is the background frequencies of words (Francis and Ku era 1982; Ku era and Francis 1967; Thorndike and Lorge 1944; Eldridge 1911; other references provided by Zipf 1935). Such information can be used to facilitate extraction by looking for words which appear with disproportionate frequency in the document of interest. Our document gisting project is complementary to earlier work in image gisting which uses a forward chaining expert system to extract semantic information from a digitally stored picture based upon definable properties of the image (Berghel et al., 1992). We start by tabulating the number of occurrences of each word appearing in the document. Each word is also associated with a list of pointers pointing to the sentence(s) in the document containing the word. Then, we attempt to consolidate inflected word forms under a single base form. Our procedure is currently rudimentary, seeking only to subsume the plural form under the singular. This may be made more sophisticated as needed by the gisting algorithms as they develop. Currently, plural subsumption appears satisfactory. Data is retained on the fifty most common base terms, with frequency of occurrence counts for singular, plural, and both together. Of these, the top twenty terms are assigned a background frequency count which is the freqency of occurrence of that term in the standard reference, Francis and Ku era's Frequency Analysis of English Usage. From the combination of absolute frequency in the document and background frequency derived from Francis and Ku era we derive a tentative "importance" value, which will ultimately be used in the creation of gists. The use of such importance values is suggested by Edmundson and Wyllys (1961). Currently, the software displays a matrix showing which derived keywords occur in which sentences in a document. An outline of the algorithm appears in Figure 1. Sample output appears in Figure 2. Computational complexity. The computational task is to process a document to produce an index containing each word appearing in the document, and for each word the number of times it appeared and pointers to the sentences containing it. This task can be accomplished in a single pass through the document. The time complexity of a single pass is at least O(n) where n is the length of the document, but can be (and in this case is) higher due to the need for accessing and updating the index. Every time a word is processed the count of the number of times the word appeared in the document must be incremented, and a pointer added to the index pointing to the sentence containing it. If a word is not present in the index it must be added, so the index gets longer as the document is processed, therefore harder to search. If binary search is used, the complexity of the search is O(log(n)) where n is the length of the index. The lengthening of the index as processing of a document proceeds occurs at a slower and slower rate, because words are added less frequently later than at first, since an increasing proportion of the words appearing later in the document were already added previously. Since the index length increases at a slower and slower rate and besides, the difficulty of searching an index increases only slowly as its length increases, the time complexity of the entire process is only slightly greater than O(n). Therefore processing of large documents this way should be feasible. Further details on our approach may be found in Berghel and Berleant (1993), Berghel et al. (1993), and Stanley et al. (1993). -------------------------------------------- From the document, generate an index. | | V From the index, extract occurrence frequencies of the words in the document | | V Augment occurrence frequency list with additional entries for the sums of singular and plural forms | | V Extract the twenty entries with the highest frequencies | | V Assign to each entry an "importance" value based on its frequency Figure 1. Outline of our current algorithm. -------------------------------------------- -------------------------------------------- SCREEN SHOT HERE Figure 2. Sample output. The document subject was recycling of scrap vehicular tires. The second of the four windows shows the twenty most common base terms, from 01 to 20, listed in descending order of frequency of occurrence in the document along with their "importance" ratings based on their frequency in the document and their background frequencies from Francis and Kucera (1982), using Edmundson and Wyllys' (1961) difference method. A cross reference of which words (numbered 01-20) occurred in which sentences of the document (numbered from 0001 with the first nine shown) appears in the larger window. -------------------------------------------- CONCLUSIONS The technique of using frequency analysis of words to determine their significance in a document is certainly not new. Early work such as by Edmundson and Wyllys and by Luhn first developed this class of techniques. Both those articles demonstrated the application of these techniques for the purposes of automatic indexing and abstracting. Our work covered in this report is not substantially different from theirs at the level of the frequency analysis. However, it does differ significantly in that it is merely a preliminary step toward gisting algorithms that cluster sentences that are similar in meaning, for customized abstract generation. Our present effort should be considered as a foundation for a study of alternative strategies for document abstracting. As such, we rely on the traditional keyword-selection and weighting schemes used over the past several decades. Our ultimate objective is to contribute to the new and important field of information customization, by extracting special purpose abstracts from documents to suit the particular, perhaps one-time needs of users. CONTRIBUTORS AND ACKNOWLEDGEMENTS Daniel Berleant and Hal Berghel wrote the paper and designed the program whose output is shown in figure 2. Karthikeyan Viswanathan implemented the software that produced the output of figure 2. Michael Stanley implemented the algorithm of Figure 1. Prasad Sunkara contributed to other parts of the software. We wish to thank the anonymous reviewers, especially #80, for useful suggestions. REFERENCES 1. Baxendale, P. B., Machine-Made Index for Technical Literature --- an Experiment, IBM Journal of Research and Development 2 (4) (1958) 354-361. Reviewed by Edmundson and Wyllys (1961). 2. Belkin, N. and W. Croft, Information Filtering and Information Retrieval: Two Sides of the Same Coin, Communications of the ACM, v. 35, n. 12, pp. 29-38, December 1992. 3. Berghel, H. and D. Berleant, "Word-Based Gisting for Electronic Information Management," Technical Report CSCI-TR- 93-06, Department of Computer Science, University of Arkansas, Fayetteville, AR 72701, 1993. 4. Berghel, H., D. Berleant, K. Viswanathan, and V. Sunkara, "KEYCHAIN: A Research Support Tool for the Stufy of Keyword Chains in Electronic Documents," Technical Report CSCI-TR-93-07, Department of Computer Science, University of Arkansas, Fayetteville, AR 72701, 1993. 5. Berghel, H., D. Roach and Y. Cheng, "An Expert System Approach to Image Analysis", Expert Systems, Vol. 3, No. 2, pp. 45-52 (1992). 6. Bhatia, S. K. and J. S. Deogun, Cluster Characterization in Information Retrieval, in Proceedings of the 1993 ACM/SIGAPP Symposium on Applied Computing, ACM Press, 721--728. 7. Communications of the ACM, Special Section on Information Filtering 35 (12) (December 1992) 26--81. 8. Edmundson, Problems in Automatic Extracting, Communications of the ACM, Vol. 7, No. 4, pp. 259-263, April 1961. 9. Edmundson, H. P., New Methods in Automatic Extracting, Journal of the Association for Computing Machinery, Vo. 16, No. 2, pp. 264-285, April 1969. 10. Edmundson, H. P. and R. E. Wyllys, Automatic Abstracting and Indexing --- Survey and Recommendations, Communications of the ACM 4 (5) (May 1961) 226-234. 11. Eldridge, R. C., Six Thousand Common English Words, The Clement Press, Buffalo, 1911. Summary in Zipf (1935). 12. Francis, W. N. and H. Kucera, Frequency Analysis of English Usage: Lexicon and Grammar, Houghton Mifflin Company, Boston, 1982. 13. Ku era, H. and W. N. Francis, Computational Analysis of Present-Day American English, Brown University Press, 1967. 14. Luhn, H. P., The Automatic Creation of Literature Abstracts, IBM Journal (April 1958) 159--165. 15. Oswald, V. A., et al., Automatic Indexing and Abstracting of the Contents of Documents. Report RADC-TR-59-208, Air Research and Development Command, US Air Force, Rome Air Development Center, 1959, pp. 5--34, 59-133. Reviewed by Edmundson and Wyllys (1961). 16. Paice, C. D., The automatic Generation of Literature Abstracts: an Approach Based on the Identification of Self-Indicating Phrases, in R. N. Oddy, S. E. Robertson, C. J. van Rijsberger, and P. W. Williams, eds., Information Retrieval Research, Butterworths, pp. 172-191, 1981. 17. Rush, J., R. Salvador and A. Zamora, Automatic Abstracting and Indexing, Journal of the Americal Society for Information Science, July-August, 1971, pp. 260-273. 18. Schaller, N. C. and B Carlson, Proliferation of Electronic Networks brings Good News, Bad News, IEEE Computer, Vol. 26, no. 9, September 1993, p. 94. 19. Stanley, M., H. Berghel, and D. Berleant, A Preliminary Word Frequency Analysis for Technical Computing Literature, technical report CSCI-TR-93-02, Department of Computer Science, University of Arkansas, Fayetteville. 20. Story, G. A., L. O'Gorman, D. Fox, L. L. Schaper, and H. V. Jagadish, The RightPages Image-Based Electronic Library for Alerting and Browsing, IEEE Computer 25 (9) (Sept. 1992) 17-26. 21. Sundheim, B (ed.), Proceedings of the Third Message Understanding Evaluation and Conference, Morgan Kaufman, Los Altos (1991). 22. Thorndike, E. L. and I. Lorge, The Teacher's Word Book of 30,000 Words, Bureau of Publications, Teachers College, Columbia University, New York, 1944. 23. Zipf, G. K., The Psycho-Biology of Language, Houghton Mifflin Co, Boston, 1935.