User:KYPark/006

A DIRECT APPROACH TO INFORMATION RETRIEVAL

Table of Contents

   WHAT
   WHY
   HOW
1. INTRODUCTION
2. THE LINE OF ATTACK
3. SYSTEMS VS. USERS
   3.1 Discrimination
   3.2 Prediction
4. DOCUMENTS VS. SURROGATES
5. THE THEORY OF INTERPRETATION
   5.1 Denotation and Connotation
   5.2 The Theory of Ogden and Richards
   5.3 Implications for Information Retrieval
6. PROPOSAL FOR FILE ORGANIZATION
   6.1 Incentives
   6.2 Extracts as Indexing Sources
   6.3 Extracts as Review Sources
7. CONCLUSION
8. REFERENCES

Contents

6. PROPOSAL FOR FILE ORGANIZATION

6.1 Incentives

The idea proposed in this chapter is to use in information retrieval those extracts in which the source document cites, describes, criticizes, and/or collates other documents (See Figure 9). It is only exploratory within the scope of this study. It can be justified on the grounds that the citing and the cited documents are coherent with each other, that extracts provide concise clues for discriminating these documents, and that even concise clues are interpreted meaningfully in the given contexts. Although widely practiced among information users, the idea has not yet been formally studied in view of efficient file organization as far as I know. Therefore, the implication of the idea might go farther in the future than can be expected to now, and require more exploration. In this respect, what is immediately required will be some rationale behind the idea. While all the preceding discussions are relevant to this rationale, the following are intended to support the idea focally.

Now it is almost certain that subject coverage or specialization can hardly be defined consistently and objectively. At best we can say that two documents are similar with respect to something, based on the evidence that we recognize from the documents. Still, the totality of evidence would not make sure similarity; it gives us no more than a degree of belief.

In most cases, two documents similar with respect to something are indexed or abstracted individually. In this sense they are related to each other only indirectly, or with some uncertainty. Indexing inconsistency, mainly caused by individual varieties even in case of fairly adequate assignment of index terms, is now well known. This will significantly degrade retrieval as a grouping process of similar documents. Therefore, to use the direct evidence of similarity established between the two or more documents will be desirable.

We can quite reasonably say that the citations, by which I mean both the citing and the cited articles inclusively, are similar at a certain level of abstraction, especially in highly specialized fields of science. Therefore we can trace back and forth between the citations in order to find similar articles. This is the principle of citation indexing applied by Garfield¹⁶. However, the serious objection to citation indexing is that it demands too much risk, relying heavily on the mere fact that X cites Y. Tracing back and forth tends to diverge tremendously. The solution required for this technique would be to exclude noise sources and provide all the citations with subject indicators more powerful than titles. Lipetz¹⁷ attempted to improve selectivity of citations by providing "context indicators" rather than "subject indicators." His approach seems plausible, but demands much intellectual effort. After all, the usefulness of direct evidence has not yet been warranted significantly by citation indexing.

In this respect, the far more elaborate method, bibliographic coupling, developed by Kessler¹⁸ shares the same fate as citation indexing. It is noticed¹⁹ that "citation tracing is pervasive information-seeking mode." What should be further noticed is that backward tracing is much more pervasive and that any intellectual tracing is initiated by discerning some meaningful evidence rather than the "mere fact."

On the other hand, it is questionable whether indexes and abstracts are the only means of retrieval as an extension of information-seeking facilities. Books, reviews, monographs, and journal articles; all these are likely to lead our information needs to other sources of information. Almost all scientific articles cites, describes, analyzes, and groups a number of other articles. Thus, the reader of the citing articles can, perhaps very easily, discriminate the cited articles as to their subjects, crucial points, logical relationships, and so on. By doing so, the reader is in effect retrieving relevant articles with the aid of expertise.

Vickery²⁰ emphasizes the importance of review articles and the like as an efficient, selective "means to discover what they must read amid the vast mass of available documents," pointing out that "the traditional means of discovery of the pertinent literature are inadequate." Nevertheless, the traditional means may better give access to more selective means. That is to say, the strategy of discovery may best be divided into two different means.

A similar strategy was considered by Goffman, et al.¹³ by introducing meta-linguistic terms to indexing. However, their approach appears passive in that it is simply intended to divide a file in order to economize searches. A more active approach is therefore desired for selective discovery in terms of quality rather than quantity.

On the other hand, Goffman, et al.¹³ regret that many abstracts written in "trivial" meta-language are much closer to object-linguistic "extract," and that many reviewers write abstracts instead of the state of the art. They seem to favor meta-linguistic abstracts more than object-linguistic "extracts." Ironically, one of the authors recently shows that extracts are better than abstracts in terms of calculated entropies as well as intellectual efforts.¹⁴ The power of meta-language which they properly recognized suffers from inconclusiveness, waiting for further observation.

On the whole, most of the traditional means, such as subject indexes, abstracts, and extracts seem to go paralytic facing efficient file organization. Obsolescence of scientific literature²¹ is now widely known. Brookes²² was interested in obsolescence involved in a cumulative file. Unfortunately, his interest has not yet been worked out. Certainly accumulated in a large, cumulative file would be archival value, but at the cost of retrieval devaluation. Thus systematic file organization AND maintenance should be taken as most essential in view of information retrieval.

Recently, Blaxter and Blaxter²³ report an interesting observation on the needs and habits of scientific authors and readers in three research institutes. They show that the information needs of individual working scientists are met by a very small number of primary journals, and that the cited references appended to primary articles or review articles are used in most literature searches. More precisely:

Trace back from a paper : about 40% on average
Trace back from a review : more than 20% each.

If this were to be the general pattern of literature searches by working scientists, and if information retrieval is to meet ultimately the information needs of individual scientists, file organization should be considered in the light of the above observation.

6.2 Extracts as Indexing Sources

Figure 9 shows the first paragraph extracted from an article* (hereafter called the sample article) in a recent issue of Physical Review. The extract has eight references (Refs. 1-8) not merely cited and described, but also criticized and collated. With respect to the cited references, the extract is meta-linguistic and of a review kind. Similar extracts can be made from other parts of the sample article wherever each cites one or more references (Figure 13). By extract is meant hereafter an extract of this kind, as opposed to a common, object-linguistic extract.

* G. J. Kutcher, P. P. Szydlik, and A. E. S. Green. "Independent-particle-model study of electrons elastically scattering from oxygen." Physical Review A, vol. 10 no. 3 (September 1974) pp. 842-850.

From the extract in Figure 9, a subject index to Refs. 1-8 may be derived as illustrated in Figure 10. The complete subject index to Refs. 1-37 of the sample article is shown in Figure 11. The actual indexing is done on the work sheet as illustrated in Figure 12, while the indexer scanning the source document selects index terms. There is therefore no need to make extracts substantially.

Figure 9. An Extract from the Sample Article.

Figure 10. The Subject Index Derived from the Extract.

Figure 11. Index Terms Derived from the Completed Work Sheet.

Figure 12. Indexing Work Sheet Completed for the Sample Article.

Figure 13. A Provisional Compilation of Extracts for the Sample Article.

6.3 Extracts as Review Sources

Figure 13 illustrates a provisional compilation for the sample article where:

ScD : Source of the sample article, i.e., location, title and author;
Abs : Abstract of the sample article;
Ext : Extracts (Exts a-aa);
Ref : References (Refs 1-37).

Similar compilations for a number of source documents may be serially accumulated into a file. Being combined with the subject index and the author index, this file may be used as personal or other means for information retrieval. Convenience of the file will remain a technical problem.

The use of the file in retrieval is much the same as that of reviews and text books which can lead the reader to various sources of information. As mentioned previously, extracts under consideration are in fact of a review kind. External and psychological contexts are involved in reading reviews. In extending to other sources of information, the reader can benefit from expertise provided by reviews of source documents. Certainly he would not make instantaneous, mechanistic YES-NO decisions based on simple criteria. To the contrary, his decisions will be carefully thought out.

Selection of one source document by using the subject index is relatively less important, since it is mainly intended to lead to retrieval of as many cited references as possible. Therefore usefulness of the file will depend on coherence of citations, i.e., coherence of cited references with each other as well as with the source document. And extracts should be made short as far as they do not significantly degrade the maximum coherence that is obtainable from the full text. Here, coherence may be defined:

coherence =

number of citations retrieved as relevant

number of citations examined

From the extract Ext a in Figure 13, the reader may notice that all the cited references (Refs 1-8) are about INDEPENDENT PARTICLE MODEL, which presumably represents the significant aspect in common. Much subject content behind this representation may be covered by the abstract of the source document. Thus, given the context by the abstract, the reader can to some extent do without the individual abstracts of the cited articles. Similarly, the reader can benefit from other contexts which are exchanged between the cited references. How much he can benefit from these external contexts will depend on his psychological context.

Extracts should be made primarily in one or more sentences. Description in sentences is one of the advantages of extracts over description in keywords. However, some extracts are nonsensical, mostly redundant, or require modification. It would be better in these cases either:

to abandon an extract (See Exts h, n) or
to reinforce an extract (See Ext d) or
to select only keywords or phrases (See Exts g, q).

In short, the length and the coherence of extracts should be balanced. Extraction of keywords or phrases similar to subject indexing, may suffice in many cases.

Perhaps the simplest file organization would be to mark extracts directly on the source document and to derive the subject index from them. In a sophisticated environment, e.g., visual display and keyboard manipulation of constituent files, the following organization may be convenient.

Subject index - in alphabetic order.
Citation index - including unduplicated citations.
Extract file - including abstracts and extracts.

Figure 14 illustrates an entry to the subject index, and Figure 15 illustrates ways of access from the subject index to the citation index and to the extract file.

Figure 14. An Entry of the Subject Index.

Figure 15. Access Sequences.

AFTERTHOUGHTS

See also