Library culture, computer culture, and the Internet haystack

Prentiss Riddle, Rice University

(I wrote this piece in the spring of 1994 as a position statement for Digital Libraries '94, but it wasn't published there. It now seems rather dated, although I mostly still agree with it. Yahoo has proven the superiority of human-built Internet indexes over robot-built ones in many applications. A number of institutions from the library world have undertaken projects to "catalog" the Internet, although I fear that the high cost of traditional "cataloging" and its emphasis on describing a fixed bibliographic object may prove to be severe limitations. Ironically, one of the strongest forces behind the development of collaborative bibliographic tools may not be the desire to provide access but the desire to restrict it, as systems are developed to censor the Internet and limit the portions of it available to children (see the Internet Parental Control FAQ.) - PR 1996.06.17)

[an error occurred while processing this directive]

I come to this conference as a common laborer in the online fields. I am a systems programmer who runs a campus-wide information system for a small university (RiceInfo). Like many CWISes, ours began as an inward-looking tool for internal communication, but soon became a starting point for users trying to find things on the Internet as we all discovered the power of tools like gopher, WWW and Mosaic. My prejudices, to put them up front, are those of someone trying to make practical use of available resources to keep my users happy in the here and now.

The explosion of the Internet has delivered to our desktops the virtual equivalent of a vast warehouse of unsorted paper. The question which interests me is: how does one convert a mountain of digital pulp into something resembling a library?

To oversimplify, the difference between a library and a pile of paper is how easy it is to find the particular items which interest you, especially when you don't know exactly what you're looking for in the first place. The tools currently available for sorting through the Internet haystack -- archie, veronica, jughead and their WWW equivalents -- are much better than no tools at all, but they are still extremely crude when compared to the subtlety and precision of the average library catalog or journal index. It's worth asking why this is the case and what to do about it.

I find that when librarians and computer professionals look at this problem, they do so from two sides of a wide cultural gap. Librarians are at a disadvantage because, to generalize shamelessly about the profession as a whole, they can never quite understand the technology as well as the people who invent it. While many librarians are struggling valiantly to keep up with runaway technology, they are continually in a position of reacting to rather than originating change. Add to that the budgetary constraints facing many libraries today, and the result is that creative insights from library science are woefully underrepresented in the evolution of Internet tools and resources.

Computer people, for their part, have some handicaps as well. One handicap is a lack of understanding of library science. Computer people are continually trying to reinvent concepts which librarians have been honing for decades. Librarians who attend Internet developers' conferences such as GopherCon or the IETF sometimes refer to the discussions of cataloging which take place there as "library school kindergarten".

But an even larger handicap, I believe, is the assumption in the computer field that human time and judgement are extremely scarce and costly resources, much too expensive to be wasted on these problems. Consequently, computer people try to solve every problem by throwing robots at it. Opinions vary widely as to the promise of smarter robots; I confess to a large dose of skepticism about the potential usefulness of AI and natural language processing techniques in this area. However, I don't think it's a stretch to suggest that even very smart robots will share a flaw with the klunky but serviceable robots (archie, veronica, etc.) available on the net today: they all suffer from the garbage-in-garbage-out problem. The vast bulk of the information on the Internet is junk, just like the vast bulk of what gets printed on paper. I doubt that robots capable of making reliable value judgements about the quality of online resources will be readily available any time soon.

So I am interested in using computers to apply to Internet resources the sort of judgement about selection and classification of materials which is librarians' stock in trade. At Rice we've had some success with a gopher tree of "Information by Subject Area". Early in the growth of gopher, many sites independently attempted to build simple menus organizing links to resources by subject. Most of these classification schemes were ad hoc and designed by computer personnel without any deep knowledge of subject classification, but fortunately many of them happened to be more or less compatible with one another. I wrote a simple program called linkmerge to automatically merge hand-selected sets of compatible gopher subject menus maintained elsewhere, resulting in a gopher subject tree which is one of the most comprehensive ones out there.

This approach has several problems. It, too, suffers from noise. It currently works primarily with gopher and would be difficult to apply to World Wide Web because WWW resources are not organized in straightforward lists like gopher menus. Worst of all, it is not scalable: sorting Internet resources into a handful of categories is doomed by the rapid growth of the Internet.

So the next step I have been thinking about is a set of tools to allow librarians to collaborate on crafting a more elaborate index of selected Internet resources. Participating librarians would interact with the system using some relatively platform-independent mechanism; at the moment I am leaning toward Mosaic forms. Suggested resources would be delivered to them from a variety of sources -- user suggestions, newsgroups and mailing lists where new resources are regularly announced, perhaps even carefully tuned harvesting robots -- but the librarians would exercise judgement about which resources merit inclusion in the index, just as they do in traditional collection development. Each entry would include several pieces of information: a descriptive title; a URL (or perhaps a URN) identifying of the resource; an abstract; contact information for a responsible party; and -- here is the crucial contribution of library science -- one or more subject classifications using a controlled vocabulary developed specifically for Internet resources. Users would be able to navigate through the index in multiple dimensions: by searching the titles and abstracts, by searching or browsing the subject index directly, or by weaving among these choices. The underlying database would be flexible enough to be translatable into several formats: HTML for use with a WWW-compatible search engine, IAFA templates for compatibility with other Internet indexing projects, and pseudo-MARC records suitable for serving out via a Z39.50 bib-1 server.

This is a brief and somewhat fevered sketch, which may not ever leave the idea stage. Nevertheless, the Internet offers us the possibility not only of building a distributed web of information resources, but also a distributed web of people skilled in the art of selecting and classifying those resources. I hope that we will find ways to tap both.

Prentiss Riddle
RiceInfo Administrator, Information Technology
Rice University
2002-A Guadalupe St. #285
Austin, TX 78705
512-323-0708
riddle@rice.edu

Prentiss Riddle (riddle@rice.edu) 1994.04.04; rev. 2001.02.15