The Promise and Perils of WorldC

The Promise and Perils of WorldCat and Online Library Catalogues as Search Engines for Copy-Specific Information and Images of Early Books (Rev. 5/11/2007 8:56 AM)

Paul Needham’s review of the Illustrated Incunabula Short Title Catalog when it first appeared on CD-ROM in 2000 observed that the database produced predictable search errors because it transmitted errors from paper sources from which electronic records were built, and the images, most of which were taken from microfilm, were of “often middling and sometimes less than middling quality, but nonetheless they are filled with usable information” (500). Phantom books appear as a result of data entry errors, and actual volumes are hidden from searches because of limitations in the database’s design. Needham’s comments about the search engine’s impact on readers’ access to these valuable resources were much on my mind as Gail McCormick, Goucher College’s Special Collections Librarian, and I planned how to reconstruct the book collection of James Wilson Bright (1852-1926) to make it useful to scholars and to our students. Over five thousand volumes from this pioneering Johns Hopkins philologist’s collection were purchased by Goucher in May, 1925, shortly before his death. The huge trove of books was accessioned in batches between May 1926 and March 1938. In the intervening five decades, much of the collection has been in general circulation with the Goucher/Bright bookplate inside the front cover as the only sign of origin. Also, we did not know when we started the project that many early printed books in the collection were entirely absent from the online catalogue, and other pre-1700 editions were not identified by the catalogue as belonging to Bright’s collection.

In January 2006, a series of happy coincidences enabled us to begin recuperating the Bright Collection and rediscovering its true value to the college. We began collecting copy-specific cataloging information about each volume and capturing high-quality, non-lossy digital images of the volumes that would be of interest to researchers. The images record original bindings, bookplates, title pages and colophons, and other evidence of readers’ interaction with the books. In the process, we have discovered nearly 200 pre-1700 printed books in the Bright Collection that were not identified in the library catalogue. [Pre-1700 SPREADSHEET] Among our discoveries were two Sammelbände of five editions, represented in our catalog only by the first edition in the binding. Other solitary editions also surprised us. One of the Bright volumes appears not to have been described before, and a second is the only complete copy of a pamphlet otherwise known only in a damaged copy at the National Library of Scotland. While searching for those volumes, we found an additional 35 non-Bright pre-1700 books, including two incunabula. Except for the non-descript or demonstrably unique books, we are not imaging whole books, but rather we attempt to record bindings, pastedowns, title pages, marginalia, and other copy-specific evidence that would enable scholars to tell whether the books were worth traveling to Baltimore to view them in person. Our next problem was how to make the collection’s value known to researchers outside our community.

Based on Needham’s experience with the IISTC-CD-ROM, we knew that much of the collection’s ultimate value to scholars and students would depend on our choice of a search-engine or user-interface through which the images would be accessed. Even the greatest image gallery may remain unseen if accessed only through a single-purpose user interface. As of January 2006, we seemed to have two choices. We could buy a pre-existing image database product to develop our own stand-alone website to house the images, or we could incorporate the images and copy-specific descriptions of these volumes within the college’s online catalogue. We also had to contend with the constraints of limited in-house funding—we got the project underway for a total of $13,000. OCLC already had begun pitching Goucher to purchase its digital content management system, “Contentdm,” and Contentdm appeared ready to pounce on our project as a demonstration of their product, but as of this moment we have not entered into an agreement with them. For now, sticking to a system we can control on our own budget, we have begun integrating the image bank into our online catalog. This helped us stay within our limited range of institutional funding, and it suited Goucher’s fundamental commitments to undergraduate teaching and original research.

This decision also has thrust us into the unexpected consequences of a titanic merger between OCLC (the “Online Computer Library Center”) and RLG (the “Research Libraries Group”), which was completed even as I was confidently developing this paper topic and which had already begun to take effect by the time Martha Driver told me the paper had been accepted. Needless to say, I have a much different talk to deliver today than I thought I would be giving in September. Nonetheless, we still may see some extraordinary results, but for now, I will be showing you a work very much “in progress,” and one which you may be able to encourage toward completion by means of your own collegial relationships with your libraries’ special collections and cataloging librarians.

Our decision to use our online catalogue’s record system to organize and display the images was inspired by Goucher’s commitment to undergraduate teaching, including instruction in research methods. Students searching the online catalogue directly or via WorldCat can use standard and copy-specific bibliographic description codes to search images by content, text author, year, place of publication, printer, as well as other specialized fields like “previous owners,” “engravings,” “maps,” “marginalia,” “bookplates,” “bindings,” etc. (SHOW OLLI HIT FOR LIBER SCINTILLARUM AND MARC RECORDS BEHIND OLLI HIT HERE!). This search screen from our online catalog shows searchable copy-specific information about the 1560 Venetian edition of Liber Scintillarum, Antonius Gangutia Siculus’ collection of famous quotations from the Venerable Bede. In addition to the enhanced catalog description, we have included in the copy-specific information the phrase “Digital images available,” a searchable “Note” field that enables users to do an Advanced Keyword Search on the “Notes” field for that phrase to retrieve records for all the books for which we have digital images. The RAW, TIFF, and JPEG files of our image gallery are stored with their metadata on the same server with the catalog so that a hyperlink from any catalog search return screen can point directly to an image file on the server. That also will enable their eventual linkage to a stand-alone Bright Collection web site which will present the gallery as coherent an Internet destination that will help us create finding aids and assist our promotion of the resource to grant funding institutions. But by linking the images first to our standard catalog records, we have established access using two search engines that already exist: our own online catalog interface, and OCLC’S WorldCat, which could be used to connect our collection’s images with those of other libraries around the world. [Start “Kalamazoo 2007 Web Page]

When the collection is accessed via WorldCat, however, researchers will see our collection’s early printed books along with those at other libraries. If those records are enhanced with images, as ours have been, researchers can move directly from written bibliographic descriptions to images that improve upon the “desbib” evidence. This will enable them to decide whether Goucher’s copy of a given text might be worth on-site study. [DEMO WorldCat SEARCH FOR de Word 1527 Golden Legend] This winter, WorldCat.org was launched as a free public access version of the search engine, and its search returns are prioritized by geographical proximity. [ACTUAL WorldCat.ORG SEARCH FOR Liber Scintillarum.] WorldCat.org’s geographically ranked search returns help researchers to learn about nearby copies of research materials as well as those stored in the predictable but often distant archives of major collections. In addition to revealing the existence large numbers of previously little-used copies of early printed books, these two versions of WorldCat enable scholars to plan research travel economically to maximize time spent with the resources themselves and to minimize time spent in travel.

Online image banks linked to catalog records create an interesting alternative, or preliminary, to what Needham called “autoptic” examination of the physical book. A searchable international early printed book image bank would enhance the chances that we all could have at least “virtual autoptic” access to early book images in the world’s libraries. High quality digital images can yield instant information about actual page size and other data that varies from copy to copy, such as owners’ inscriptions on paste-downs and title pages. Nevertheless, the digital simulacrum of a book page will never replace the actual artifact for scholarly study, but consulting the image can confirm a scholar’s decision to invest time and treasure to visit the copy, itself (Mckitterick 18-21 and Tanselle xvii).

So much for the promise—now come the perils.

As most of you probably know, the Research Library Group’s “Union Catalog” contains copy-specific information on over 50 million pre-1900 printed books. When RLG merged with OCLC, they were absorbed into a much larger organization that specializes in online university and public library catalogs. Like most corporate mergers, this one will have far-reaching consequences for OCLC and RLG’s “customers.”

OCLC records originally were designed to collect and represent what Fredson Bowers called an “ideal copy” bibliographic description for each edition known to the collection. A search for any specific edition might return information that it is held in one or more of the 57 thousand member libraries which claim to own it. Copy-specific information, including links to digital images, was never immediately available in OCLC records unless by accident. RLG’s “Union Catalog,” by contrast, was designed to discriminate among individual copies of early printed editions, which they define as before 1900. In late September of 2006, when I learned that these two potentially incompatible cataloging systems were being merged, an icy dread fell upon me. Would the larger, more opulently funded OCLC transform RLG’s records to its more “user friendly,” commercially oriented system?

My confidence was not improved by a conversation in March with Bob Shulz, the WorldCat product manager, who told me that they had recently concluded marketing agreements with Yahoo and Google to include WorldCat.org in Yahoo and Google searches. Everything seemed to point toward what we might call the “Googlization” of this resource, especially his observation that the WorldCat.org product seemed successful because it was receiving “twelve million hits per month with a ten-percent click-through rate to library catalogs.” I don’t want to paint Mr. Shulz as a capitalist brigand, however—he has listened to my pleading to keep WorldCat relevant to scholars, but we have all heard the quantitative discourse of evaluation from administrators and corporate raiders, and it constitutes a dangerous challenge to the integrity of RLG’s database.

James Michalko, RLG’s Vice President for Programs and OCLC Programs and Research, has told me that “OCLC will implement an ‘institutional record’ capability in WorldCat,” which means, in layman’s terms, that a link from WorldCat’s search returns will connect researchers to individual libraries’ records containing copy-specific information (Email 3/23/07). Rather than a catastrophe, this could foretell an immense improvement in our ability to access information about specific copies. As of last April 13, RLG’s “Frequently Asked Questions” web page reported that they had compared forty million copy-specific Union Catalog records matching records for editions already in WorldCat. In addition, they had added about 8 million new editions, and they were still working on about 2 million editions to determine whether they were new to OCLC or duplicated OCLC records for editions already recorded. In addition to the Union Catalog’s three million non-Latin editions already in WorldCat, they have added 300,000 more, and they have enriched copy-specific information on another 600,000 copies. The very idea of hundreds of thousands or even millions of copy-specific descriptions that might be enhanced with digital images is something worth fighting for. At the moment, however, I must confess that the process of using the WorldCat search engine to locate image galleries of early printed books resembles using an orbiting space telescope to find a lost contact lens.

The power of the concept lies in the immense size of the collective catalogue and the robust capacity and speed of its search engine. In addition to the eight million copy-specific RLG records now integrated into the fifty million books already in the OCLC system, we also would be able to combine records of small collections of early books in 57,000 libraries located in 112 countries, if we can only make WorldCat ready to perform the kinds of accurate inquiries scholars need. We need a tool that reliably will locate all and only what we are looking for, but that robustness and exclusivity are devilishly difficult to achieve. [SWITCH TO deWorde GL 1527 SEARCH RETURN PAGE]

WorldCat’s first major problem is that it does not completely eliminate “electronic copies,” surrogates called up even though the “Books” field has been checked and the “No microforms” exclusionary field has been selected. For instance, in this search for the 1527 de Worde Golden Legend, WorldCat lists 12 copies, of which two are false hits. The University of Connecticut and University of North Carolina, Greensboro libraries “hold” only EEBO “electronic editions” of Caxton’s 1483 and de Worde’s 1493. The second problem originates with the member libraries, some of which do not allow access to their collections from off-site, or do not allow direct access from WorldCat to individual records via “deep coding.” On this search, you can see that hyperlinks are not available to Harvard’s Houghton Library, U. San Francisco’s Gleeson Library, and the National Art Library at the Victoria and Albert. That left users access to records for six detectable, complete paper copies of the 1527 de Worde Golden Legend, plus two two-leaf bifolia from the edition which were listed by catalogers as "copies."

Pennsylvania State University's Special Collections Library, University of Hawaii at Manoa and Claremont College's catalogs presented the standard OCLC record for bona fide paper copies but with no copy-specific information and no images available. The Newberry Library (Ill.) had an imperfect copy containing autographs by William Morris (1890) and Sir Sidney Cockerell. UNC-Chapel Hill had only two leaves (clxii-clxiii, “St. Swithun”), as did the U. San Francisco’s Gleason Library, which had to be searched directly rather than by WorldCat (cciii-ccxiii, "The Assumpcyon of Our Lady"). Princeton’s Scheide Library shows us copy-specific entries for the 1483 Caxton, the 1521 de Worde (29 cm.), and the 1527 de Worde (30 cm.). Cambridge University tells us that they have a “very imperfect copy” of the 1512 edition (26 cm) and a complete copy of the 1527 edition (30 cm). [BACK DEMO TO WORLDCAT SEARCH SCREEN]

Valuable as the copy-specific descriptions might be, none of these catalog entries had links to images, but all could have presented them if such images were available. Nevertheless, at this time, WorldCat could not take us directly to those images because it can only search exclusively for “Types” of objects as “Books” or “Visual Materials.” It cannot comprehend an object which exists both as a book and as digitized images within the same collection. The solution to these problems is relatively easy if we can just motivate the OCLC administration to see it as desirable.

In my March conversation with the OCLC product manager for WorldCat, I urged him to consider two changes to enable us to find all and only physical copies of early printed books whose online entries linked to digital images of those books. First, the “Book” Type field has to be programmed to exclude items clearly labeled elsewhere as “electronic resources,” usually subscription-only links low quality digitized microfilm images from Early English Books Online. Excluding those false hits would eliminate most of the clutter we find in searches for real books. Mr. Schulz said it was entirely possible to do that, and he has put that “on his list” of features to consider adding to the system. This is a task he might find more attractive if he thought more member libraries and scholars desired it.

A second change to WorldCat would enable it to search “500 MARC record fields” that are reserved for institutional-copy-specific descriptive notes, and to establish a consistent terminology for indicating that images were available. I suggested simply the searchable phrase “Digital images available,” as in the record for our copy of the Liber Scintillarum. Mr. Schulz said that WorldCat certainly could establish that as the standard code for any libraries wishing to incorporate imaging into their item records. The hyperlinks to the images, themselves, can be stored in the 856 MARC record fields. Once again, this is a task he might find more attractive if he thought more member libraries and scholars desired it.

WorldCat’s problems with individual libraries’ online catalogues would still need to be repaired even if the search engine’s shortcomings were overcome, but these affect only single collection interfaces, not the entire search apparatus.

If more library collections will integrate image of early books into catalog records accessible to WorldCat, this enormous resource could be a search engine connecting all of the world’s early book images for scholarly research. This need not preclude construction of specialized image galleries around grant- and program-specific projects, as long as they are stored in a format and on a server which can be accessed by the library’s online catalogue. From the scholar’s point of view, however, searching each of these stand-alone galleries is a slow process, and we must learn of their existence by diligent scouring of the Internet and scholars’ newsgroups and newsletters, always taking the risk that we are missing a major collection that has recently come online. By contrast, using WorldCat to search for early printed book images could help us find numerous examples are available of a given early print edition.

We are far from the ideal super-search-engine that could make it possible to compare title page images of widely scattered copies of a single edition, or to compare images from the same page of successive editions of a single work. Nevertheless, the WorldCat interface is inherently capable of producing such a tool. Other libraries with small rare book collections would be encouraged to image and descriptively catalogue their own books when they learned how economically and easily their collections could be enhanced. These kinds of projects benefit them in two ways librarians are sure to appreciate, bringing new teaching opportunities to their students and encouraging new local and regional scholarly activity using these underutilized treasures. If tiny Goucher College can mount a significant online image bank of early books via WorldCat, many other institutions also can do so. Together we could create a virtual image bank of enormous power and accessibility.

Medieval Institute International Congress, May 9-13, 2007

Early Book Society Session on “Searching Digital Image Archives”

Arnold Sanders, English Department, Goucher College, Baltimore, MD 21204

410-337-6515 (o) / 410-337-6272 (h)

Works Cited

Bawcutt, P. “The Mystery of The spyte of Spaine (Heirs of Andro Hart, 1628).” The Bibliotheck: A Scottish Journal of Bibliography and Allied Topics 19 (1994) 5-22.

Furrie, Betty. Understanding MARC: Bibliographic Machine-Readable Cataloging. 7^th Edition. Washington, D.C.: Library of Congress, 2003. Available online at http://www.loc.gov/marc/umb/. (Viewed 9/11/06.)

McKitterick, David. Print, Manuscript and the Search for Order 1450-1830. Cambridge: Cambridge UP, 2003.

Needham, Paul. “Counting Incunables: The IISTC CD-ROM,” [Review Essay] in Huntington Library Quarterly 61: 3-4 (2000) 456-529.

The Spyte of Spaine, or, a thankfull remembrance of Gods mercie in Britanes dileuerie from the Spanish Armado. 1588. Edinburgh: Heirs of Andro Hart, 1628. (STC 22998.5, James Wilson Bright Collection, 46005).

“Status: RLG Union Catalog Integration into WorldCat.” OCLC / RLG. Available at http://www.rlg.org/en/page.php?Page_ID=21014 Last updated 2 May 2007. Viewed 2 May 2007

Tanselle, Thomas. Literature and Artifacts. Charlottesville: Bibliographic Society of the University of Virginia, 1998.

Ziegler, Bernardus. Themata ördinariae disputationis de discrimine Veteris et Novi Testamenti, ex XXXI. / capite hieremiæ, à Bernhardo Cieglero, Sacræ Theologiæ Doctore, proposita V. Decembris. Lipsiae (Leipzig), Germany : [s.n.], 1545 (James Wilson Bright Collection, 46,128).