Googling the Library Catalog – To Talk of Many Things

“More and more web content, including library content, never sees the light of day. It is hidden behind proprietary database interfaces where it can't be found by popular Internet search engines. Librarians must bring the deep, invisible web to the surface so that our public can discover our resources through a query on a popular search engine….

Our students would be better served by library content on persistent pages that continue to exist beyond the end of a database search. This would help them find information where they look for it—online. We can accomplish this by using web site development products that allow for publishing web sites and database packages that enable data export. We can update, change, and maintain content in background databases and still offer a persistent presence for our content on the web server….

There are many initiatives today to bring library content to the surface web. Projects such as the Open Archives Initiative and products such as SFX seek to harvest and mine information from the depths of myriad databases. At the University at Buffalo (UB), State University of New York, we have concluded experiments that show millions of web pages can be extracted from databases—such as the catalog—and stored persistently on the web. This allows libraries to have the best of both worlds: database control and bookmarkable persistent documents. It will also allow students to use their interface of choice—that of a web search engine—to find authoritative materials….

At UB, we wondered if a web server could handle an unlimited number of persistent web pages. Many web sites contain tens of thousands of pages, but could a web site handle millions of pages? Modern disks certainly have the required capacity….

As a proof of concept test, and to meet the practical needs of conversion testing, we made a file for each of our 2.2 million MARC records. We extracted exactly one MARC record in each file and named the files in a consistent way that included the unchanging, unique record number from the mainframe system. Not only did the web server handle this without a problem, but with the help of a free program called MarcEdit, we could download any record at will and display or edit it on a desktop computer….

In short order we had 4.4 million files, 2.2 million MARC records, and a corresponding HTML page for each record that contained a link to the MARC record. Then we decided to index the HTML pages. Would Google crawl through the web site and make us a free catalog?

After several weeks, a Google spider found our site and crawled away. Unfortunately, Google only picked up our 80,000 directory entries, crawled about 20,000 of our HTML MARC pages, and then stopped crawling. Apparently, too much content from one site is to be feared. We did get to see what a Google library catalog might look like. More importantly, we proved that it is possible to create many catalog pages, index them, and search them with an Internet search engine. This can be done without an ILS and without a relational database….

Further experiments with web spidering engines confirmed that it is possible to index two million XML catalog pages. There are scalability problems, however. We found one free engine that handled XML beautifully but was memory bound while building the index and thus limited to about 100,000 of our pages. Another free engine successfully indexed the whole two million–record site but exhibited slow response time when searching….

Librarians working closely on the design of the NetCatalog have already pointed out some benefits. These stem from the fundamental design model. Because all of the metadata is tagged and included on a results document, any and all information can be combined in a search. Novel combinations of search criteria allow slicing and dicing that was formerly impossible in NOTIS. For example, location-based searching and call number searching can be combined with author, title, and other bibliographic selection. Limits by language, material type, and even physical attributes like book cover color are now possible. Searches and results pages can be easily saved and rerun. Web servers run 24/7, unlike mainframe systems, so for the first time we have round-the-clock operation of the catalog.” [Library Journal, via Catalogablog]

Emphasis above is mine. This idea is really blowing me away, although I can't see every library doing this. It could certainly be an interesting route for VIC, though. [The Shifted Librarian]

Leave a comment