Pittsburgh, PA
Sunday
November 22, 2009
    News           Sports           Lifestyle           Classifieds           About Us
Health & Science
 
Place an Ad
Running Calendar
Travel Getaways
Headlines by E-mail
Home >  Health & Science >  Science Printer-friendly versionE-mail this story
CMU's digital archive puts the papers of prominent people online

Procedure preserves originals

Monday, December 03, 2001

By Byron Spice, Science Editor, Post-Gazette

As an archivist and historian for the Heinz family, Diane Martz is accustomed to digging through musty files and dusty boxes to find old photographs, speeches or first-hand accounts.

 
    For more information

Sen. H. John Heinz III Archive, http://heinz1.library.cmu
.edu/HELIOS/

Allen Newell Collection, http://heinz1.library.cmu
.edu/Newell/

Herbert A. Simon Collection, http://heinz1.library.cmu
.edu/Simon/

The National Library of Medicine's Profiles in Science, including the new collection of geneticist Barbara McClintock's papers: http://www.profiles.nlm
.nih.gov

Digital Library Federation: www.diglib.org

 
 

Not surprisingly, she wouldn't mind handling the same task with a few clicks of her computer mouse instead. And in the case of the late Sen. John Heinz personal papers, that's just what she does.

The senator's papers, all 850,000 pages of them, are digitally stored by the Carnegie Mellon University Libraries. So a simple computer search can quickly locate images of a handwritten note or a piece of Senate correspondence. No dust. No sweat. No time at all.

"Digital is much easier," Martz said. Unfortunately for her, most of the requests she fields are for older members of the illustrious Heinz family, whose papers are stored the old-fashioned way.

It's not surprising that libraries are moving aggressively to digitize their archives, as well as their special collections, said Dan Greenstein, executive director of the Washington, D.C.,-based Digital Library Federation. Although the cost is high, the benefits are considerable.

Storing a digital image of each page allows users to view a collection from any computer with an Internet connection. And translating each document and image into text that can be read by a computer means entire collections can be quickly searched for particular references or documents.

Historical artifacts are transformed into databases.

"By putting these collections online, we've really changed the way people do research," said Gabrielle Michalek, head of digital library initiatives at Carnegie Mellon. Scholars who once spent months or years poring through box after box of material, and then rushed to write an article or book, can now rapidly find many documents via computer, giving them more time to think and analyze before publishing, she suggested.

An appealing proposal

Carnegie Mellon dove into digitization in a big way 10 years ago, following the April 1991 death of Heinz. As Gloriana St. Clair, university librarian, recalls it, then-president Robert Mehrabian first proposed digitizing the senator's papers.

"That was a cutting edge idea at the time," when the Internet was still in its infancy, St. Clair said.

But the idea appealed to the Heinz family. "It combined a lot of things that the family thought were important to the senator," said Grant Oliphant, spokesman for the Heinz Endowments. Technology had been one of his interests and the family was interested in showing the depth and breadth of his public life, as reflected in his papers. And there was also the democratic appeal of making it all easily available.

"Often, archives like this are the province of scholars, working in dusty rooms," Oliphant said. But when the collection is online, it is open to anyone.

Staff members at the Heinz offices often use the digitized collection. So do legislative aides, journalists and political science students.

"It appears to us it's used about 100 times more frequently than other congressional collections," St. Clair said.

The Heinz Endowments and the Heinz Family Foundation donated more than $1 million that Carnegie Mellon used to develop its technology and procedures and to digitize the senator's papers.

All 850,000 pages of the Heinz collection went online two years ago. Carnegie Mellon also has digitized the papers of artificial intelligence pioneer Allen Newell, who died in 1992, and has begun to publish those of his close colleague, Herbert A. Simon, who died early this year.

Last month, the library passed the milestone of 1 million archival pages published -- the millionth page being a handwritten note by Simon detailing his 1978 trip to Sweden to accept the Nobel Prize for economics.

"This is an enormous archival collection," St. Clair said.

Other digital collections are even larger, Greenstein said. The Library of Congress boasts 7.5 million digitized pages thus far. In October, the National Library of Medicine announced its digitized archive of geneticist and Nobel laureate Barbara McClintock, the eighth in a series of "Profiles in Science" it is publishing online.

For a broader audience

The digitization process is a multi-stage effort. First, each page is digitally scanned, creating an electronic photocopy. It's this image that the collection users see when they open the document on-line. But for a computer to sort and search through the archive, the images must also be scanned using a process called optical character recognition to translate the document into text that the computer can read.

Human readers -- at CMU, graduate students -- then must compare the machine-produced text with the original document to ensure accuracy. Handwritten material that cannot be scanned must be punched into the computer database manually.

The original documents are then stored, safe, cool and dry, at Iron Mountain National Underground Storage, a former limestone mine in Butler County.

The costs are coming down -- digitizing the Newell and Simon collections has cost about $100,000, compared to the $1 million spent on the Heinz collection -- but Michalek said it still isn't cheap. Money must still be spent to preserve the original papers, while also investing in the management of large computer databases.

Some archival material cannot be digitized -- copyrighted material, personnel or medical records, for instance. Simon's "Tower of Hanoi," a peg-and-block game that he used to teach problem-solving to his students, remains a hands-on artifact. But the vast majority of people who use the archives, Michalek said, come through the computer, not through the door.

Making historic materials available to a broad audience is an important contribution, said Kiron Skinner, a CMU political scientist and historian who co-edited a best-selling book, "Reagan in His Own Hand."

"I think we know now that it's valuable," Skinner said. "But is it a replacement? I don't think so." Serious scholars, she maintained, will continue to examine the actual documents. That's not just because so few archives have been digitized, but because even the best digital archives will be incomplete.

"Part of doing archival research is that sometimes you stumble across a goldmine," explained Skinner, who came across one when she found Ronald Reagan's handwritten scripts for his radio commentaries that aired in the late 1970s, before his successful run for president. Those scripts became the basis for her book. "I don't think I'd ever feel confident ... that everything had been catalogued."

Skinner routinely used photocopies of Reagan's writing so she could study them outside the archive, yet often found new information not seen in the photocopy when she later went back to the original document.

There also are intangible benefits to the tangible artifact.

"It's a matter of holding the paper that the person once held," she said. "It affects the creative process."

Author Nicholson Baker made an impassioned argument in favor of paper earlier this year in his book, "Double Fold: Libraries and the Assault on Paper." Baker castigated librarians for tossing out newspapers and books in favor of microfilm records. Microfilm, he argued, is incomplete and subject to fading. Digitization only exacerbates this trend, he says, and often is of poor quality.

Though scanning technology is improving, any digitized image is going to be what librarians call "lossy," missing some level of detail of the original, Michalek acknowledged. But she emphasized that the archival material is not destroyed; in fact, preservation is improved because fewer hands touch the documents and because storage conditions in the old limestone mine are superior to those in the library.

"I'm a historian," Greenstein said, "and I know that managing the artifact is important." In the case of archives or a library's special, one-of-a-kind collections, it obviously is important to preserve the original.

But selective digitization, Greenstein added, "could lead to a whole new way of thinking about managing collections." Not every library would need to invest in the same expensive books, for instance. Digitization would allow more sharing of resources, stretching often-limited library budgets.

In many cases, digitization will make more books and materials available to more people. For instance, St. Clair and Raj Reddy, CMU's former chairman of computer science, this fall received a $500,000 grant from the National Science Foundation to begin the Million Book Project, an international effort that will digitize at least a million books published before 1920 and no longer under copyright protection. That's more books than can be found in all the libraries in some states, St. Clair noted.

"It will bring a large library to everyone who can get on-line," she added.

Search | Contact Us |  Site Map | Terms of Use |  Privacy Policy |  Advertise | Help |  Corrections