Building the Digital Collection

Introduction. The books in Pioneering the Upper Midwest  are reproduced as searchable texts and complete sets of page images. For most users, the value of searchable texts is clear: the researcher can find passages that refer to specific people, places, or topics. There are three reasons for accompanying the texts with facsimile page images. First, the searchable texts are only 99.95 percent accurate and the inclusion of page images permits a researcher to verify or perfect the Library's transcriptions. Second, the facsimile page images offer the researcher a reasonable sense of the look of the printed original. And third, they provide reproductions of the books' illustrations.

Although these books are not as valuable as illuminated medieval manuscripts or Thoreau first editions, the Library will retain these nineteenth- and early twentieth-century American imprints as bound originals. Thus the books were scanned in their bindings in a manner intended to prevent damage. The Library hopes that the availability of digital copies will mean that the original books receive less handling in future years but the project was not intended to represent digital reformatting for preservation.

The books were digitized in two phases under the terms of two contracts with Systems Integration Group of Lanham, Maryland. In both phases, the facsimile images were made first, followed by the transcription and encoding of the texts. The image capture occurred at the Library, using three different scanners, while the preparation of the texts was carried out offsite, with the contractor rekeying from the page images.

The works in this collection are presented both as searchable full texts and as complete sets of facsimile page images. In the page image display, users can "turn" from one page to the next, previous, or any numbered page. If a researcher identifies an item by searching the bibliographic records, a pair of links on the top of each catalog record display offers access to the full text and the image set. If a researcher identifies an item by searching the full text, the system brings forth a list of relevant items that offers links to each item's full text, image set, table of contents, and bibliographic information. Users can move back and forth between the full texts and the image sets at any time.

When the full text is selected, the initial display is in the HTML (HyperText Markup Language) format. The user is also offered the option of retrieving and displaying the text in SGML (Standard Generalized Markup Language) format.

Image production during the first phase. During the project's first phase (1995-96), images were captured using a Xerox (Kurzweil) K5200 scanner. Like certain specialized photocopiers, this scanner has a book-edge design that requires the book to be inverted and positioned along a beveled edge for scanning. Only one page is scanned at a time. From a book conservation perspective, this design is preferable to a flatbed scanner that requires the volume to be pressed flat, like a book on a typical photocopier. The book edge is shaped like a wedge, however, and with excessive pressure, it can damage a book's binding. As this project proceeded, the Library's conservators encouraged the digital program staff to seek a less damaging scanner.

The Xerox K5200 scanner, however, has a number of valuable features. Among these is the ability to produce several image types, two of which were produced for this collection. The master or archival version of the images for pages that consist of typography and line art is a 300 dpi bitonal image in the TIFF format, using ITU Group IV compression. Higher resolutions were not considered because there were no plans to produce new paper copies of these books, and the exigencies of scanning these bound volumes precluded the creation of the types of high resolution images associated with some digital preservation reformatting projects being carried out in university libraries.

In addition to the master images of typographic or line-art pages, the Library created a second second image type to reproduce printed halftone illustrations. Printed halftones present special problems that result from interference between the spatial frequency of the halftone dot pattern and the spatial frequency applied by scanning and/or output devices. When the two frequencies combine, the interference between them manifests itself as moiré patterns that degrade the image. The Xerox K5200 offers a diffuse dithering algorithm that randomizes the scanner's pattern of dots to produce bitonal images in which moiré is suppressed or reduced. The algorithm adds speckles to white areas surrounding an illustration, however, adversely affecting captions or other typography included in the same scan as the illustration. When the Xerox K5200's diffuse dithering treatment is applied, the software creates files in the PCX format. The Library's diffuse-dithered random-dot-pattern images can be printed on a laser printer with good results but do not rescale well for screen display.

During this project's first phase, pages with illustrations were captured twice. First, the full pages were captured as bitonal images with no treatment of printed halftones. Second, cropped versions of the illustrations (not entire pages) were captured and these image were treated with the diffuse dithering algorithm.

Image production during the second phase. During the project's second phase (1997-98), most page images were captured using a Minolta PS3000 scanner. This device has an overhead-scanner design that permits books to lie open on a cradle or flat surface, without the need to press pages flat. Thus the Minolta scanner does not subject books to the same stress as a book-edge device. In order to maximize the image quality produced by the scanner, the Library's contractor developed custom cradles, lighting, and procedures. The Minolta produces bitonal images in the TIFF format, with ITU Group IV compression, and at different levels of resolution. The Library's 300 dpi specification was continued in the second phase.

A number of factors ruled out the continued use of the Xerox K5200 scanner for illustrations. In addition to the worries about damage produced by the book-edge design, the K5200 has been discontinued by Xerox and the scanner software limits its use to relatively slow 386- or 486-chip IBM-compatible computers. Since the diffuse dithering algorithm that the Library preferred for illustrations is only available as a part of the Xerox K5200 software, the Library decided to produce the illustration images during the project's second-phase in grayscale and color. For increased efficiency in production during the second phase, the Library also decided to capture one version of each full page: either a bitonal image for pages with typography and line art or a grayscale or color image for pages with printed halftones.

The contractor produced the 300 dpi 8-bit grayscale and 24-bit color illustration-page images using a Phase One digital camera back on a 4x5-inch view camera. The camera was mounted on a stand above the books, which were supported by a cradle as they were scanned. Since this is not a preservation reformatting project, no uncompressed versions of the illustration page images were archived and the master images have been compressed with the JPEG algorithm. After capture, the contractor mitigated the effect of moiré patterns by processing the images with a combination of high- and low-pass filters and blurring and sharpening.

Converting the texts. After capture during both phases, the contractor sent the images to a subcontractor for rekeying. The Library's transcription requirement for text is 99.95 percent accuracy compared to the original. The texts are marked up with Standard Generalized Markup Language (SGML; ISO 8879), using the Library's American Memory document type definition (DTD) for Historical Documents. The American Memory DTD conforms to the international guidelines for humanities texts developed by the Text Encoding Initiative (TEI). The SGML-encoded version of the text serves as an archival file and is also made available online. In addition, online access is provided to HTML texts derived from the SGML archival files.

The HTML versions of the Library's texts result from a two-step process. First, the SGML texts are transformed to a format that is suitable for the indexing routines used by the InQuery search engine. This step also segments longer texts into "chunks" (generally chapters) to make them easier for users to access. Then, after a user's search brings forth a chapter to read, the text is formatted as HTML on the fly for display in the browser.

