James Linden

~# linux devops jedi / maker / N6NRD

Digitizing Texts: The Next Generation

What is it about the eBook world that blinds people to quality and makes them wholly quantity-centric? Do numbers mean more than usefulness? The last twenty years of converting paper books to eTexts and eBooks - particularly works in the public domain - has been an evolutionary "how to" learning curve. Although the quantity of texts converted is impressive, the quality has greatly suffered, and this lack of quality is beginning to catch up with us.

It is time that we take a brief look at the issue of quality digital conversion of paper texts: to study what has been done, and to offer a revolutionary new path integrating what we now know is best, and what we predict will be important in the future. After all, the purpose of digitizing the "Paper Domain" is not only preservation, but to make the content wholly useful for future generations and their needs, and not just meet present-day needs.

Thus, in the first part of this article, I will present a summary overview of the preferred, next generation process that anyone, whether a private individual or a major organized effort, should follow when digitizing paper works.

In the second part of this article I will review several of the larger organized efforts to digitize public domain books and similar types of works. This review will focus on the deficiencies they have, as well as mention the innovative approaches they have implemented.

  1. Locating appropriate printed material

    The first hurdle is to find and appropriate usable copies of the print books. This is often difficult because of the lack of quantity for many older books. Even when a book is found, one must consider the impact destroying the book will have on the literary world. For example, it would not be wise to destroy an original copy of a book if only half a dozen are known to exist. For those of you whom may be less knowledgeable about working with printed material, I will explain that often, it is necessary to cut the binding of a book in order to get the loose pages to scan accurately. Once cut, the book cannot generally be re-bound. Scanning a rare book can be done without cutting, but it requires more manual labor. In my example, of a book with only six copies, this would be acceptable "extra" work.

  2. Copyright clearance

    Getting authorative copyright clearance isn't generally too difficult, because of basic copyright laws in various countries. Still, each book needs to be cleared individually to insure that no copyright extensions or special case rules might apply. For books where copyright may apply, it is necessary to get permission from the copyright holder to digitize their work. Once this copyright information is obtained, it should be saved in a database for future reference and inclusion in an authorative catalog (described later). Once cleared and prepared, a barcode should be assigned to the printed book for tracking.

  3. Scanning the printed pages

    Scanning the loose pages of the print book isn't a particular difficult task. However, it is very important to save the original scan images. Many eBooks no longer have these images available, making it extremely difficult to track typographical errors. Also, these books don't often include illustrations because they were done only in plain text, and without the original scans (and no long-term storage for the printed pages, as noted above), it is very difficult to get these illustrations from the specific edition of the book that the text came from. These scans should be saved in an economical format such as PNG or compressed TIF, which maintains scalability and quality without sacrificing file space. Original scan images are particularly important for tracking the pedigree of the book. Scan images should also be saved to some distributable medium, such as CD, for archival purposes, bar-coded to match the printed book's barcode and stored in a safe location.

  4. OCR (Optical Character Recognition) of the scanned images

    The OCRing process is relatively straightforward with various OCR software packages. The resulting raw text is then used to put together the actual full text. The resulting raw text files need to be archive in their original form. These original text files are important for version tracking and error correction, among other things. Again, these should be burned to CD and bar-coded.

  5. Proofing

    This step is actually a multipart process. Humans should proofread the raw OCR'ed text, at least twice, and preferably, three times. This insures that the OCR accurately deciphered all the characters from the scans. It is crucial that at least two different people proofread each page of text. This keeps the human brain's natural "auto-adjust" feature from becoming too much of an issue. The proofing step's main complication comes at the end, with the pre-assembly task. Because the raw OCR text files are straight text, they loose the entire original formatting of the print book. As the pages are being proofed, it is imperative to rebuild all the original formatting.

  6. Assembly

    Assembly of the master document is the single most complex step in the entire process. If the master document is straight text, it cannot maintain the formatting and character encoding which is necessary to properly reproduce the content. Flat text also does not offer a consistent metadata and content structure. While no file format is going to be the absolute best solution for every case, we can certainly do better than plain text and arbitrary XML vocabularies. Personally, I recommend a strict subset of the Open eBook Publication Structure specification, which can implement Dublin Core metadata as well as MathML and SVG as needed. My only point of opposition with the OEBPS format is that it is a subset of XHTML, and is therefore prone to the many problems of such a loosely specified vocabulary. This issue, however, can be resolved easily enough. The assembly process is crucial to the proper production of eBooks. Well marked-up master documents make it very simple to convert to other formats, both text and other media. This master format is not for normal distribution uses, although it can be distributed along with the "user" formats such as HTML, RTF, PDF, etc. The master document needs to include page numbers, full metadata, references to embedded illustrations, etc. This information is needed so that it can automatically reference the scanned images and raw OCR'ed text files for each page, among other uses. This master document needs to be accessibility-friendly as well and should contain only structural and semantic markup, and exclude presentational markup. This enables format conversion systems to create versions of the document supporting all the features of a particular format, without being stuck with too many complex special cases that would have to be addressed with custom code for each case.

  7. Format conversion

    Creating these "user" formats is very simple once a rich master document is available. By using a single master document, many other formats can be generated without having to redo massive amounts of work, as is the case currently. These "user" formats can even be generated on the fly, based on the user's own preferences for things such as font, font size, colors, etc.

  8. Alternative formats

    Another crucial piece of the puzzle addresses accessibility. Voice synthesis technology has advanced far enough to make synthesized books quite enjoyable. While human "readings" will always be better, synthesized audio is a very large step forward. (For those of you who's only exposure to synthesis is from Microsoft Reader, check out Rhetorical Systems' text-to-speech demo.) The voice synthesis system could use Shorten audio format for masters (or another lossless format), and convert to various other audio formats as needed, including MP3, Ogg Vorbis, WAV, AIFF, and even streamed formats, again even based on user preferences. Once these masters are made, they should be archived and placed on CDs, with the barcode.

  9. Language translation

    Another large hole in the picture is the lack of texts in multiple languages. Generally, only the more popular books are ever translated to other languages, leaving tens of thousands of books out of foreign libraries. Language translation software can be 90% accurate, according to one expert. (My grandfather, Eldred Linden, who has been translating texts in a half a dozen languages for over a decade.) With a proofing system similar to the OCR proofing step, language translation could be quality-controlled well enough to make it a very real possibility in the immediate future.

  10. Cataloging

    Implementation of an authorative catalog is relatively simple in the scheme of things. This authorative catalog should index scanned images, raw OCR'ed text, proofread text, rich master documents, format options, alternative media, and language translations - all in one database. Of course, this database should include World Cat and Library of Congress compliant MARC data, and Dublin Core metadata, along with copyright clearance information, processing history, version history, and pointers to each format at various locations, etc. Graphics for CD label artwork could be built on the fly from catalog data, or graphic artists could create CD label images that can be indexed within the catalog. Similarly, graphics for other uses, such as printable covers, etc. can be indexed as well. The catalog itself should be widely distributed and made available in multiple languages. Maintaining the catalog data could then be done by small groups of people interested in particular subsets of the catalog. For example, the Edgar Allan Poe Society of Baltimore might be interested in maintaining the catalog entries for Poe's works.

  11. Distribution

    Of course, users must be able to access these texts for them to be truly useful. The most obvious interface is the aforementioned authorative catalog. Not only should users be able to download the format of their choice in the language of their choice, but they should also have the option of downloading CD images or receiving CDs via snail mail. With all the data available to the catalog, CD masters can be created on-the-fly based on user preferences, giving users the ability to select what they want included on their CD (either image or disc). For example, I might want a CD image for all of Edgar Allan Poe's works for research purposes. For this, I would want the scanned images and a couple of different format flavors. On the other hand, I may want every work done in a particular year range, and I could simply select this year range via a web form and have the system build my CD image for me. When the image is done, I could receive an email with a link to download my image so I can burn it, along with a link to the appropriate CD label graphic so I can print a nice label. Building CD case sleeve graphics could be done on the fly, with a complete list of everything on the CD. Again, I could print this on sleeve paper for my CD case. Naturally, once a custom collection is made (similar to a shopping cart system), the files can be downloaded in native format or compressed into a package, skipping the CD image step entirely.

While this system may sound daunting to many people, it is nothing to be afraid of. When broken down properly, the entire puzzle can be put together piece by piece. Most of the technology needed to make it all happen is either open source, or can be purchased for moderate sums using grant funds. The equipment to run such a system is also quite simple when broken down into pieces. The benefit of the "puzzle" approach is that no two steps would have to be done using the same equipment, or even in the same physical location.

Yes, I have left out many smaller items of interest, but rest assured, they all have their place, and can all be taken care of. This entire vision is scalable and sustainable - to enhance our lives now, and for our children and grandchildren to use and expand to suit their own needs.

There are currently four major eBook production projects, each of which has their own ways of doing things, some positive and some negative.

Project Gutenberg

Project Gutenberg is the original eBook producer on the Internet, starting back in July 1971 with the "Declaration of Independence". In the following 32 years, not a whole lot has changed for some aspects of the work, but recently, several important things have changed. PG has done well by allowing other people to mirror their archives and by actively maintaining a good volunteer group. The biggest problem with the system is the core file format in use. Gutenberg relies on a decades-old vanilla text format that does not handle character encoding, embedded illustrations, or other much-needed data. The inconsistent data structure also makes it hard to work with the files. Another large issue is the current state of PG's catalog. With very little metadata available, the catalog has little use except to decipher the extremely vague file-naming scheme that PG employs. Gutenberg currently has no real version tracking in place, so when content is updated (usually typographical fixes), it is hard to find these things out. Again, I must commend PG on their volunteer base. Doing a production project on any scale requires volunteers, and here, Gutenberg has succeeded. In recent history, PG's biggest improvement came with the introduction of the Distributed Proofreaders website. The DP system allows volunteers to proofread the raw OCR texts against the original scanned images, one page at a time. By implementing a two-step proofing process, DP has done well on quality. However, this system does not create rich master documents, so most of the original formatting is still lost in the final product. Any eBook production system would be wise to take advantage of DP's successful system.

Million Book Project at the Internet Archive

The Million Book Project is very ambitious project to - you guessed it - digitize one million books. While this goal is indeed admirable, so far, the project has done little but make existing problems even worse. They have, however, maintained archives of the original scan images, an improvement on PG's work. The biggest problem I have found with MBP is their catalog - and I use the term "catalog" very loosely because their "catalog" is basically just a very rough listing that can be viewed using various criteria. An example of one of the issues with their catalog is the highly inaccurate topic headings. For instance, there are six topic heads for Shaktism, not one of which is correctly spelled. Another issue is that author names and titles are not consistently normalized using standard notation. This, of course, makes alphabetical browsing very difficult, not to mention forcing the user to constantly make mental adjustments when reading the list. While MBP has attempted to implement Dublin Core metadata, it lacks anything beyond the very basic fields, so is useless in any practical sense. Even while pointing out these issues, I must commend them on their scanned images archive.

Electronic Book Center at the University of Virginia Library

This project is probably the most complete academic-centric library of eBooks and eTexts. They have carefully created master documents in SGML, therefore maintaining more of the original formatting than other projects have. The Electronic Book Center doesn't really have a catalog, per se, although they have very neatly laid-out listing pages. This is their biggest problem, as it makes it hard to find different eBooks. Another, smaller, issue is that a lot of the library is only accessible from the University, and inaccessible to the general public. EBC did this for various reasons, primarily to control access to copyrighted work, and while unfortunately, it is necessary to keep some publishers and providers happy. While I would like to see EBC implement an XML vocabulary for their master documents, their use of SGML puts them well ahead of the pack in regard to complete masters.

Making of America

The Making of America project deals mostly with American content from the late 18th century to the late 19th century, but they have done an exceptional job with the content they have. The most impressive part of MOA is their catalog and search, which far exceeds that of any other project. They have carefully indexed a lot of the important metadata, using normalized notation that makes the catalog very searchable. While their particular search engine technology is a bit slow, that should not detract too much from the good work on the data behind it. MOA has also preserved the original scanned images, one more point that puts them ahead of most projects. So far, I have not located any text versions of the content, which is a bit odd, considering the generally good system they seem to have. Unfortunately, scanned images alone aren't enough for eBook work - there must be text flavors available. The project's website can be a little confusing, but it's understandable with the search engine behind it and all the metadata they try to display.

In combining all four of these projects, we don't even have one full "next generation" project, a disheartening thought. The good news is that with a little work, all these projects (and others) can support a full compliment of features, proper archives, and invaluable format options for their entire content holdings.

Not one of these projects deals with accessibility issues, particularly, none of them have any system, rudimentary or not, for speech synthesis. All of the projects have major issues with their catalogs, either missing crucial metadata, or lacking a good, understandable interface to display the metadata that is available. Some of the projects have multiple formats available, but they are ad-hoc and inconsistent at best. It is hard to find out some information regarding the internal workings of the projects, with the exception of Project Gutenberg, where I have volunteered a good deal in the past few years. Some of the projects may indeed have long-term storage systems for the print books after they have been processed.

EBooks are an invaluable technology in this era, and will continue to be so for at least a few decades. Even in the future, should the current concept of eBooks be usurped by newer technology, the work done now will be important in any migration effort. In order to insure full reusability, current projects need to keep the larger picture in mind.

For those of you whom may be directly involved with any of these projects, I welcome feedback and corrections. I have endeavored to make this article as accurate as possible, but with limited access to parts of the projects, it is difficult to clearly address each point.

Special thanks goes to Dr. Jon Noring for graciously allowing me to use his "Paper Domain" concept.

photo of James Linden
Founder / Head Geek
Digital Dock, LLC
aka kodekrash & N6NRD
Collegeville, PA USA

What I Do

Linux DevOps
Web development

Full CV

What I've Done

Drowned a motorcycle
Rescued a skunk
Built Prime GNU/Linux
Contributed to Spidering Hacks