James Linden

~# linux ninja / web dev geek / robotics nerd / idea machine / N6NRD

Making Digital Resources From Print Material

Overview

Making digital resources from print material isn't easy, nor should it be done without a plan. Many philosophies exist on how and why certain things should be done. This article is simply one man's views for the future. Hopefully, these ideas can be implemented, in full, or in part, to the benefit of mankind.

Global Premises of the System

The system described here assumes a web-based implementation - that is to say, all the processing is done via the web, and no special software is required by the end user, except a compatible browser.

Several terms are used in this article, as defined here:

  1. System: The whole system of scripts, applications, etc.
  2. User(s): Any person doing the human labor for any part of the processing.
  3. Database: The core system database, or series of databases.
  4. Project: Each item being processed, whether that item is a series of illustrations, a book, a magazine, a newspaper, etc.
  5. Scan(s): The scanned page images.
  6. Image(s): Any portion of a scan which would be included in the final content, such as illustrations, mathmatical formulas, or other non-text data.
  7. History: The complete record of any actions taken by a user or a system for each project.

The article assumes that the system operates in a certain way, namely with the following features:

  1. Multiple users do any manual labor needed, volunteers or employees, whatever the case may be.
  2. Every action taken by a user or the system should be recorded in the database.

Basic Processing Steps

Step 1: Finding Desirable Print Material

In my opinion, not everything should be digitized. A thoughtful selection process is needed to decide what to work on, and what to ignore. Along with this comes the prioritization of the material. There is not a whole lot to say about this, except to note that it really comes down to personal preference and interest, and possibly, a defined project's scope. Once the material is selected, a processing project record should be created in the system database.

Step 2: Copyright Clearance

If the target content is public domain, a copyright clearance check is required to verify the status of copyright on the selected material. This clearance should require very little human interaction, which would consist solely of a person entering selected metadata into a form such as title, author, publish date, etc. The system should query various sources, as needed, to derive the copyright status. This discovered data should be verified by a qualified person. The verified information should then be added to the database and associated with the proper project.

Any copyright information which is retrieved by the system should be stored in the database for possible future use, whether it applies to the current project or not. This data could be added to an existing local copyright clearance database, or used to build one from scratch over time.

If the target content is not public domain, then it is assumed that the system has implicit permission to process the copyrighted material, as the case might be for a company providing conversion services to clients who provide the content.

In the case of copyrighted material, the copyright info should be added to the project record, and put into the copyright clearance database as well.

Step 3: Scanning

At this point, there is no good way to automate the scanning - it takes manpower. Even automated scanning machines require a human's physical interaction. This article assumes that the user who originally selected the text for processing has access to the physical material. This user would then scan each and every page of the print material, naming the files using numeric identifiers according to sequence in the, ignoring book page numbers, etc. The scans should then be submitted to the system, individually or in batch(es). In theory, this submission would be done via some sort of upload system, either direct FTP, or via a webpage. The webpage method is generally impractical since it isn't designed to handle the size of files required for page scans, which can often exceed several megabytes each. These scans should then be associated with the current project.

Step 4: Scan Verification

Once the scans are fully uploaded, a second user (a user who did not scan and upload) should verify that the sequences of scans make sense. Any questions or possible errors should be brought to the attention of the person who scanned the material. Each issue should be documented and resolved. If any issues cannot be resolved (torn pages, blotted ink, etc), they should be noted as such in the history record. Once this process is completed, the scans should be made available for the next step of processing.

Step 5: Scan Processing

Next, each scan image should be processed on an individual basis. This step would include fixing skew angle, cropping multi-column content, resizing, resampling, or other cleanup of the image which might be required to insure usability of the image. Multi-column images should be renamed by appending a letter (a-z) to the image filename. After scan cleanup, images which should be part of the final text should be created from the original page scans and saved in a separate directory using filenames which easily correlate to the scan image it is derived from. Any captions associated with the images should be saved to the database or a text file and associated with the image record. This step is complete only after every scan and image is fully processed for the entire project.

Step 6: OCR Processing

The OCR processing should be as automatic as possible, and using a simple interface would allow the users to control OCR software located on the server. OCR for many projects could be highly automated, with users simple activating the OCR process, and verifying the output to make sure the OCR didn't encounter issues. More complicated material may require special OCR configuration, and if not possible on the server, the image should be downloaded by users with OCR capability, with the outputted text upload back to the system. Text output should be saved to files using the same filename as the images it was derived from, but in a separate directory. Once all OCR is complete, the project can be released for the next step. The history log should reflect how each page was OCRed, by the server, or manually by a user.

Step 7: Proofreading Round 1

The first round of proofreading should concentrate on only correcting errors where the text output does not match the OCR output. It should not be concerned with formatting (italics, etc), or spelling errors, except where spelling errors don't match the scan. Any corrections should be logged to the history, recording the correcting user, the page it was found on, and the exact correction. This data is important for proactively teaching the OCR system to eliminate errors for later projects.

Step 8: Proofreading Round 2

Round two of the proofreading should concentrate on formatting and correcting spelling mistakes which do appear in the original material, should such spelling correction be desired. For example, spell correcting a Mark Twain novel is not desired, as the errors are the author's intended style. Footnotes (and other forms thereof) locations should be marked in the text and actual footnote text saved in a separate file. Again, each change should be logged. Formatting changes don't really need to be logged, but could be for consistancy.

Step 9: Preassembly

After proofing, the system should automatically assemble the text files into a single merged text file, saving the file using the project's ID in the project's directory. Markers denoting each file's content should be placed where each scan's text starts. Once the system has completed this task, the text should be made available to a user to verify the assembly is complete and accurate.

Step 10: Assembly

This step's purpose is to introduce markers for images (created in step 5), including image captions into the merged text file. During this step, structure should be defined, such as chapter breaks, and other such presentational features (blockquotes, stanzas of verse, etc). Footnote references should also be verified. Once this step is complete, the updated file should be saved as a different filename than the preassembly output file in the project directory.

Step 11: Markup

After assembly, the system should automatically convert scan image number markers, structure markers, footnote references, and formatting marker to full markup. During the markup step, a user should verify this automatic markup and add any other markup required to the text. The final output should be saved as an XML file using the project's ID number and in the project directory.

Step 12: Cataloging

Metadata from the project should be verified by a qualified cataloger. The cataloger may update the metadata, in which even the system should add the changes to the history record. At this point, the release should be assigned to the project.

Step 13: Master Creation

During final assembly, the system should prepend the metadata, in full XML format, to the marked-up file. It should also do any other markup, such as adding license data, processing notes and history, etc to the file. This XML master file should be verified by a qualified user, either making changes as needed, or adding notes of any changes needed and sending the file back to the proper step for reprocessing.

Step 14: Format Generation

Any formats which will be static files should be generated by the system and verified by users.

Step 15: Archiving

Where necessary, the system should package up the master XML document, along with the images, scans, full history, OCR text output, and output files from other steps, including the generated static format output. A manifest file should be including, citing each file and it's description. This archive should be saved to various locations, and preferably burned to CD for safekeeping.

Step 16: Packaging

Where needed, the system should package up static formats, such as HTML+images, etc. Again, a manifest file should be generated and included. The packages should be saved using the release ID as part of or the whole filename. These packages, and any single-file versions of the project should be saved to the public repository. The master document, scan images, images, etc should be saved to the master repository under the release ID. Derivative versions, such as might be need for multi-volume projects, should be packaged under the same release ID, with added information to denote it as individual volumes of a single project. These multiple volumes should not be given new release IDs.

Step 17: Delivery

Once the master files are saved to the private repository, and the packages saved to the public repository, the catalog data should be added to the public catalog and the file made officially available to the public. The catalog record inteface should provide links to download mirros, along with access to on-the-fly format conversion for formats which are not stored statically.

Optional Sub Processes

Sub 1: Format Conversion

On demand format conversion should parse the master document as required by the particular output format's need. If required, a package should be created as described in step 15. This process can be accomplished by the same system used in step 14. These files should be named like 100.html or 100.txt, etc.

Sub 2: Language Translation

Automatic language translation of texts should be made available. Once the system has automatically translated the content, the new content could be processed as described in step 7 through step 16. Sub 1 may also be implemented. Again, these derivative packages should be given the same release ID as the parent document, but marked as a different language derivative, using ISO 3-character standard language codes. For example: 100-eng.xml and 100-spa.xml.

Sub 3: Voice Synthesis

Voice synthesis of the texts can be greatly automated using a series of processing steps. The voice synthesis system should implement multiple voices for different characters, narration, etc. Consistant rules should be created for denoting footnotes and other "peripheral" data, since it must be synthesized inline, not as an appendix. All synthesis media masters should be saved in Shorten Audio format, which provides non-lossy compression.

  • Step 1: Voice Markup

    Each character in the content should be assigned a voice, and content marked to identify where each person is speaking. In the case of theatrical content (Hamlet, for example), this may already exist and only need to be verified and associated with the selected voice profile.

  • Step 2: Test Synthesis

    A few passages for each voice profile should be sampled and synthesized for verification by a user. Once verified, the system should synthesize the full content.

  • Step 3: Synthesis

    The system should synthesize the entire content in blocks, possibly as small as a few paragraphs at a time. This makes it possible to correct mistakes and only have to re-synthesize small portions, and replacing that output file.

  • Step 3: Proofing Round 1

    Users should listen to the blocks of synthesis output and read along with the text. Any errors should be corrected by instructing the synthesis engine to re-synthesize that block containing the error. All such corrections should be added to the project history.

  • Step 4: Block Merging

    The system should automatically merge the blocks into a seamless file or series of files (for large content). This process should be verified by comparing a log file generated during the process to existing synthesis block counts.

  • Step 5: Proofing Round 2

    A user should listen to the full synthesis, reading along in the text and check for continuity, noting any issues, and sending back to step 3 for correction.

  • Step 6: Approval

    Final approval of the synthesis output should result in the output being cataloged as defined in step 12, then archived, packaged, and delivered as defined in step 15 through step 17. A master synthesis output copy should be burned to CD, again, for safekeeping.

  • Step 7: Format Conversion

    Shorten Audio files aren't common in the real world, but the file format is the best for masters. Once the masters are created, they should be automatically converted to more popular formats, such as patent-encumbered MP3, or open-source Ogg Vorbis, for delivery to the users. These copies should be created and saved as static files, as on-the-fly conversion is simply too hardware intensive for a repository server to handle.

The Repository

After processing, the work is not completed. The repository management and use is just as important as the processing. The following section details the construction and management of the repository. The repository contains the original scans, the master XML document, proofread OCR files, images, and possibly Shorten Audio masters, alternative language master XML documents, etc. Along with the masters, any static versions of the content (audio and text) should be stored in the repository.

Part 1: Directory Structure

The repository is a large filesystem, and careful consideration should go into creating it. The following directory structure is my suggestion for a good filesystem (assuming the release ID of the content is 100):

  • repo
    • 1
      • 0
        • 100
          • 100-man.xml
          • 100-man.txt
          • 100-log.xml
          • 100-log.txt
          • masters
            • text
              • 100-eng.xml
              • ...
              • 100-spa.xml
            • audio
              • eng
                • 100-eng-001.shn
                • ...
                • 100-eng-999.shn
              • spa
                • 100-spa-001.shn
                • ...
                • 100-spa-999.shn
          • text
            • eng
              • 100-eng.txt
              • 100-eng.xml
              • 100-eng.html
              • 100-eng.txt.zip
              • 100-eng.xml.zip
              • 100-eng.html.zip
              • ...
            • spa
              • 100-spa.txt
              • 100-spa.xml
              • 100-spa.html
              • 100-spa.txt.zip
              • 100-spa.xml.zip
              • 100-spa.html.zip
              • ...
          • scans
            • 100-0001.png
            • ...
            • 100-9999.png
          • images
            • 100-0001a.png
            • ...
            • 100-9999z.png
          • audio
            • eng
              • 100-eng-001.ogg
              • ...
              • 100-eng-999.ogg
            • spa
              • 100-spa-001.ogg
              • ...
              • 100-spa-009.ogg
          • iso
            • 100.iso

Most of the filesystem is self-explanatory, with a couple of possible exceptions:

  1. 100-man.* files in the root are the manifest files which should contain all metadata associated with the content, along with a list of all files available for the content. This file should be automatically updated whenever the system changes any of the data, adds a new format or language translation, etc.
  2. 100-log.* files in the root are the change log files which should contain a log of all processing steps and updates to the content files. This file should be automatically updated whenever changes are made to the content or the repository for the content.
  3. 100.iso in the iso directory is a CD image of the entire contents of the 100 root directory, except the ISO itself of course. This image should contain all the manifest, change log, scans, images, text, audio, and tranlations of the content which are available. This iso file should be rebuilt whenever changes are made to the content.

The masters subdirectory should not be writable except the system itself, nor should it be available via FTP or rsync, etc. Everything else should be readable by the world, and available via FTP, etc.

None of the files in the repository should ever be manually edited for any reason!


Although I never finished writing this concept, I thought I'd put it up for those who may be interested in the part that is done. One day, I may revisit this.

photo of James Linden
Founder / Head Geek
Digital Dock, LLC
aka kodekrash & N6NRD
Perkasie, PA USA

What I Do

Linux administration/virtualization
Datacenter management
Web development

Full CV

What I've Done

Drowned a motorcycle
Rescued a skunk
Built Prime GNU/Linux
Contributed to Spidering Hacks