<< Previous Message Main Index Next Message >>
<< Previous Message in Thread This Month Next Message in Thread >>
Date   : Wed, 02 Jul 2008 22:41:42 +0100
From   : mu.list@... (Mark Usher)
Subject: The Micro User

Coming in a bit late on this thread...

I have a complete set of Acorn Users, apart from Issue 1, that have been
donated to the doc project for the specific purpose of being scanned.

The best way to get good images of the magazines is to pull them apart and
use a sheet feeder - that lays the image flat on the scanner glass, not
scans as the page goes past. These scanners are becoming increasingly common
as part of a Minolta/Canon/et. al   photocopier that now have these
functions built-in. They usually have the ability to be connected to the LAN
to transfer the scanned images.

This batch scanning is the best way to do this amount of work. Someone must
have access to a machine of this sort?

I experimented with pulling the magazines apart and found the best way was
to use a plane, holding the magazine between two pieces of square edged
timber to get a clean cut and edge.

Another way would be to heat and melt the binging glue, but I didn't find a
method to do this with satisfactory results. Heat also damaged the paper of
the magazine and thus the subsequent image.

Regarding the OCRing and images. The main purpose is to preserve the image
as published, and thus the information it contains. Modern techniques can be
used to do this, but the object is not to recreate a file that can be used
to reproduce or reprint the original. 

Once we have this image, it is extremely beneficial if it can be indexed and
searchable in some form, so that users may find information more readily.
The combined use of PDF and OCR, where the OCR text is layered behind the
image is the perfect answer to this. The original information in the form of
the image is preserved, and the need to search and index content is
achieved. Because the original image is still what the user sees, it is not
of paramount importance for the OCR to be 100% accurate. 95% accuracy would
be perfectly acceptable and would be far less time consuming.

In summary... I would suggest the following
1) Preparation of magazines to single pages for duplex batch scanning
2) Collation of images
3) Batch preparation into PDF files
4) Batch OCR of PDF files to provide text search and indexing capability
5) Indexing

Annotations can always be made to PDF files where articles have been found
to be inaccurate without changing the original image, as people find or feel
inclined to do so. These can always be turned on or off by readers.

Program listings, or other critical pieces of text, that would suffer from
inaccurate OCR results could always be corrected by proof reading manually.
In the case of program listings, disc/tape images could easily be linked to.
I believe 8BS had most of the Acorn User programs available.

I believe that the above would give us what we desire, the ready easy access
to Acorn User (Micro User et al) with the ability to search for text within
articles and preserve the images as they were originally published. It would
also require substantially less man effort as much is an automated process.


-Mark
www.bbcdocs.com
 
<< Previous Message Main Index Next Message >>
<< Previous Message in Thread This Month Next Message in Thread >>