Date : Tue, 24 Jun 2008 11:59:05 +0100
From : christopher.whytehead@... (chris whytehead)
Subject: The Micro User
>
> You don't have to do all these stages. OCR is really advanced these days
> making this all very simple to do without all this tedious messing around.
> Scanning, Layout and Prof reading built in. Preserving the layout or
> selected different types of finally layout. OCR has moved on from what it
> was in the past.
>
I have to disagree at least in part with Chris Thornley.
I have scanned a lot of documents (somewhere around 600 separate documents
which are available from Chris's Acorns) over the last 5 years using
Scansoft/Nuance OmniPage 11, 15 & 16 and Windows 2000. I use Omnipage with a
Cannon A4 scanner to scan the document at 300dpi. My objective is to
reproduce the content of the document accurately with as close an
approximation to the layout as I can reasonably achieve.
My procedure is:
1. Scan 10 pages of the document. I choose 10 because it takes me about 1
hour to scan, OCR and proof read 10 A4 pages (about 30 mins for 10 A5
pages), after which my eyes need a rest.
2. Save the scanned 10 pages as an Omnipage file (.opd).
3. OCR and proof read the pages 1 at a time.
4. When the 10 are complete resave the Omnipage (.opd) file. This is both to
save progress (some manuals may take a week to complete) and crash
prevention as Omnipage crashes are not unknown.
5. Repeat steps 1 to 4 until document is complete, then save as PDF (edited)
file and proof read the PDF particularly checking the layout.
6. Correct the inevitable errors, resave PDF and check error resolved.
Repeat until all OK or as good as I can get it.
7. Final save of Omni page file after all corrections and archive it so I
can re edit and recreate pdf if needed.
8. upload PDF to Chris's Acorns and index.
Some observations on this procedure:
1. Omnipage scanning quality is good, even with bound documents (e.g.
magazines) Omnipage copes reasonably well with the curves. Although I think
the Canon scanner also contributes to this.
2. I scan colour pages in colour, pages with greyscale images (e.g.
photographs) using greyscale and other pages in black and white.
Occasionally with very poor quality originals I use greyscale instead of
black and white and then enhance the image (in Omnipage) to get the best OCR
I can.
3. In general I find the word recognition good and OmniPage "learns" to
recognise words by asking you about unrecognised ones. Omnipage has a
training file which I save after each document is scanned. I find that
recognition definitely improves with the number of pages OCRed. I also have
a user dictionary which must have loads of Acorn/computer specific words by
now.
4. The main issue I have with OCR is layout. I find that Omnipage will use
fonts, font sizes, justification and spacings indiscriminately. I now
restrict scanned documents to 3 fonts (Times new Roman, Arial and Courier
New) unless there is a very good reason to use another font (e.g. example
text in that font). So I start by converting all text to the main font and
font size (usually Times New Roman) and change spacing to single space,
standardise line length and justification and set 0 point space after a
line. I then apply any changes needed to restore the page's appearance (e.g.
Headings in Arial bold) and spacing. I find particular case is needed with
page headings and footings, with page numbers.
5. I have particular difficulty with text that flows around or over
irregular objects/pictures. (e.g. in the Acorns promotional brochures).
Sometimes I have no option but to treat the whole lot as a image and not OCR
it.
6. When I save as PDF files, I save as PDF version 1.2 (Acrobat 3.x) to
ensure support from most RISC OS pdf readers. I test using the latest
Acrobat Reader (currently 8) and then check the final version using the
latest !PDF version.
7. I recently found that the default setting was to save images in the PDF
was 150dpi and that the quality is not as good as it should be. Therefore I
have changed this to 300dpi, which results in much better quality PDFs but
at the price of larger files, the size increase is directly related to the
number and size of Images in the document.
8. I have gradually improved the quality of final document as I have learnt
more about OmniPage's features, and recognise some of my earlier scans
really need to be revisited. But the key to good scans is careful
proofreading and correction which remains very time consuming.
9. I upgraded from Omnipage 15 to 16 last year but rapidly discovered a
problem with the output. Nuance's technical support could reproduce the
problem which was not present in Omnipage 15, but could/would not fix it. So
I had to downgrade to OmniPage 15 again.
10. I can reload the Omnipage files (.opd) at any time and generate TIFF,
JPEG, PNG etc page images. I have recent regenerated PDFs going back to
October 2007 to change the embedded images from 150dpi to 300 dpi. I will
eventually get around to the rest.
Finally I have found that I cannot reasonably scan pages greater that the
size of my A4 scanner, so I have had to have A0, A1, A2 and A3 technical
diagrams commercially scanned.
Hope this is useful.
Chris
--
How does a project get to be a year late?... One day at a time.
Chris's Acorns http://acorn.chriswhy.co.uk