Date : Mon, 23 Jun 2008 22:55:30 +0100
From : bfoley@... (Brian Foley)
Subject: The Micro User
On Mon, Jun 23, 2008 at 09:11:10PM +0100, Chris Thornley wrote:
> You don't have to do all these stages. OCR is really advanced these days
> making this all very simple to do without all this tedious messing around.
> Scanning, Layout and Prof reading built in. Preserving the layout or
> selected different types of finally layout. OCR has moved on from what it
> was in the past.
I might be a little behind the times, but I do a lot of work with
Distributed Proofereaders (http://pgdp.net -- the guys who've proofed
and formatted 13,000 of the texts available on Project Gutenberg), and
the consensus there is that OCR packages still can't be trusted on
their own. Even on clean scans of English texts, they're still prone
to 'scannos' such as misinterpreting 'and' as 'arid', and you can only
imagine what they'd do with computer jargon, acronyms, and hexadecimal
dumps in the yellow pages!
It also appears that while, for example, the most recent versions of
ABBYY FineReader have gotten much better about recognising and
preserving layout, they've done so at the cost of being much slower
that earlier versions, and having a higher error rate when recognising
the text itself.
So while I'd be all in favour of people *optionally* postprocessing,
the raw scans in whatever way they like, especially if it involves
making stuff searchable, and adding hyperlinks and metadata, I still
think it's important to preserve only lightly tweaked scans to fall
back on.
Cheers,
Brian.