Date : Wed, 25 Jun 2008 09:30:40 -0500
From : jules.richardson99@... (Jules Richardson)
Subject: The Micro User
Jonathan Graham Harston wrote:
> "Chris Thornley" wrote:
>> You don't have to do all these stages. OCR is really advanced these days
>> making this all very simple to do without all this tedious messing around.
>> Scanning, Layout and Prof reading built in. Preserving the layout or
>> selected different types of finally layout. OCR has moved on from what it
>> was in the past.
>
> For OCRing it's just three stages: Scan, OCR, proofread. A
> computer can't proofread, it doesn't know what it's supposed to
> say.
Hmm, I'd suggest more like: scan, pre-process, OCR, proofread (and as you
mention there's a bit of optional hand-wavy layout stuff around the OCR* step)
* I was messing with some code last year where I could roughly draw the page
layout on top of a scan (making a note of how columns interacted, where
colour, greyscale and line-art images were etc.) so that I could just feed the
data into a back-end for colour reduction, OCR and PDF generation. I suspect
the big expensive packages do all this already, but I was looking for a free
Linux-based option. Maybe I'll get around to finishing the code at some
point :-)
> blemishes OCR'd into the letter O....
That's why I do the pre-process step; sometimes after I've scanned a page I'll
do some manual cleanup and/or remove any unwanted artifacts - sometimes I
won't bother though and wait to see how the OCR process copes. The important
thing's to save at a high quality though (I've lost count of how many bi-level
scans I've seen out there that would need a *huge* amount of editing before
any OCR stage could happen)
I've really not got much faith in the current generation of OCR progs though -
they're a lot better than they were, but not quite 'there' yet.
> Listings are the hardest part by far. I can recreate the text of
> an article in less than 20 minutes, but getting a working listing
> is a day's work.
Yep, and *everything* needs proof-reading. My motivation fails at that point
:) I think ideally I'd like to make the scans and OCR copy available online,
and have some system to let users compare and send in corrections - there's no
point me wasting effort proof-reading something that nobody's going to
download anyway :-)
cheers
Jules