<< Previous Message Main Index Next Message >>
<< Previous Message in Thread This Month Next Message in Thread >>
Date   : Wed, 25 Jun 2008 09:30:40 -0500
From   : jules.richardson99@... (Jules Richardson)
Subject: The Micro User

Jonathan Graham Harston wrote:
> "Chris Thornley" wrote:
>> You don't have to do all these stages. OCR is really advanced these days
>> making this all very simple to do without all this tedious messing around.
>> Scanning, Layout and Prof reading built in. Preserving the layout or
>> selected different types of finally layout. OCR has moved on from what it
>> was in the past.
>  
> For OCRing it's just three stages: Scan, OCR, proofread. A
> computer can't proofread, it doesn't know what it's supposed to
> say.

Hmm, I'd suggest more like: scan, pre-process, OCR, proofread (and as you 
mention there's a bit of optional hand-wavy layout stuff around the OCR* step)

* I was messing with some code last year where I could roughly draw the page 
layout on top of a scan (making a note of how columns interacted, where 
colour, greyscale and line-art images were etc.) so that I could just feed the 
data into a back-end for colour reduction, OCR and PDF generation.  I suspect 
the big expensive packages do all this already, but I was looking for a free 
Linux-based option.  Maybe I'll get around to finishing the code at some
point :-)


> blemishes OCR'd into the letter O....

That's why I do the pre-process step; sometimes after I've scanned a page I'll 
do some manual cleanup and/or remove any unwanted artifacts - sometimes I 
won't bother though and wait to see how the OCR process copes. The important 
thing's to save at a high quality though (I've lost count of how many bi-level 
scans I've seen out there that would need a *huge* amount of editing before 
any OCR stage could happen)

I've really not got much faith in the current generation of OCR progs though - 
they're a lot better than they were, but not quite 'there' yet.

> Listings are the hardest part by far. I can recreate the text of
> an article in less than 20 minutes, but getting a working listing
> is a day's work.

Yep, and *everything* needs proof-reading. My motivation fails at that point 
:) I think ideally I'd like to make the scans and OCR copy available online, 
and have some system to let users compare and send in corrections - there's no 
point me wasting effort proof-reading something that nobody's going to 
download anyway :-)

cheers

Jules
<< Previous Message Main Index Next Message >>
<< Previous Message in Thread This Month Next Message in Thread >>