Date : Sat, 30 Jun 2001 01:05:58 +0200
From : Isabel Cisternas & Robert Schmidt <rschmidt@...>
Subject: Re: Wouter's vital doc scans 'n stuff
> Regarding magazine scans: doc is certainly not acceptable (bad bad
> Robert, for suggesting this!), rtf is also a monstrosity that should be
> nuked (html conversions of the rtf stuff in the bbc doc project
> please!). I can't read rtf on unix, haven't seen any decent conversion
> programs and I'm not going to start windoze to read rtf!
:-) I didn't suggest those formats because they were neccessarily any
good - I plucked them from the top of my head as the most widespread
formats used for representing OCRed text. I should have made it more
obvious that I thought PDF would be the best choice of the lot.
> PDF also seems pretty much useless (separating out the text and putting
> back in the graphics is going to be a hell of a job). Just use the
> filing system and use small jpg's for navigating, big ones for detail,
> and make text files of all articles/listings etc.
Part of the charm of having magazine scans available electronically,
IMHO, is to have it rendered on the screen as identical to the original
as possible. Being able to extract ASCII text (as in a properly OCRed
PDF) is very desirable, but secondary. Next comes metadata and
hyperlinks: some time in the future, we may have a relatively complete
set of, say, vintage Acorn Users available as PDFs. Imagine the
coolness of being able to instantly jump to referenced articles in other
issues, or to click a keyword and be presented with a cross reference to
occurances of the keyword in other issues, or even in other magazines.
Several shortcomings of PDF have now been pointed out, but to me it
still seems like the best (ultimate, almost) option.
> And JPEG 2000 is not yet a standard, so not an option either. I'm
> probably just going to use JPEG. Still doing some testing on what's best
> and the format etc.
JPEG2000 is very close, and will, over time, replace JPEG as we know
it. LeadTools and Pegasus (major suppliers of imaging code libraries)
recently announced full JPEG2000 support in their tools and libraries.
The samples I've seen are certainly impressive, but again, that's
photographic content. Using any kind of lossy compression on scanned
text worries me.
Cheers,
Robert