Date : Tue, 05 Nov 2002 20:04:26 GMT
From : pete@... (Pete Turnbull)
Subject: Re: Which format do you want BBC manuals in? RTF/HTML
On Nov 5, 10:51, Paul Wheatley wrote:
> Pete Turnbull wrote:
> > On Nov 4, 14:38, Paul Wheatley wrote:
>
> >>Have you considered Latex? Thats a better archival format than PDF
> > That wouldn't be my first choice, as it's a pain to obtain all the
software
> > and fonts and set them up, and PDF exists for more platforms than LaTeX
> > does. PDF is also better at handling graphics, more likely to deal
> > correctly with fonts (unless you go to a lot of trouble to get decent
LaTeX
> > fonts), is inherently compressed and therfore more space efficient, and
> > usually easier to view.
> The crucial thing about a good archive format is that its easy to get the
> data back out. With PDF that isn't the case.
Not necessarily true. PDF isn't "one format", it's a set of rules which
allow lots of things to be encoded into one file. It's possible to scan a
document, OCR the image, and include both the image (normally with TIFF 6
compression, which is pretty good) and the text in the same PDF file. Thus
you have a document which is both accurate in appearance, and searchable;
and from which the text can be extracted (at least, so I'm told -- I've
never had to try).
There have been a lot of discussions about this on the ClassicCmp mailing
list over the last few years, where there are lots of people doing a lot of
archive stuff (much more than for the BBC), and the concensus there seems
to be for PDF or flat ASCII (the latter because it makes very small files
that are easy to read on *anything*, including Beebs and old CP/M etc
systems).
As for the software, there are PDF readers for most platforms with graphics
capability (including RISC OS, Amiga OS, Mac OS, AIX, VMS, etc, not just
Unix and relations), and free software for a variety of systems that can
turn TIFF or PostScript or text and other sources into PDF. GhostScript
can do it, if I remember correctly, and there are also some libraries like
pdflib that do it (admittedly most of the ones I know of run on Windows or
some flavour of Unix, but then so does most of the scanning and OCR
software).
PostScript is about as widely supported as PDF, but it's hard to
incorporate both the text and a matching TIFF (or whatever) image -- you
tend to get one or the other -- and it isn't always searchble (PostScript
sometimes splits up words where you might not expect it).
LaTeX is less well supported, harder to search reliably (at least with
simple tools, again because of the embedded commands), and poor at images.
It's also hard to read the common Computer Modern fonts on most displays
-- they're meant for high-res output devices, not monitors. It's a good
choice for preserving the source of documents which are to be printed, but
not for browsing or on-screen display, providing you remember to also
preserve all the separate files that contain macros, etc. A complete LaTeX
document of any complexity is hardly ever a single file. It's also worth
noting that it's losing ground to Word even for academic documents these
days (a seriously retrograde step, IMNSHO).
Of course, there's nothing to stop an archive from keeping things in two
formats, and some do. I've seen places where the flat-ASCII content is
stored alongside the PDF.
--
Pete Peter Turnbull
Network Manager
University of York