past & future
formats: hypertext & images
This page considers some electronic publishing formats.
It covers -
- some questions about online publishing formats
- from SGML to HTML and XML
- a mechanism for 'print faithfulness'
- a scholarly publishing tool
- bugbear or bright idea
- GIF, JPEG, BMP, TIFF, PICT, MNG
As we note in our network
guide, the internet is a network of networks: in practice
a broad set of standards that allows information to be
exchanged by machines across the globe.
The web is one part of the net, based on a subset of standards
that allow users to readily identify and navigate within/from
electronic documents and allow publishers to incorporate
a mix of text, graphics and audiovisual content such as
video or sound recordings.
Those standards are broad, as they have to be to accommodate
significant variations in the 'pipelines' that transfer
information, the performance characteristics of different
machines and the behaviour of different software. Different
the same information differently (and in some cases don't
display it at all well).
The web is based on derivatives of Standard Generalized
Markup Language (SGML), a text annotation protocol originally
developed for offline publishing. In essence, SGML describes
the relationship between a document's content and its
structure. It has been enshrined by the International
Standards Organization. Tony Hicks's article
Should We Be Using ISO 12083? offers a concise
introduction. There is more detail in SGML on
the Web: Small Steps Beyond HTML (Upper Saddle River:
Prentice-Hall 1997) by Yuri Rubinsky & Murray Maloney.
Many of the documents currently available on the web have
been prepared using Hypertext Markup Language (HTML),
based on SGML. There is a plain English guide
at the World Wide Web Consortium (W3C) site; other information's
found in our Design guide.
HTML allows publishers to broadly specify how information
is displayed online and how it is structured, for example
enabling users to navigate within individual documents
or to other sites using hyperlinks.
Many commercial/professional sites now use Extensible
Markup Language (XML), which can be generated from specialist
databases and accommodates specialist dialects used for
example in data exchange between manufacturers. As with
any dialect, there are concerns about compatibility ...
something that may impede growth of B2B (or merely encourage
the involvement of government competition watchdogs).
HTML - currently the lingua franca of the web - accommodates
a variety of graphic formats: TIFF, GIF, JPEG, PNG and
so forth. The W3C and specialist bodies are actively exploring
particular enhancements, eg for identifying and describing
documents using metadata,
and are considering proposals that XHTML
('HTML4') become the new global standard.
Adoption of XHTML appears likely, given consumer/commercial
interest in a more powerful generation of HTML and ongoing
enhancement of the networks (eg bigger pipes, more powerful
machines), but is likely to take some time. Thom Lieb
provided a concise introduction in his article
on The X(HTML) Files.
The outstanding bibliography about XML for publishing
is Robin Cover's SGML/XML Bibliography (Cover)
as part of his XML Cover Pages
project. It is a worthy complement to Bailey's Scholarly
Electronic Publishing Bibliography, noted elsewhere
in this guide.
Conformance with HTML varies. Dagfinn Parnas' 2001 Masters
thesis on the parsing of incorrect HTML (PDF)
examined a sample of 2.4 million URIs and suggests that
a large number were formally invalid. As with a spoken
language, much information can be conveyed despite breaches
of formal rules: most of those pages displayed adequately.
Online authoring tutorials include -
SGML cannot provide a facsimile of a printed page
- ie the same fonts, layout and colours of a corporate
brochure, annual report or technical publication - and
can only feature charts if they're converted into an image
(eg a GIF).
Much of the time that faithfulness to print is unnecessary:
variations in font and proportions are of less importance
than immediate access to the information.
For those instances where a facsimile is required, the
Portable Document Format (PDF)
- a proprietary standard created by software developer
Adobe Systems - allows publishers to present a publication
(generally prepared using PostScript-based desktop publishing
software) onscreen or as a printout with the same appearance
PostScript is a 'page description language' for personal
computers, desktop printers and imagesetters (high-resolution
output devices used for professional output) that describe
a page in terms of all the elements that appear on it,
including the placement of graphics and spacing of fonts.
PDF essentially freezes the original publication by converting
it into an image (which depending on the length of the
document ranges from one page to thousands of pages).
Because it is an image, it is significantly slower than
HTML to download - as a result we recommend that long
documents be broken into bite-sized chunks - and cannot
be automatically generated from a database in the same
way as XML.
However, it is attractive to many publishers, particularly
if used as an adjunct to an SGML version. It is now possible
to include hypertext links within PDFs and for search
engines to index constituent text. PDF is not an appropriate
tool for all markets and the devices used by many visually
disabled people unfortunately will not translate PDF documents
PDF Reference: Adobe Portable Document Format Version
1.3 (Boston: Addison-Wesley 2000) is the definitive
guide to PDF. Apart from Adobe's site, we recommend the
which features an interview
with PDF creator John Warnock and his 1991 essay
on "Camelot", the technology that became PDF. John Ockerbloom's
DigiNews paper Archiving & Preserving PDF Files
offers a lucid introduction to long-term access questions.
Adobe's Online PDF-to-HTML tool is here.
The Text Encoding Initiative (TEI)
is an international project to develop guidelines for
the encoding of textual material in electronic form for
research purposes. The TEI Guidelines are online.
Text encoding is not centred on making electronic texts
findable or readable by people, although that is the major
byproduct. Instead, it seeks to ensure they're readable
by machines, so that a text can be searched, configured
and analysed in different ways.
Flash is a proprietary software product (from Macromedia)
for the web that supports animated vector graphics
Unlike HTML, Flash is essentially independent of browsers:
most users with a version 3+ browser can see a nearly
identical version to that seen by other users. Unlike
HTML it is 'closed source' - to create a flas animation
you need Macromedia's authoring software and to view flash
you need a browser that's bundled with the viewing software
(the 'rendering engine' or 'plugin') or separately download
The Design and Accessibility
guides elsewhere on this site highlight usability concerns:
animations and video have a place on the web but enthusiasts
often overlook questions about navigation or patience
(a particular issue with the elaborate splash pages found
on many corporate sites).
Different formats for still images, animations and moving
images have had varying degrees of acceptance. A supplementary
note considers the nature
and adoption of the GIF, JPEG, PICT, TIFF, MNG, PNG and