Caslon Analytics elephant logo title for Publishing guide
home | about | site use | resources | publications | timeline   spacer graphic   Ketupa


past & future
























on demand

rights trade





related pages icon






related pages icon
& Notes:

Print &
the Book


section heading icon     formats: hypertext & images

This page considers some electronic publishing formats.

It covers -

  • introduction - some questions about online publishing formats
  • HTML - from SGML to HTML and XML
  • PDF - a mechanism for 'print faithfulness'
  • TEI - a scholarly publishing tool
  • Flash - bugbear or bright idea
  • images - GIF, JPEG, BMP, TIFF, PICT, MNG

section marker icon     introduction

As we note in our network guide, the internet is a network of networks: in practice a broad set of standards that allows information to be exchanged by machines across the globe. 

The web is one part of the net, based on a subset of standards that allow users to readily identify and navigate within/from electronic documents and allow publishers to incorporate a mix of text, graphics and audiovisual content such as video or sound recordings.

Those standards are broad, as they have to be to accommodate significant variations in the 'pipelines' that transfer information, the performance characteristics of different machines and the behaviour of different software. Different machines/software display the same information differently (and in some cases don't display it at all well).

The web is based on derivatives of Standard Generalized Markup Language (SGML), a text annotation protocol originally developed for offline publishing. In essence, SGML describes the relationship between a document's content and its structure. It has been enshrined by the International Standards Organization. Tony Hicks's article Should We Be Using ISO 12083? offers a concise introduction. There is more detail in SGML on the Web: Small Steps Beyond HTML (Upper Saddle River: Prentice-Hall 1997) by Yuri Rubinsky & Murray Maloney.

section marker icon     HTML

Many of the documents currently available on the web have been prepared using Hypertext Markup Language (HTML), based on SGML. There is a plain English guide at the World Wide Web Consortium (W3C) site; other information's found in our Design guide.

HTML allows publishers to broadly specify how information is displayed online and how it is structured, for example enabling users to navigate within individual documents or to other sites using hyperlinks. 

Many commercial/professional sites now use Extensible Markup Language (XML), which can be generated from specialist databases and accommodates specialist dialects used for example in data exchange between manufacturers. As with any dialect, there are concerns about compatibility ... something that may impede growth of B2B (or merely encourage the involvement of government competition watchdogs). 

HTML - currently the lingua franca of the web - accommodates a variety of graphic formats: TIFF, GIF, JPEG, PNG and so forth. The W3C and specialist bodies are actively exploring particular enhancements, eg for identifying and describing documents using metadata, and are considering proposals that XHTML ('HTML4') become the new global standard. 

Adoption of XHTML appears likely, given consumer/commercial interest in a more powerful generation of HTML and ongoing enhancement of the networks (eg bigger pipes, more powerful machines), but is likely to take some time. Thom Lieb provided a concise introduction in his article on The X(HTML) Files.

The outstanding bibliography about XML for publishing is Robin Cover's SGML/XML Bibliography (Cover) as part of his XML Cover Pages project. It is a worthy complement to Bailey's Scholarly Electronic Publishing Bibliography, noted elsewhere in this guide.

Conformance with HTML varies. Dagfinn Parnas' 2001 Masters thesis on the parsing of incorrect HTML (PDF) examined a sample of 2.4 million URIs and suggests that a large number were formally invalid. As with a spoken language, much information can be conveyed despite breaches of formal rules: most of those pages displayed adequately.

Online authoring tutorials include -

section marker icon     PDF

SGML cannot provide a facsimile of a printed page - ie the same fonts, layout and colours of a corporate brochure, annual report or technical publication - and can only feature charts if they're converted into an image (eg a GIF).

Much of the time that faithfulness to print is unnecessary: variations in font and proportions are of less importance than immediate access to the information. 

For those instances where a facsimile is required, the Portable Document Format (PDF) - a proprietary standard created by software developer Adobe Systems - allows publishers to present a publication (generally prepared using PostScript-based desktop publishing software) onscreen or as a printout with the same appearance as print.

PostScript is a 'page description language' for personal computers, desktop printers and imagesetters (high-resolution output devices used for professional output) that describe a page in terms of all the elements that appear on it, including the placement of graphics and spacing of fonts.

PDF essentially freezes the original publication by converting it into an image (which depending on the length of the document ranges from one page to thousands of pages). 

Because it is an image, it is significantly slower than HTML to download - as a result we recommend that long documents be broken into bite-sized chunks - and cannot be automatically generated from a database in the same way as XML. 

However, it is attractive to many publishers, particularly if used as an adjunct to an SGML version. It is now possible to include hypertext links within PDFs and for search engines to index constituent text. PDF is not an appropriate tool for all markets and the devices used by many visually disabled people unfortunately will not translate PDF documents into speech.

PDF Reference: Adobe Portable Document Format Version 1.3 (Boston: Addison-Wesley 2000) is the definitive guide to PDF. Apart from Adobe's site, we recommend the independent PlanetPDF, which features an interview with PDF creator John Warnock and his 1991 essay on "Camelot", the technology that became PDF. John Ockerbloom's DigiNews paper Archiving & Preserving PDF Files offers a lucid introduction to long-term access questions.

Adobe's Online PDF-to-HTML tool is here.

section marker icon     TEI

The Text Encoding Initiative (TEI) is an international project to develop guidelines for the encoding of textual material in electronic form for research purposes. The TEI Guidelines are online.

Text encoding is not centred on making electronic texts findable or readable by people, although that is the major byproduct. Instead, it seeks to ensure they're readable by machines, so that a text can be searched, configured and analysed in different ways. 

section marker icon     Flash

Flash is a proprietary software product (from Macromedia) for the web that supports animated vector graphics

Unlike HTML, Flash is essentially independent of browsers: most users with a version 3+ browser can see a nearly identical version to that seen by other users. Unlike HTML it is 'closed source' - to create a flas animation you need Macromedia's authoring software and to view flash you need a browser that's bundled with the viewing software (the 'rendering engine' or 'plugin') or separately download the plugin.

The Design and Accessibility guides elsewhere on this site highlight usability concerns: animations and video have a place on the web but enthusiasts often overlook questions about navigation or patience (a particular issue with the elaborate splash pages found on many corporate sites).

section marker icon     images

Different formats for still images, animations and moving images have had varying degrees of acceptance. A supplementary note considers the nature and adoption of the GIF, JPEG, PICT, TIFF, MNG, PNG and BMP formats

icon for link to next page    next page  (scholarly monographs)

this site
the web



version of September 2004
© Bruce Arnold