title for Metadata profile
home | about | site use | resources | publications | timeline |::| Analysphere | Ketupa

overview

online

DC, AGLS

RDF

PICS

PURLs

numbers

UDDI

thesauri

landmarks











related pages icon
related

profile:


Directories,
Engines
& Behaviour

section heading icon     online metadata

This page looks at use of metadata on the web.

It covers -

subsection heading icon     the standards question

Internet engineering and standards bodies have not mandated detailed standards for metadata. That means, for example, that there's no standardized terminology and thesaurus (one reason why many librarians look at the web askance). 

Essentially, in developing the web provision was made for the inclusion of metadata within pages/sites, allowing descriptive and other information to be embedded in each page among the 'invisible' code. 

Provision was also made for construction of search engines and other tools to point to web pages, drawing on the embedded metadata or using their own metadata about those pages (eg infomation derived from parsing the text on each page rather than metadata supplied by the page's author). 

That has had several results -

There is disagreement among specialist users about development of specific standards for the structuring and expression of embedded metadata. (Competing and complementary standards from librarians, museum curators, informatics specialists and others include the Dublin Core, AAT, CSDGM, GIS, CGIS-SAIF, Resource Description Framework and Warwick Framework.) 

There is similar disagreement about content rating metadata such as PICS used in censorship or content management schemes). As Charles Thomas & Linda Griffin note in their First Monday article on Who Will Create The Metadata For The Internet?, although there are commercial incentives for effective metadata, the various schemes have to break out of the silicon ghetto

The wide range of search engines and directories produce different results. There are now at least 2,000 search engines although most traffic goes to the top 11 such as Yahoo! and Google.

subsection heading icon     who is using metadata?

Most web pages (and probably most sites) don't have descriptive metadata. Some studies suggest that only 34% have any 'meaningful' metadata and that much metadata is not relevant to the particular site.

Less than 0.3% of sites (and thus a much smaller fraction of the 'deep web' described in the metrics guide elsewhere on this site) uses Dublin Core (DC) metadata, described in the following page of this profile. As Jane Greenberg & associates note in a paper on Author-Generated Dublin Core Metadata for Web Resources: A Baseline Study in an Organization, there are questions about the application of DC by non-specialists. Use of DC and other metadata sets is concentrated in

  • sites owned by government agencies, academic and cultural institutions
  • on higher-level pages, particularly index pages and subdirectory pages
  • major corporate intranets (especially those involving specialist audiences and created using advanced content management systems) rather than SME or 'domestic' internet sites

Few major search engines rely on metadata supplied by the owners of sites. One industry figure quoted in Search Engine Watch comments

search engines do not trust metadata. It's fine to talk about how nice it would be if all web pages were categorized, but the search engines know from experience that people will lie, mislead or do whatever they can to get on top.

The 2003 paper Trends in the Evolution of the Public Web 1998-2002 by Edward O'Neill, Brian Lavoie & Rick Bennett commented on trends in metadata usage on the 'surface web' (ie publicly-accessible sites) from 1998 to 2003, suggesting that

it seems clear that metadata usage is on the rise: steady increases in the percent of public Web sites containing metadata on the home page (where metadata usage is most common) are observed throughout the five-year period. Similar increases were observed in the percentage of all Web pages harvested from public sites that contained some form of metadata. One caveat should be mentioned, however: with the advent of more sophisticated HTML editors, some META tags are created and populated automatically as part of the document template. It is likely that this accounts for at least part of the perceived increase in META tag usage on public Web sites.

A second interesting feature about metadata usage on the Web is that, apparently, it is not becoming more detailed. If it is assumed that one META tag is equivalent to one metadata element, or piece of descriptive information about the Web resource, then it is clear that, on average, Web pages that include metadata contain about two or three elements. Clearly, there is no widespread movement to include detailed description of Web resources on the public Web.

A discouraging aspect of metadata usage trends on the public Web over the last five years is the seeming reluctance of content creators to adopt formal metadata schemes with which to describe their documents. For example, Dublin Core metadata appeared on only 0.5 percent of public Web site home pages in 1998; that figure increased almost imperceptibly to 0.7 percent in 2002. The vast majority of metadata provided on the public Web is ad hoc in its creation, unstructured by any formal metadata scheme.

subsection heading icon     where does it come from?

In practice metadata about a page originates in two ways.

The creator of the page can embed metadata when constructing (or amending the page). 

Some software used in building sites will automatically generate such metadata, albeit crudely. We have manually developed the metadata for most pages on this site, for example.

Manual development and application poses two problems.

The first is that many authors are simply unaware of the existence of metadata. Others are uncertain about its nature - what is it, where does it go, what terms to use - or see it as an afterthought rather than integral to electronic publishing. Some see no benefits justifying inclusion of metadata. Others are repelled by the technical literature on particular metadata sets or even by the 'true religion' approach of some metadata zealots.

A second problem is that effective metadata requires understanding of the particular metadata set and of the content of a page, document or major structural component within a site. Some authors, through lack of training, time or interest, create and apply junk metadata

Metadata need not be created by a page/document's author. A alternative way is the creation of metadata about the page by an unrelated entity, ie by something/someone that visits the page rather than by the page's owner. 

Many search engines use 'robots' or 'spiders' to visit pages, look for significant terms within the text and incorporate that information within the database that fuels the search engine or flags that it has objectionable content. Other engines and directories use humans to examine the pages and create the metadata, which may be held as a separate but related resource.

subsection heading icon     and how is it associated?

Metadata Principles & Practicalities, a 2002 paper by Erik Duval, Wayne Hodgins, Stuart Sutton & Stuart Weibel, notes that there are several ways to associate metadata with resources:

Embedded metadata resides within the markup of the resource and is created at the time that the resource is created, often by the author. "Experts differ concerning whether author-created metadata is best or whether it is better to have trained practitioners evaluate and describe resources. As a practical matter, resource description expertise is a scarce and costly commodity, and thus any investment by authors in the description of their intellectual products is likely to be of value. Embedded metadata can also be harvested ... the presumptive increase in visibility that might result is an incentive for creators to assign metadata".

Associated metadata is maintained in files tightly coupled to the resources they describe and may or may not be harvestable (eg using SOIF). The advantage of associated metadata derives from the relative ease of managing the metadata without altering the content of the resource itself, but this benefit is purchased at the cost of simplicity, necessitating the co-management of resource files and metadata files.

Third-Party metadata is maintained in a separate repository by an organization that may or may not have direct control over or access to the content of the resource, typically in a database that is not accessible to harvesters, eg through an MCF scheme.

Weibel notes the confusion of syntax issues and association models, with a tendency to assume HTML-based metadata is equivalent to embedded metadata.

His paper also notes that a given information resource will sometimes have multiple metadata records that reflect different purposes and the roles of the organizations that create/manage those records.

A resource may be created with embedded metadata supplied by the author. A separate record might be created by the issuing organization (an academic department or publisher, for example) and stored in a separate database. A third party (perhaps a library) might create yet another version of metadata, either from scratch or derived from a previous record. In most cases these records will not be managed in a coordinated way, and differences may arise among them that may cause ambiguity or confusion. This may be less than ideal, but must be expected in an environment where various organizations may choose to manage resource descriptions with different objectives.

subsection heading icon     does it matter?

As you might expect, there's disagreement about what matters. 

It is clear that most search engines ignore metadata embedded by creators. 

More broadly, many sites will never rank highly on search engines. Their owners should concentrate on driving traffic to them in other ways.

On the other hand, in parts of the web and intranets - such as libraries, image archives and bodies dealing with geospatial information - there is agreement about use of metadata and about specific standards, for example Dublin Core. Comprehensive application of such standards reflects institutional priorities (eg government-wide mandates or user demands for sophisticated resource identification), resource stability (is it worth applying state-of-the-art descriptors to a page that will be offline within a week) and the ability of organisations to make a major commitment to data application and associated retrieval tools.

Consistent use of metadata schemes, often as a consequence of the management of information within each body's databases, facilitates information exchange outside the web and for example the operation of 'gateways' or sectoral search engines that provide seamless access to the holdings of a group of museums.


Preservation Metadata for Digital Objects: A Review of the State of the Art
(PDF) is a concise overview by the US Research Libraries Group of competing preservation metadata initiatives such as the Open Archival Information System (OAIS) and CURL Exemplars in Digital Archives (CEDARS).

subsection heading icon     and the future?

The idea of a standard set of terms and phrases as the basis for online resource identification has been seductive to librarians and information scientists but has not found significant acceptance among most site creators and search engine/directory developers. Two assumptions have impeded past online metadata initiatives.

What one observer characterised as the "technological legacy of knowledge representation" assumes the existence of "a class of disinterested information workers (i.e., librarians)" responsible for comprehensive and systematic subject cataloguing.

However, that class has little clout online. Businesses, organisations and individuals can mark up their pages as they please. There are few legal constraints or community norms to prevent the use of 'false' metadata, with the result that few search engines rely on metadata because the unscrupulous will 'spoof' ('spamdex') the search results.

Current metadata strategies are designed for "high-level document properties", with inclusion of topical descriptors and phrases in the 'head' element of a page assuming that the content will be stable and that the page accurately identifies subsidiary pages that may not have full metadata.

The Metrics & Statistics guide elsewhere on this site points to research into the volatility of online content that undermines that assumption. Koehler's paper on Digital Libraries & WWW Persistence for example estimates that the 'half life' of a web page is less than two years (with the half life of a site a bit more than two years), while the 1997 Rate of Change & other Metrics: a Live Study of the World Wide Web paper by Fred Douglis, Anja Feldmann & Balachander Krishnamurthy and the 2000 paper How dynamic is the web? by Brian Brewington & George Cybenko estimate that 20% of pages are less than twelve days old, with only 25% older than one year.

More broadly, despite broad initiatives such as Dublin Core and LOM the 'metadata community' has been - and presumably will continue to be - riven by disagreements about particular applications of those standards and about questions of interoperability. Some of the more conceptually interesting work has indeed centred on schemes - such as RDF - that encompass an "agreement to disagree", ie to accommodate different 'dialects' and 'languages'.

One writer on DC thus diplomatically comments that

Deployment of local qualifiers and extensions is an appropriate action, but designers should do so with the understanding that interoperability with other applications may suffer.

Moves by some some enthusiasts for a single metadata schema appear as misplaced as claims that universal brotherhood and happiness would be achieved through large-scale adoption of Esperanto or Volapuk.





icon for link to next page   next page  (DC, AGLS and LOM)



this site
the web

Google

version of November 2003
© Caslon Analytics