overview
online
DC, AGLS
RDF
PICS
PURLs
numbers
UDDI
thesauri
landmarks

related
profile:
Directories,
Engines
& Behaviour |
online metadata
This
page looks at use of metadata on the web.
It covers -
the standards question
Internet engineering and standards bodies have not
mandated detailed standards for metadata. That means,
for example, that there's no standardized terminology
and thesaurus (one reason why many librarians look at
the web askance).
Essentially, in developing
the web provision was made for the inclusion of metadata
within pages/sites, allowing descriptive and other information
to be embedded in each page among the 'invisible' code.
Provision was also made for construction of search engines
and other tools to point to web pages, drawing on the
embedded metadata or using their own metadata about those
pages (eg infomation derived from parsing the text on
each page rather than metadata supplied by the page's
author).
That has had several results -
There is disagreement among specialist users about development
of specific standards for the structuring and expression
of embedded metadata. (Competing and complementary standards
from librarians, museum curators, informatics specialists
and others include the Dublin Core, AAT, CSDGM,
GIS, CGIS-SAIF, Resource Description Framework and Warwick
Framework.)
There is similar disagreement about content rating metadata
such as PICS used in censorship
or content management schemes). As Charles Thomas
& Linda Griffin note in their First Monday
article
on Who Will Create The Metadata For The Internet?,
although there are commercial incentives for effective
metadata, the various schemes have to break out of the
silicon ghetto
The wide range of search engines and directories produce
different results. There are now at least 2,000 search
engines although most traffic goes to the top 11 such
as Yahoo! and Google.
who is using metadata?
Most web pages (and probably most sites) don't have descriptive
metadata. Some studies
suggest that only 34% have any 'meaningful' metadata and
that much metadata is not relevant to the particular site.
Less than 0.3% of sites (and thus a much smaller fraction
of the 'deep web' described in the metrics
guide elsewhere on this site) uses Dublin Core (DC) metadata,
described in the following page of this profile. As
Jane Greenberg & associates note in a paper
on Author-Generated Dublin Core Metadata for Web Resources:
A Baseline Study in an Organization, there are questions
about the application of DC by non-specialists. Use of
DC and other metadata sets is concentrated in
- sites
owned by government agencies, academic and cultural
institutions
- on
higher-level pages, particularly index pages and subdirectory
pages
- major
corporate intranets (especially those involving specialist
audiences and created using advanced content management
systems) rather than SME or 'domestic' internet sites
Few major search engines rely on metadata supplied by
the owners of sites. One industry figure quoted in
Search Engine Watch comments
search
engines do not trust metadata. It's fine to talk about
how nice it would be if all web pages were categorized,
but the search engines know from experience that people
will lie, mislead or do whatever they can to get on
top.
The 2003 paper
Trends in the Evolution of the Public Web 1998-2002
by Edward O'Neill, Brian Lavoie & Rick Bennett commented
on trends in metadata usage on the 'surface web' (ie publicly-accessible
sites) from 1998 to 2003, suggesting that
it
seems clear that metadata usage is on the rise: steady
increases in the percent of public Web sites containing
metadata on the home page (where metadata usage is most
common) are observed throughout the five-year period.
Similar increases were observed in the percentage of
all Web pages harvested from public sites that contained
some form of metadata. One caveat should be mentioned,
however: with the advent of more sophisticated HTML
editors, some META tags are created and populated automatically
as part of the document template. It is likely that
this accounts for at least part of the perceived increase
in META tag usage on public Web sites.
A second interesting feature about metadata usage on
the Web is that, apparently, it is not becoming more
detailed. If it is assumed that one META tag is equivalent
to one metadata element, or piece of descriptive information
about the Web resource, then it is clear that, on average,
Web pages that include metadata contain about two or
three elements. Clearly, there is no widespread movement
to include detailed description of Web resources on
the public Web.
A discouraging aspect of metadata usage trends on the
public Web over the last five years is the seeming reluctance
of content creators to adopt formal metadata schemes
with which to describe their documents. For example,
Dublin Core metadata appeared on only 0.5 percent of
public Web site home pages in 1998; that figure increased
almost imperceptibly to 0.7 percent in 2002. The vast
majority of metadata provided on the public Web is ad
hoc in its creation, unstructured by any formal metadata
scheme.
where does it come from?
In practice metadata about a page originates in two
ways.
The creator of the page can embed metadata when constructing
(or amending the page).
Some software used in building sites will automatically
generate such metadata, albeit crudely. We have manually
developed the metadata for most pages on this site, for
example.
Manual development and application poses two problems.
The first is that many authors are simply unaware of the
existence of metadata. Others are uncertain about its
nature - what is it, where does it go, what terms to use
- or see it as an afterthought rather than integral to
electronic publishing. Some see no benefits justifying
inclusion of metadata. Others are repelled by the technical
literature on particular metadata sets or even by the
'true religion' approach of some metadata zealots.
A second problem is that effective metadata requires understanding
of the particular metadata set and of the content of a
page, document or major structural component within a
site. Some authors, through lack of training, time or
interest, create and apply junk metadata
Metadata need not be created by a page/document's author.
A alternative way is the creation of metadata about the
page by an unrelated entity, ie by something/someone that
visits the page rather than by the page's owner.
Many search engines use 'robots' or 'spiders' to visit
pages, look for significant terms within the text and
incorporate that information within the database that
fuels the search engine or flags that it has objectionable
content. Other engines and directories use humans to examine
the pages and create the metadata, which may be held as
a separate but related resource.
and how is it associated?
Metadata Principles & Practicalities, a 2002
paper
by Erik Duval, Wayne Hodgins, Stuart Sutton & Stuart
Weibel, notes that there are several ways to associate
metadata with resources:
Embedded metadata resides within the markup of
the resource and is created at the time that the resource
is created, often by the author. "Experts differ
concerning whether author-created metadata is best or
whether it is better to have trained practitioners evaluate
and describe resources. As a practical matter, resource
description expertise is a scarce and costly commodity,
and thus any investment by authors in the description
of their intellectual products is likely to be of value.
Embedded metadata can also be harvested ... the presumptive
increase in visibility that might result is an incentive
for creators to assign metadata".
Associated metadata is maintained in files tightly
coupled to the resources they describe and may or may
not be harvestable (eg using SOIF). The advantage of
associated metadata derives from the relative ease of
managing the metadata without altering the content of
the resource itself, but this benefit is purchased at
the cost of simplicity, necessitating the co-management
of resource files and metadata files.
Third-Party metadata is maintained in a separate
repository by an organization that may or may not have
direct control over or access to the content of the
resource, typically in a database that is not accessible
to harvesters, eg through an MCF
scheme.
Weibel
notes the confusion of syntax issues and association models,
with a tendency to assume HTML-based metadata is equivalent
to embedded metadata.
His paper also notes that a given information resource
will sometimes have multiple metadata records that reflect
different purposes and the roles of the organizations
that create/manage those records.
A
resource may be created with embedded metadata supplied
by the author. A separate record might be created by
the issuing organization (an academic department or
publisher, for example) and stored in a separate database.
A third party (perhaps a library) might create yet another
version of metadata, either from scratch or derived
from a previous record. In most cases these records
will not be managed in a coordinated way, and differences
may arise among them that may cause ambiguity or confusion.
This may be less than ideal, but must be expected in
an environment where various organizations may choose
to manage resource descriptions with different objectives.
does it matter?
As you might expect, there's disagreement about what
matters.
It is clear that most search engines ignore metadata embedded
by creators.
More broadly, many sites will never rank highly on search
engines. Their owners should concentrate on driving traffic
to them in other ways.
On the other hand, in parts of the web and intranets -
such as libraries, image archives and bodies dealing with
geospatial information - there is agreement about use
of metadata and about specific standards, for example
Dublin Core. Comprehensive application of such standards
reflects institutional priorities (eg government-wide
mandates or user demands for sophisticated resource identification),
resource stability (is it worth applying state-of-the-art
descriptors to a page that will be offline within a week)
and the ability of organisations to make a major commitment
to data application and associated retrieval tools.
Consistent use of metadata schemes, often as a consequence
of the management of information within each body's databases,
facilitates information exchange outside the web and for
example the operation of 'gateways' or sectoral search
engines that provide seamless access to the holdings of
a group of museums.
Preservation Metadata for Digital Objects: A Review of
the State of the Art (PDF)
is a concise overview by the US Research Libraries Group
of competing preservation metadata initiatives such as
the Open Archival Information System (OAIS) and CURL Exemplars
in Digital Archives (CEDARS).
and the future?
The idea of a standard set of terms and phrases as the
basis for online resource identification has been seductive
to librarians and information scientists but has not found
significant acceptance among most site creators and search
engine/directory developers. Two assumptions have impeded
past online metadata initiatives.
What one observer characterised
as the "technological legacy of knowledge representation"
assumes the existence of "a class of disinterested
information workers (i.e., librarians)" responsible
for comprehensive and systematic subject cataloguing.
However, that class has little clout online. Businesses,
organisations and individuals can mark up their pages
as they please. There are few legal constraints or community
norms to prevent the use of 'false' metadata, with the
result that few search engines rely on metadata because
the unscrupulous will 'spoof' ('spamdex') the search results.
Current metadata strategies are designed for "high-level
document properties", with inclusion of topical descriptors
and phrases in the 'head' element of a page assuming that
the content will be stable and that the page accurately
identifies subsidiary pages that may not have full metadata.
The Metrics & Statistics
guide elsewhere on this site points to research into the
volatility of online content that undermines that assumption.
Koehler's paper
on Digital Libraries & WWW Persistence for
example estimates that the 'half life' of a web page is
less than two years (with the half life of a site a bit
more than two years), while the 1997 Rate of Change
& other Metrics: a Live Study of the World Wide Web
paper
by Fred Douglis, Anja Feldmann & Balachander Krishnamurthy
and the 2000 paper
How dynamic is the web? by Brian Brewington &
George Cybenko estimate that 20% of pages are less than
twelve days old, with only 25% older than one year.
More broadly, despite broad initiatives such as Dublin
Core and LOM the 'metadata community' has been - and presumably
will continue to be - riven by disagreements about particular
applications of those standards and about questions of
interoperability. Some of the more conceptually interesting
work has indeed centred on schemes - such as RDF - that
encompass an "agreement to disagree", ie to
accommodate different 'dialects' and 'languages'.
One writer on DC thus diplomatically comments that
Deployment
of local qualifiers and extensions is an appropriate
action, but designers should do so with the understanding
that interoperability with other applications may suffer.
Moves
by some some enthusiasts for a single metadata schema
appear as misplaced as claims that universal brotherhood
and happiness would be achieved through large-scale adoption
of Esperanto or Volapuk.
next page
(DC, AGLS and LOM)
|
|