lies & spin
page examines the content of the web, including estimates
of the number of pages and images, the volatility of content
and severity of link-rot.
It covers -
size of the web
Specifics of the number of pages or documents on the
internet perhaps matter less than a broad idea of its
size (and growth) and disagreement about those figures.
As of 2001 the latest academic estimate from the US was
that the Web has some 800 million pages - this page is
800 million plus one - with Northern Lights, supposedly
the most inclusive search
engine at that time, covering less than 16% of that
figure. The O'Neill, Lavoie & Bennett Trends
paper suggested that in 2002 the figure had grown to 1.4
billion publicly-accessible pages.
Google reported that by December 2002 it had indexed almost
2.5 billion individual pages, increasing to 3.1 billion
by February 2003. As of January 2004 Google claimed to
cover 3,307,998,701 pages. In February 2004 it announced
that covered "6 billion items": 4.28 billion
web pages, 880 million images and 845 million Usenet
Three of the seminal papers - often referred to as the
'NEC studies' - are How Big is the Web (HBW),
Accessibility & Distribution of Information on
the Web (ADIW)
and the 1998 and 1999 Search Engine Coverage Update
by Steve Lawrence & C Lee Giles. The most recent paper
suggests that the web is growing faster than coverage
by the search engines and that dead links are more common.
In early 2001 Iktomi and NEC Research estimated
that there were more than a billion "unique pages". IDC's
The Global Market Forecast for Internet Usage &
Commerce report forecast that the global online population
will grow from 240 million in 1999 to 602 million in 2003,
with the number of web pages climbing from 2.1 billion
in 1999 to 16.5 billion in 2003.
US metrics company Cyveillance estimated
that there were over 2.1 billion pages on the web (heading
towards 4 billion by the end of 2001) with the "average
page" having 23 internal links, 6 external links
and 14 images. The US Federal Library & Information
Center estimated that the federal government alone had
over 27 million publicly accessible pages online.
BrightPlanet, a new entrant to the search engine market,
that "the deep Web" contains "550
billion individual documents", with only a small
fraction indexed by its competitors.
That figure, like many web statistics, is problematical.
More importantly, unlike the 'surface web' the dark
web information is generally not publicly accessible,
eg involves a subscription or item fee or resides on a
corporate intranet. That is one reason for concern
about digital divides. It is also a reason why academic/public
libraries have an ongoing role.
The major 2001 study
and 2003 study
by Hal Varian & Peter Lyman on scoping the 'information
universe' - quantifying what is produced, transmitted,
consumed and archived - are of relevance. The 1997
paper A Methodology for Sampling the World Wide
Web by Edward O'Neill, Patrick McClain & Brian
Lavoie will interest statistics buffs.
In 2005 the National Library of Australia (PDF)
published an initial report on what was claimed as "the
first whole Australian domain harvest", identifying
some "185 million unique documents" from 811,523
hosts. 67% of the documents were text/html, 17% were JPEG
images, 11% were GIF
images and 1.6% were PDFs. The harvested content came
to 6.69 terabytes.
the non-text net
There is no consensus about
the number of still images (eg photographs), video recordings,
animations and sound recordings on the net
rates at which that content is growing
nations from which most of that content is originating
February 2005 Google announced that its cache of the web
had reached over a billion images, up from some 880 million
in February 2004. Some questions about audio and image
searching are here.
number of personal sites
Figures about the number of 'personal sites' (homepages
and blogs) are problematical.
Sonia Livingstone & Magdalena Bober's 2004 UK
Children Go Online: Surveying the experiences of young
people & their parents (PDF)
has been interpreted as suggesting that "34% of UK
kids" (the 9 to 19 cohort) have personal pages. Research
such as The Construction of Identity in the Personal
Homepages of Adolescents, a 1998 paper
by Daniel Chandler & Dilwyn Roberts-Young, indicates
that most homepages are created by older adolescents.
One might infer from figures highlighted in our discussion
of blogging that few personal
homepages are actively maintained (or indeed visited).
volatility of content
Wallace Koehler's paper
on Digital Libraries & WWW Persistence estimates
that the 'half life' of a web page is less than two years
and the half life of a site is a bit more than two years.
That is in line with more restricted research such as
the 1997 Rate of Change & other Metrics: a Live
Study of the World Wide Web paper
by Fred Douglis, Anja Feldmann & Balachander Krishnamurthy
and the 2000 paper
How dynamic is the web? by Brian Brewington &
George Cybenko. The latter estimated that 20% of pages
are less than twelve days old, with only 25% older than
The O'Neill, Lavoie & Bennett Trends paper
the public Web, in terms of number of sites, is getting
smaller, public Web sites themselves are getting larger.
In 2001, the average number of pages per public site
was 413; in 2002, that number had increased to 441.
Halavais' 'Social Weather' On The Web (PDF)
suggests that blogs are
the most dynamic web content.
There have been few large-scale studies of link-rot, ie
broken links that result in the 401 'not found' message
on your browser.
A 2003 paper in Science by Robert Dellavalle,
Eric Hester, Lauren Heilig, Amanda Drake, Jeff Kuntzman,
Marla Graber & Lisa Schilling on 'Going, Going, Gone:
Lost Internet References' examined internet citations
in The New England Journal of Medicine, The
Journal of the American Medical Association and Nature.
Web content was cited in over 1,000 items published between
2000 and 2003. After three months, 15 months and 27 months
following publication the number of inactive references
grew from 3.8% to 10% and 13%.
Other studies suggest that link rot and dead print-format
citations to online content outside the sciences may be
as high as 40% after three years. One 2002 US study suggested
that up to 50% of URLs
cited in articles in two IT journals were inaccessible
within four years.
Its similarly been claimed that of around 2,500 UK government
sites, around 27% of URLs become invalid each year as
sites are restructured, cease to operate after administrative
reorganisations or documents are taken offline.
and of domains
The O'Neill Trends paper also comments that
addition to a slower rate of new site creation, the
rate at which existing sites disappear may have increased.
Analysis of the 2001 and 2002 Web sample data suggests
that as much as 17 percent of the public Web sites that
existed in 2001 had ceased to exist by 2002. Many of
those who created Web sites in the past have apparently
determined that continuing to maintain the sites is
no longer worthwhile. Economics is one motivating factor
for this: the "dot-com bust" resulted in many
Internet-related firms going out of business; other
companies scaled back or even eliminated their Web-based
operations .... Other analysts note a decline in Web
sites maintained by private individuals — the
so-called "personal" Web sites. Some attribute
this decline to the fact that many free-of-charge Web
hosting agreements are now expiring, and individuals
are unwilling to pay fees in order to maintain their
A State of the Domain study (PDF)
highlighted volatility in domain registrations from August
2001 to August 2002 -
renewed by current registrant
by a new registrant
previously registered (ie wholly new)
Paul Clemente's The State of the Net (New York:
McGraw-Hill 1998) is now dated but offers a snapshot of
figures before the dot-com crash.
next page (traffic)