| overview
 directories
 
 engines
 
 dark web
 
 images
 
 shopping
 
 people
 
 behaviour
 
 wetware
 
 law
 
 cases
 
 anxieties
 
 landmarks
 
 
 
 
 
 
 
 
  related
 Guides:
 
 Publishing
 
 Networks
 & the GII
 
 Metrics &
 Statistics
 
 Security
 & InfoCrime
 
 
 
 
  related
 Profiles:
 
 Metadata
 
 Optimisation
 
 Search
 Terms
 
 Colour
 Pages
 
 Browsers
 |  dark 
                        web 
 This page looks at what has variously been labelled the 
                        dark web (or dark internet), invisible web, deep web or 
                        hidden web - online resources that are inaccessible (for 
                        example because protected by firewalls) or that are not 
                        identified by search engines because they feature a robot 
                        exclusion tag in their metadata.
 
 It covers -
  introduction 
 Much of the content on the web - and more broadly on the 
                        net - is not readily identifiable through directories 
                        or public search engines such as Google and accessible 
                        if identified.
 
 Those navigation tools may broadly identify online resources 
                        that are not publicly available (for example provide an 
                        abstract of a report or journal article). Some content 
                        may simply be undetected by search engines. 'Google Still 
                        Not Indexing Hidden Web URLs' by
 Kat Hagedorn & Joshua Santelli in 14(7) D-Lib 
                        (2008) for example comments 
                        that the leading search engine - and its peers - misses 
                        substantial OAI content.
 
 Unavailability may reflect the content owner or publishers 
                        desire to wholly exclude access by outsiders to an intranet 
                        or database. It may instead reflect the exigencies of 
                        online publishing, with 
                        some publishers providing wide access to subscribers or 
                        on an item by item (or sessional) basis after payment 
                        of an access fee. Some commercial content appears online 
                        for a short term basis (eg for a day or a week) before 
                        moving behind a firewall. Other content is detectable 
                        through abstracts but is distributed by email 
                        rather than on the web.
 
 The elusive nature of such content has resulted in characterisations 
                        of the dark web, the deep (as distinct from surface) web, 
                        the invisible web or the hidden web. Those characterisations 
                        can be misleading, as some content resides on the internet 
                        rather the part of the net that we label the web.
 
 
  how big 
 As noted in discussion elsewhere on this site regarding 
                        internet metrics, the size 
                        and composition of the 'invisible web' is contentious.
 
 That is partly because of definitional disagreements.
 
 Is 'invisibility' attributable to deficiencies in search 
                        engine technology, given that there are substantial numbers 
                        of web pages that can be accessed by ordinary people (ie 
                        without a password or payment) but are not indexed by 
                        the major search engines?
 
 Elsewhere we have noted claims that the largest public 
                        search engines regularly visit and index less than 20% 
                        of all static web pages, arguably not a tragedy given 
                        the ephemeral nature of much blogging 
                        and the prevalence of domain name 
                        tasting.
 
 Does 'deepness' include corporate databases and intranets 
                        that are not publicly accessible but have some connection 
                        to the net and from which, for example, an employee with 
                        appropriate organisation might download a document while 
                        away from the office?
 
 Many - perhaps most - corporate networks have some connection 
                        to the net, on a permanent or ad hoc basis. Should those 
                        networks, and the millions of files they hold, be included 
                        in the dark net? Major cultural institutions now provide 
                        access to large bibliographic and image databases, with 
                        data being displayed 'on the fly'. Is that content part 
                        of the deep web?
 
 Contention also reflects uncertainty about data, with 
                        disagreement about systematic counting of static websites 
                        and pages, the indeterminate number of sites that dynamically 
                        display content and the number of pages so displayed.
 
 It is common to encounter claims that the overall number 
                        of 'pages' in the surface and submerged web is around 
                        4 trillion. It is less common to see a detailed methodology 
                        for derivation of that number and figures from major search 
                        engines such as Google about both the number of sites/pages 
                        they have spidered and their estimates of what has not 
                        been spidered. There has been no authoritative inventory 
                        of commercial publishing sites and cultural institution 
                        sites.
 
 
  basis 
 Why can't resources be readily found and accessed? Reasons 
                        for invisibility vary widely.
 
 Some content is in fact not meant to be invisible. It 
                        may be undetected because it is new: search engines do 
                        not purport to provide instant identification of all content 
                        on an ongoing basis and lags in discovery are common. 
                        The content may have been found by a search engine but 
                        is not displayed because the site/page has been 'sandboxed' 
                        (a mechanism used by some engines to inhibit problematical 
                        publishing by adult content vendors and similar businesses).
 
 Some content is invisible because of search engine priorities: 
                        the search engine is programmed not to bother with the 
                        less frequented parts of cyberspace, in particular those 
                        pages that get no traffic and that are not acknowledged 
                        by other sites.
 
 A rule of thumb is that just because a search engine shows 
                        no results that does not exclude the content's existence.
 
 Some publicly-accessible content is displayed 'on the 
                        fly', ie only appears when there is a request from a user. 
                        Such a request might involve entering a search query on 
                        a whole-of-web or site-specific search engine. It might 
                        instead involve clicking a link or using an online form 
                        to access information otherwise held behind a firewall, 
                        often information aggregated in response to the specific 
                        request.
 
 Examples include job listings, financial data services 
                        (with currency or share prices being updated on an ongoing 
                        basis in real time), online white pages and 'colour 
                        pages' directories, output from a range of government 
                        databases (eg patent registers) 
                        and travel sites (with pricing of some airline seats and 
                        hotel rooms for example reflecting the demand expressed 
                        by queries from users of the site).
 
 Some engines have traditionally ignored dynamically generated 
                        web pages whose URLs feature a long string of parameters 
                        (eg a date plus search terms plus region). The rationale 
                        often provided is that such pages are likely to duplicate 
                        cached content; some specialists have occasionally fretted 
                        that the spider will be induced to go around in circles.
 
 Some content is meant to be invisible, with access 
                        being provided only to authorised users (a class that 
                        usually does not include whole of web search engines). 
                        The restriction may involve use of a password. Access 
                        may involve an ongoing subscription (often to a whole 
                        database). It may instead involve sessional, non-subscription 
                        use of a whole database or merely delivery of an individual 
                        document (with much online scholarly journal publishing 
                        for example selling a PDF of a single article at a price 
                        equivalent to a hardcover academic book).
 
 Such sites are proliferating, serving specialist markets 
                        (often corporate/institutional users rather than individuals 
                        without a business/academic affiliation). They include 
                        academic library subscriptions, reports by some major 
                        technical and financial publishers, and some newspapers.
 
 Many newspapers have adopted a slightly different strategy, 
                        providing free access to excerpts (with engines such as 
                        Google thus being able to spider a 'teaser' rather than 
                        the full content of a particular item or even set of multimedia 
                        files) or pulling 'archival' content behind a firewall 
                        after a certain period of time.
 
 That 'time-limited access' often allows ongoing access 
                        by subscribers, who may have paid for the privilege or 
                        may instead merely have supplied information that allows 
                        the publisher to build a fuzzy picture of their demographics. 
                        Removal means that search engines preserve the URL, with 
                        future visits to that page being met by a sign-up form. 
                        The boundaries of the dark web are porous: some content 
                        is cached by an engine and can be discerned through a 
                        diligent search.
 
 Some content is in fact online, without a password or 
                        other restriction, but is invisible to search engines 
                        (rather than to anyone who knows the URL for the particular 
                        file).
 
 That invisibility may based on the site operator's use 
                        of the 'robot exclusion' or robots.txt 
                        file or tag, which signals to a search engine - when spidered 
                        - that the particular file or part of the site is not 
                        to be indexed. Invisibility may be even more low-tech, 
                        based on the absence of a link pointing to a particular 
                        page (ie it does not form part of the hierarchical relationship 
                        in a static web site, with the homepage/index page pointing 
                        to subsidiary pages).
 
 Geocoding (aka geo-tagging) 
                        and other filtering 
                        of content means that for some people particular parts 
                        of the web - for example those that deal with human rights, 
                        criticism of their ruler or adult content - are dark. 
                        Filtering may be intended to restrict access by a nation's 
                        population to offensive or subversive content. It may 
                        instead form part of a business strategy, with online 
                        broadcasters for example trialling systems that seek to 
                        restrict access outside specific locations. Such restriction 
                        may assume that there is a close and reliable correlation 
                        between a geographic location and the IP address of the 
                        user's computer.
 
 
  retrieval 
 Savvy researchers, of course, do not rely solely on Google, 
                        MSN or Yahoo! (and certainly not only on results from 
                        the first page of a search list). Much information in 
                        the dark web can be identified and even accessed through 
                        diligence or social networks, given that the best search 
                        engine is often a contact who knows what information is 
                        available and has the key needed to unlock the door.
 
 One response is use specialist search engines such as 
                        JSTOR and Medline to identify rather than access documents. 
                        Some documents may be published in print formats, eg in 
                        journals that can be consulted in major libraries or accessed 
                        on a document by document basis through inter-library 
                        copying arrangements.
 
 Another response, as noted above, is to use acquaintances 
                        - or even generous librarians, officials and corporate 
                        employees - to gain access via an institutional or corporate 
                        subscription to a commercial database such as LexisNexis 
                        or Factiva.
 
 A third response is to pay for the content, whether on 
                        an item by item basis (offered by major journal publishers 
                        such as Elsevier and Blackwell), on a sessional basis 
                        or on a subscription basis.
 
 A further response is to make use of site-specific search 
                        engines and directories, ie navigate through corporate 
                        or other sites in search of documents or dynamically generated 
                        information that is readily available to a visitor but 
                        does not appear on a whole of web engine such as MSN. 
                        That response can be important given the timeliness of 
                        data collection by major engines, with delays of weeks 
                        or months being common before new information appears 
                        in their search results.
 
 
 
 
  next page  (image 
                        searching) 
 
 
 |  
                        
                       |