This page considers privacy aspects of online searching: who knows what you are doing online and what is the status of that information?

It covers -

It is complemented by a discussion of consumer anxieties regarding search engines and comments on 'social search' (aka 'people mining').

section marker icon     introduction

It is common to encounter the meme that use of the net is necessarily anonymous or pseudonymous - "on the net no one knows you are dog". We have suggested elsewhere on this site that the scope for such anonymity is often overstated: in practice government agencies and marketers can often 'triangulate' identity and, particularly through accessing data maintained by internet service providers and other intermediaries, can sometimes track online activity in great detail.

That electronic reader peering over the user's shoulder may, for example, be able to identify -

  • when a user was online
  • what email was sent, and even where it was sent to
  • how much data was downloaded or uploaded, often with a close indication of the nature of that data and its point of origin
  • what sites were visited
  • what searches were conducted.

Google for example is reported to have the capacity to -

  • produce a list of people who searched for a specific term, identified by IP address and/or Google cookie value?
  • produce a list of the terms searched by the user of a specified IP address or Google cookie value.

Others have worried about server logs, scattered across cyberspace and marking visits to a particular site, what pages were viewed and the visitor's IP address.

Those concerns gained attention in 2006, when it was revealed that the US government had ordered Google to supply information regarding a million random web addresses and records of all Google searches over a one week period. That order would supposedly enable the government to determine how often pornography shows up in online searches, thereby substantiating a defence of the Child Online Protection Act (which as noted earlier in this guide was struck down by the Supreme Court in 2004). The government has argued that COPA is the only viable way to combat child pornography.

Tim Wu of Columbia Law School and co-author with Jack Goldsmith of Who Controls The Internet: Illusions Of A Borderless World (New York: Oxford Uni Press 2006) commented

the big news for most Americans shouldn't be that the administration wants yet more confidential records. It should be the revelation that every single search you've ever conducted—ever—is stored on a database, somewhere. Forget e-mail and wiretaps—for many of us, there's probably nothing more embarrassing than the searches we've made over the last decade ...

Americans today feel great freedom to tell their deepest secrets; secrets they won't share with their spouses or priests, to their computers. The Luddites were right—our closest confidants today are robots. People have a place to find basic anonymous information on things like sexually transmitted diseases, depression, or drug addiction. The ability to look in secret for another job is not merely liberating, it's economically efficient. But all this depends on our feeling free to search without being watched.

The other alternative is that we all just accept this limitation on our freedom and learn to be more careful. If you go around googling "gay cowboy," perhaps you're just asking for trouble. Perhaps one should live, as they say, as if everything you do will soon show up on Page A1 of the New York Observer. But living like that—as if everything you do will be publicly aired one day—is wretched, and the exact opposite of what it means to be living in a free society ...

Today's search engines are close to an "always on" wiretap. Even for someone like myself who's hardly a privacy activist, that's a bit too scary. Google, and the rest of the search engine industry need to learn how to better keep our secrets.

Later in the same year AOL inadvertently released search log data covering searches by 658,000 AOL subcribers from March to May. A spokesperson confessed

This was a screw up, and we're angry and upset about it. It was an innocent enough attempt to reach out to the academic community with new research tools, but it was obviously not appropriately vetted, and if it had been, it would have been stopped in an instant. Although there was no personally-identifiable data linked to these accounts, we're absolutely not defending this. It was a mistake, and we apologize. We've launched an internal investigation into what happened, and we are taking steps to ensure that this type of thing never happens again.

The information featured identification numbers rather than names or user IDs, although critics sniffed that people could be readily identified on the basis of their searches. Rebecca Jeschke of the EFF claimed that "It's reasonably easy for people to see what their neighbors are searching for, since most people usually google themselves".

section marker icon     legal frameworks

Legislation in Europe and elsewhere over the past five years has sought to mandate long-term retention by ISPs and telecommunication providers of traffic records, including information about messaging (SMS and email) and server logs. Governments have sought access to that data under cybercrime or national security legislation, which also encompasses privileged access to personal and corporate computers and networks.

Legal regimes typically provide some protection for email (on the model of written correspondence) but little or no protection for search information.

Both government and civil litigators are increasingly finding logs a valuable target for subpoenas. That is of interest because of the growing capacity to "wring every ounce of useful information out of such logs", eg identifying a user's identity from an IP address by correlating data from different sources.

section marker icon     responses

Some advocates have suggested that anonymisation services are the most effective response to concerns, although the effectiveness of those services is uncertain and they are arguably beyond the capacity of most users.

Critics have suggested a number of potential responses by search engines such as Google and Yahoo! in order to minimise privacy-related risks while not significantly inhibiting research and development.

Wu asked Google to

please stop keeping quite so much information attached to our IP addresses; please modify logging practices so that all identifying information is stripped. And please run history's greatest "search and delete," right now, and take out the IP addresses from every file that contains everyone's last five years of searches.

Lauren Weinstein similarly suggests that search services should -

1) Minimize the length of time that full log records are maintained for users not using enhanced services. For instance, full records might be maintained for 30 days (an arbitrary figure for this example). These would be available to law enforcement queries and the like for ongoing investigations. However, after the expiration period, records would be anonymized (stripped of IP, cookie, and other connection-related data identifying the user). Logged search query strings (though they also can contain personal information, as we know) would not be affected at this stage and would continue to be available for R&D and other purposes, but now with a significantly lower outside abuse potential.

2) After some longer period of time (a year? - again, an arbitrary period for the sake of this example) the remaining portion of the records for non-enhanced service users would be deleted. I of course cannot address the non-trivial issues of system and related data backups in this regard, since I have no idea how Google has structured backup activities across their enterprise, but this aspect in particular might make for an interesting technical challenge.

Weinstein notes that protecting users of enhanced search-history-based services poses other problems. In order for those services to work some form of detailed data must be maintained for the users. It has been suggested that the potential for abuse could be greatly reduced through various cryptographic mechanisms.

In March 2007 Google announced that all identifying data will be erased after 18-24 months and that

privacy is one of the cornerstones of trust. We will be retroactively going back into our log database and anonymising all the information there.

In June 2007 it informed European Union privacy watchdog group the Article 29 Data Protection Working Party that it would cut back retention of identifiable web search histories to 18 months from 24, with data being anonymised after a year and a half. Google noted that it faced a great lack of legal clarity, with some of its services potentially falling under EU data retention rules that require organisations to keep some electronic communication records for up to two years, and that "We looked at what other companies in the industry do, and we were not able to find explicit and clear privacy policies".

version of June 2007
© Bruce Arnold | caslon analytics