What You Should Know About “Anonymous” Aggregated Data About You
Originally posted as part of the Choose Privacy Week 2015 Blog, sponsored by the American Library Association Office of Intellectual Freedom. Choose Privacy Week is held the first week of May each year to invite library users (that’s pretty much all of us) into a national conversation about privacy rights in a digital age. The campaign gives libraries the tools they need to educate and engage users, and gives citizens the resources to think critically and make more informed choices about their privacy.
Today more than ever, we appreciate that raw data has great financial value. The owners of websites, social media tools, and cell phone applications make billions of dollars annually on “targeted,” or “behavioral,” advertising. They attempt to ensure us that although they collect, share, and use data about us in countless ways, our privacy is safe, because they only use “anonymous” aggregate data. But it turns out that it may not even be possible for aggregate data to be anonymous, no matter hard one tries to make it so.
Underlying the belief that such privacy protection is possible is the assumption that if we “de-identify,” or “anonymize,” data, it is impossible to identify an individual person from that data. A set of data is “anonymized” when it is stripped of “personally identifiable/identifying information,” or “PII.” Obvious examples of PII include name, social security number, and driver’s license number. Although the concept of PII forms the basis for much privacy law and regulation, it turns out that determining which data are capable of identifying an individual, and thus constitute PII that should be subject to regulation, is not a simple task.
And that’s a pretty serious problem in a world in which leaving a trail of digital footprints has become as natural as exhaling a trail of carbon dioxide.
In my book What You Need to Know About Privacy Law: A Guide for Librarians and Educators, I ask the question “When is ‘anonymous’ not really anonymous?” In the context of data collection, this is kind of a trick question, because the answer is a resounding “Always!”
The key is understanding what makes data valuable. A single data point is worthless, and the more connectable data points, the more valuable a set of data. For example, a useful set of data for a retailer might include gender, age, geographical location, and purchasing habits. It would be useful to this retailer to know that women between the ages of thirty and forty who live in Austin, Texas, buy an average of three pairs of dress shoes annually. Adding more data points makes this set of data more valuable: Do they buy online, at boutiques, or at malls? How much do they spend? When do they shop? Conversely, a single data point – women – is utterly useless.
Recent experiments have shown (some unintentionally) the surprising ease with which apparently anonymous data can be “reidentified,” that is, combined in a manner that results in identifying individuals to a great degree of certainty.
The 2000 study that is credited with blowing open the door on reidentification showed the results of aggregating various combinations of only three of the many elements contained in publicly available census data. Using data from the 1990 census, this study found that 87% of U.S. residents can be identified using the combination of ZIP code, birth date, and gender; 53% from the combination of city, birth date, and gender; and even city with the larger area of county, 18% when combined with birth date and gender.
In 2006, AOL responded to the growing interest in open research by releasing an “anonymized” set of 20,000,000 search queries input by 650,000 users. AOL replaced PII such as names and IP addresses with unique identifier numbers, which were needed, of course, to be able to connect search queries made by the same user for the purpose of researching online behavior.
As researchers combed through the data, stories began to develop, one of the most startling being the user who searched for the phrases “how to kill your wife,” “pictures of dead people,” and “car crash photo.” Eventually, one researcher identified an individual, Thelma Arnold of Lilburn, Georgia, based on her combined searches: “homes sold in shadow lake subdivision gwinnett county georgia,” “landscapers in Lilburn, Ga,” and several searches for people with the last name “Arnold.”
Later in 2006, Netflix released 100,000,000 records, with PII replaced by unique identifier numbers, showing how 500,000 users had rated movies over a six-year period. The records included the movie being rated, the rating given, and the date of the rating. Netflix offered a prize to the first team that found a way to “significantly improve” the algorithm Netflix uses to recommend movies based on user ratings.
In only two weeks, one team published results showing that if you know when, within a two-week range, a friend has rated six movies in the database, you will be able to identify that friend on Netflix 99% of the time, thereby allowing you access to every other review posted by that person (even though Netflix reviews are posted “anonymously”).
This study, in particular, uncovers the degree of falseness in our sense of security in our online privacy, even when our names are not directly or publicly linked with our online activities.
The failure of anonymization to protect privacy, combined with the inherent conflict between privacy of data and the value or utility of data, provides yet more evidence that determining how to balance individual privacy rights and the rights of information users is a moving target.