Privacy, Data Combination, and Why PII Can Be A Red Herring

11 min read

Last week, Snapchat had a security breach. The Snapchat response was pretty typical for Snapchat, and unconvincing: they blamed the data leak - over 100,00 pictures that were supposed to be "temporary" - on end users for using third party apps. But Snapchat's record here is spotty at best. In late 2013/early 2014, Snapchat leaked data of about 4.6 million users. Both of these data breaches at Snapchat were preceded by the revelation that the "vanishing" images sent via Snapchat were actually just hidden on your phone - and not deleted - in Snapchat's app. Predictably, Snapchat denied this basic flaw by saying that it's not a big deal because retrieving data is hard.

A few days after the most recent Snapchat breach, the Guardian reported that Whisper - an app used to anonymously spill secrets - was sharing data with the US Department of Defense, tracking the location of people who opted out of location tracking, and highlighting specific users for increased scrutiny.

Whisper provides this description of their service:

With Whisper, you can anonymously share your thoughts and emotions with the world, and form lasting and meaningful relationships in a community built around trust and honesty. If you have ever had something too intimate to share on traditional social networks, simply share it on Whisper!

Whisper responded that the Guardian article was filled with lies and inaccuracies, and the Guardian debunked these claims. Whisper rewrote their privacy policies four days before the Guardian story broke to allow the behavior described in the Guardian article, and Whisper had every right to do this because Whisper - like most tech and EdTech companies - can change the terms at any point, with no notice to end users.

Before the Whisper debacle, we had a blow up over Secret - the app of choice for anonymous harassment and bullying - that had received 36 million in VC funding. Apparently, the founder's apathy toward teen suicide was not a significant enough liability to discourage VC funding (although, in fairness, it does seem that some VCs stayed away from the company).

Jonathan Zdziarski has a solid writeup on Whisper's iOS app, and his piece is worth reading for many reasons, but especially for this universally relevant gem:

Tracking a unique identifier across the lifetime of an application could trivially be used by a company to build a history and profile for the subject, associating all of their former posts, photos, searches, and other stored data with a single identity. Any single message, then, containing identifying correspondence – or multiple messages containing different minor details that could be correlated to form an identity, will positively identify not only the user, but also correlate it to their entire history within the app. Further associating a GPS location to this data would, over the long term, easily provide enough information to determine the user’s identity, simply by analyzing the overlaps of geo-coordinates over a time period.

It also can't be emphasized enough that the data collected by Whisper - a unique device ID, IP addresses, browser, operating system, browsing habits, etc, all tied to a consistent, unique user ID - are pretty comparable to what most education technology companies collect from their users. To quote from the Guardian article this information allows tech folks to make the following claim about an individual, and mean it:

"He’s a guy that we’ll track for the rest of his life and he’ll have no idea we’ll be watching him," (a) Whisper executive said.

And please understand: the functionality of Whisper is different than that of EdTech apps. But, the underlying data collected by Whisper is very similar to many edtech apps. Zdziarski's writeup of the Whisper app (quoted above) contains many general truths that are equally applicable to EdTech applications. We need to start thinking about apps in two ways: first, the functionality they deliver, and second, the data they collect (and the potential uses of that data both on its own and when combined with other datasets) in the process of our use. We'll revisit this idea later in this post.

The backdrop to the slipshod privacy and security practices, paired with apathy toward user safety, buttressed by terms of service and privacy policies that are hostile to end users, is the violence levelled at people who have traditionally been underrepresented in technology. Rape threats, death threats, threats of violence against family members, harassment are often used to silence people. The experiences of Kathy Sierra, Anita Sarkeesian, and Brianna Wu all speak to the larger barriers - and aggressions and microagressions - routinely experienced yet rarely acknowledged by the "experts" - or the people with access to money - in the tech world. The silence of the EdTech world - and in particular, the silence of the "games will save education" and the "let's get more women in STEM" crowd - has been deafening, to the point where it borders on complicity.

To recap: we have an ecosystem of apps that routinely collect more data than they need to deliver the functionality they claim to deliver. We have companies that repeatedly show they care more about organizational growth than user privacy. We have a VC funding system that appears apathetic at best to user needs. We have all of these elements within a system that regularly turns a blind eye to the various facets of racism, sexism, misogyny, and violence.

And, all of these practices are supported by terms of service and privacy policies that empower the company over the end user.

A little while back, I wrote about Remind, and about how their "safe" application provides no way for teachers to actually verify the identify of any people who have joined any class. It's a "feature" of their "privacy" approach. Remind has received 60 million in VC funding. In fairness, the founder of Remind reached out to me after that post, and made the time to talk over these - and other - concerns, and they are in the process of reviewing these practices. But Remind gathers comparable data to Whisper, and then some, as - unlike Whisper - Remind harvests phone numbers from parents, students over 13, and teachers who install the mobile app, regardless of whether it's needed or not.

When users download and use our mobile application, we automatically collect IP address, device ID, device type, user agent browser, what OS they are running, whether or not you signed up on the web, and phone number. If a user is under 13 and using our Student Application we do not collect the device ID or IP address, we will only collect the device type and OS they are running.

Remind, like just about every EdTech company out there, contains clauses that define how data they collect can leak out to other sources. The first way is through partner agreements, and the second is in case of acquisition or bankruptcy. Edmodo's phrasing is comparable to Remind's, and both are comparable to just about every VC funded EdTech company currently in operation:

If Edmodo, or some all of its assets were acquired or otherwise transferred, or in the unlikely event that Edmodo goes out of business or enters bankruptcy, user information may be transferred to or acquired by a third party.

Given how some segments of the tech world fetishize failure, and how difficult it is for startups to achieve longevity the likelihood that learner data will be transferred via bankruptcy or acquisition via some of the apps currently in use is a certainty.

In the education space - and arguably, outside it as well - we need to expand how we think about data collection and data sharing. Personally Identifiable Information (or PII) is becoming increasingly meaningless. A device ID - a unique identifier on a piece of hardware - can serve as a proxy for an individual. Location data - even blurred location data - can be used to derive a person's identity with a high level of accuracy. From a privacy place, while PII matters, the true power of data - and the potential damage caused by abuses of that power - come when data from multiple sources is combined. If a dataset from an EdTech vendor is transferred as part of an affiliate agreement, or as part of a sale or bankruptcy, we lose control over the fate of our information.

The business of combining data is lucrative; currently around 4000 data brokers participate in a 156 billion a year market. Data brokers collect data on a range of subjects:

They have created lists of victims of sexual assault, and lists of people with sexually transmitted diseases. Lists of people who have Alzheimer’s, dementia and AIDS. Lists of the impotent and the depressed.
There are lists of “impulse buyers.” Lists of suckers: gullible consumers who have shown that they are susceptible to “vulnerability-based marketing.” And lists of those deemed commercially undesirable because they live in or near trailer parks or nursing homes.

The fact that these lists often contain inaccuracies doesn't prevent their use in a range of industries:

Typically sold at a few cents per name, the lists don’t have to be particularly reliable to attract eager buyers — mostly marketers, but also, increasingly, financial institutions vetting customers to guard against fraud, and employers screening potential hires.

Recently, in the build up to the housing crisis, we see big data from data brokers used to shape racist policies in the housing market:

Specifically targeted for subprime loans among the minority demographic were black women. Women of color are the most likely to receive subprime loans while white men are the least likely; the disparity grows with income levels. Compared to white men earning the same level of income, black women earning less than the area median income are two and a half times more likely to receive subprime. Upper-income black women were nearly five times more likely to receive subprime purchase mortgages than upper-income white men.
The services of data collection agencies made it easy for lenders who were able to buy information about a potential borrower’s age, race and income. Armed with that information, it was easy for lenders to target moderate-to-high income women of color.

When we think about privacy, we need to expand our view to include more than the data that gets collected, and focus on where that data can end up. Privacy policies, and user ownership of and control over their data, are central to this.

Think about teachers feeding interaction data into Edmodo. Think about Class Dojo collecting a detailed data store that could be used to identify (how teachers perceive) impulsivity or obedience. Think about student intellectual property taken without compensation by TurnItIn. Think about test data and college planning data collected and stored by ETS. Think about students in EAA schools in Detroit, Recovery District schools in New Orleans, KIPP schools in Los Angeles, having their learning habits observed and stored by Agilex. Think about the data sets on kids stored by Rocketship charters. Think about the data flowing into Schoolzilla.

We need to start looking at privacy, data collection, the lack of understanding of abusive dynamics, trends of tech and EdTech funding by VCs, as related. The issues discussed in this post are exacerbated by teachers who partner with companies - in the form of Brand Ambassador or Advisory programs - and then form a social media phalanx to insulate the companies from criticism. While most of these relationships are benign, these relationships are generally not disclosed, which creates some immediate conflicts of interest.

A complaint I hear frequently - often multiple times a day - is that addressing privacy issues feels overwhelming, and that there is no place to start. And yes, it definitely can feel overwhelming, but we all have the opportunity to start in on this every time we interact with technology. We start improving privacy when we call out abusive dynamics online. We start improving privacy when we let a vendor know we will not be using their app with students because of their privacy policies. We start improving privacy when we talk to our colleagues about how privacy - and respect for student data - informs our tech choices. We start improving privacy when we talk to our schools, our school boards, and our elected officials about the ways that current practice needs to improve. Vendors won't listen as long as people use products with bad policies, and VC funders will continue to fund these bad products as long as there is a chance for profit.