Hiding Behind Metadata

4 min read

In reading through the coverage of the ongoing data collection by the government, one of the ways that people obfuscate the depth of the privacy intrusion is by hiding behind the term "metadata."

David Brooks, in NPR's Week In Politics, provides an example of how jargon is used to obfuscate reality:

I'm somewhat bothered by the secrecy, but I don't feel it's intrusive. Basically, they're running huge amounts of megadata through an algorithm. That feels less intrusive to me than the average TSA search at the airport.

A more accurate description is that a TSA search is more immediate, physical, and obvious, and is intrusive in a palpable way. By describing the government data grab as passing data through an "algorithm," Brooks attemps to create a level of distance between life (real, physical and immediate) and what the government is doing (just some geeks with pocket protectors and lab coats). The phrase "running megadata through an algorithm" is, from a technical place, meaningless. If you're using any computer, from a smartphone to a mainframe, you are running data through an algorithm. It's what computers do. It's how repetitive tasks get automated.

Returning specifically to the specifics of what metadata can show, the Electronic Frontier Foundation has a great post on why metadata matters. They highlight some scenarios that show where, just by examing information about the call, you can easily infer the details of what was discussed:

They know you spoke with an HIV testing service, then your doctor, then your health insurance company in the same hour. But they don't know what was discussed.

They know you received a call from the local NRA office while it was having a campaign against gun legislation, and then called your senators and congressional representatives immediately after. But the content of those calls remains safe from government intrusion.

They know you called a gynecologist, spoke for a half hour, and then called the local Planned Parenthood's number later that day. But nobody knows what you spoke about.

The EFF examples show how basic information about a call can show key details that suggest the contents and nature of the call. However, the dataset the government collects is more multifaceted that what the EFF discusses. Because the government is collecting multiple datasets from different sources, it can cross reference these datasets in more sophisticated ways to draw more specific inferences.

From phone records, the government can set a baseline of "normal" call activity. From simple online activity, such as Facebook likes, people can be profiled. From location records (available from cell phone data), a pattern of movement can be predicted.

To use one example, let's say data analysts set a flag on people who have identified with Tea Party groups to look for deviations in calling frequency. Want to get a sense of what the person making these calls is thinking about? Look at their search history (likely accessible from data provided by Google, Microsoft, and Facebook). Look at any videos they watched (available via data from YouTube). Based on knowledge about their past movement history, see if they went anyplace out of the ordinary. Then, because the government has a list of this person's contacts, run the same analysis on contacts, going two degrees (friends, and friends of friends) out.

If you're tracking a terrorist, this is incredibly useful information. However, what happens when a whistleblower gets redefined as a terrorist? Would David Brooks be comfortable with his "metadata" - and that of his contacts, and of his contact's contacts - getting run through an "algorithm?"

Metadata - on its own, as a single data point - can provide a fair amount of information about what a person is doing. Metadata from multiple sources, cross-referenced, moves us into an increased level of precision. When you hear someone discussing this issue, and they describe the data grab as "just metadata," you have witnessed an act of obfuscation.

, , ,