Privacy Policies and Machine Learning

3 min read

Today, Google announced the release of their v2 machine learning system under an open source license. This is a big deal for a few reasons. First, to understate things, Google understands machine learning. The opportunity to see how Google works on addressing machine learning will save huge numbers of people huge amounts of time. Second, this lets us take a look inside what is generally a black box. We don't often get to see how ratings, reviews, recommendations, etc are made at scale. This release peels back one piece of one curtain and lets us look inside.

Before we go any further, it's worth highlighting that machine learning - even with a solid codebase - is incredibly complex. Doing it well involves a range of work on infrastructure, data structure, training the algorithm, and ongoing, constant monitoring for accuracy - and even then, there is still a lot of confusion and misconception about what machine learning does, and what machine learning should do. So, before we proceed any further it needs to be highlighted that doing machine learning well requires (at the very least) clearly defined goals, a reliable dataset, and months of dedicated, focused work training the algorithm with representative data. The codebase can jumpstart the process, but it is only the beginning.

As part of the work we're doing at Common Sense Media, Jeff Graham and I are working with a large number of school districts on a system that streamlines the process of evaluating the legal policies and terms of a range of education technology applications.

The first part of this work involves tracking policies and terms so we can (among other things) track changes to policies to alert us when we need to update an evaluation. There are a range of other observations this will allow - and we have started talking about some of them already.

The second part of this work involves mapping specific paragraphs in privacy policies to specific privacy concerns. When it comes to evaluating policies, this analysis is the most time consuming. Doing it well requires reading and re-reading the policies to pull relevant sections together. While there are ways to simplify this, these methods are more useful for a general triage than a comprehensive review.

However, Jeff has been looking at machine learning as a way to simplify the initial triage for a while. To understate things, it's complicated. Doing it right - and training and adjusting the algorithm - is no small feat. Implementing machine learning as part of the privacy work is a distant speck on a very crowded roadmap. It's incredibly complicated, and we have a lot of work to do before it makes sense to begin looking at implementing machine learning to do the initial categorization. But, announcements like the one from Google today get us closer.