BuzzFeed and Methods for Tracking the Trackers; or This Is Hard, Chapter 9674

7 min read

For the last several months, Kris Shaffer and I have been working together on tracking news sites, partisan sites, and hate sites, and their relative popularity on social media. We have also been looking at the advertising and tracking technology used on these sites in an effort to understand how these sites generate revenue. Based on our research, with initial summaries published between February and late March, 2017, we concluded that ad tech and tracking allows misleading news and hate speech to generate revenue.

Kris has three posts on the subject:

I published this piece:

We have been continuing this work because, while our early research showed some significant and interesting patterns, these issues are complex, and we want to be thorough.

Fortunately, there are other people doing similar work. This BuzzFeed article published in early April looks at very similar details to what Kris and I have been researching, and reaches some similar conclusions. However, when reviewing the data behind the BuzzFeed work, I noticed some anomalies that appear to be related to the methodology used to collect the data supporting the BuzzFeed piece.

At the outset, I want to highlight that this conversation wouldn't be possible if all of us weren't describing our methods. While the methodology of the BuzzFeed piece omits some essential details, the overall conclusions still hold up. The need to counter misinformation and the business models that make misinformation profitable are universally recognized, and the more people we have looking at these details, the better.

The more we credit the range of work happening in this space, the better. One paper I hadn't seen until yesterday was this study from Mezzobit. I will definitely be reaching out to look at this service. I have also benefited from being to talk with and learn from David Carroll, Chris Gilliard, Jeff Graham, and Girard Kelly, among others.

But, returning to the BuzzFeed story, this post will look at 3 main concerns: the methodology, the focus on display ads versus the larger ecosystem, and how BuzzFeed's adtech practices compare to the companies they study. I have additional questions on the use of archive.org as a tool to track adtech, but a detailed discussion of that topic is outside the scope of this post.

Methodology (Ghostery-based versus intercepting proxy)

Our methodology in studying trackers is pretty straightforward. We use OWASP ZAP (an intercepting proxy) to capture activity when we visit a site. Then, we export all URLs from the session, which is core functionality in the proxy. Then, we use tldextract to break these URLs down into their component pieces to make them easier to study. This gives us a precise (albeit labor intensive) view into what trackers are placed on what sites.

There are multiple ways to get this view, each with their own advantages and drawbacks. The BuzzFeed methodology uses a web-based tool:

Liliana Bounegru, a a co-investigator on the upcoming A Field Guide To Fake News, used the Tracker Tracker tool to extract ad trackers currently present on the homepage and one article page of each of these sites. Some sites on the list are no longer active, so those were discarded in the analysis. Bounegru then used the Wayback Machine to look for archived versions of the homepage and an article page for each of these sites prior to November 2016. In the end, we identified 51 sites that had trackers on their archived pages and were still online in March 2017.

The tracker tool used to drive the BuzzFeed article is a web-based tool that appears to be based on Ghostery. While its output is informative, it's not precise enough to be considered complete. It's still a useful tool because it's going to be imprecise in consistent ways, but the imprecision can lead to a lack of necessary detail.

As an example, the BuzzFeed article mentions multiple sites where they were unable to identify the source of some ads and their associated ad networks.

The networks serving ads on the pages were collected into a spreadsheet. In some cases, we were not able to identify the provider responsible for pop-unders that were present on several sites. We noted that in the spreadsheet.

Using an intercepting proxy, identifying the source of the pop-under ads is pretty straightforward. We ran a test on TMZWorldStarNews, one of the sites identified as having unidentified pop-under ads. The full archive of the BuzzFeed data set is available on Github.

In our review, the url of the pop-under was http://www.sike.tv/topvideos/?utm_source=advertisecom&utm_campaign=STV-01-RON&utm_term=72822-iy

When we look at the URL, it contains the string "?utm_source=advertisecom" - and advertise.com is a known ad network. When we take a deeper look into the proxy logs, we can see the full set of popunders that will be triggered by this provider, along with the affiliated urls used to deliver content. In tracking ads, affiliating domains with specific providers is both important and difficult to do. Using an intercepting proxy helps give a clearer view of the actual traffic, which helps make these connections.

Display ads versus trackers/advertising ecosystem

The BuzzFeed article appears to direct attention onto what ads get displayed, rather than the larger tracking ecosystem.

In order to determine the ads currently running on fake news sites, Silverman visited 76 active fake news sites without an ad blocker, and with the Ghostery browser plugin enabled. (Ghostery identifies which ad trackers are active on a given webpage, and is also used in the Tracker Tracker tool.) For each site he visited the homepage and at least one article page to examine the ads.

However, focusing solely on the display of ads omits the larger ecosystem of vendors that track users. Using the example of TMZWorldStarNews, the BuzzFeed dataset doesn't identify any trackers.

Using an intercepting proxy, we observe nearly 800 different calls to several hundred distinct urls while visiting the homepage and a single article on TMZWorldStarNews. Scores of these distinct URLs belong to ad trackers. Each of these ad trackers get data on users, and many of these ad trackers appear affiliated because they pass cookie IDs to one another. These affiliations are visible via an intercepting proxy, although spotting them requires some detailed searches through the proxy logs.

What does BuzzFeed do?

Another interesting question that we encounter in our study of ad tracking centers on how more mainstream sites track their visitors and deliver ads. It's one thing to say that ad networks will indiscriminately sell to misinformation sites, but it's still another thing when mainstream sites continue to work with ad tech vendors who will sell to anyone. If we look at the web through the lens of ad tech, many web sites with very different content have significant overlaps via the ad tech they use.

From a quick glance at the ad tech used on BuzzFeed, we see some overlaps with what we observed on TMZWorldStarNews. Both sites make calls to the third party sites/ad trackers listed below:

  • agkn.com
  • crwdcntrl.net
  • demdex.net
  • doubleclick.com
  • facebook.com
  • moatads.com
  • nexac.com
  • quantserve.com
  • quantcount.com
  • scorecardresearch.com
  • twitter.com

BuzzFeed is not alone here. As we observed earlier, other mainstream sites use the same adtech as highly partisan or misinformation sites.

How can we expect ad trackers to heed calls for increased responsibility when mainstream news organizations continue to give money and user data to companies that support misinformation?

Conclusion

Tracking ad trackers is far more complex than it should be - and getting the details right is essential in mapping the terrain. Ad tracking - and the profiling it requires - is central to making misinformation profitable. It also lays the foundation for increased information asymmetry, which is a key element in maintaining existing power structures. We need to make the entire ad tracking system easier to understand. It's difficult, complicated work - and that's another reason why people who care about getting this right need to work together.