Back in 2012, Google thought I was a man.
Let me back up. In January of that year, the search giant released a new privacy policy that, for the first time, sought to aggregate your usage data from across its array of products—including Google Search, Gmail, Google Calendar, YouTube, and others—into a single profile. This change caused quite a stir, both inside and outside of tech circles, and as a result, users flocked to the “ad preferences” section of their profiles, where Google had listed the categories that a user seemed to be interested in, as inferred from their web usage patterns—like “Computers & Electronics,” or “Parenting.” But in addition to those categories, Google listed the age range and gender it thought you were. It thought I was a man, and somewhere between 35 and 44. I was 28.
Pretty soon, I realized it wasn’t just me: Tons of the women in my professional circle were buzzing about it on Twitter—all labeled as men. So were female writers at Mashable, a tech media site; The Mary Sue, which covers geek pop culture from a feminist perspective; and Forbes, the business magazine. So, what did all of us have in common? Our search histories were littered with topics like web development, finance, and sci-fi. In other words, we searched like men. Or at least, that’s what Google thought.
What Google was doing is something that’s now commonplace for tech products: It was using proxies. A proxy is a stand-in for real knowledge—similar to the personas that designers use as a stand-in for their real audience. But in this case, we’re talking about proxy data: When you don’t have a piece of information about a user that you want, you use data you do have to infer that information. Here, Google wanted to track my age and gender, because advertisers place a high value on this information. But since Google didn’t have demographic data at the time, it tried to infer those facts from something it had lots of: my behavioral data.
The problem with this kind of proxy, though, is that it relies on assumptions—and those assumptions get embedded more deeply over time. So if your model assumes, from what it has seen and heard in the past, that most people interested in technology are men, it will learn to code users who visit tech websites as more likely to be male. Once that assumption is baked in, it skews the results: The more often women are incorrectly labeled as men, the more it looks like men dominate tech websites—and the more strongly the system starts to correlate tech website usage with men.
Proxies “define their own reality and use it to justify their results.”
In short, proxy data can actually make a system less accurate over time, not more, without you even realizing it. Yet much of the data stored about us is proxy data, from ZIP codes being used to predict creditworthiness, to SAT scores being used to predict teens’ driving habits.
It’s easy to say it doesn’t really matter that Google often gets gender wrong; after all, it’s just going to use that information to serve up more “relevant” advertising. If most of us would rather ignore advertising anyway, who cares? But consider the potential ramifications: If, for example, Google frequently coded women who worked in technology in 2012 as men, then it could have skewed data about the readership of tech publications to look more male than it actually was. People who run media sites pay close attention to their audience data, and use it to make decisions. If they believed their audiences were more male than they were, they might think, “Well, maybe women do just care less about technology”—an argument they’ve no doubt heard before. That might skew publications’ reporting on the gender gap in tech companies to focus more on the “pipeline,” and less on structural and cultural problems that keep women out. After all, if women interested in technology don’t exist, how could employers hire them?
This is theoretical, sure: I don’t know how often Google got gender wrong back then, and I don’t know how much that affected the way the tech industry continued to be perceived. But that’s the problem: Neither does Google. Proxies are naturally inexact, writes data scientist Cathy O’Neil in Weapons of Math Destruction. Even worse, they’re self-perpetuating: They “define their own reality and use it to justify their results.”
Now, Google doesn’t think I’m a man anymore. Sometime in the last five years, it sorted that out (not surprising, since Google now knows a lot more about me, including how often I shop for dresses and search for haircut ideas). But that doesn’t stop other tech companies from relying on proxies—including Facebook. In the fall of 2016, journalists at ProPublica found that Facebook was allowing advertisers to target customers according to their race, even when they were advertising housing—something that’s been blatantly illegal since the federal Fair Housing Act of 1968. To test the system, ProPublica posted an ad with a $50 budget, and chose to target users who were tagged as “likely to move” or as having an interest in topics like “buying a house” (some of those zillions of attributes we talked about earlier), while excluding users who were African American, Asian American, and Hispanic. The ad was approved right away. Then they showed the result to civil rights lawyer John Relman. He gasped. “This is horrifying,” he told them. “This is massively illegal.”
But hold up: Facebook doesn’t actually let us put our race on our profile. So how can it allow advertisers to segment that way? By proxies, of course. See, what Facebook offers advertisers isn’t really the ability to target by race and ethnicity. It targets by ethnic affinity. In other words, if you’ve liked posts or pages that, according to Facebook’s algorithm, suggest you’re interested in content about a particular racial or ethnic group, then you might be included. Except Facebook didn’t really position it that way for advertisers: When ProPublica created its ad, Facebook had placed the ethnic-affinity menu in the “demographics” section—a crystal-clear sign that this selection wasn’t just about interests, but about identity.
There are legitimate reasons for Facebook to offer ethnicity-based targeting—for example, so that a hair product designed for black women is actually targeted at black women, or so that a Hispanic community group reaches Hispanic people. That makes sense. And since ProPublica’s report, Facebook has started excluding certain types of ads, such as those for housing, credit, and employment, from using ethnic-affinity targeting. But by using proxy data, Facebook didn’t just open the door for discriminatory ads; it also opened a potential legal loophole: They can deny that they were operating illegally, because they weren’t filtering users by race, but only by interest in race-related content. Sure.
There’s also something deeply worrisome about Facebook assigning users an identity on the back end, while not allowing those same users to select their own identity in the front end of the system, says Safiya Noble, an information studies scholar. “We are being racially profiled by a platform that doesn’t allow us to even declare our own race and ethnicity,” she told me. “What does that mean to not allow culture and ethnicity to be visible in the platform?”
What it means is that Facebook controls how its users represent themselves online—preventing people from choosing to identify themselves the way they’d like, while enabling advertisers to make assumptions. And because all this is happening via proxy data, it’s obscured from view—so most of us never even realize it’s happening.
Sara Wachter-Boettcher is a web consultant and author of the forthcoming book Technically Wrong: Sexist Apps, Biased Algorithms, and Other Threats of Toxic Tech.
Excerpted from Technically Wrong: Sexist Apps, Biased Algorithms, and Other Threats of Toxic Techby Sara Wachter-Boettcher. © 2017 by Sara Wachter-Boettcher. Used with permission of the publisher, W.W. Norton & Company, Inc. All rights reserved.