It’s an Empirical Life
Let me begin by retelling a story told by Nikesh Arora, Google’s Senior Vice President and Chief Business Officer, in the pages of Fast Company. Speaking of his company’s cofounder and CEO Larry Page, Arora recalled, “I was talking to Larry on Saturday. I told him that I’d gotten back from nine cities
in 12 days—Munich, Copenhagen, Davos, Zurich, New Delhi, Bombay, London, San Francisco. There’s a silence for five seconds. And then he’s like, ‘That’s only eight.’”1
I share this anecdote because the article that follows is ostensibly about Big Data, that vague, slightly villainous, suddenly unavoidable term for the type of operation Google runs, and I want to stress that this story has a human side. Big Data’s advocates and enemies tend to downplay this dimension, portraying what is essentially number crunching as an almost supernatural phenomenon, a kind of hyperempirical magic conjured by “clever” (alt: “sophisticated”) algorithms running on ever-more powerful computers processing an ever- widening array of the informational stuff that we knowingly and unknowingly generate. It’s obvious, but still seems worth mentioning, that everything this article discusses is human-made; the technology, the analysis, and the actions that result are essentially human. They may allow for superhuman abilities, but their origins are in us.
Back to Google. But first a disclaimer: Google is amazing. I am utterly dependent on Google and struggle to think of a single component of my life, business or pleasure, that isn’t aided by one Google product or another. This essay certainly wouldn’t exist without it.
And it is precisely this brilliance that makes the scope of Google’s ambitions so unsettling: based on what the company has achieved in a decade and a half, it is hard to locate the limits of what it can do. With each experiment or acquisition (driverless cars, smart thermostats, lie-detecting tattoos, mountain-climbing military robots …), the company’s dream of a data-driven, always-on augmented reality—expressed by Page as a search function “included in people’s brains” so that “when you think about something and don’t really know much about it, you will automatically get information”2—seems closer to reality.
Google’s mission is “to organize the world’s information and make it universally accessible and useful.”3 It’s an ancient, admirable ambition that no pharaoh, monk, monarch, or modern state has come as close to achieving. To understand how Google does it, it helps to know that for the computer scientist, information is abstract; it exists independent of the meaning it expresses or the language used to express it. When the mathematician and code- breaker Claude Shannon published his seminal paper on information theory in 1953, he emphasized this dissociation: “‘Information’ here,” he wrote in “Communication Theory—Exposition of Fundamentals,” “although related to the everyday meaning of the word, should not be confused with it.”4 The meaning of a message is generally irrelevant, Shannon argued, and though the notion may sound like heresy to the semantically oriented, it remains the central dogma of information theory and ultimately of Big Data.
Google takes this theory as fact, and it forms the basis of many of the company’s greatest accomplishments, including Google Search and Google Translate, two Shannonite triumphs in fields where prevailing opinion assumed that some sort of semantic intelligence allowing computers to “think” and “understand” natural language was necessary to achieve satisfying results. Google’s founding vision, developed when Page and his partner Sergey Brin were still computer science students, was based on the opposite principle: to search in explicit ignorance of semantics—to ignore the content of webpages and organize the Internet via the topology of the Web itself, which links pages to other pages. By focusing on connections, the search engine they developed could estimate the most useful, authoritative, or even interesting results without any understanding of the words they contained. In 2012, writer and software engineer David Auerbach explained Google’s meaning-free method:
The importance of a page about Sergei Prokofiev could be determined, in part, from the number of pages that linked to it with the link text “Sergei Prokofiev.” And in part from the importance of those other pages vis-à-vis Prokofiev. And in part from how often Prokofiev is mentioned on the page. And in part from how much other stuff was mentioned on the page. These signals of a page’s standing are determined from the topological layout of the web and from lexical analysis of the text, but not from semantic or ontological understanding of what the page is about. Sergei Prokofiev may as well be selenium mining.5
Once information is liberated from meaning, it becomes raw material for forms of analysis that determine quality through quantity. The nature of the material matters less than the amount, and the goal becomes to collect as much information as possible regardless of the relative value of each piece. This impulse has driven almost all of Google’s efforts—not to mention those of that other Big Data behemoth, the National Security Agency (NSA). Out of this undifferentiated, seemingly worthless stuff, Google has crafted tools of enormous value, including a “Google corpus,” a trillion-word set of structured texts that provides a foundation for the company’s experiments in translation, speech recognition, spelling correction, and other seemingly meaning-dependent processes. In a paper entitled, “The Unreasonable Effectiveness of Data,” three Google researchers described the creation of the corpus and why an indifference to quality makes the set so valuable. “One of us,” they wrote, “as an undergraduate at Brown University, remembers the excitement of having access to the Brown Corpus, containing one million English words. […] In 2006, Google released a trillion-word corpus with frequency counts for all sequences up to five words long.” Then the crucial part:
In some ways this corpus is a step backwards from the Brown Corpus: it’s taken from unfiltered Web pages and thus contains incomplete sentences, spelling errors, grammatical errors, and all sorts of other errors. It’s not annotated with carefully hand-corrected part-of- speech tags. But the fact that it’s a million times larger than the Brown Corpus outweighs these drawbacks.6
The belief that even the rawest, most seemingly prosaic scraps of information contain something useful—a physicist might call it potential energy; a financier, option value—and the idea that the enterprising analyst can repeatedly tap the same data for new insights, have triggered a gold rush within certain segments of the free market. Data, according to the World Economic Forum, is “the new ‘oil’—a valuable resource of the 21st century […] a new asset class touching all aspects of society.”7
Not exactly new. For years, companies in finance, insurance, entertainment, and marketing have included client and consumer information within their portfolios of intangible assets; telecommunications companies and governments have always collected personal data as part of their standard operations. Suddenly, this stuff has a purpose. It can expose inefficiencies and trigger insights; it grants data-rich organizations new means to evaluate past efforts and spot revenue-generating opportunities.
Improvements in data recording and storage capacity, and the accuracy of the algorithms used to collect, analyze, and crossbreed the datasets they produce, have made something once considered ancillary seem like the whole point.
Then there are the Facebooks and Twitters, companies that basically only deal in data, or more specifically, only deal in our data. Social media companies show us the full economic implications of our personal information, especially when they go public. Take Twitter—a currently unprofitable company with an estimated $1.2 billion in advertising revenue,8 and no major physical assets, that was nonetheless valued at $24.9 billion when it went public in November 2013. Here is Twitter cofounder and Chairman Jack Dorsey in his introduction to the company’s filing with the Securities and Exchange Commission:
“The mission we serve as Twitter, Inc. is to give everyone the power to create and share ideas and information instantly without barriers. Our business and revenue will always follow that mission in ways that improve—and do not detract from—a free and global conversation.”9
How will “business and revenue” do that? The document goes on to define a few ways, all of which
are utterly dependent on Twitter’s huge and growing pool of free labor—241 million monthly active users as of January 2014, up 30 percent from the previous year. The primary revenue stream is made through advertising, purchased in the form of “promoting” a tweet, an account, or a trend, which then appear in users’ timelines. Currently, 85 percent of Twitter’s revenue is generated this way, but its future depends on Big Data.
For a fee, Twitter offers access to something called the “firehose,” its stream of public data—about 500 million tweets each day. Twitter has agreements with two firms, DataSift and Gnip, to sell access to the information. Businesses then parse the tweets, sometimes using a technique called sentiment analysis, to aggregate customer feedback or judge the impact of marketing campaigns. A company called MarketPsych then analyzes tweets as signals for investments in the stock market. Describing itself as “a leader in behavioral finance research and consulting,” the company has basically turned Twitter into a real-time monitor of global moods.10 Together with Thomson Reuters, the company has developed over 18,000 separate indices across 119 countries, updated each minute, on emotional states like optimism, gloom, joy, fear, trust, anger, stress, urgency, and uncertainty. The information is drawn from human expression, and would seem to describe the human condition, but it is collected for computers to interpret. Wall Street’s quantitative analysts input the data into algorithmic models to search for correlations that can be used to make money. To quote the indices’ marketing materials:
Where there are markets, there are emotions.
Where there are emotions, there are cycles.
Where you understand cycles, you profit.11
Twitter may aspire to “improve—and not detract from—a free and global conversation,” but what its shareholders want is a constantly flowing pipeline of personal data that advertisers and the many middlemen reading our tweets will pay to be part of. Wall Street is willing to temporarily overlook the company’s lack of profits because it knows that the goods flow in essentially for free. Twitter established a platform—based largely on the unpaid efforts of others (from the filing: “Many of our products and services contain open-source software, and we license some of our software through open- source projects, which may pose particular risks to our proprietary software, products, and services in a manner that could have a negative effect on our business.”)—and millions of people gave it life. Now that platform is worth 30 billion speculative dollars and the users can expect to be compensated with, at best, ads of unprecedented precision and, at worst, a mainline into the NSA and any number of credit agencies, insurance companies, and police departments, all of whom are developing means to analyze us via publicly available personal info—credit reports and consumer-marketing data as proxies for blood and urine samples, for instance; vital signs, body language, and location as proxies for possible terrorist intent.
“Price is the money-name of the labour realised in a commodity,” wrote Karl Marx.12 Marx identified the gap between the price at which the worker sells his labor and the amount the employer receives for the commodity, and called it surplus value. He considered this gap the basis of capitalism, and it is hard to imagine a purer expression of his theory than Twitter, Facebook, or Instagram, the latter a company that, when it was sold to Facebook for about one billion dollars in 2012, had 30 million customers and only 13 employees. To quote a comparison by MIT’s Erik Brynjolfsson and Andrew McAfee, Kodak, which filed for bankruptcy a few months before the Instagram sale, employed 145,000 people at its peak.
The distinction speaks to why Big Data is so powerful, lucrative, and potentially destructive. In America, wage- labor compensation has barely changed in over four decades. Research by Lawrence Summers, among others, suggests that the substitution of machines for human labor explains both this stagnancy and the increase in economic inequality that it produces.13 Because humans are decreasingly necessary to perform industrial tasks, the owners of the machines that replace them gain ever more of the world’s income, while the workers’ share shrinks. This is exactly what Marx predicted for an economic system that “cannot exist without constantly revolutionizing the instruments of production, and thereby the relations of production, and with them the whole relations of society.”14
As patterns of mechanization that first appeared on factory floors infiltrate the office, Marx’s words ring
true in the age of Big Data. The Economist assures us that “computers can already detect intruders in a closed- circuit camera picture more reliably than a human can. By comparing reams of financial or biometric data, they can often diagnose fraud or illness more accurately than any number of accountants or doctors.”15 Watson, the IBM supercomputer that defeated two Jeopardy grand champions in 2011, is currently doing exactly that sort of diagnostic work at Memorial Sloan Kettering Cancer Center in New York.
The argument for automation is simple: machines can more reliably satisfy a small set of tasks because that’s all they have to do. Humans have to cope with distractions, such as sleeping and eating, making us inferior for any routinized job. Until recently, we’ve assumed that more cognitively challenging, less clearly defined tasks, such as diagnostics and driving, were beyond the abilities of a computer and thus safely within the purview of people. Ten years ago, economists felt confident that undertakings like negotiating a turn against traffic or deciphering scrawled handwriting couldn’t be done by machines. We now know that they can—and this ability is derived from Big Data.
To routinize a task, all that’s required is an accurate algorithm. This has been the case since the Industrial Revolution, when the personalized efforts of artisans were reduced to smaller, highly specialized sequences, requiring less skill, but more workers. Today a job is “routine” if it can be translated into code that machines can read. But cognitive tasks are more complex to automate. An enormous amount of information is needed in order to specify the many contingencies a technology should manage to adequately substitute human labor. The success of Google Translate, for example, is hard to quantify without an (ideally) unlimited corpus of human-translated digitized texts to test its algorithms against. It’s logical then that the company collects “the world’s information” and devises new products and services that encourage the inputting of more information, voluntarily or not. Every piece improves an algorithm—confirms what it’s doing right or red flags what it’s doing wrong—and together the data allow its programmers to computerize more and more of what we do.
Having already dominated manufacturing, machines are on the march. A recent study by researchers at Oxford argues that 47 percent of jobs in the United States could be automated in the next two decades.16 Depending on your background, it may make you feel better to know that the authors offered this qualifier: “Generalist occupations requiring knowledge of human heuristics, and specialist occupations involving the development of novel ideas and artifacts, are the least susceptible to computerisation.”17 Human heuristics? Novel ideas? One might conclude that the urban designer, with his or her peculiar mix of creative and social intelligence, is safe from the encroaching algorithms. I sort of doubt it.
On the one hand, there are the low-hanging fruits: any architectural studio employs a massive amount of highly un-cognitive labor to make models, drawings, diagrams, and the like. This work doesn’t seem to me any less computerizable than that of paralegals and telemarketers, two occupations that will apparently soon require no human involvement other than management. On the other hand, there is the often highly uncreative nature of creativity itself. Accepting the universal tendency of architecture offices to repeat themselves as a product of exigencies and market expectations, we should also probably admit that even seemingly original designs are usually just unfamiliar combinations of established ideas. Assemble a superhuman store of references and provide it the right parameters and, Big Data advocates would argue, creativity can be computerized too.
Google is already blazing a trail in this direction. A few years ago, one of its top executives ordered staff to test click-response to 41 gradations of blue in order to determine the design of a toolbar. Google’s top designer ended up quitting over the approach. In an open resignation letter with recognizable John Henry undertones, he wrote, “When a company is filled with engineers, it turns to engineering to solve problems. Reduce each decision to a simple logic problem. […] All that data eventually becomes a crutch for every decision, paralyzing the company and preventing it from making any daring design decisions.”18
Which brings me back, in a possibly illogical way, to Larry Page and his CBO’s “only eight”-city tour. For the interlocutor interested in understanding the state of the world, or maybe hearing some travel horror stories, Arora’s statement seems like the start of a great conversation. Not for Page. This is a person who can cut through the humanistic bullshit and go straight to what is empirically true. His is a pure and enormously monetizable mentality, and it will define our times. But it nonetheless is a mentality shaped by subjective experience and some very particular assumptions about the point of human life. For those of us who assess our efforts through different metrics, the age of Big Data offers good and bad: it possesses the potential to liberate us from drudgery and provide new forms of inspiration and ammunition in defense of our ideas. But if we’re not mindful, data collectors, aggregators, and analysts will dictate an ever-wider range of our output, remaking some of our professions in the process. I can’t prove that, but it feels true.
Brendan McGetrick is a writer, editor, and designer; he is currently the Director of Strelka Knowledge at Strelka Institute, Moscow, and curator of the Russian Pavilion at the 2014 Venice Architecture Biennale.