Russia would like the EU to vanish. Carl Bildt

Fishing in a Sea of Numbers

Online data is a partial reflection of the way humans act and think. Human behavior is too complex to be measured only by Facebook statuses and “Like-buttons.”

open-data data-mining big-data small-data

The rise of “big data” and computational methods that can draw out patterns in large datasets has generated a good deal of attention. Some have compared the potential impact of “big data” to the invention of the microscope—a new technology that allows us to see things that were not visible to the naked eye. While the comparison has some merit, it also misses a crucial point—especially with regards to Internet data.

The notable transformation isn’t that we have any old big datasets but that more and more of our social, commercial, and civic lives take place on the Internet which makes it easy to record so much. However, examining online imprints is not like looking through a microscope. Quite the opposite; this kind of data is often more akin to footprints made by people walking in the sand. What we have with “big data” is a telescope that lets us sit on the moon and look down at all those footprints at once. Understanding this is key to understanding both the promise and the peril of big data analytics.

Big data is indeed powerful for probing certain questions—and that power itself raises social and ethical questions. On the other hand, this data is not as powerful or objective as people hope it is for the solution of other issues and questions. There is a real danger that treating it as a broadly accurate and complete representation of reality in all instances will create a whole host of other thorny social issues. The current buzzword-laden hype and sudden gushing of money towards this kind of research threatens our ability to discuss the ethics of the power of big data as well as the ethics of expecting it to do more than it can—and more crucially, differentiating the former from the latter.

An Ambiguous Button

To begin with, there is no such thing as objective data. All data is loaded with the method of its creation and collection. Facebook data? It’s generated in a social network that is used by a particular subsection of the world’s population in a semi-rigid format that rewards only certain kinds of behavior. The algorithms and norms that govern that platform shape the site’s data. The same goes for every other Internet platform and application.

Online imprints are not perfect or complete mirrors of the world. They are just that—imprints of our behavior through a particular medium. It’s true that the Internet made visible many things that used to happen without being captured so extensively. However, just looking at online data can be a bit like looking only at the shadows in Plato’s cave—online imprints do not represent the full and fickle complexity of our species. I just witnessed thousands of people clicking the “Like-button” on Facebook for an announcement by a teenager with terminal cancer writing that her cancer had spread and that she had run out of options. But it surely doesn’t mean that they “like” this development. It can mean many things: “Look at this idiotic post,” “I wish that hadn’t happened to you,” “I hate this” and, yes, sometimes, “I like this.” Data aggregators and machine semantic analysis would have a hard time telling the difference.

Studying human behavior has all sorts of intricacies that don’t occur when researching inanimate objects, gases in a chamber, or spread of infectious diseases. Humans are fickle and change their behavior. Remember the spectacular excitement about how Google would be able to identify flu trends by looking at people searching for flu-related terms such as “aches,” “chills,” and “fever”? It turns out that the Google predictions greatly overestimated the actual incidence of flu this year. Why? The spectacular excitement about Google’s Flu trends caused more people to “google” the flu, which caused more spikes for Google Flu trends, thereby causing more news stories and hence more spikes. That is a telling example because it is one in which we know what the actual trend was so we can tell that our big data model failed. For many current applications of big data, all we have is the “big data” result. How will we know when we are wrong and by how much? If the map becomes so detailed, will we forget that “the map is not the territory”?

Still, big data can also be powerful and provide unique insights. Recently, researchers started teasing out drug interactions by combing – through Google searches – multiple drugs with their respective side effects. This is a positive, powerful use of big data as it is impossible to test every potential drug interaction—but if we know what people are experiencing, we can start looking backwards and deducing from there.

Greetings from Orwell and Huxley

Researchers have also found out that analyzing a person’s Facebook likes in the aggregate can give us a pretty good estimate of that person’s “sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parental separation, age, and gender.” Note that we’d be acquiring this information by “modeling” these attributes rather than asking the person about their sexual orientation or asking them to take a personality test. In other words, we would know this information about them but they would not know that we know.

Such asymmetries raise many questions about the ethical use of such data. Imagine what a political campaign can do with such information. Each message can be uniquely tailored to appeal to that person’s specific personality, weaknesses, fears and dreams. While each citizen remains under the impression that it’s a match made in heaven, it would be a match made by big data analytics. This is not just science fiction. Big data analytics played a big role in the 2012 US presidential elections, allowing the Obama campaign to have better targeted voter turnout operations. While big data cannot make up for bad candidates or unappealing arguments, it can help you win elections through an optimized voter turnout, which can deemphasize the role of policy debates, especially in close elections. Rather than convincing the public with a broad argument, a campaign can concentrate on finding potential supporters and getting them to the polls.

20th century literature was dominated by two separate dystopias. George Orwell envisioned that governments would use their surveillance power to limit their citizen’s power. Aldous Huxley feared being drugged into a stupor of pleasure. What if the true dystopia is massive surveillance in service of seducing us into acceptance and passivity?

Read more in this debate: Reinhard Clemens, Mathew Ingram.


comments powered by Disqus

Related Content: Open-data, Data-mining, Big-data


The Unjustified Criticism of Big Data


From Analysis to Paralysis

Big Data is often portrayed as witches’ brew. But such judgments miss the crucial point: supporting data analysis will facilitate our everyday lives. read more

by Reinhard Clemens




The Importance of Small Data


Think Small

They don’t seem worth paying attention to, but for companies like Facebook, they are pure gold: the tiny digital footprints that we leave behind as we wander the Internet. To avoid undesirable consequences, we must focus on small data instead of... read more

by Mathew Ingram
Most Read