22 some ethical concerns for the data scientist
The Latin word scientia is commonly translated as both “science” and “knowledge.” Modern science or modern knowledge has led not only to dramatic improvements in public health and human life expectancy, but also to weapons of war that allow the slaughter of millions at a distance. The idea that science or knowledge is ethically fraught is not new: Adam and Eve are said to have been cast out of the Garden of Eden after tasting the forbidden fruit from the tree of knowledge (Shattuck 1997). And, as we have seen in our discussion of the “most dangerous equation,” the lack of knowledge can also be dangerous - perhaps particularly today, as we are in the grips of a pandemic with leaders who govern by intuition rather than science, by hunch rather than data (Howard Wainer 2007).
In this final chapter, I consider a few case studies which exemplify some of the ethical issues and concerns that frame data science. Note the word “frame” - this is little more than the beginning of a scaffolding. But I hope that it is sufficient to bolster the claim that ethical concerns are or should be at the foundation of data science, and that individuals with training in the liberal arts (and not just math and engineering) should play an important role.
22.1 ethics and personality harvesting
In personality psychology, concerns about how personality tests might invade privacy have been raised for over fifty years. Today, these concerns are newly relevant. One issue is measurement and experimentation without consent (Gleibs 2014). Here, the most dramatic and arguably consequential example of this occurred when Cambridge Analytica scraped the online activity of individuals whose friends had participated in a study of personality, then leveraged this further to assess the personality of all or nearly all American adults. Cambridge Analytica, Facebook, and other data-rich firms such as Experian engage in practices such as “shadow profiling,” which is the nonconsensual profiling of individuals through their social connections (Garcia et al. 2018).
This profiling leads to the selective tailoring of messages (ads, political appeals) which, at the very least, attenuates freedom of choice. At a societal level, it can lead to manipulations of consumer and political behavior, including voting decisions (Bond et al. 2012; Matz, Gladstone, and Stillwell 2016). Tailored messaging can not only harm our social fabric, deepening partisan divides among us, but can have consequences for social institutions as well.
In a recent paper, several of us expressed our concern that the harvesting of personality information is likely to become more common in the years to come (Boyd, Pasca, and Lanning 2020). This argument derives from a series of claims, which are listed below:
- It takes very little data to identify someone. We can often discover the identity of an individual from a very small number of data points (Sweeney 2005).
- There is a great deal of data available. For example, credit card purchases, Uber trips, exercise routes tracked on platforms such as Strava, turnpike itineraries, medical records - are becoming increasingly digital, and so the potential for linking data increases.
- Combining data leads to new value. As we have seen this term, there can be meaningful value, new knowledge, when different datasets are joined.
- There are few constraints on data collection and synthesis. The collection of digital data is typically undertaken by private companies motivated by profit, who are often unconstrained by ethical concerns that might be raised by, for example, human subjects review boards.
- Information is valuable to a range of parties - from potential employers to insurance companies to potential romantic partners. For these and other reasons, we anticipate that there will be a growing market for data about our personalities (Wu 2019).
22.2 the law of unintended consequences
One of the recurring issues in digital ethics has been the so-called “law of unintended consequences” (Merton 1936). When we initiate a new policy, or collect a new dataset, or give permission to a social media application to share our information, or investigate our family tree using a service such as Ancestry.com or 23andMe.com, we do not know what will happen with the information that we share or create. For example, assume that Fred signs up to learn about his family and health on 23andMe. He finds that he has a health vulnerability with an already known, or even a soon-to-be-discovered, genetic predisposition. That information has consequences not just for him but, potentially, for his offspring. If that genetic information is shared, it takes little in the way of a dystopian imagination to consider that Fred ’s grandchildren might have to pay higher health insurance premiums, be prohibited from migrating to certain other countries, be seen as less desirable partners, etc. These examples might seem extreme, but unintended consequences are typically unforeseeable. For example, when the good folks at Netflix shared some of their movie-preference data and offered a million-dollar prize to anyone who could substantially improve on their recommendation-engine, they did not intend to “out” individuals including a closeted lesbian mom who was identified by several data scientists (Narayanan and Shmatikov 2008).
22.3 your privacy is my concern
It is not paradoxical that I should be concerned with your (right to) privacy. I’m invested in your ability to choose what to reveal about yourself, when, and to whom, in part because I want you to be able to live in and contribute to the world. Even if I am not that concerned about having details of my own life revealed, I must be sympathetic to your concerns. The concern for privacy is not a fetish, but is critical if we are to live and work in a just and decent society.
22.4 who should hold the digital keys?
Let’s shift gears somewhat, and consider some of the issues surrounding autonomous vehicles.
In a data-dependent world, who should be the guardians of the code that connects us? To consider just one example, the computer systems in modern cars typically run millions of lines of code. As cars become increasingly autonomous, this complexity will only increase. (Incidentally, the Society of Automotive Engineers, or SAE, describes 6 levels of ‘auto autonomy.’ At this writing, the most sophisticated systems available to consumers, such as Tesla Autopilot, are at level 2. What lies ahead are cars which are self-driving on carefully selected, geo-fenced roads, and ultimately cars “which can operate on any road… a human driver could negotiate”). Our roads and highways will become an Internet of Vehicles (IOV), which will include not just connections between cars and an intelligent cloud ‘above us’ but also direct links between a distributed system of intelligent cars, stoplights, and road sensors in a fog ‘around us’ (Bonomi et al. 2012). Fog computing and the IOV will reduce travel times and increase both fuel efficiency and automotive safety.
Obviously, there are cybersecurity concerns. While the prospects for a chaotic, choreographed hack of hundreds of vehicles on the streets of Manhattan, such as that in the 2017 movie “The Fate of the Furious,” are remote at best (or worst), there have been examples of “white-hat hackers” who have successfully infiltrated (and thereby helped secure) car information systems.
As the IOV develops, there will be vulnerabilities to privacy as well as safety, and the security of the system will be paramount. Different car manufacturers are taking different approaches to developing secure information systems, with many using a closed or proprietary approach. But the scope of the problem is so large that there is a movement towards pooling resources and encouraging collaboration among industry partners, academics, and citizen scientists in the development of an open-source autonomous driving platform, such as Apollo. Perhaps counterintuitively, there may be significant security advantages to using source code that is open to all (Clarke, Dorwin, and Nash 2009; FitzGerald, Levin, and Parziale 2016).
22.5 contact-tracing and COVID-19
On April 10, 2020, Google and Apple announced that they would collaborate on a contact-tracing system to try and slow the COVID-19 pandemic. Keeping in mind the ethical issues of (a) the law of unintended consequences, (b) prior failures to maintain anonymity, (c) arguments that data should be best secured by industry, government, or crowdsourcing, as well as the tech issues of (d) the potential methods for contact tracing, and (e) the costs of the pandemic to public health and to world economies, to what extent should digital tracking be used to assess and limit the spread of the novel coronavirus?
You can learn more about the Google-Apple collaboration at their FAQ and this piece at techcrunch. Consider some of the ethical issues raised in this essay at fivethirtyeight.com, and in this more recent piece at the Verge.
22.6 the digital divide
Not all are benefiting equally from the birth of the digital age. Unfortunately, the “Matthew-effect,” by which resources flow to those who have them most and need them less, is a fundamental property of networked, complex systems. As we become more interconnected, the gap between rich and poor is accelerating. The primary path to a successful life is to find a “scalable” occupation - that is, one in which you can serve many people with little effort. But that means that fewer will “serve.”
So yes, spend your summer trying to become a social media influencer. Better still, make a commitment to working to try to reduce the growing digital divide, and more broadly to address issues of equality and social justice in your own applications of data science. Google, which once had the mantra “don’t be evil” in its core of conduct, has now moved to a set of positive (do’s) as well as negative (don’ts) guidelines for their work in Artificial Intelligence. Information can be empowering (for those who have it).
22.7 still more case studies
As we face ubiquitous observation as our world becomes an Internet of Things, and live in a world in which decisions will increasingly be made for us by applications of machine learning and artifical intelligence, new questions will be raised about the social impact of our science. Some of these are illustrated in these six hypothetical case studies proposed by an interdisciplinary team at Princeton. They warrant your consideration.
22.8 some potential remedies
In the European Union, the General Data Protection Regulation (GDPR) has been in effect since 2018, and provides guidelines for protecting people and personal data in the digital age. In a related piece, (Loukides, Mason, and Patil 2018) provided a briefer set of guidelines. They argued that it is not enough that people provide consent, but also that there must be clarity: That it, it’s not enough that people agree to share their data, they must also understand what they are agreeing to. Other issues that they highlight include the need for data security (as stolen data often include, for example, SSNs and passwords), and protection of vulnerable populations such as children. Finally, they provide a checklist for those developing data products:
❏ Have we listed how this technology can be attacked or abused?
❏ Have we tested our training data to ensure it is fair and representative?
❏ Have we studied and understood possible sources of bias in our data?
❏ Does our team reflect diversity of opinions, backgrounds, and kinds of thought?
❏ What kind of user consent do we need to collect to use the data?
❏ Do we have a mechanism for gathering consent from users?
❏ Have we explained clearly what users are consenting to?
❏ Do we have a mechanism for redress if people are harmed by the results?
❏ Can we shut down this software in production if it is behaving badly?
❏ Have we tested for fairness with respect to different user groups?
❏ Have we tested for disparate error rates among different user groups?
❏ Do we test and monitor for model drift to ensure our software remains fair over time?
❏ Do we have a plan to protect and secure user data?