preface

This work-in-progress will ultimately serve as a textbook for introductory undergraduate courses in data sciences, and is particularly aimed at students in the liberal arts. No prior knowledge of computer programming is presumed, though, ideally, students will have had college algebra (or its equivalent) and an introductory course in statistics, methods, or data analysis.

the role of the liberal arts in data science

Data science is still a relatively new field of study, and there are multiple approaches to teaching it and to its place in the college curriculum. This book is intended to serve courses such as Introduction to Data Science at the Wilkes Honors College of Florida Atlantic University which, in turn, was initially based on data science classes at the universities of North Carolina, British Columbia, Duke, Maryland, Wisconsin, Stanford, BYU, Harvard, and UC Berkeley. At each of these schools, the Introduction to Data Science is, to my eyes at least, closer to Statistics than to Computer Science.

Statistics is itself a broad field, and our approach is aligned with its most applied and pragmatic form. From this perspective, the choice of statistical methods should follow from the data and problem at hand - in other words, statistics should serve the needs of the user rather than dictate them [@loevinger1957].

Pragmatism, in turn, can serve various goals, ranging from maximizing the revenues generated by an online ad to minimizing the carbon footprint of a travel itinerary. Data science for the liberal arts may be seen as a fusion of the pragmatism of data science with social and humanistic concerns; we stand beside programs in Computational Social Science as it has been taught at schools including Chicago, Georgia Tech, UC Santa Barbara, Princeton, UC Berkeley, at Berlin’s Hertie School of Governance, and in Columbia’s School of Journalism.

Data science for the liberal arts begins with the person and society rather than with the algorithm and network. In its concern with the liberal arts, it is intended to provide a modest counterbalance to the inherently centripetal, or inequality-accelerating, force of modern information technology.¹

some features of the text

There are a number of different approaches to teaching data science for the liberal arts. The present text includes several distinguishing features.

In a recent informal survey of introductory data science courses, I saw a pretty even split between those which begin with Python and those which begin with the statistical programming language R. This difference corresponds, very loosely, to the split noted above: Computer science based approaches to data science are frequently grounded in Python, while statistics-based approaches are generally grounded in R. Our course, like those for most of the syllabi and courses linked above, will be based in R.

Reproducible science

The course will provide an introduction to some of the methods and tools of reproducible science. We will consider the replication crisis in the natural and social sciences, and then consider three distinct approaches which serve as partial solutions to the crisis. The first of these is training in a notebook-based approach to writing analyses, reports and projects (using R markdown documents). The second is using public repositories (such as the Open Science Framework and, to a limited extent, GitHub) to provide snapshots of projects over time. Finally, the third is to consider the place of significance testing in the age of Big Data, and to provide training in the use of descriptive, exploratory techniques of data analysis.

Good visualizations

As I note in the first chapter, communication is a distinguishing concern of data science for the liberal arts. “Communication” includes not just writing up results, but also designing data displays that incisively convey the key ideas or features in a flood of data. We’ll examine and develop data visualizations such as plots, networks and text clouds. More advanced topics may include maps, interactive displays, and animations.

A little data

There are plenty of data sources for us to examine, and we’ll consider existing datasets from disciplines ranging from literature to economics to public health, with sizes ranging from a few dozen to millions of data points. We will also clean and create new datasets.

A few tools

One feature of Data Science is that it is changing rapidly. The tools, methods, data sources, and ethical concerns that face us in 2025 are different from those which shaped the field just one or two years ago.

In fields that are undergoing rapid change, there is some trade-off between building expertise with existing (older) tools and trying the newer approaches. Partly because I want to equip you with skills which will not be obsolete, partly because some of these new approaches promise more accessibility, elegance, and/or power, and partly because of my own interest in staying current, we’ll be using some of the latest packages and programs.

In the last few years, I’ve shifted the class from the standard R dialect (as I learned it from the Johns Hopkins-Coursera Data Science Specialization) to the Tidyverse, a dialect of R that I find to be relatively clear and concise. A few years ago, I shifted our primary platform from individual laptops to a cloud-based R platform; while this approach has its advantages, I found that these did not outweigh the costs of the approach, so we will go back to the standalone method.

We’ll explore different approaches to learning R syntax, including the learnr package, Swirl, and DataCamp.

In the past, I’ve recommended using dedicated markdown editors such as Typora and Obsidian. While I still think that these are worth considering for some text-editing and note-taking applications, we’ll do our work instead with the editor in the latest variant of RStudio on our laptops, as this allows WYSIWIG (what you see is what you get) formatting of documents - such as this one - that are intended as “publication-ready” texts.

We’ll use, and explore the advantages and disadvantages, of spreadsheets such as Excel or Google Sheets as well.

the book is for you

It’s my intention that this text should serve every college student, regardless of concentration or college major. The skills and insights that you will gain in this course will help you in graduate and professional schools, will help you in your careers, and will help you in your goal of making a better world. And it will help you train the next generation of data scientists as well.

I hope to return to this in a later chapter, but in the meantime consider the discussion of the “Matthew Effect” in sociology and network science [@watts2004].↩︎