preface

This work-in-progress will ultimately serve as a textbook for introductory undergraduate courses in data sciences. No prior knowledge of computer programming is presumed, though, ideally, students will have had college algebra (or its equivalent) and an introductory course in statistics, methods, or data analysis.

Data science is still a new field of study, and there are multiple approaches to teaching it and to its place in the college curriculum. This book is intended to serve courses such as the Introduction to Data Science at the Wilkes Honors College of Florida Atlantic University which, in turn, has drawn from data science classes at the universities of North Carolina, British Columbia, Duke, Maryland, Wisconsin, Stanford, BYU, Harvard, and UC Berkeley At each of these schools, the Introduction to Data Science appears, to my eyes at least, closer to Statistics than to Computer Science.

But if our approach is closer to statistics than to programming, it is particularly close to statistics in its most applied and pragmatic form. The choice of statistical methods should follow from the data and problem at hand - that is, statistics should serve the needs of the user rather than dictate them (Loevinger 1957). This pragmatic focus is driving the growth of data science in industry, and it is reflected in the way data science has been taught at still other schools including Chicago, Georgia Tech, UC Santa Barbara, Princeton, UC Berkeley, at Berlin’s Hertie School of Governance, and in Columbia’s School of Journalism.

some features of the text

There are a number of different approaches to teaching data science. The present text includes several distinguishing features.

R

In a recent informal survey of introductory data science courses, I saw a pretty even split between those which begin with Python and those which begin with the statistical programming language R. This difference corresponds, very loosely, to the split noted above: Computer science based approaches to data science are frequently grounded in Python, while statistics-based approaches are generally grounded in R. Our course, like those for most of the syllabi and courses linked above, will be based in R.

Reproducible science

The course will provide an introduction to some of the methods and tools of reproducible science. We will consider the replication crisis in the natural and social sciences, and then consider three distinct approaches which serve as partial solutions to the crisis. The first of these is training in a notebook-based approach to writing analyses, reports and projects (using R markdown documents). The second is using public repositories (such as the Open Science Framework and, to a limited extent, GitHub) to provide snapshots of projects over time. Finally, the third is to consider the place of significance testing in the age of Big Data, and to provide training in the use of descriptive, exploratory techniques of data analysis.

Good visualizations

Part of Type C data science is communication, and this includes not just writing up results, but also designing data displays that incisively convey the key ideas or features in a flood of data. We’ll examine and develop data visualizations such as plots, networks and text clouds. More advanced topics may include maps, interactive displays, and animations.

All Some of the data

It’s been claimed that in the last fifteen years, humans have produced more than 60 times as much information as existed in the entire previous history of humankind. (It sounds like hyperbole, but even if it’s off by an order of magnitude it’s still amazing). There are plenty of data sources for us to examine, and we’ll consider existing datasets from disciplines ranging from literature to economics to public health, with sizes ranging from a few dozen to millions of data points. We will also clean and create new datasets.

All A few of the latest tools

One feature of Data Science is that it is changing rapidly. The tools, methods, data sources, and ethical concerns that face us in 2021 are different from those which shaped the field just one or two years ago.

In fields that are undergoing rapid change, there is some trade-off between building expertise with existing (older) tools and trying the newer approaches. Partly because I want to equip you with skills which will not be obsolete, partly because some of these new approaches promise more accessibility, elegance, and/or power, and partly because of my own interest in staying current, we’ll be using some of the latest packages and programs.

In the last few years, I’ve shifted the class from the standard R dialect (as I learned it from the Johns Hopkins-Coursera Data Science Specialization) to the Tidyverse. This year, for the first time, much of our work with the R language will take place online, using RStudio cloud, for this should help overcome some hiccups that arise when students are using different machines, operating systems, etc., and should facilitate collaboration among us as well. We’ll also be experimenting for the first time with the learnr package, and emphasizing other approaches to learning R, including Swirl, less.

In the past, I’ve used the Slack platform for messaging, communication and collaboration; this year, I’ve somewhat reluctantly moved to the Canvas LMS. And, in the past, I’ve recommended using dedicated markdown editors such as Typora. While I still think that these are great for some text-editing and note-taking, we’ll do our work instead with the editor in the latest variant of RStudio on our laptops, as this allows WYSIWIG (what you see is what you get) formatting of documents - such as this one - that are intended as “publication-ready” texts.

We’ll continue to use spreadsheets such as Excel or Google Sheets as well.

the book is for you

It’s my intention that this text should serve every college student, regardless of concentration or college major. The skills and insights that you will gain in this course will help you in graduate and professional schools, will help you in your careers, and will help you in your goal of making a better world. And it will help you train the next generation of data scientists as well.

References

Loevinger, Jane. 1957. “Objective Tests as Instruments of Psychological Theory.” Psychological Reports 3 (3): 635–94. https://doi.org/b27jpk.