preface
some features of the text
the book is for you
I Introduction
1
data science for the liberal arts
1.1
type C data science = data science for the liberal arts
1.2
the incompleteness of the data science Venn diagram
1.3
a dimension of depth
1.4
Google and the liberal arts
1.5
data sci and TMI
1.6
discussion: what will you do with data science?
2
getting started
2.1
are you already a programmer and statistician?
2.2
some best practices for spreadsheets
2.3
setting up your machine: some basic tools
2.4
a modified 15-minute rule
2.5
discussion: who deserves a good grade?
3
welcome to R world
3.1
Using RStudio cloud
3.2
Using R Studio on your laptop
4
R stands for …
4.1
a few characteristics of R
4.2
finding help
4.3
Wickham and R for Data Science
5
now draw the rest of the owl
5.1
Review the RStudio cloud primers
5.2
Play with RStudio on your laptop
5.2.1
Create a new R Markdown document and knit it to a PDF or a Word doc
5.2.2
Play with and explore the movies data
5.3
Take a DataCamp class
5.4
Swirl (Swirlstats)
5.5
Read Peng’s text and/or watch the associated videos
5.6
Code along with a tidy webinar, or attend a virtual conference
5.7
Something else
II Part II Towards data literacy
6
principles of data visualization
6.1
some opening thoughts
6.2
some early graphs
6.3
Tukey and EDA
6.4
approaches to graphs
6.5
Tufte: first principles
6.6
the politics of data visualization
6.6.1
poor design leads to an uninformed or misinformed world
6.6.2
poor design can be a tool to deceive
6.7
the psychology of data visualization
6.7.1
the power of animation
6.7.2
telling the truth when the truth is unclear
6.7.3
visualizing uncertainty
6.8
further reading and resources
7
visualization in R with ggplot
7.1
a picture > (words, numbers)?
7.2
Read Hadley ggplots
7.3
exploring more data
8
examining local COVID data in R
8.1
tracking the Novel Coronavirus (from Feb 2020)
8.1.1
reading the data (Feb 2020)
8.1.2
cleaning (wrangling, munging) the data (Feb 2020)
8.1.3
eleven months later: the code still runs!
8.1.4
shall we graph it? (Feb 2021)
8.1.5
too bad
8.1.6
what now?
8.1.7
an initial plot (Feb 2020)
8.1.8
five weeks later (Mar 2020)
8.1.9
adding recovered cases (code from Feb, data through Mar 2020)
8.2
plotting confirmed cases (Feb-Mar 2020)
8.3
status (Feb 2021)
8.3.1
An assignment
8.4
how to create new knowledge
9
on probability and statistics
9.1
on probability
9.2
the rules of probability
9.2.1
keeping conditional probabilities straight
9.3
continuous probability distributions
9.4
the most dangerous equation
10
reproducibility and the replication crisis
10.1
Answers to the reproducibility crisis
10.1.1
Tweak or abandon NHST
10.1.2
keep a log of every step of every analysis in R markdown or Jupyter notebooks
10.2
answers to the reproducibility crisis III: Pre-registration
10.3
further readings
III Part III Towards data proficiency
11
literate programming with R markdown
11.1
scripts are files of code
11.1.1
some elements of coding style
11.2
projects are directories containing related scripts
11.3
R markdown documents integrate rationale, script, and results
11.4
What to do when you are stuck
12
the tidyverse
12.1
some simple principles
13
finding, exploring, and cleaning data
13.1
Data in R libraries
13.2
Other prepared datasets
13.2.1
Keep it manageable
13.3
Make/extract/combine your own data
13.4
exploring data
13.5
messy data: Cleaning and curation
14
transforming and joining data
14.1
from data on the web to data in R
14.2
working with geodata: a function to get US states from latitude/longitude
14.2.1
applying the function to the music data
14.3
drowning in the sea of songs (with apologies to Artist #
ARIVOIM1187B990643
)
14.3.1
combining the song titles with our US artists
14.3.2
exercises
14.4
review of munging tools
14.5
more about joining
14.5.1
more about munging
15
strings, factors, dates, and times
15.1
strings
15.2
factors
15.2.1
types of babies
15.2.2
types of grown-ups
15.3
dates
15.4
times
16
lists
17
loops, functions, and beyond
17.1
loops
17.2
from loop to apply to purrr::map
17.3
some examples of functions
17.3.1
preliminaries
17.3.2
the function
17.3.3
applying the function
17.4
how many bottles of what?
18
from correlation to regression
18.1
correlation
18.2
correlations based on small samples are unstable: A Monte Carlo demonstration
18.2.1
The regression line
18.2.2
Warning: there are two regression lines
18.3
multiple regression
18.4
Swiss fertility data
18.5
marital affairs data
19
cross-validation
19.1
revisiting the affairs data
19.2
avoiding capitalizing on chance
19.2.1
splitting the data into training and test subsamples
19.3
an example of cross-validated linear regression
19.3.1
applying logistic regression analysis to the training data
20
prediction and classification
20.1
from regression to classification: selection of a threshold
20.1.1
applying the model to the test data
20.1.2
changing our decision threshold
20.1.3
more confusion
20.1.4
ROCs and AUC
20.2
another approach to classification: k-nearest neighbor
20.2.1
application: the affairs data
20.2.2
from one doppelganger to many
20.2.3
the Bayesian classifier
20.2.4
Back to the affairs data
20.2.5
avoiding capitalization on chance (again)
20.2.6
the multinomial case
21
machine learning: chihuahuas vs muffins, and other distinctions and ideas
21.1
supervised versus unsupervised
21.2
prediction versus classification
21.3
understanding versus prediction
21.4
bias versus variability
21.4.1
resampling: beyond test, training, and validation samples
21.5
compensatory versus non-compensatory problems
21.6
a postscript: The Tidymodels packages
22
some ethical concerns for the data scientist
22.1
ethics and personality harvesting
22.2
the law of unintended consequences
22.3
your privacy is my concern
22.4
who should hold the digital keys?
22.5
contact-tracing and COVID-19
22.6
the digital divide
22.7
still more case studies
22.8
some potential remedies
References
Data science for the liberal arts
Data science for the liberal arts
Kevin Lanning
2021-09-12