13 finding, exploring, and cleaning data

Data science, oddly enough, begins not with R… but with data. There is no shortage of datasets available to analyze, and each can give rise to a host of interesting analyses and insights.

What do you want to study?

13.1 Data in R libraries

A few weeks back, we explored five different R packages (libraries) for looking at COVID data. There are a number of other R packages which provide ‘easy’ access to data - for example, consider the babynames package. Install it on your machine, load the library, and View it.

Exercise 13.1 Come up with an interesting question about the babynames dataset. How would you go about examining it?

Libraries in R with multiple datasets include the built-in R dataset library, which includes, at this writing, 101 different datasets. Another is the R dslabs library, with 42 additional datasets. And the R fivethirtyeight library provides access to 198 clever, clean, and largely manageable datasets, each of which underlies the empirical analyses and reports of Nate Silver and his team (You can learn more at https://data.fivethirtyeight.com/).

Less tidy, somewhat more ambitious, and more far-ranging datasets include these Nineteen datasets and sources, ranging from Census data to Yelp reviews.

13.2 Other prepared datasets

The datasets in the libraries described above should be relatively easy to work with, requiring minimal munging prior to analysis. If you are up for something a little more ambitious, read on.

Kaggleis a noun (a community, a website, a challenge), and a verb (to kaggle is to participate in a data challenge) which describes a crowdsourced competition to improve on a problem in prediction. Perhaps the first and best known example of this was the Netflix prize (Jackson 2017), which, in 2006, promised one million dollars to the first team to improve the algorithm by which that company recommended movies to its customer base. The competition took several years, and inspired substantial improvements in machine learning as well as in crowdsourced science. At this writing, Kaggle hosts many active competitions - including a $1,500,000 award offered by the United States Department of Homeland Security known as the “passenger screening algorithm challenge.” (Good luck!) Kaggle also hosts hundreds if not thousands of datasets. A good place to start is with their datasets stored in comma separated value format (.csv); you can find them here. Kaggling is an important feature of data science culture.

If you are into psychology and behavioral science, the Open Science Framework (OSF) provides a system for hosting and sharing code and data from research articles. One OSF page is a compilation of many datasets from prominent papers in psychology and psychiatry Incidentally, almost all of the data and code from papers I have published is on the OSF as well. Beyond OSF, there is some structured personality test data available, too.

Outside of psychology, repositories of data from many disciplines may be found at Re3data https://www.re3data.org/.

There are many datasets about music - songs, artists, lyrics, etc. - at millionsongdataset. Note that many of these are quite large, but there are “smaller” files of ~ 10000 songs that are available.

Github is the primary site for coders to share and improve upon their work. Git is a system in which one can upload (push) one’s work from a local computer to the cloud in a repository (repo), share this with collaborators who copy (fork) the repo, pull it down to their computers, and possibly make changes which will appear as a separate branch of the repo. Each change is time-stamped, and efficiently stored as only its difference from the prior edit (or commit). There are, in all of these pushes and pulls, opportunities for collisions and problems, but learning Git remains a critical part of the data scientist’s toolkit. You can set up an account on Github if you like, but even without this you can access some of the datasets that are stored there, including a set of curated datasets on topics such as economics, demographics, air quality, flights and house prices. Perhaps the easiest way to access these is to click through repos until you find a data directory, open the files up as ‘raw’ files, and paste them into a spreadsheet or notepad program of your choice. Github also hosts the ’awesome public datasets’ (many of which probably are). You can work with R repositories straight from R studio.

Or just Google datasets.

13.2.1 Keep it manageable

Proceed with caution - many of these datasets are likely to be quite large (for example, analyses of images) and/or in formats that for now are too challenging (JSON). I encourage you to stick with data that are available in a .csv format and that don’t have more than, say, a million data points (e.g., 50,000 observations * 20 variables). And probably avoid, for the time being, datasets consisting primarily of natural language samples or networks - for those who are interested, we will look at these in the Computational Social Science class in the Fall.

13.3 Make/extract/combine your own data

Despite the petabytes (exabytes? zettabytes? yottabytes?) of data in the datasets described above, it’s possible that the dataset that you want to examine does not yet exist. But you may be able to create it, for example, by scraping data from the Web. Typically, you would use an Application Programming Interface (API) to pull data down from platforms such as Twitter or Reddit. For these and other major social media and news platforms, there are R packages which will walk you through the process of getting the data from webpage to tidy dataset. (Be aware, though, that the methods for data access on these platforms frequently changes, so that code that worked a year ago might not work today).

Another source of data is … your own life. If you wear a pedometer or sleep tracker, are a calorie counter or keep other logs as a member of the quantified self movement, consider how such data might relate to aspects of the physical environment (such as temperature, or the time between sunrise and sunset) and/or the broader social and cultural context (a measure, perhaps of the sentiment, or mood, of news articles from papers like the NY Times).

Finally, you might want to combine multiple datasets, such as county-level home pricing data from Zillow (https://www.zillow.com/research/data/), county-level elections data from, for example, here: https://github.com/tonmcg/US_County_Level_Election_Results_08-16, and the boundaries of Woodard’s 11 American Nations (see Lanning). In joining different datasets, or data from different sources, we can go beyond a pedagogical exercise (learning about learning) and contribute new and meaningful knowledge.

13.4 exploring data

To look at the datasets in the prior section, remember that there are a few key commands, which are here applied to the gapminder dataset:

str(gapminder)
head(gapminder)
summary(gapminder)

To simplify your data, you’ll want to select certain columns or rows, or possibly create new variables based on existing scores:

gapminder %>% 
    filter(gdpPercap < 100 & year > 2000)
gapminder %>%
    select(-(c(continent,pop)))
gapminder %>% 
    mutate(size = ifelse(pop < 10e06, "small", "large"))

13.5 messy data: Cleaning and curation

Between 50 and 80% of the work of the data scientist consists of the compiling, cleaning and curation of data, or what is called data munging or wrangling.

One part of data wrangling is looking for and dealing with encoding inconsistencies, missing values, and errors. Consider the following:

Exercise 13.2

Run the following code in an R markdown document. You’ll need to add a library beforehand.

car2019 <- tibble(“model” = c(“Corolla,” “Prius,” “Camry,” “Avalon”), “price” = c(22.5, “about 25K” , 24762, “33000-34000”))

Inspect the data frame. Add and annotate code to fix any problems that you believe exist. Summarize the results.

Another part of data wrangling is about data rectangling (Bryan 2017), that is, getting diverse types of data into a data frame (specifically, a tibble). This is likely to be particularly challenging when you are combining data from different sources, including Web APIs. We’ll consider this further down the road when we talk about lists.

A third part of data wrangling occurs when we join data from different sources. There are many ways to do this, but attention must be paid to insure that the observations line up correctly, that the same metrics are used for different datasets (for example, inflation adjusted dollars vs raw), that dates are interpreted as dates, that missing values are recognized as missing and not scored as zero, and so forth. We’ll talk about this in the weeks ahead, particularly when we consider relational data.

library(tidyverse)
library(maps) # maps must be loaded after tidyverse, else purrr::map takes over
library(maptools)

References

Bryan, Jennifer. 2017. “Excuse Me, Do You Have a Moment to Talk about Version Control?” Preprint. PeerJ Preprints. https://doi.org/10.7287/peerj.preprints.3159v2.

Jackson, Dan. 2017. “The Netflix Prize: How a 1 Million Contest Changed Binge-Watching Forever.” Thrillist. Com.