14 transforming and joining data

Once you have your data, you will almost invariably need to transform it - to sort it, to select observations or variables from it, to create new variables, to partition it into groups, or to summarize it. In R, there is a general purpose tool (ok, package) that exists for this called dplyr (“d-plier”). Dplyr is a core part of the tidyverse and hence is loaded automatically when you load the tidyverse ensemble of libraries. The versatility of dplyr is demonstrated in Chapter 5 of R4DS, which shows how to do many basic, and some not so basic, operations on your data. Read it closely if you haven’t already done so.

14.1 from data on the web to data in R

Let’s consider a dataset consisting of 10,000 songs (observations) and 35 measures (variables including artist). The first step here (and for your class project and the data science challenge) will be to get the data into R. How do you this?

Method 1: Download the file to your computer as a csv file. Some advantages of this include (a) it will allow the code to run even if the website changes or disappears, (b) in the likely event that I will run the code several times, it is faster to get data from my own machine than from the cloud, and (c) if there are any anomalies in the data I can look at them locally (e.g., in Notepad++). If you use this approach, make sure that you know where your data is on your computer. The easiest way to do this is to work with an R project. Within R studio, create a new R project (or use an existing one). Move the csv file into this directory, so that data, code, and results will all be in the same place.

Method 2: Import the data from the cloud directly into R. You can do this either by specifying the musicURL separately (Method 2a) or call it directly within the read_csv statement (2b). Here, too, I would encourage you to set up and work within a project.

### Method 1
# music1 <- read_csv("music.csv") 
### Method 2a
musicURL <- 
  "https://think.cs.vt.edu/corgis/datasets/csv/music/music.csv?forcedownload=1"
music1 <- read_csv(musicURL)
### Method 2b
# music1 <- read_csv(
#  "https://think.cs.vt.edu/corgis/csv/music/music.csv?forcedownload=1"
#  )

What can you do with the data in its existing form? What questions do you wish you could ask about these songs, artists, and the places they come from?

14.2 working with geodata: a function to get US states from latitude/longitude

When I first ran this code in 2018, the variables included states and countries. The current dataset doesn’t have these, but it does have longitude and latitude. To get state data from this, reverse geocoding is needed. A Google search turned up the following function. Note that it requires two additional libraries, ‘maps’ and ‘maptools.’

# source is https://github.com/abresler
# The single argument to this function, pointsDF, is a data.frame in which:
#   - column 1 contains the longitude in degrees (negative in the US)
#   - column 2 contains the latitude in degrees
latlong2state <- function(pointsDF) {
    # Prepare SpatialPolygons object with one SpatialPolygon
    # per state (plus DC, minus HI & AK)
    states <- map('state', fill=TRUE, col="transparent", plot=FALSE)
    IDs <- sapply(strsplit(states$names, ":"), function(x) x[1])
    states_sp <- map2SpatialPolygons(states, IDs=IDs,
                     proj4string=CRS("+proj=longlat +datum=WGS84"))
    # Convert pointsDF to a SpatialPoints object 
    pointsSP <- SpatialPoints(pointsDF, 
                    proj4string=CRS("+proj=longlat +datum=WGS84"))
    # Use 'over' to get _indices_ of the Polygons object containing each point 
    indices <- over(pointsSP, states_sp)
    # Return the state names of the Polygons object containing each point
    stateNames <- sapply(states_sp@polygons, function(x) x@ID)
    stateNames[indices]
}

14.2.1 applying the function to the music data

If you have run the chunk above, the latlong2state function will now exists in your environment. You still need to apply the function to your data.

We do this using three of the essential tools in the munger’s toolbox: select (to choose several columns from the data), mutate (to create a new variable), and finally drop_na (to drop cases with missing data - here, artists from outside of the US).

# create a new dataset with just latitude and longitude
tempLatLong <- music1 %>% 
  select (artist.longitude, 
              artist.latitude)
music1$state <- latlong2state(tempLatLong)
USartists <- music1 %>% mutate(state =
                                 latlong2state(tempLatLong)) %>% 
  drop_na(state)
rm(tempLatLong)

14.3 drowning in the sea of songs (with apologies to Artist #ARIVOIM1187B990643)

At this point, we have a list of songs and artists and some measures of “hotttness,” restricted to US artists and now including states as well as latitude and longitude. There is a field called song-id, but it doesn’t have song titles.

(several hours later)

I found the titles on the ’net. They were part of a file “millionsongsubset,” which is an archive of many files, and which is altogether about 2 GB in size. Within this, there is a file that looks right - let’s try it.

The file is delimited by the characters “<SEP>,” but read_delim requires that data fields be separated by a single character (like a comma or a tab). So the file is read as a single variable, which R assigns the default name “X1.” I use mutate to replace each “<SEP>” by a tab, then the separate command to split X1 up into its constituent variables - the first is a mysterious ID, the second looks like the song.id in the original file, the third is an artist name, and the fourth is a song title.

14.3.1 combining the song titles with our US artists

Finally, we can join this to the USartist data, as each row of the two files have a common key (song.id). Join is discussed in more detail below. Here, we use a right_join - implicitly, “maybeTitles,” which is at the head of the pipe, is on the left. The right_join will link the left (maybeTitles) file with the right (USartists), by what appears to be a common key:

# again, several different ways to load the data
maybeTitles <- read_delim("https://raw.githubusercontent.com/xbucchiotty/xpua/master/src/main/resources/AdditionalFiles/subset_unique_tracks.txt","~",
                          col_names = FALSE) %>% 

#maybeTitles <- read_delim("data/subset_unique_tracks.txt","~",
#                          col_names = FALSE) %>% 
  mutate(X1 = str_replace_all(X1, "<SEP>", "\t")) %>% 
  separate(X1, sep = "\t", into = c("ID1",
                                   "song.id",
                                   "artistFromMaybeTitles",
                                   "SongTitle")) %>%
  right_join(USartists, key = "song.id")

## Rows: 10000 Columns: 1

## -- Column specification --------------------------------------------------------
## Delimiter: "~"
## chr (1): X1

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

## Joining, by = "song.id"

So far, so good. Do things line up correctly? If the match is perfect, the two artist fields should be identical. We can use the [not] oddly-titled logical function “identical” to check this:

identical(maybeTitles$artistFromMaybeTitles,
          maybeTitles$artist.name)

## [1] FALSE

Whoops. The two variables aren’t the same - or are they? Run the following commands, which includes two more of the essential munging tools - select and arrange. Here, note that the “filter” function is used to select cases where the two variables are different.

maybeTitles %>% 
  filter (artistFromMaybeTitles != artist.name) %>% 
  select (artistFromMaybeTitles, artist.name) %>% 
  arrange(artistFromMaybeTitles) %>% 
  head()

## # A tibble: 6 x 2
##   artistFromMaybeTitles        artist.name                    
##   <chr>                        <chr>                          
## 1 Alvin Youngblood Hart        Alvin YoungbloodA Hart         
## 2 Andraé Crouch & Solid Gospel AndraA(c) Crouch & Solid Gospel
## 3 Angélica María               AngA(c)lica MarAa              
## 4 Fabián                       FabiA!n                        
## 5 Ill Niño                     Ill NiA+-o                     
## 6 Los Muñequitos De Matanzas   Los MuA+-equitos De Matanzas

14.3.2 exercises

What happened with the artist names? How would you move forward with these data?

Do something interesting with the data - use filter, arrange, and select in your piped code.

Work with the babynames data. Use mutate to create an interesting new variable.

most importantly, make progress on your class project. You will be presenting this two weeks from today.

14.4 review of munging tools

The primary tools, at least in my experience, are these:

If you want to pull out only certain columns, use select.
If you want to pull out only certain rows, use filter.
If you want to sort, you arrange.
If you want to create, you mutate.
If you want to combine data, you join.

14.5 more about joining

There are many different approaches to joining datasets. We might want to retain only cases which are present in both datasets: A political consultant might want to target calls to people who were on both ‘a list of Democrats’ and ‘a list of people who voted in 2020.’ We might want to include all members of one Dataset who are absent in a second: An investigative journalist might want to interview ‘People who worked for Amazon in November 2020,’ but not January 2021. We might even want to join two datasets in which people appear in either of two lists - a private investigator might look for any individuals who rented cars from Hertz or Avis.

These three approaches - an inner join, a left join, and a full join - begin to illustrate the richness and challenges of combining datasets. Joins are at the heart of Structured Query Language (or SQL, which is pronounced es-que-ell or sequel). Joins of multiple databases can become very complex - think about links in a company between sales data, salary, hours worked, seniority, and a client base, or, better, links in a University between lists of students, classes, faculties, grades, and parking violations). Relational Database Management Systems (RDBMS) are linked networks of datasets such as these, and a rudimentary understanding of the ideas behind these is an essential part of proficiency in data science.

There are two sources which I will refer you to for now. The first is chapter 13 of R4DS. The second is what Jenny Bryan describes as a cheatsheet, but is I think more than this (https://stat545.com/bit001_dplyr-cheatsheet.html). This shows how the various “joins” of dplyr work in combining two tibbles (superheroes and publishers). If you want to get a feel for how combining data works, it is a great start.

14.5.1 more about munging

We’ve also used separate to split a column into several; the complement of this is unite.

You may want to add new measures for all cases without a common key. Here, you can use the bind_cols function. If you instead want to add observations to an existing dataframe, you may bind_rows.

We haven’t yet considered the tools used to reshape data frames from short-and-wide to long-and-narrow, and vice-versa. In our music data, we might, for example, want to think about the data by artist (with possibly multiple songs on each line), or song-titles (with possibly covers by different artists on each line).

More typically, consider repeated-measures data in which each score reflects a measure taken on one of ten subjects and one of three occasions - for example, scores on the Beck Depression Inventory (BDI) before, during, and after treatment. Here, we may wish to structure the data in a “long” format with thirty rows (one for each subject X occasion) and three columns (person, occasion, and BDI), or in a “wide” format with just ten rows (one for each subject), and four columns (person, BDI-before, etc). To move from “long” to “wide,” use the pivot_longer ~~spread~~ command, and to move instead from “wide” to “long,” use pivot_wider ~~gather~~. (These changes were introduced in ~ 2020).