15 strings, factors, dates, and times

This chapter discusses some of the types of data other than numeric and logical, in particular strings, factors, and dates/times.

In this chapter, as in the last few, I refer primarily to three of the chapters in R4DS. Consider these notes supplementary.

15.1 strings

Strings are sets of characters which may include “123” as well as “why *DID* the chicken cross the road?” Samples of text, from names to novels, are the most interesting type of string.

Among the tools that are used in examining texts are searches (do these tweets include language associated with hate speech?), validity checks (does the string correspond to a valid zip code?), and reformatting (to lower case so that BOB, Bob, and bob are all coded as identical). These ideas are simple, but quickly become challenging when, for example, the strings in which we are interested include characters that R usually interprets as code - such as commas, quotes, and slashes. See the section on string basics (14.2) for how to “escape” these characters, for example, how to treat a hashtag (#) as just a character as opposed to the beginning of a comment. These rules are codified as regular expressions (regex, sometimes regexp). Regex are not unique to R, but are shared with other languages as well.

In R, particularly in the tidyverse package, regex are typically implicit, represented within commands that are part of the stringr package and that typically begin with str_. For example, str_detect returns a set of logical values:

donuts <- c("glazed", "cakes", "Pink sprinkled",
            "cream filled",
            "day-old frosted", "chocolates")
donuts %>% 
    str_detect(" ")

## [1] FALSE FALSE  TRUE  TRUE  TRUE FALSE

Most of the str_ functions are straightforward, but remember that str_sub provides a subset, not a substitution; to change a string, use str_replace. As Hadley points out in R4DS, the autocomplete function in R_studio is very handy for helping you explore the different functions - in your console, type str_ … then scroll through the possibilities.

donuts %>% str_sub(1,5)

## [1] "glaze" "cakes" "Pink " "cream" "day-o" "choco"

donuts %>% str_replace(" ","_")

## [1] "glazed"          "cakes"           "Pink_sprinkled"  "cream_filled"   
## [5] "day-old_frosted" "chocolates"

As you work with texts, simple problems sometimes require sophisticated codes. The regex that are used to solve these problems quickly become dense and challenging.

One tool that can help you is the str_view command, which returns (when used in R studio) highlighted text showing corresponding passages. For example:

dumbSlashMovieTitles <- c("Craslash", 
                     "Revenge o' \'McSlashy'", 
                     "Star Wars: Episode \\slash")
dumbSlashMovieTitles %>% str_view ("sh")

Note the effects of the backslash: For example, the Star Wars movie has two slashes in the input, but only one in the output, as the backslash is the escape character that tells R to take the next character literally (as in the ‘Revenge’ movie).

So. How would you use str_view to find strings which include the backslash character in our dumbSlashMovieTitles? [Experiment. Google. Ask your classmates.]

The density of regex statements can make them challenging. The backslash case is an unusual one, but other special characters are more common, and can be useful, including (for example) indicators for the beginning (^) or end ($) of a string:

# get rid of suspected plural 
donuts %>% str_replace ("s$", "")

## [1] "glazed"          "cake"            "Pink sprinkled"  "cream filled"   
## [5] "day-old frosted" "chocolate"

A particularly useful function in stringr is str_split, which can be used to quickly break a text into discrete words. Note that using a space and the explicit “word” boundary give different results.

str_split(donuts, " ", simplify = TRUE)

##      [,1]         [,2]       
## [1,] "glazed"     ""         
## [2,] "cakes"      ""         
## [3,] "Pink"       "sprinkled"
## [4,] "cream"      "filled"   
## [5,] "day-old"    "frosted"  
## [6,] "chocolates" ""

str_split(donuts, boundary ("word"), simplify = TRUE)

##      [,1]         [,2]        [,3]     
## [1,] "glazed"     ""          ""       
## [2,] "cakes"      ""          ""       
## [3,] "Pink"       "sprinkled" ""       
## [4,] "cream"      "filled"    ""       
## [5,] "day"        "old"       "frosted"
## [6,] "chocolates" ""          ""

The output of str_split is generally a list (more on that soon), but here the lists are simplified into tibbles. In the tidyverse, str_split is typically one of the first steps in preparing text. The tidytext package (https://www.tidytextmining.com/), which is discussed at length in the computational social science course, builds on this foundation and is a powerful set of tools for all sorts of problems in formal text analysis.

15.2 factors

Conditions (experimental vs control), categories (male or female), types (scorpio, “hates astrology”) and other nominal measures are categorical variables or factors. In the tidyverse, the r package for dealing with this type of measure is forcats, one of the core parts of the tidyverse.

Here’s an example of a categorical variable. Why is it set up like this, and what does it do?

# Example of a factor
eyes <- factor(x = c("blue", "green",
                     "green", "zombieRed"), 
               levels = c("blue", "brown",
                          "green"))
eyes

## [1] blue  green green <NA> 
## Levels: blue brown green

In base R, string variables (“donut,” “anti-Brexit,” and “yellow”) are generally treated as factors by default. In the tidyverse, string variables are treated as strings until they are explicitly declared as factors.

The syntax for working with factors-as-categories is given in Chapter 15 of R4DS. I will not duplicate that here, but I will point out that factors are represented internally in R as numbers, and converting (coercing) factors to other data types can occasionally lead to nasty surprises. Sections 15.4 and 15.5 describe how factors can be cleanly reordered and modified.

15.2.1 types of babies

In the babynames data, baby’s gender is a categorical variable, which is treated (because tidyverse) as a character or string. Here, we make it into a factor. We create two other factors as well.

# adding third level for non-binary babies
sexlevels <- c("M", "F", "O")
babynames2 <- babynames %>% 
    mutate(sex = factor(sex,
               levels =  sexlevels)) %>% 
    mutate(beginVowel = case_when(
        substr(name,1,1) %in%
            c("A","E","I","O","U") ~ "Vowel",
        TRUE ~ "Consonant")) %>% 
    mutate(beginVowel = factor(beginVowel)) %>% 
    mutate (century = case_when(
        year < 1900 ~ "19th",
        year < 2000 ~ "20th",
        year > 1999 ~ "21st")) %>% 
    mutate(century = factor(century))

Use the syntax above to create types of names for different generations (boomers, gen x, Millenials, gen z). Use https://www.kasasa.com/articles/generations/gen-x-gen-y-gen-z to determine your groupings.

Say something interesting about the data - names, genders, etc. Plot this.

15.2.2 types of grown-ups

If you would instead like to examine survey data, the forcats package includes a set of categorical variables.

Using the discussion in Chapter 15 of R4DS as your guide, examine the relationship between two or more of these categorical variables. Again, plot these

gss_cat

## # A tibble: 21,483 x 9
##     year marital         age race  rincome        partyid  relig  denom  tvhours
##    <int> <fct>         <int> <fct> <fct>          <fct>    <fct>  <fct>    <int>
##  1  2000 Never married    26 White $8000 to 9999  Ind,nea~ Prote~ South~      12
##  2  2000 Divorced         48 White $8000 to 9999  Not str~ Prote~ Bapti~      NA
##  3  2000 Widowed          67 White Not applicable Indepen~ Prote~ No de~       2
##  4  2000 Never married    39 White Not applicable Ind,nea~ Ortho~ Not a~       4
##  5  2000 Divorced         25 White Not applicable Not str~ None   Not a~       1
##  6  2000 Married          25 White $20000 - 24999 Strong ~ Prote~ South~      NA
##  7  2000 Never married    36 White $25000 or more Not str~ Chris~ Not a~       3
##  8  2000 Divorced         44 White $7000 to 7999  Ind,nea~ Prote~ Luthe~      NA
##  9  2000 Married          44 White $25000 or more Not str~ Prote~ Other        0
## 10  2000 Married          47 White $25000 or more Strong ~ Prote~ South~       3
## # ... with 21,473 more rows

15.3 dates

The challenges of combining time-demarcated data (Chapter 16) are significant. For dates, a variety of different formats (3-April, October 23, 1943, 10/12/92) must be made sense of. Sometimes we are concerned with durations (how many days, etc.); on other occasions, we are concerned with characteristics of particular dates (as in figuring out the day of the week on which you were born). And don’t forget about leap years.

In R, the lubridate package (a non-core part of the tidyverse, i.e., one that you must load separately) helps to handle dates and times smoothly. It anticipates many of the problems we might encounter in extracting date and time information from strings. Lubridate generally works well to simplify files with dates and times, and can be used to help in data munging. For example, in my analyses of the Corona data, dates and times were reported in four different ways. The code below decodes these transparently and combines them into a common date/time format .

2/3/20 6 PM

2/3/20 18:00

2/3/20 18:00:00

2020-02-03 18:00:00

# not run
coronaData2 <- coronaData %>% mutate
    (`FixedDate = 
           parse_date_time(`Last Update`,
                           c('mdy hp','mdy HM',
                             'mdy HMS','ymd HMS')))

15.4 times

Working with temporal data is often challenging. The existence of, for example, 12 versus 24 hour clocks, time zones, and daylight savings, can make a simple question about duration quite challenging.

Imagine that Fred was born in Singapore at the exact moment of Y2K. He now lives in NYC. How many hours has he been alive as of right now? How would you solve this?

# find timezones for Singapore and NYC
# a = get datetime for Y2K in Singapore in UTC
# b = get datetime for now in NYC in UTC
# compute difference and express in sensible metric