17 loops, functions, and beyond

In one of the most important contemporary theoretical models of intelligence, (Sternberg 1999) has argued that the ability to automatize, that is, to work efficiently on repeated or habitual tasks, is a key component of intelligent behavior. Solving habitual problems efficiently - whether it is making a cup of coffee, finding the shortest path to complete a shopping list in a supermarket or a series of errands across town - allows us to focus our limited resources on other challenging tasks that, in turn, may determine whether we survive, or at least prosper.

In programming, loops and functions are essential tools for making repetitive tasks simple. Simplifying your code is one of the more intellectually satisfying aspects of working in R or in any programming language.

In R, loops are supplemented by additional tools for simplifying and avoiding repetition in code, including the ‘apply’ family and the map function in purrr. Functions (and, beyond this, custom libraries) can further streamline your work.

17.1 loops

Consider the task of printing out a series of numbers. Here’s a simple example of how this could be done in a loop in R. It prints numbers between 1995 and 1998 (inclusive).

for (i in 1995:1998) { # i is an index code
    print(i) # print the ith value in the sequence 
} # go to the next one until the range is complete
## [1] 1995
## [1] 1996
## [1] 1997
## [1] 1998

Let’s expand on this a little, connecting back to the babynames data. This will print counts of the number of babies for each year in GenerationZ, which includes years from 1995-2015.

genZ <- (1995:2015)
# this reduces the (big) babynames to a simple file of years and counts
babyCounts <-
    babynames %>% 
    group_by(year) %>% 
    filter(year %in% genZ) %>% 
    summarize(nbabies = sum(n))
    
# and this uses a for loop to print each row in turn
for (i in (1:nrow(babyCounts))) { 
    # filename[i,j] == ith row and jth column
    print(paste((babyCounts[i,1]), babyCounts[i,2]))
}
## [1] "1995 3661351"
## [1] "1996 3646362"
## [1] "1997 3624799"
## [1] "1998 3677107"
## [1] "1999 3692537"
## [1] "2000 3778079"
## [1] "2001 3741451"
## [1] "2002 3736042"
## [1] "2003 3799971"
## [1] "2004 3818361"
## [1] "2005 3842004"
## [1] "2006 3952932"
## [1] "2007 3994007"
## [1] "2008 3926358"
## [1] "2009 3815638"
## [1] "2010 3690700"
## [1] "2011 3651914"
## [1] "2012 3650462"
## [1] "2013 3637310"
## [1] "2014 3696311"
## [1] "2015 3688687"

The syntax of loops, including where to put parentheses in index statements, can be tricky. Expect to refer to Google and StackExchange often in order to get your code running.

Another good source is Chapter 21 of R4DS. There, Wickham goes in to additional detail, including a description of seq_along(df) as a tool for creating an index corresponding to 1:ncol(df). Here’s an example with the diamonds dataset:

diamonds2 <- diamonds %>% 
    select_if(., is.numeric)
# for (i in (1:ncol(diamonds2))) {
for (i in seq_along(diamonds2)) {
    print(paste(names(diamonds2[i]),
                round(mean(diamonds2[[i]]),2)))
}
## [1] "carat 0.8"
## [1] "depth 61.75"
## [1] "table 57.46"
## [1] "price 3932.8"
## [1] "x 5.73"
## [1] "y 5.73"
## [1] "z 3.54"

Chapter 21 of R4DS also considers some extensions to related problems such as loops of indefinite length, which can often be addressed using the ‘while’ command.

17.2 from loop to apply to purrr::map

Understanding for loops is fundamental in programming, but in R they should often be sidestepped. If the order of iteration isn’t important (if, for example, it doesn’t matter which of the diamonds variables we take the mean of first), then using one of the measures from the apply family can generally be used to make your code simpler and more efficient.

The logic is that one takes a dataframe (df or tibble), then applies a function to its rows or columns:

# apply = apply the function (mean) to the columns(2) of the df 
diamonds2  %>% apply(2,mean) 
##        carat        depth        table        price            x            y 
##    0.7979397   61.7494049   57.4571839 3932.7997219    5.7311572    5.7345260 
##            z 
##    3.5387338
# sapply = simply apply - guesses that you are looking for col. means
diamonds2  %>% sapply(mean) 
##        carat        depth        table        price            x            y 
##    0.7979397   61.7494049   57.4571839 3932.7997219    5.7311572    5.7345260 
##            z 
##    3.5387338

The many variants of the apply family, including lapply [list apply] and tapply [table apply] as well as sapply, each have their own uses and can be quite efficient but, again, can be syntactically challenging.

In the evolving tidyverse, the map family of commands is supplementing if not supplanting apply; these commands (part of the purrr package in the core tidyverse) may prove to be more convenient and clear. For example, the map_df function will apply a function and return a dataframe (tibble), which can be handy for further analysis.

diamonds2 %>% 
  map_df(mean)
## # A tibble: 1 x 7
##   carat depth table price     x     y     z
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.798  61.7  57.5 3933.  5.73  5.73  3.54

17.3 some examples of functions

If you repeat a series of lines of code several times in your program, it is often best to wrap this into a function.

The first example is from a preregistered study I recently started of language and politics. For the preregistration, I ran analyses using simulated data, both to increase the likelihood that the code will run without error on real data and to help anticipate the analyses which are to be run on ‘real’ data. I began by getting a real body of text from the net and scrambling it, then constructing fake ‘Republican’ and ‘Democratic’ texts from this. Here, I illustrate this by constructing 50 sample documents, each consisting of between 5 and 20 words.

The project is given in four steps:

17.3.1 preliminaries

Here is the preliminary stuff, where I pull the data off the net and initialize the variables

sampledata <- read_csv("data/sentiment-words-DFE-785960.csv")#url(
#     "https://www.crowdflower.com/wp-content/uploads/2016/03/sentiment-words-DFE-785960.csv"))
#"https://raw.githubusercontent.com/totalgood/hope/master/data/corpora/sentiment-words-DFE-785960.csv"))
  # pulls off four words
sampledata <- sampledata[22:25] %>% na.omit()
ndocs <- 50
minDocLength <-  5
maxDocLength <- 20
doc <- vector(mode = "character", length = ndocs)

17.3.2 the function

Here’s the simple function which pulls a random word out of the matrix of sampledata.

set.seed(33458) # a random seed is used to allow reproducible results
getword <- function() {
      rowid <- sample(1:nrow(sampledata), 1)
      colid <- sample(1:ncol(sampledata), 1)
      sampledata[rowid,colid]
}

17.3.3 applying the function

The function is applied, first, to extract one word, then, in successive loops, to build up one phrase and then many.

# combine words into docs
# establish length of first phrase
docLength <- sample(minDocLength:maxDocLength,1)
# initialize with one word
sampleCorpus <- getword()
# loop to build up first phrase
for (i in 1:docLength) {
      addWord <- getword()
      sampleCorpus <- paste(sampleCorpus, addWord)
}
#add additional simulated documents
for (j in 2:ndocs) {
      docLength <- sample(minDocLength:maxDocLength,1)
      newdoc <- getword()
      for (i in 1:docLength - 1) {
            addWord <- getword()
            newdoc <- paste(newdoc, addWord)
      }
      sampleCorpus <- rbind(newdoc,sampleCorpus)
}

Finally, the results are combined with a vector of alternating labels of ‘Dem’ and ‘Rep’:

row.names(sampleCorpus) <- NULL
evenOdd <- rep(c("Dem","Rep"),length.out = nrow(sampleCorpus))
workingCorpus <- as_tibble(cbind(evenOdd,sampleCorpus))
head(workingCorpus,5)
## # A tibble: 5 x 2
##   evenOdd V2                                                                    
##   <chr>   <chr>                                                                 
## 1 Dem     wild at heart nobody loves me lost art miss the sun need a doctor kin~
## 2 Rep     great player awfully nice back hurts not paying pretty rubbish found ~
## 3 Dem     yahooo make no promises horrible daughter ew crazy evening fit        
## 4 Rep     can't sleep lazy afternoon sucks #fedup great shame positive break fa~
## 5 Dem     problems nice funniest hurt not necessarily sad nose eat no problem d~

17.4 how many bottles of what?

To put the fun back into function, here’s a solution to the “99 bottles of beer” function described in r4DS 21.2.1. Study the code. Ask or answer a question about it in class or on Slack.

beerSong <- function(liquid = "beer", count = 99, surface = "wall") {
    songtext <- "" 
    for (i in (count:1)) { 
        thisLine = (paste0(i, " bottles of ", liquid, 
            " on the ", surface,
            ", you take one down and pass it around,\n"))
        songtext = c(songtext, thisLine)
    }
    songtext = c(songtext, 
                 (paste0("no more bottles of ",
                         liquid," on the ", surface,"...")))
    cat(songtext) # cat prints without line numbers
}
#beerSong() 

And here are solutions proposed by some of your classmates over the last few years. How do they differ from each other, and from the solution given above?

beersng <- function(n) {
  if (n == 1) {
    cat("\n",n," bottle of beer on the wall, ",n,
        " bottles of beer.\nTake one down and pass it around,",
        " no more bottles of beer on the wall.\n", sep = "")
    cat("\nNo more bottles of beer on the wall, 
        no more bottles of beer.\n",
        "Go to the store and buy some more, 
        99 bottles of beer on the wall.\n", sep="")
  } else {
    cat("\n",n," bottles of beer on the wall, ",n,
        " bottles of beer.\nTake one down and pass it around, ",
        n-1, " more bottles of beer on the wall.\n", sep="")
    return(beersng(n-1))
  }
}
#beersng(99)
moreBeer <- function () {
  for (i in 0:100){
  starting_number <- 100
  if (starting_number - i == 0) {
    print("No more bottles of beer on the wall, 
          no more bottles of beer, 
          Go to the store and buy some more, 
          99 bottles of beer on the wall.")
    break
  }
  print(paste(starting_number - i,
              "bottles of beer on the wall",
              "Take one down and pass it around,",
              starting_number - i - 1,
              "bottles of beer on the wall."))
  }
}
# moreBeer()
song <- function(bottlesofbeer){
  for(i in bottlesofbeer:1){ 
     cat(bottlesofbeer," bottles of beer on the wall \n",
        bottlesofbeer," bottles of beer \nTake one down, 
        pass it around \n",
        bottlesofbeer-1, " bottles of beer on the wall \n"," \n")       
        bottlesofbeer = bottlesofbeer - 1 
  }
}
#song(99)

Which code is ‘best?’ Good code is clear, but it is also efficient. We probably shouldn’t expect much in the way of differences between these functions in terms of speed (as each includes 99 iterations of a simple print command), but here’s a simple way to compare. Note that I’ve used the “sink” command to write my output to files rather than consoles:

start_time <- Sys.time()
sink (file = "f1.txt")
beerSong("beer",99)
sink()
end_time <- Sys.time()
beerSongtime <- end_time - start_time

start_time <- Sys.time()
sink (file = "f2.txt")
beersng(99)
sink()
end_time <- Sys.time()
beersngtime <- end_time - start_time

start_time <- Sys.time()
sink (file = "f3.txt")
song(99)
sink()
end_time <- Sys.time()
songtime <- end_time - start_time

start_time <- Sys.time()
sink (file = "f4.txt")
moreBeer()
sink()
end_time <- Sys.time()
moreBeertime <- end_time - start_time

Here is the first line of output from each function, together with table showing the elapsed times:

readLines("f1.txt",1)
## character(0)
readLines("f2.txt",2)
## character(0)
readLines("f3.txt",4)
## character(0)
readLines("f4.txt",1)
## character(0)
tibble(functionName = c("beerSongtime",
           "beersngtime",
           "songtime",
           "moreBeertime"), time =
             c(beerSongtime,beersngtime,
           songtime,
           moreBeertime)) %>% 
          kable(digits = 2)
functionName time
beerSongtime 0.02 secs
beersngtime 0.02 secs
songtime 0.02 secs
moreBeertime 0.03 secs

References

Sternberg, Robert J. 1999. “The Theory of Successful Intelligence.” Review of General Psychology 3 (4): 292–316. https://doi.org/cqrkxh.