2 getting started


We begin with a brief self-assessment, asking you to reflect on your own knowledge of data science, including the necessary-but-not-sufficient areas of computer programming and statistics. We then move to a description of some rudimentary tools that we will be using.

2.1 are you already a programmer and statistician?

Regarding programming, you may know more than you think you do. Here’s a simple program - a set of instructions - for producing a cup of coffee:

add water to the kettle and turn it on

if it’s morning, put regular coffee in the French press, otherwise use decaf

if the water has boiled, add it to the French press, else keep waiting

if the coffee has steeped for four minutes, depress (smash) piston/plunger, else keep waiting

pour coffee into cup

enjoy

As a post-millennial student from a WEIRD culture, or Western, Educated, Industrialized, Rich Democracy (Henrich, Heine, and Norenzayan 2010), you’ve ‘programmed’ computers, too, if only to enter a password, open an app, and upload a photo on your cell phone.

Statistics is of fundamental importance, not just for understanding abstract trends, but for making decisions about everyday life. Consider the case of Susie, a college senior:

Exercise 2_1 Susie is applying to two med schools. At School A, 25% of students are accepted, and at School B, 25% are accepted as well. You are Susie. Are you going to get in to at least one of these programs? What is the probability? Does your estimate depend upon any assumptions?

Questions such as these are important for us. If the combined probability is low, it will likely (another probability concept) make sense for Susie to spend time, money, and energy to apply to additional programs. If the probability is higher, it may not. But problems like this are hard - our estimates of probability are frequently poorly calibrated, and combining probability estimates is challenging (see, e.g., (Tversky and Kahneman 1974), and consider taking a course in Behavioral Economics or Thinking and Decision Making to learn more).

You may have worked with data in spreadsheets such as Excel or Google Sheets.

Exercise 2_2 Open the Google Sheet at http://bit.ly/dslaX2_1. Save a copy and edit it, entering the following in cell B7:

=SUM (B2:B6)

What is the result?

Now copy cell B7 to C7

What happens? Is this the result you expected? Would another approach be more useful?

2.2 some best practices for spreadsheets

Spreadsheets are great tools - the first one, Visi-Calc, was the first “killer app” to usher in the personal computer revolution. But they have limitations as well. Best practices have been proposed for using spreadsheets in data science such as, for example, include only data (and not calculations) in spreadsheets, use what we will recognize as a ‘tidy’ format in which data are in a simple rectangle (avoid combining cells and using multi-line headers), and save spreadsheets as simple text files, typically in comma-delimited or CSV format (Broman and Woo 2018). Typically, this means that each worksheet will be in a separate file, rather than in workbooks with multiple tabs.

There are good reasons for these recommendations: For example, when we sort data in spreadsheets, we risk chaos, for example, only certain columns may be sorted. When we manipulate data in spreadsheets, we typically will not have a record of what was (and wasn’t) changed, and this compromises the reproducibility of our work.

The bottom line is that spreadsheets should generally be used to store data rather than to analyze it. But don’t be a slave to this - if you are making some quick calculations that are inconsequential and/or to be used on just one occasion, working in Excel or Google Sheets is often the way to go.

2.3 setting up your machine: some basic tools

Collaboration and communication are integral to data science. In the world beyond universities, the most important messaging and collaboration platform is Slack. Slack is a commercial app, but it has a free tier. Slack is handy for group work of all forms, but most students in full-time programs will be expected to use a Learning Management System such as Canvas so that all of their classes are on the same platform.

Documents produced in word processors typically include invisible characters used for formatting, security, etc. These characters, when fed into a computer program, lead to unpredictable results. Markdown is a simple syntax for producing documents that are easily read by both humans and machines. Most of the documents you will produce and work with this term, including the present chapter, are and will be written in some version of Markdown.

You can find an introduction to markdown syntax in Chapter 3 of (Freeman and Ross 2017). I use Typora (currently free for both Windows and Mac), but there are many alternatives. Install this or another Markdown editor on your laptop and play with it.

Google Docs is free and is convenient for collaborative work. One other important feature of Google Docs is that it provides a framework for version control, a critical skill in information management. You can learn more about how to see and revert to prior versions of a project in Google Docs here. Version control can help you avoid the chaos and confusion of having a computer (or several computers) full of files that look like Cham’s (2012) comic:

Fig 2.1: Never call anything ‘final.doc.’

Version control is an important concept in data science. Collaboratively built programs and platforms, including most of the add-ons (libraries, packages) which make R so powerful, are open-source projects built by many individuals over time. For projects such as these, both the current build and its associated history are typically maintained on GitHub, a website for hosting code. When we contribute to these projects, we will first mirror the web-based GitHub site using a program on our own Macs or Windows PCs called Git, then upload our proposed changes. Keeping remote and local branches of files in sync can be challenging, however, and you will not be expected to use this technology in this class. But if you are curious, or want to learn more, an introduction to using Git and R together may be found here.

2.4 a modified 15-minute rule

You will run into problems, if not here, then elsewhere. An important determinant of your success will be the balance you maintain between persistence and help-seeking.

The 15-minute rule is one guideline for this balance: It has been cleverly summarized as “You must try, and then you must ask.” That is, if you get stuck, keep trying for 15 minutes, then reach out to others. I think that this rule is basically sound, particularly if it is applied with cognitive flexibility, social sensitivity, and reciprocity. So when you get stuck, make a note of the problem, then move to another part of your project (that’s the cognitive flexibility part): This allows your problem to percolate and still make progress. When you ask others for help, ask in a way that shows an awareness of the demands on their time (social sensitivity): Part of this means that you should explain your problem in as detailed a fashion as possible - in technical terms, a “reprex” or reproducible example. Finally, you should be willing to provide as well as give help (reciprocity).

2.5 discussion: who deserves a good grade?

In an introductory class in data science, students invariably come to class with different backgrounds. Should this be taken into account in assigning grades? That is, would it be possible (and desirable) to assign grades in a class based not just on what students know at the end of the term, but also on how much they have learned?

A formal, statistical approach to this could use regression analysis. That is, one could predict final exam scores from pretest scores, and use the residuals - the extent to which students did better or worse than expected - as a contributor to final exam grades. Interestingly, there would be an unusual incentive for students on this ‘pretest’ to do, seemingly perversely, as poorly as possible. How could this be addressed?

Another problem with this approach is that there may be ‘ceiling effects’ - students who are the strongest coming in to the class can’t improve as much as those who have more room to grow. Again, how might this be addressed? Should it?

References

Broman, Karl W., and Kara H. Woo. 2018. “Data Organization in Spreadsheets.” The American Statistician 72 (1): 2–10. https://doi.org/10.1080/00031305.2017.1375989.
Freeman, Michael, and Joel Ross. 2017. Technical Foundations of Informatics, u Washington INFO 201.
Henrich, Joseph, Steven J. Heine, and Ara Norenzayan. 2010. “The Weirdest People in the World?” Behavioral and Brain Sciences 33 (2-3): 61–83. https://doi.org/c9j35b.
Tversky, Amos, and Daniel Kahneman. 1974. “Judgment Under Uncertainty: Heuristics and Biases.” Science 185 (4157): 1124–31. https://doi.org/gwh.