9 on probability and statistics
Last week, we considered (Anscombe 1973) and his quartet, and how visualizing data is valuable. This week, we move to a brief discussion of principles of statistics.
9.1 on probability
Discrete probability is used to understand the likelihood of categorical events. We can think of initial estimates of probability as subjective or personal. For some events (what is the probability this plane will crash?), an estimate of probability can be drawn from a base rate or relative frequency (e.g., p(this plane will crash) = (number of flights with crashes/ number of flights)). For other events (what is the probability that a US President will resign or be impeached before completing his term of office?), it may be hard to arrive at a suitable base rate. Here, a number of subjective beliefs or principles may be combined to arrive at a subjective or personal probability. In a sense, all probability estimates begin with a personal belief such as this, in part because the choice of the most informative base rate is often not self-evident - in the plane crash example, maybe we should consider a reference group rates such as ‘for this airline’ etc. (Lanning 1987). The personal origins of probability estimates should become less important as we are exposed to data and revise our estimates in accordance with Bayes theorem. But over the last 45 years, a substantial body of evidence has demonstrates that, under at least some circumstances, we don’t make estimates of probability in this way. (This material is discussed in the Thinking and Decision Making/Behavioral Economics class).
9.2 the rules of probability
Here’s an introduction to the principles of probability. These are presented, with examples and code, in this R markdown document at Harvard’s datasciencelabs repository:
I. For any event A, 0 <= P (A) <= 1
II. Let S be the sample space, or set of all possible outcomes. Then P(S) = 1, and P (not S) = 0.
III. If P (A and B) = 0, then P (A or B) = P (A) + P (B).
IV. P (A|B) = P (A and B)/ P (B)
Principle III applies for mutually exclusive events, such as A = you are here this morning, B = you are at the beach this morning. For mutually exclusive (disjoint, disjunctive) events, the union is the sum of the two events. This is called the addition rule for disjoint events.
A different rule applies for events that are mutually independent, such as (A = I toss a coin and it lands on ‘Heads’) and (B = it will rain tomorrow). What we mean by independent is that our estimates of the probability of one don’t change based on the state of the other - your estimate of the likelihood of rain shouldn’t depend on my coin flip. Here, you multiply rather than add:
If P (A|B) = P (A), then P (A and B) = P (A) P (B).
In words - if the probability of A given B equals the probability of A, then the probability of both A and B equals the probability of A times the probability of B.
Ask yourself - are mutually exclusive events independent? Come up with your own examples.
This multiplication rule is handy for estimating the probability of an outcome that happens following a chain of independent events, such as the probability that the next eight times I toss a coin it will land on “tails” every time:
P (TTTTTTTT) = P (T) P (T) P (T) P (T) P (T) P (T) P (T) P (T). = .58 = 1/256.
Many sets of events are neither disjoint nor independent, so we need more general ways of thinking about pairs of events. For most of us, Venn diagrams are useful to think about combining probabilities. The union or P (A U B) describes the probability that A, B, or both of these will occur. Here, you will use the general addition rule:
P (A or B) = P (A) + P (B) - P (A and B)
(the probability of A or B is the probability of A plus the probability of B minus the probability of both A and B).
For the intersection or P (A ∩ B), we need to consider conditional probabilities. Think of the probability of two events sequentially: First, what’s the probability of A? Second, what’s the probability of B, given that A has occurred? Multiply these to get the likelihood of A and B:
P (A and B) = P (A) P (B|A).
Example: The probability of you and your roommate both getting mononucleosis equals the probability of your getting mono times the probability that your roommate gets it, given that you have it also.
This is the general multiplication rule. In this abstract example, the order is irrelevant. To estimate the likelihood of A and B, we could as easily take the probability of B, and multiply it by the conditional probability of A given B
P (A and B) = P (B) P (A|B).
Use the mono example again. What are A and B here? Does it still make sense? When might P (B|A) make more sense than P (A|B)?
We are often interested in estimating conditional probabilities, in which case we’ll use the same equation, but solve instead for P (A|B). This leads us back to principle IV:
IV. P (A|B) = P (A and B)/ P (B)
What is the likelihood of getting in an accident (A), given that one is driving on I-95 (B)? How would you estimate this? If there are fewer accidents on Military Trail, does this mean that you would be safer there?
9.2.1 keeping conditional probabilities straight
In general, P (B|A) and P (A|B) are not equivalent. Moore’s (2000) Basic Practice of Statistics gives the example of
P (Police car | Crown Victoria) = .7, and P (Crown Vic | Police car) = .85.
Here, one could use a Venn diagram to model this asymmetry. There are several R packages for doing this, including Venn and VennDiagram. Consider exploring these, or at the very least make a rough sketch that can help you answer the following question: > In general, if P (A|B) < P (B|A), what must be true of the relationship of P (A) to P (B)?
9.3 continuous probability distributions
We can also use probability with continuous variables such as systolic blood pressure (that’s the first one), which has a mean of approximately 120 and a standard deviation of 15. Continuous probability distributions are handy tools for thinking about the meaning of scores, particularly when we express scores in standard deviations from the mean (z scores). More to the point, this way of thinking about probability is widely used in questions of scientific inference, as, for example, in testing hypotheses such that “the average systolic blood pressure among a group of people studying at Crux (hence caffeinated) will be significantly greater than that of the population as a whole.”
This is part of the logic of Null Hypothesis Significance Testing (NHST) - if the result in my Crux sample is sufficiently high, then I say that I have rejected the null hypothesis, and found data which support the hypothesis of interest.
9.4 the most dangerous equation
Just as (Tufte 2001) demonstrated that poor data visualizations can be dangerous, leading, for example, to the loss of life in the Challenger disaster, (Howard Wainer 2007) shows that a lack of statistical literacy is also “dangerous.”
He argues that deMoivre’s equation is the most dangerous equation - this equation (for the standard error) shows that variability decreases with the square root of sample size. Other nominees include the linear regression equation (and, in particular, how coefficients may change or reverse when new variables are added) and regression to the mean. Regarding linear regression, Simpson’s paradox shows how the direction of regression coefficients may change when additional variables are added to a model.
I would argue that, from the standpoint of psychology, ignorance of regression to the mean was arguably more ‘dangerous’ than ignorance about the central limit theorem and standard error, in particular because regression effects contribute to an overestimate of the effectiveness of punishment and an under-appreciation of the effectiveness of positive reinforcement as tools for behavior change (Hastie and Dawes 2010).