What is risk?
1. The set up
Imagine that you go to your doctor. You’d like to know your (individual!) risk for some cancer. Let’s talk about what this actually means. Your risk is defined as the probability that you will actually get that disease. The concept of “true risk” is debated because you either will develop cancer or you won’t1 (see the “problem of the single-case” below). For now, we assume that this is not pre-determined. This estimate for your true risk is practical because 1) neither you, the best researchers, nor the smartest clinicians can be sure that you definitely will or will not develop cancer, and 2) it may prompt you to take preventive behaviors, such as increased screenings or changing your modifiable risk factors (e.g. diet and exercise), that could help reduce your risk.
Assuming that a true risk does exist, let’s talk about your hypothetical probability of developing cancer. We’ll let this little blue circle represent a healthy version of you:
My goal as a researcher is to estimate your (individual) probability of getting cancer. In a statistically-perfect, fictitious universe, I would study many (let’s say 100) counterfactual versions of you. These counterfactuals would be identical to you in every way. Let me repeat: in each world, all known and unknown characteristics that make you you, would be replicated perfectly.
Then, in each world, I would monitor you to see if you develop cancer. For this demonstration, let’s say you develop cancer in 12 out of 100 worlds. We would thus estimate that your probability of developing cancer is \(\frac{12}{100} * 100\% = 12\%\).
Let’s stop right there. First of all - why do we get different results in different worlds? As an empiricist, I find this apparent lack of reproducibility deeply unintuitive, even troubling. Secondly, did you notice that I estimated a probability as an observed proportion - why did I do that? Thirdly, once we have an estimate for your probability of disease, what does it even mean? Finally, in the real world, we can’t study 100 counterfactual versions of you, so what do we do? This discussion aims to address each of these points (at somewhat inconsistent levels of depth). Peruse various sections depending on your interests.
2. What is randomness and where does it come from?
True randomness is the absolute lack of pattern or predictability of events, and it is arguably an intrinsic aspect of natural systems2:
- Quantum mechanics states that we cannot precisely know both the position and velocity of subatomic particles, and by measuring either, we increase the uncertainty around the other quantity. Practically, this means that we cannot predict radioactive decay times for individual atoms, Brownian motion, heat transfer patterns, or cosmic radiation, to name a few. Thus, quantum mechanics serves as an inherent source of randomness in the universe.
Alternatively, apparent randomness can be debunked by taking a closer look at the system, upon which order, pattern, and/or predictability could be discerned:
Chaos is unpredictability of a system arising due to high sensitivity to initial conditions. Given the slightest change in subatomic or atomic factors, we can see wide variations in outcomes (i.e. the butterfly effect).
Stochasticity is the idea that a very large number (i.e. an uncountable number) of external agents interact, creating an extremely complex web of factors that affect an outcome.
Apparently random processes are extremely complex. Sometimes we approximate them with true randomness to simplify our model and minimize required computational power. For example, genetic mutations and environmental exposures (e.g. air pollution, smoking, diet, exercise, UV exposure, etc.) may not be truly random3, but we can simplify our calculations dramatically by selecting only the most important risk factors and grouping the others into a random error term.
3. Why do we use a proportion to estimate a probability?
The field of statistics gives us several tools to estimate a parameter, which is exactly what our probability is here. Let’s begin by re-stating our problem mathematically. We will consider your health status (yes, yours!) as a random variable since we don’t know your outcome yet. Your possible outcomes are healthy (denoted as a \(0\)) and diseased (denoted as a \(1\)). Thus, we aim to estimate your probability of being diseased, \(p\), defined as \(p = P(X = 1)\), where \(P\) is a probability function. Thus, since we have two possible outcomes, and one probability parameter, we will assume that \(X\) follows a Bernoulli distribution with probability p:
\[X \sim Ber(p)\]
A. Method of moments estimation
The method of moments estimation technique is based on the result from the Weak Law of Large Numbers (WLLN), which says that the empirical mean \(\bar{X}\) approaches the true mean of a random variable [X] as the number of observations \(n\) gets large. More formally, we state the WLLN as follows:
\[ \bar{X_n} \to^p \mathbb{E}[X] \] where \(\to^p\) indicates convergence in probability.
Thus, we can substitute the observed empirical mean \(\bar{x}\) for its expectation, and solve for our parameter \(p\) as follows:
\[ \begin{align} \mathop{\mathbb{E}}[X] &= p \\ \bar{X} &\approx p \\ \hat{p} &= \frac{1}{n}\sum_{i=1}^{n}X_i \end{align} \]
B. Maximum likelihood estimation
An alternative approach, that has many appealing properties, is known as maximum likelihood estimation, or MLE. This method reframes the problem in such as way that we can think about finding the parameter \(p\) that makes our observations \(x\) most likely. For a \(Ber(p)\) random variable, the likelihood function, which is the same as the joint probability mass function, is
\[ \begin{align} {\cal{L}}{(p|X)} &= \Pi_{i=1}^n P(X_i|p) \\ &= \Pi_{i=1}^{n} p^{X_i}(1-p)^{1 - X_i} \end{align} \]
We will maximize the log likelihood (since it is easier to work with), by taking its derivative, setting it equal to zero, and solving for \(p\):
\[ \begin{align} log({\cal{L}}) &\propto \sum_{i=1}^{n} X_i log(p) + (1-X_i)log(1-p) \\ \frac{dlog{\cal{L}}}{dp} &\propto \frac{\sum_{i=1}^{n}{X_i}}{p} - \frac{n-\sum_{i=1}^{n}{X_i}}{1-p} \\ 0 &= \frac{\sum_{i=1}^{n}{X_i}}{p} - \frac{n-\sum_{i=1}^{n}{X_i}}{1-p} \\ \hat{p} &= \frac{1}{n}\sum_{i=1}^{n}{X_i} \end{align} \]
In summary, both methods give the same estimate for \(p\) which is equal to the total number of successes (or in this case, the number of times you develop the disease: 12) divided by the total number of trials (or in this case, the total number of fictitious worlds where we followed a counterfactual version of you: n = 100). Thus, we arrive at our previous estimate that you have a 12% probability of getting cancer.
4. What is a probability anyway?
Now that I’ve told you that your true risk of developing cancer is 12% - so what?? What are some ways you can interpret that statement?
Intuitively, many of us think about probability as the chance or likelihood of something happening. More rigorously, recall that a probability measure must meet the three axioms stated by Kolmogorov (1933). Simply put, they say that 1) the total probability of all possible outcomes must equal 1, 2) for any possible outcome, its probability must be between 0 and 1 (i.e. no negative probabilities), and 3) that the probability of mutually exclusive outcomes is equal to the sum of the probabilities of each outcome. Mathematically, we write:
For probability triple, \((\Omega, \cal{F}, P)\), where \(\Omega\) = sample space, \(\cal{F}\) = \(\sigma\)-algebra, and \(P\) = probability measure, \[P(\Omega) = 1, \] \[P(p) \ge 0 \text{ for any } p \in \Omega,\] \[P(\cup_{i} p_i) = \Sigma_{i} p_i \text{ for countable disjoint } p_i.\]
There are at least six distinct ways to interpret probabilities4, but I will highlight the three most common. I won’t go into each thoroughly, but I want to give a sense for the range of what a probability might mean. Perhaps the similarities and differences between the interpretations will help us understand the concept of probability more deeply. I think it’s important for the person communicating the probability as well as the one receiving it to be aware of the potential ways that the information may be interpreted.
A. Classical probability
Core idea: Probability is determined a priori by identifying all possible outcomes and assigning each one an equal chance.
Example: Assume a coin is fair. Then since there are two sides (heads and tails), we attribute equal probability to each: \(P(heads) = P(tails) = 0.5\).
Breakdown: Large (uncountable) probability spaces, spaces with unknown/unaccounted outcomes, unequal chances for different events (e.g. weighted dice, unfair coin)
B. Frequency interpretations
Core idea: Frequentists assume that the long-run frequency of events is fixed, objective, and intrinsically tied to the probability of that event. Thus, instead of considering all of the possible outcomes (as in classical probability), we will use the observed outcomes and compare this to the expected results assuming a null hypothesis4.
Example: Let’s assess whether a coin is fair by estimating the probability of a coin landing on heads. We would flip the coin many times and estimate the probability by counting the number of heads and dividing it by the number of total trials. If this is (sufficiently) close to 0.5, we would assume this is a fair coin. Experiments with a large number of repeated trials or large sample sizes of independent events are good applications for frequentists statistics.
Breakdown: Suppose we only observed a few of the trials above. If we observed up to 10 trials (first row below) we would be tempted to conclude that your cancer risk is 0/10 = 0%. However, if we observed 11 trials, we would conclude that your risk is 1/11 = 9.1%. With 50 trials, we would estimate that your cancer risk is 4/50 = 8%. But none of these is correct. We know theoretically that a sequence of unlikely events is possible, but frequentists assume that the probability that corresponds most closely with the observed data, regardless of prior knowledge, is the best estimate. As the number of trials increases, this becomes less of a problem, but few real-life events are repeatable an arbitrary number of times, and many are downright unrepeatable (e.g. the Big Bang, volcanic eruption/earthquake/natural disaster, 2024 presidential election, etc.) leaving us with a gap in how to reconcile the true probability of the event. This is known as the “problem of the single-case.” This also becomes much more complex in the event of dependent events.
C. Subjective interpretation (aka Bayesian)
Core idea: A probability represents someone’s degree of belief, confidence, or credence. This “someone” is a rational and logically consistent “suitable agent” who follows the axioms of probability. However, these credences are inherently subjective (hence the name). Your level of credence about an event can be estimated by what you would consider a “fair bet” (i.e. putting your money where you mouth is): “Your degree of belief in E is p iff p units of utility is the price at which you would buy or sell a bet that pays 1 unit of utility if E, 0 if not E”4. In other words, you are indifferent to placing a bet \(p\) on \(E\) as you are to placing \(1-p\) on \(\neg E\). In the statistical framework, someone’s belief is based on prior knowledge - summarized as a prior probability with a selected distribution and parameter(s) - which is combined with observed data to obtain a posterior, from which conclusions are drawn5.
\(\text{Bayes Rule:}\) \[ P(H|E) = \frac{P(E|H) \cdot P(H)}{P(E)} \] where \(H\) is the hypothesis, \(P(H)\) is the estimate of the probability of the hypothesis (i.e. “prior” probability), \(E\) is the evidence (i.e. data), \(P(E)\) is the marginal likelihood of the data, \(P(E|H)\) is the “likelihood” of the data given the fixed hypothesis, and \(P(H|E)\) is the probability of the hypothesis given the fixed data (i.e. “posterior” probability).
Example: We could evaluate the hypothesis that a coin is fair compared to the hypothesis that it is not by calculating the posterior under each prior. First we might assume that the coin flip is fair (i.e. the flip is distributed Bernoulli(0.5)). Then we flip the coin many times and count the number of heads. We then estimate the likelihood of this event given the fixed hypothesis that the coin is fair. Finally, we multiply the likelihood and the prior to generate a posterior probability that the coin is fair given the data. It is common to compare two hypotheses, so we may also repeat the process assuming another parameter for the prior probability. Then we could compare the posterior probabilities for these two hypotheses and see which is more probable.
Breakdown: Selecting a prior is subjective and there are few well-defined methods for choosing one6. Sometimes there is minimal prior knowledge, in which case a uniform prior may be used; this is the same as doing a frequentist analysis.
Going back to our cancer example, we employed a frequentist approach to estimate your risk of cancer (in some fictitious universe). If we had a priori knowledge about your risk of cancer, e.g. given your family history, age, or other factors, we could update our probability estimate by combining the observed data and this prior knowledge in a Bayesian approach.
5. How do we actually estimate your probability of disease?
In reality, of course, we can’t study 100 counterfactual versions of you. We have to make an estimate with what’s available to us now. Since you are a human, let’s estimate the risk within a group of 100 randomly sampled humans:
After following these 100 individuals, we find that 11 people get cancer. Risk (or cumulative incidence) is estimated as the number of incident (i.e. new) cases within a given time divided by the number of individuals in the study population (sounds familiar, eh?). Thus, we estimate an 11% risk within this study population.
Then, assuming that all individuals in this study population are independent and identically distributed*, using the math above, we assign each individual within the target population (which includes you!) the same probability of disease. * The assumption of the participants being identically distributed is critical, and impossible to test. Practically, researchers and clinicians attempt to group individuals with similar risk factors into the same group to estimate risk, but defining which population is most relevant to you is challenging if not impossible.
6. The reference class problem
For demonstration purposes, let’s assume the same overall risk of 11% in the total population. Now consider the subset of this population that are men (i.e. the left half of the population): 8/50 men get sick, so their risk is 16%. Women (on the right), however, get sick at a rate of 3/50 = 6%. We can subdivide our population further by age, or any other risk factors (or any combination of these characteristics), and we estimate different risks for each subpopulation. Taking this to an extreme, we see that a young man has a 20% risk, and an older man has a 12% risk; a young woman has an 8% risk, and an older woman has a 4% risk. So which population - and consequently which risk estimate - is most relevant for a given individual?
This predicament of identifying the appropriate population for which to estimate risk is known as the “reference class problem.” This problem was identified by John Venn in 1876: “every single thing or event has an indefinite number of properties or attributes observable in it, and might therefore be considered as belonging to an indefinite number of different classes of things”7. Thus, Stern (2012)1 argues that there is no such thing as unconditional or individual risk. The risk factors included in the model that we select determine the population classes, and thus the risk estimates for each class. Even for well-calibrated models, when we compare two models, the risk estimates associated with any one individual can vary substantially depending on which reference class an individual is assigned to8.
7. Limitations
The goal of precision medicine is to make the reference class as small as possible so that there is less variability within that group, and better estimates of risk are assigned to each individual9. This is a tricky problem and is discussed thoroughly by Kent et al. (2018). Most importantly, risk should be clearly communicated to doctors and patients that risk is estimated using a specific model, and “people with X, Y, and Z risk factors have a W% risk for disease Q.”
In my research, however, we skirt the individual risk issue by making predictions at the population-level. We can feel confident making these predictions when the model is shown to perform well on a given population (i.e. it is “well-calibrated”). Thus, our models can help inform how cancer screening guidelines should set to screen the top X percentage of the population at highest risk, or other allocations of resources.
Additional limitations include the problem of adjusting someone’s risk if their covariates change over time - it is currently unknown how this will affect risk since most models only include covariates measured at baseline. Issues surrounding generalizability or transportability to other populations for groups that are under-represented in study populations are also common.