diff --git a/lectures/prob_meaning.md b/lectures/prob_meaning.md index a77cb139b..0914ad45b 100644 --- a/lectures/prob_meaning.md +++ b/lectures/prob_meaning.md @@ -4,7 +4,7 @@ jupytext: extension: .md format_name: myst format_version: 0.13 - jupytext_version: 1.10.3 + jupytext_version: 1.17.2 kernelspec: display_name: Python 3 (ipykernel) language: python @@ -23,21 +23,15 @@ This lecture illustrates two distinct interpretations of a **probability distr * A Bayesian interpretation as a **personal opinion** (about a parameter or list of parameters) after seeing a collection of observations -We recommend watching this video about **hypothesis testing** within the frequentist approach +We recommend watching the following two videos before proceeding: -```{youtube} 8JIe_cz6qGA -``` - -After you watch that video, please watch the following video on the Bayesian approach to constructing **coverage intervals** - -```{youtube} Pahyv9i_X2k -``` +* [Hypothesis testing within the frequentist approach](https://www.youtube.com/watch?v=8JIe_cz6qGA) -After you are familiar with the material in these videos, this lecture uses the Socratic method to help consolidate your understanding of the different questions that are answered by +* [The Bayesian approach to constructing coverage intervals](https://www.youtube.com/watch?v=Pahyv9i_X2k) - * a frequentist confidence interval +After you are familiar with the material in these videos, this lecture uses the Socratic method to help consolidate your understanding of these two meanings of probability. - * a Bayesian coverage interval +Along the way we construct a **Bayesian coverage interval** and contrast the question it answers with the frequentist relative-frequency reasoning that opens the lecture. We do this by inviting you to write some Python code. @@ -46,7 +40,7 @@ proceeding to read the rest of the lecture. We provide our own answers as the lecture unfolds, but you'll learn more if you try writing your own code before reading and running ours. -**Code for answering questions:** +### Code for answering questions To answer our coding questions, we’ll start with some imports @@ -69,12 +63,12 @@ The random variable $X $ takes on possible values $k = 0, 1, 2, \ldots, n$ wit $$ p(k \mid \theta) := \mathbb{P}\{X = k \mid \theta\} = -\left(\frac{n!}{k! (n-k)!} \right) \theta^k (1-\theta)^{n-k} +\binom{n}{k} \theta^k (1-\theta)^{n-k} $$ where the fixed parameter $\theta \in (0,1)$. -This is called the **binomial distribution**. +This is called the [binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution). Here @@ -102,8 +96,7 @@ Let $\sum_{h=1}^n y_h^i$ denote the total number of times heads come up during Let $f_k$ record the fraction of samples of length $n$ for which $\sum_{h=1}^n y_h^i = k$: $$ -f_k^I = \frac{\textrm{number of samples of length n for which } \sum_{h=1}^n y_h^i = k}{ - I} +f_k^I = \frac{1}{I} \sum_{i=1}^I \mathbb{1}\left\{ \sum_{h=1}^n y_h^i = k \right\} $$ The probability $p(k \mid \theta)$ answers the following question: @@ -115,9 +108,9 @@ As usual, a law of large numbers justifies this answer. ```{exercise} :label: pm_ex1 -1. Please write a Python class to compute $f_k^I$ +1. Write Python code to compute $f_k^I$ -2. Please use your code to compute $f_k^I, k = 0, \ldots , n$ and compare them to +2. Use your code to compute $f_k^I, k = 0, \ldots , n$ and compare them to $p(k \mid \theta)$ for various values of $\theta, n$ and $I$ 3. With the Law of Large Numbers in mind, use your code to describe the relationship between $f_k^I$ and $p(k \mid \theta)$ as $I$ grows @@ -127,50 +120,36 @@ As usual, a law of large numbers justifies this answer. :class: dropdown ``` -Here is one solution: +Here is one solution. + +We simulate the coin flips with one function and assemble the comparison table +with another. ```{code-cell} ipython3 -class Frequentist: - - def __init__(self, θ, n, I, rng=None): - self.θ, self.n, self.I = θ, n, I - self.rng = rng or np.random.default_rng() - - def binomial(self, k): - '''Compute the theoretical probability.''' - self.P = binom.pmf(k, self.n, self.θ) - - def draw(self): - '''Draw n independent flips for I sequences.''' - θ, n, I = self.θ, self.n, self.I - sample = self.rng.random((I, n)) - self.Y = (sample <= θ).astype(int) - - def compute_fk(self, k): - '''Compute f_k^I for a given k.''' - head_counts = np.sum(self.Y, axis=1) - self.f_kI = np.sum(head_counts == k) / self.I - - def compare(self): - '''Compute and print the comparison.''' - self.draw() - rows = [] - for k in range(self.n + 1): - self.binomial(k) - self.compute_fk(k) - rows.append([k, self.P, self.f_kI]) - return pd.DataFrame( - rows, columns=['k', 'Theoretical', 'Frequentist'] - ).set_index('k') +def simulate_head_counts(θ, n, I, seed=1234): + "Simulate I sequences of n coin flips; return the heads count of each sequence." + rng = np.random.default_rng(seed) + Y = (rng.random((I, n)) <= θ).astype(int) + return Y.sum(axis=1) ``` ```{code-cell} ipython3 -rng = np.random.default_rng(123) -θ, n, k, I = 0.7, 20, 10, 1_000_000 +def compare_frequencies(θ, n, I, seed=1234): + "Tabulate theoretical binomial probabilities against simulated frequencies." + head_counts = simulate_head_counts(θ, n, I, seed) + rows = [ + (k, binom.pmf(k, n, θ), np.mean(head_counts == k)) + for k in range(n + 1) + ] + return pd.DataFrame( + rows, columns=['k', 'Theoretical', 'Frequentist'] + ).set_index('k') +``` -freq = Frequentist(θ, n, I, rng=rng) +```{code-cell} ipython3 +θ, n, k, I = 0.7, 20, 10, 1_000_000 -freq.compare() +compare_frequencies(θ, n, I) ``` From the table above, can you see the law of large numbers at work? @@ -180,7 +159,7 @@ From the table above, can you see the law of large numbers at work? Let's do some more calculations. -**Comparison with different $\theta$** +### Comparison with different $\theta$ Now we fix @@ -191,18 +170,14 @@ $$ We'll vary $\theta$ from $0.01$ to $0.99$ and plot outcomes against $\theta$. ```{code-cell} ipython3 -rng = np.random.default_rng(234) θ_low, θ_high, n_thetas = 0.01, 0.99, 50 thetas = np.linspace(θ_low, θ_high, n_thetas) P = [] f_kI = [] for i in range(n_thetas): - freq = Frequentist(thetas[i], n, I, rng=rng) - freq.binomial(k) - freq.draw() - freq.compute_fk(k) - P.append(freq.P) - f_kI.append(freq.f_kI) + P.append(binom.pmf(k, n, thetas[i])) + head_counts = simulate_head_counts(thetas[i], n, I, seed=i) + f_kI.append(np.mean(head_counts == k)) ``` ```{code-cell} ipython3 @@ -219,25 +194,21 @@ ax.legend() plt.show() ``` -**Comparison with different $n$** +### Comparison with different $n$ Now we fix $\theta=0.7, k=10, I=1,000,000$ and vary $n$ from $1$ to $100$. Then we'll plot outcomes. ```{code-cell} ipython3 -rng = np.random.default_rng(345) n_low, n_high, n_ns = 1, 100, 50 ns = np.linspace(n_low, n_high, n_ns, dtype='int') P = [] f_kI = [] for i in range(n_ns): - freq = Frequentist(θ, ns[i], I, rng=rng) - freq.binomial(k) - freq.draw() - freq.compute_fk(k) - P.append(freq.P) - f_kI.append(freq.f_kI) + P.append(binom.pmf(k, ns[i], θ)) + head_counts = simulate_head_counts(θ, ns[i], I, seed=i) + f_kI.append(np.mean(head_counts == k)) ``` ```{code-cell} ipython3 @@ -254,24 +225,22 @@ ax.legend() plt.show() ``` -**Comparison with different $I$** +Because $k=10$ is held fixed, $p(k \mid \theta)$ is zero whenever $n < 10$ — we cannot obtain $10$ heads in fewer than $10$ flips — and it is largest near $n \approx 14$, where the expected number of heads $n\theta$ is closest to $k$. The simulated fraction tracks this shape. + +### Comparison with different $I$ Now we fix $\theta=0.7, n=20, k=10$ and vary $\log(I)$ from $2$ to $6$. ```{code-cell} ipython3 -rng = np.random.default_rng(456) I_log_low, I_log_high, n_Is = 2, 6, 200 log_Is = np.linspace(I_log_low, I_log_high, n_Is) Is = np.power(10, log_Is).astype(int) P = [] f_kI = [] for i in range(n_Is): - freq = Frequentist(θ, n, Is[i], rng=rng) - freq.binomial(k) - freq.draw() - freq.compute_fk(k) - P.append(freq.P) - f_kI.append(freq.f_kI) + P.append(binom.pmf(k, n, θ)) + head_counts = simulate_head_counts(θ, n, Is[i], seed=i) + f_kI.append(np.mean(head_counts == k)) ``` ```{code-cell} ipython3 @@ -308,7 +277,7 @@ $$ So, by the LLN, the average of $\rho_{k,i}$ converges to: $$ -\mathbb{E}[\rho_{k,i}] = p(k \mid \theta) = \left(\frac{n!}{k! (n-k)!} \right) \theta^k (1-\theta)^{n-k} +\mathbb{E}[\rho_{k,i}] = p(k \mid \theta) = \binom{n}{k} \theta^k (1-\theta)^{n-k} $$ as $I$ goes to infinity. @@ -327,6 +296,8 @@ Instead, the probability distribution of $\theta$ is now a summary of our views * **before** we have seen **any** data at all, or * **before** we have seen **more** data, after we have seen **some** data +### The beta prior and its posterior + Thus, suppose that, before seeing any data, you have a personal prior probability distribution with density $$ @@ -346,26 +317,22 @@ $$ p(\theta \mid k) = \frac{p(k \mid \theta) \cdot p(\theta)}{\int_0^1 p(k \mid \theta) \cdot p(\theta) \, d\theta} $$ -The exercise below derives a closed form for the posterior. - -```{exercise} -:label: pm_ex2 - -**a)** Please write down the **likelihood function** for a single coin flip with outcome $Y \in \{0, 1\}$. +Because the beta prior is conjugate to the binomial likelihood, this integral evaluates to (the kernel of) another beta density, so that -**b)** Please write down the **posterior** distribution for $\theta$ after observing that single flip. - -**c)** Now pretend that the true value of $\theta = .4$ and that someone who doesn't know this has a beta prior distribution with parameters $\beta = \alpha = .5$. Please write a Python class to simulate this person's personal posterior distribution for $\theta$ for a _single_ sequence of $n$ draws. +$$ +\theta \mid k \sim \textrm{Beta}(\alpha + k, \, \beta + n - k) +$$ (eq:beta_posterior) -**d)** Please plot the posterior distribution for $\theta$ as a function of $\theta$ as $n$ grows as $1, 2, \ldots$. +The first exercise below asks you to derive this closed form. -**e)** For various $n$'s, please describe and compute a Bayesian coverage interval for the interval $[.45, .55]$. +```{exercise} +:label: pm_ex2 -**f)** Please tell what question a Bayesian coverage interval answers. +**a)** Write down the **likelihood function** for a single coin flip with outcome $Y \in \{0, 1\}$. -**g)** Please compute the posterior probability that $\theta \in [.45, .55]$ for various values of sample size $n$. +**b)** Write down the **posterior** distribution for $\theta$ after observing that single flip. -**h)** Please use your Python class to study what happens to the posterior distribution as $n \rightarrow + \infty$, again assuming that the true value of $\theta = .4$, though it is unknown to the person doing the updating via Bayes' Law. +**c)** Derive the closed-form posterior {eq}`eq:beta_posterior` for a sample of $n$ flips that yields $k$ heads. ``` @@ -403,76 +370,90 @@ $$ \theta \mid Y \sim \textrm{Beta}(\alpha + Y, \, \beta + (1-Y)) $$ -The same calculation with the binomial likelihood in place of the Bernoulli likelihood generalizes this result to $n$ flips with $k$ heads: +**c)** The same calculation, with the binomial likelihood in place of the Bernoulli likelihood, generalizes the result to a sample of $n$ flips that yields $k$ heads. + +The beta prior contributes the factor $\theta^{\alpha-1}(1-\theta)^{\beta-1}$, and the binomial likelihood contributes $\theta^{k}(1-\theta)^{n-k}$, so the posterior is proportional to $$ -\theta \mid k \sim \textrm{Beta}(\alpha + k, \, \beta + n - k) +\theta^{\alpha + k - 1} (1-\theta)^{\beta + n - k - 1}, $$ -This is the formula we use in the remaining parts of the exercise. +which is the kernel of a beta density. Hence -**c)** +$$ +\theta \mid k \sim \textrm{Beta}(\alpha + k, \, \beta + n - k), +$$ + +as stated in {eq}`eq:beta_posterior`. + +```{solution-end} +``` + +### Exploring the posterior numerically + +The next exercise puts this posterior to work. + +```{exercise} +:label: pm_ex3 + +**a)** Now pretend that the true value of $\theta = 0.4$ and that someone who doesn't know this has a beta prior distribution with parameters $\beta = \alpha = 0.5$. Write Python code to simulate this person's personal posterior distribution for $\theta$ for a _single_ sequence of $n$ draws. + +**b)** Plot the posterior distribution for $\theta$ as a function of $\theta$ as $n$ grows as $1, 2, \ldots$. + +**c)** For various $n$'s, describe and compute a Bayesian coverage interval for the interval $[0.45, 0.55]$. + +**d)** Tell what question a Bayesian coverage interval answers. + +**e)** Compute the posterior probability that $\theta \in [0.45, 0.55]$ for various values of sample size $n$. + +**f)** Use your Python code to study what happens to the posterior distribution as $n \rightarrow + \infty$, again assuming that the true value of $\theta = 0.4$, though it is unknown to the person doing the updating via Bayes' Law. +``` + +```{solution-start} pm_ex3 +:class: dropdown +``` + +**a)** + +We use one function to simulate a sequence of coin flips and another to form the +Beta posterior from the first `n_obs` of those flips. ```{code-cell} ipython3 -class Bayesian: - - def __init__(self, θ=0.4, n=1_000_000, α=0.5, β=0.5, - rng=None): - ''' - Parameters - ---------- - θ : Probability of heads on each flip. - n : Number of flips in the sequence. - α, β : Parameters of the beta prior on θ. - rng : NumPy random generator. - ''' - self.θ, self.n, self.α, self.β = θ, n, α, β - self.rng = rng or np.random.default_rng() - self.prior = st.beta(α, β) - - def draw(self): - '''Simulate a sequence of n coin flips.''' - array = self.rng.random(self.n) - self.draws = (array < self.θ).astype(int) - - def form_single_posterior(self, n_obs): - '''Return the posterior after the first n_obs flips.''' - heads = self.draws[:n_obs].sum() - tails = n_obs - heads - return st.beta(self.α + heads, self.β + tails) - - def form_posterior_series(self, n_obs_list): - '''Form posteriors for each sample size in n_obs_list.''' - self.posterior_list = [] - for n_obs in n_obs_list: - self.posterior_list.append( - self.form_single_posterior(n_obs) - ) +def simulate_flips(θ=0.4, n=1_000_000, seed=1234): + "Simulate n coin flips, each landing heads (1) with probability θ." + rng = np.random.default_rng(seed) + return (rng.random(n) < θ).astype(int) ``` -**d)** +```{code-cell} ipython3 +def form_posterior(draws, n_obs, α=0.5, β=0.5): + "Beta posterior for θ from the first n_obs flips, given a Beta(α, β) prior." + heads = draws[:n_obs].sum() + return st.beta(α + heads, β + n_obs - heads) +``` + +**b)** ```{code-cell} ipython3 -rng = np.random.default_rng(567) -bayes = Bayesian(rng=rng) -bayes.draw() +draws = simulate_flips() n_obs_list = [1, 2, 3, 4, 5, 10, 20, 50, 100, 1000, 5000, 10_000, 50_000, 100_000, 200_000, 300_000] -bayes.form_posterior_series(n_obs_list) +posterior_list = [form_posterior(draws, n_obs) for n_obs in n_obs_list] θ_values = np.linspace(0.01, 1, 1000) fig, ax = plt.subplots(figsize=(10, 6)) -ax.plot(θ_values, bayes.prior.pdf(θ_values), +prior = st.beta(0.5, 0.5) +ax.plot(θ_values, prior.pdf(θ_values), label='n = 0 (prior)', linestyle='--') for i, n_obs in enumerate(n_obs_list[:10]): - posterior = bayes.posterior_list[i] + posterior = posterior_list[i] ax.plot(θ_values, posterior.pdf(θ_values), label=f'n = {n_obs}') @@ -484,11 +465,11 @@ ax.legend(fontsize=11) plt.show() ``` -**e)** +**c)** ```{code-cell} ipython3 -lower_bound = [post.ppf(0.05) for post in bayes.posterior_list[:10]] -upper_bound = [post.ppf(0.95) for post in bayes.posterior_list[:10]] +lower_bound = [post.ppf(0.05) for post in posterior_list[:10]] +upper_bound = [post.ppf(0.95) for post in posterior_list[:10]] interval_df = pd.DataFrame() interval_df['upper'] = upper_bound @@ -500,7 +481,7 @@ interval_df As $n$ increases, we can see that Bayesian coverage intervals narrow and move toward $0.4$. -**f)** The Bayesian coverage interval tells the range of $\theta$ that corresponds to the [$q_1$, $q_2$] quantiles of the cumulative distribution function (CDF) of the posterior distribution. +**d)** The Bayesian coverage interval tells the range of $\theta$ that corresponds to the [$q_1$, $q_2$] quantiles of the cumulative distribution function (CDF) of the posterior distribution. To construct the coverage interval we first compute a posterior distribution of the unknown parameter $\theta$. @@ -510,14 +491,14 @@ $$ F(a)=q_1,F(b)=q_2 $$ -**g)** +**e)** ```{code-cell} ipython3 left_value, right_value = 0.45, 0.55 posterior_prob_list = [ post.cdf(right_value) - post.cdf(left_value) - for post in bayes.posterior_list + for post in posterior_list ] fig, ax = plt.subplots(figsize=(8, 5)) @@ -534,29 +515,29 @@ ax.set_xlabel('number of observations', fontsize=11) plt.show() ``` -Notice that in the graph above the posterior probability that $\theta \in [.45, .55]$ exhibits a hump shape as $n$ increases. +Notice that in the graph above the posterior probability that $\theta \in [0.45, 0.55]$ exhibits a hump shape as $n$ increases. Two opposing forces are at work. The first force is that the individual adjusts his belief as he observes new outcomes, so his posterior probability distribution becomes more and more realistic, which explains the rise of the posterior probability. -However, $[.45, .55]$ actually excludes the true $\theta =.4 $ that generates the data. +However, $[0.45, 0.55]$ actually excludes the true $\theta = 0.4$ that generates the data. As a result, the posterior probability drops as larger and larger samples refine his posterior probability distribution of $\theta$. The descent seems precipitous only because of the scale of the graph that has the number of observations increasing disproportionately. -When the number of observations becomes large enough, our Bayesian becomes so confident about $\theta$ that he considers $\theta \in [.45, .55]$ very unlikely. +When the number of observations becomes large enough, our Bayesian becomes so confident about $\theta$ that he considers $\theta \in [0.45, 0.55]$ very unlikely. That is why we see a nearly horizontal line when the number of observations exceeds 1000. -**h)** Using the Python class we made above, we can see the evolution of posterior distributions as $n$ approaches infinity. +**f)** Using the functions we wrote above, we can see the evolution of posterior distributions as $n$ approaches infinity. ```{code-cell} ipython3 fig, ax = plt.subplots(figsize=(10, 6)) for i, n_obs in enumerate(n_obs_list[10:]): - posterior = bayes.posterior_list[i + 10] + posterior = posterior_list[i + 10] ax.plot(θ_values, posterior.pdf(θ_values), label=f'n = {n_obs:,}') @@ -575,8 +556,8 @@ Here the posterior mean converges to $0.4$ while the posterior standard deviat To show this, we compute the mean and standard deviation of the posterior distributions. ```{code-cell} ipython3 -mean_list = [post.mean() for post in bayes.posterior_list] -std_list = [post.std() for post in bayes.posterior_list] +mean_list = [post.mean() for post in posterior_list] +std_list = [post.std() for post in posterior_list] fig, ax = plt.subplots(1, 2, figsize=(14, 5)) @@ -600,6 +581,8 @@ plt.show() ```{solution-end} ``` +### Why the posterior concentrates + How shall we interpret the patterns above? The answer is encoded in the Bayesian updating formula derived above. @@ -619,8 +602,8 @@ Since the data are generated with $\theta = 0.4$, the Law of Large Numbers tells Consequently, the posterior mean converges to $0.4$ and the posterior variance shrinks to zero. ```{code-cell} ipython3 -upper_bound = [post.ppf(0.95) for post in bayes.posterior_list] -lower_bound = [post.ppf(0.05) for post in bayes.posterior_list] +upper_bound = [post.ppf(0.95) for post in posterior_list] +lower_bound = [post.ppf(0.05) for post in posterior_list] fig, ax = plt.subplots(figsize=(10, 6)) ax.scatter(np.arange(len(upper_bound)), @@ -640,13 +623,23 @@ plt.show() After observing a large number of outcomes, the posterior distribution collapses around $0.4$. -Thus, the Bayesian statistician comes to believe that $\theta$ is near $.4$. +Thus, the Bayesian statistician comes to believe that $\theta$ is near $0.4$. As shown in the figure above, as the number of observations grows, the Bayesian coverage intervals (BCIs) become narrower and narrower around $0.4$. However, if you take a closer look, you will find that the centers of the BCIs are not exactly $0.4$, due to the persistent influence of the prior distribution and the randomness of the simulation path. +## Comparing the two interpretations + +We can now place the two meanings of probability side by side. + +In the frequentist calculations, $\theta$ was a *fixed* number and a probability was a *relative frequency*: $p(k \mid \theta)$ is the fraction of a large number of independent length-$n$ sequences in which exactly $k$ heads occur, and we confirmed by simulation that $f_k^I \to p(k \mid \theta)$ as $I \to \infty$. + +In the Bayesian calculations, $\theta$ was itself a *random variable* describing our beliefs, and a probability was a *degree of belief*. The Bayesian coverage interval summarizes the posterior $p(\theta \mid k)$: it reports the range of $\theta$ lying between two posterior quantiles, and so answers the question "given the data I have seen and my prior, where do I now believe $\theta$ lies?" + +These are genuinely different questions, even though both rest on the same binomial likelihood. The frequentist statement describes the data-generating mechanism for a fixed $\theta$; the Bayesian statement describes our uncertainty about $\theta$ given the particular data we happened to observe. + ## Role of a Conjugate Prior We have made assumptions that link functional forms of our likelihood function and our prior in a way that has eased our calculations considerably.