11. Risk, Ambiguity, and Misspecification[1]#
\(\newcommand{\eqdef}{\stackrel{\text{def}}{=}}\)
“Uncertainty must be taken in a sense radically distinct from the familiar notion of risk, from which it has never been properly separated…. and there are far-reaching and crucial differences in the bearings of the phenomena depending on which of the two is really present and operating.” [Knight, 1921]
11.1. Introduction#
Likelihood functions are probability distributions conditioned on parameters; prior probability distributions describe a decision maker’s subjective belief about those parameters.[2] By distinguishing roles played by likelihood functions and subjective priors over their parameters, this chapter brings some recent contributions to decision theory into contact with statistics and econometrics in ways that can address practical econometric concerns about model misspecifications and choices of prior probabilities.
We combine ideas from control theories that construct decision rules that are robust to a class of model misspecifications with axiomatic decision theories invented by economic theorists. Such decision theories originated with axiomatic formulations by von Neumann and Morgenstern, Savage, and Wald ([Wald, 1947, Wald, 1949, Wald, 1950], [Savage, 1954]). Ellsberg ([Ellsberg, 1961]) pointed out that Savage’s framework seems to include nothing that could be called “uncertainty” as distinct from “risk”. Theorists after Ellsberg constructed coherent systems of axioms that embrace a notion of ambiguity aversion. However, most recent axiomatic formulations of decision making under uncertainty in economics are not cast explicitly in terms of likelihood functions and prior distributions over parameters.
This chapter reinterprets objects that appear in some of those axiomatic foundations of decision theories in ways useful to an econometrician. We do this by showing how to use an axiomatic structure to express ambiguity about a prior over a family of statistical models, on the one hand, along with concerns about misspecifications of those models, on the other hand.
Although they proceeded differently than we do here, [Chamberlain, 2020], [Cerreia-Vioglio et al., 2013], and [Denti and Pomatto, 2022] studied related issues. [Chamberlain, 2020] emphasized that likelihoods and priors are both vulnerable to potential misspecifications. He focused on uncertainty about predictive distributions constructed by integrating likelihoods with respect to priors. In contrast to Chamberlain, we formulate a decision theory that distinguishes uncertainties about priors from uncertainties about likelihoods. [Cerreia-Vioglio et al., 2013] (section 4.2) provided a rationalization of the smooth ambiguity preferences proposed by [Klibanoff et al., 2005] that includes likelihoods and priors as components. [Denti and Pomatto, 2022] extended this approach by using an axiomatic revealed preference approach to deduce a parameterization of a likelihood function. However, neither [Cerreia-Vioglio et al., 2013] nor [Denti and Pomatto, 2022] sharply distinguished prior uncertainty from concerns about misspecifications of likelihood functions. We want to do that. We formulate concerns about statistical model misspecifications as uncertainty about likelihoods.
More specifically, we align definitions of statistical models, uncertainty, and ambiguity with ideas from decision theories that build on [Anscombe and Aumann, 1963]’s way of representing subjective and objective uncertainties. In particular, we connect our analysis to econometrics and robust control theory by using Anscombe-Aumann states as parameters that index alternative statistical models of random variables that affect outcomes that a decision maker cares about. By modifying aspects of [Gilboa et al., 2010], [Cerreia-Vioglio et al., 2013], and [Denti and Pomatto, 2022], we show how to use variational preferences to represent uncertainty about priors and also concerns about statistical model misspecifications.
Discrepancies between two probability distributions occur throughout our analysis. This fact opens possible connections between our framework and some models in “behavioral” economics and finance that assume that decision makers inside their models have expected utility preferences in which an agent’s subjective probability – typically a predictive density – differs systematically from the predictive density that the model user assumes governs the data.[3] Other “behavioral” models focus on putative differences among agents’ degrees of confidence in their views of the world. Our framework implies that the form taken by a “lack of confidence” should depend on the probabilistic concept about which a decision maker is uncertain. Preference structures that we describe in this chapter allow us to formalize different amounts of “confidence” about details of specifications of particular statistical models, on one hand, and about subjective probabilities to attach to alternative statistical models, on the other hand. Our representations of preferences provide ways to characterize degrees of confidence in terms of statistical discrepancies between alternative probability distributions.[4]
11.2. Background motivation#
While we are continually reminded that we live in a data rich environment, the data are not “rich” along all dimensions relevant for decision making. Moreover, data do not simply “speak,’’ but they require a conceptual framework for interpretation. The limited knowledge along some dimensions, have given rise to the presence of “deep uncertainties” in the many policy-making settings. An example of current interest emerges in the study of prudent economic policy to address adverse implications of climate change. This perspective is expressed in a recent paper:
“The economic consequences of many of the complex risks associated with climate change cannot, however, currently be quantified. … these unquantified, poorly understood and often deeply uncertain risks can and should be included in economic evaluations and decision-making processes.” [Rising et al., 2022]
We will capture deep uncertainty as a lack of confidence in expressing probabilities over some event or outcomes pertinent for the design of policy. While we will entertain ambiguity in probabilities, we necessarily impose some bounds on this ambiguity.
It is unfortunate that aggregate or macroeconomic uncertainty is often treated as second-order in nature. In policy making settings, there is sometimes an expressed concern that acknowledging uncertainty will reinforce passivity in policy making, but this turns out not to be an across the board implication of uncertainty aversion. If fact, there is a danger in the pretense of knowledge Hayek as emphasized in his haunting forewarning:
“Even if true scientists should recognize the limits of studying human behavior, as long as the public has expectations, there will be people who pretend or believe that they can do more to meet popular demand than what is really in their power.” [Hayek, 1989]
As economists, we often like to explore tradeoffs. The explicit incorporation of a broad notion of uncertainty allows us to explore two tradeoffs pertinent to decision making. One such tradeoff emerges when we consider implications from multiple models or alternative parameter configurations: how much weight do we assign to:
best guesses
potentially bad outcomes
when making decisions? Focusing exclusively on best guesses may naively ignore adverse possibilities worthy of consideration. Focusing exclusively on outcomes may lead to extreme policies that perform poorly in more normal outcomes. These extremes open the door to tradeoff reflected in aversion to uncertainty.
A second tradeoff of interest is dynamic in nature: do we act now, or do we wait until we learn more when dealing with future uncertainty? While waiting may seem to be particularly attractive, it can sometimes be costly to wait and more prudent to at least take some action now even if our current knowledge is incomplete.
11.2.1. Aims#
In this chapter we allow for a broad perspective on uncertainty, encompassing:
risk: unknown outcomes with known probabilities;
ambiguity: unknown weights to assign to alternative probability models;
misspecification: unknown ways in which a model might provide flawed probabilistic predictions;
We will have a particular emphasis on formulations that are tractable and revealing.
11.3. Decision theory overview#
Decision theory under uncertainty provides alternative axiomatic formulations of “rationality.” As there are multiple axiomatic formulations of decision making under uncertainty, it is perhaps best to replace the term “rational” with “prudent.” While these axiomatic formulations are of intellectual and substantive interest, in this chapter we will focus on the implied representations. This approach remains interesting because we have sympathy for Savage’s own perspective on his elegant axiomatic formulation:
Indeed the axioms have served their only function the existential of Theorems and 3; in further exploitation of the theory, …, the axioms themselves can end, in my experience, and should be forgotten.” [Savage, 1952]
11.3.1. Approach#
In this chapter we will exploit modifications of Savage-style axiomatic formulations from decision theory under uncertainty, to investigate notions of uncertainty beyond risk. The overall aim is to make contact with applied challenges in economics and other disciplines. We will start with the basics of statistical decision theory and then proceed to explore extensions that distinguish concerns about potential misspecifications of likelihoods from concerns about the misspecification of priors. This opens the door to better ways for conducting uncertainty quantification for dynamic, stochastic economic models used for private sector planning and governmental policy assessment. It is achieved by providing tractable and revealing methods for exploring sensitivity to subjective uncertainties, including potential model misspecification and ambiguity across models. This will allow us to systematically:
assess the impact of uncertainty on prudent decision or policy outcomes;
isolate the forms of uncertainty that are most consequential for these outcomes.
To make the methods tractable and revealing we will utilize tools from probability and statistics to limit the type and amount of uncertainty that is entertained. As inputs, the resulting representations of objectives for decision making will require a specification of aversion to or dislike of uncertainty about probabilities over future events.
11.3.2. Anscombe-Aumann (AA)#
[Anscombe and Aumann, 1963] provided a different way to justify Savage’s representation of decision making in the presence of subjective uncertainty. They feature prominently the distinction between a “horse race” and a “roulette wheel”. They rationalize preferences over acts, where an act maps states into lotteries over prizes. The latter is the counterpart to a roulette wheel. Probability assignments over states then become the subjective input and the counterpart to the “horse race.”
While [Anscombe and Aumann, 1963] used this formulation to extend the von Neumann-Morgenstern expected utility with known probabilities to decisions problems where subjective probabilities also play a central role as in Savage’s approach. While [Anscombe and Aumann, 1963] provides an alternative derivation of subjective expected utility, many subsequent contributions used the Anscombe-Aumann framework to extend the analysis to incorporate forms of ambiguity aversion. Prominent examples include [Gilboa and Schmeidler, 1989] and [Maccheroni et al., 2006]. In what follows we provide a statistical slant to such analyses.
11.3.3. Basic setup#
Consider a parameterized model of a random vector with realization \(x\):
where
\(\theta \in \Theta\) and \(\Theta\) is a parameter space, and \(\mathcal X\) is the space of possible realizations of \(x\). We refer to \(\ell\) as a likelihood and the probability implied by each \(\theta\) as a “structured” probability model.
Denote by \(\pi\) a prior distribution over \(\Theta.\) We will sometimes have reason to focus on a particular prior, \(\pi\) that we will call a baseline prior. We denote a prize rule by \(\gamma\) which maps \(\mathcal X\) into prizes. We define a decision rule, \(\delta,\) that can condition on observations or signals. To elaborate further, partition \(x = (w, z)\) where the decision rule can depend on \(z,\) and the prize rule on the entire \(x\) vector. A probability distribution over the \(w\)’s reflects random outcomes realized after a decision has been made. For instance, in an intertemporal context, \(w\) may reflect future shocks. Thus decisions may have further repercussions for prizes beyond \(z\):
Since the decision rule can depend on \(z\) , ex post learning is entertained. Decision rules are restricted to be in a set \(\Delta\) and imply restrictions for prize rules:
The preferences over prize rules imply a ranking over decision rules, the \(\delta\)’s, in \(\Delta\). In what follows, we define:
While we are featuring the impact of a decision rule on a prize rule, we may extend the analysis to allow \(\delta\) to influence \(\ell\) as happens when we entertain experimentation.
Risk is assessed using expected utility with a utility function \({U}\). We compute the expected utility for prize rules as a function of \(\theta\) as:
Following a language from statistical decision theory, we call \({\overline {U}}(\gamma \mid \theta)\) the risk function for a given prize rule when viewed as a function of \(\theta.\)[5]
We allow the utility function to depend on the unknown parameter \(\theta,\) as is common in statistical decision problems. Arguably, such a formulation is short hand for a more primitive specification in which the parameter has ramifications for a future prize.
11.3.4. A simple statistical application#
As an illustration, we consider a model selection problem. Suppose \(\Theta = \{\theta_1, \theta_2\},\) and that the decision maker can use the entire underlying random vector. Thus, we make no distinction between \(\gamma\) and \(\delta\) and no distinctions between \(x\) and \(z\).
The decision rule, \(\delta = \gamma\) is a mapping from \(\mathcal Z\) to \([0,1]\) where \(\delta(z) = 1\) means that the decision maker selects model \(\theta_1\) and \(\delta(z) = 0\) means that the decision maker selects model \(\theta_2.\)
We allow for intermediate values between zero and one, which can be interpreted as randomization. These intermediate choices will end up not being of particular interest for this example.
The utility function for assessing risk is:
where \(\upsilon_1,\) and \(\upsilon_2\) are positive utility parameters.
A class of decision rules, called threshold rules, will be of particular interest. Partition \(\mathcal Z\) into two sets:
where the intersection of \(\mathcal Z_1\) and \(\mathcal Z_2\) is empty. A threshold rule has the form:
For a threshold rule, the conditional expected utility is
Suppose that the utility weights \(\upsilon_1\) and \(\upsilon_2\) are both one. Under this threshold decision rule, \(1 - {\overline {U}}(\gamma \mid \theta_1)\) is the probability of making a mistake when model \(\theta_1\) generates the data, and \(1 - {\overline {U}}(\gamma \mid \theta_2)\) is the probability of making a mistake if model \(\theta_2\) generates the data. In statistics, the first of these is called a type I error and type II depending assuming we consider \(\theta_1\) to be the “null model” and \(\theta_2\) to be the alternative. The utility weights determine relative importance to the decision maker of making a correct identification of the model.
11.4. Subjective expected utility#
Order preferences over \(\gamma\) and hence \(\delta\)
for a specific \(\pi.\) This representation is supported by Savage and Anscombe-Aumann axioms, but imposes full confidence with no potential misspecification of the priors or the likelihood.
We use these preference for a decision problem where prize rules are restricted to be in the set \(\Gamma(\Delta)\):
Problem 11.1
Recall that partitioning of \(x = (w, z)\) where the decision rule can only depend on \(z\) and the prize rule on the entire \(x\) vector. Factor \(\ell(\cdot \mid \theta)\) and \(\tau\) as:
These factorizations in (11.2) allow us to write the objective as:
To solve problem (11.1), it is convenient to exchange the orders of integration in the objective:
Notice that even if the utility function \({U}\) does not depend on \(\theta\), this dependence may emerge after we integrate over \({\mathcal W}\) because of the dependence of \(\ell_2(\cdot \mid \theta)\) on the unknown parameter.
As \(\delta\) only depends on \(z\) and the objective is additively separable in \(z\), we may solve a conditional problem: using the objective:
for each value of \(z\) provided that the restrictions imposed on \(\delta\) by the construction of the set of decision rules \(\Delta\) are separable in \(z\). That is, provided that we can write:
for given constraint sets \(\Delta(z),\) we may solve
Problem 11.2
Finally, notice that
is the Bayesian posterior distribution for \(\theta.\) Equivalently, we may solve the conditional Bayesian problem:
Problem 11.3
since in forming the objective of conditional problem, (11.4), we divided the objective for the conditional problem, (11.3) by a function of \(z\) alone.
For illustration purposes, consider the example given in Section A simple statistical application. In this example, \(x = z\) and there is no \(w\) contribution. Impose prior probabilities, \( \pi(\theta_1),\) and \( \pi(\theta_2) = 1 - \pi(\theta_1)\) on the two models. Compute the Bayesian posterior probabilities for each value of \(\theta,\)
Consider the conditional problem. If the decision maker chooses model one, then the conditional expected utility is \( \upsilon_1 \bar \pi (\theta_1 \mid z) \) and similarly for choosing model two. Thus the Bayesian decision maker computes:
and chooses a model in accordance to this maximization. This maximization is equivalent to
expressed in terms of the prior, likelihood, and utility contributions. Taking logarithms and rearranging, we see that model \(\theta_1\) is selected if
If the right side of this inequality is zero, say because prior probabilities are the same across models and utility weights are also the same, then the decision rule says to maximize the log likelihood. More generally both prior weights and utility weights come into play.
Notice that this decision rule is a threshold rule where we use the posterior probabilities to partition the \(\mathcal Z\) space. The subset \(\mathcal Z_1\) contains all \(z\) such that inequality (11.5) is satisfied with a weak inequality. We arbitrarily include the indifference values in \(\mathcal Z\).
The Bayesian solution to the decision problem is posed assuming full confidence in a subjective prior distribution. In many problems, including ones with multiple sources of uncertainty, such confidence may well not be warranted. Such a concern might well have been the motivation behind Savage’s remark:
… if I knew of any good way to make a mathematical model of these phenomena [vagueness and indecision], I would adopt it, but I despair of finding one. One of the consequences of vagueness is that we are able to elicit precise probabilities by self-interrogation in some situations but not in others.
Personal communication from L. J. Savage to Karl Popper in 1957
11.5. An extreme response#
Suppose we go to the other extreme and avoid imposing a prior altogether. Compare two prize rules, \(\gamma_1\) and \(\gamma_2\), by computing the conditional (on \(\theta\)) expected utilities, \({\overline {U}} [\gamma_1, \theta]\) and \({\overline {U}} [\gamma_2, \theta]\) for each \(\theta \in \Theta\). Then \(\gamma_2\) is preferred to \(\gamma_1\) if the conditional expected utility of the former exceeds that of the latter for all \(\theta \in \Theta\). This, however, only implies a partial ordering among prize rules. Many such rules cannot be ranked. This partial ordering gives rise to a construct called admissibility, where an admissible \(\delta \in \Delta\) cannot be dominated in the sense of this partial order.
11.5.1. Constructing admissible decision rules#
One way to construct an admissible decision rule is to impose a prior and solve the resulting Bayesian decision problem. We give two situations in which this result necessarily applies, but there are other settings where this result is known to hold.
Proposition 11.1
If an ex ante Bayesian decision problem, (11.1), has a unique solution (except possibly on a set that has measure zero under \(\tau_2\)) , then this Bayesian solution is admissible.
Proof. Let \({\tilde \delta}\) be a decision rule that weakly dominates a Bayesian decision rule, \(\delta\), in the sense that
for \(\theta \in \Theta.\) The \(\tilde \delta\) must also solve the ex ante Bayesian decision problem. Since the solution to the ex ante decision problem is unique, \({\tilde \delta} = \delta\).
Proposition 11.2
Suppose \(\Theta\) has a finite number of elements. If a prior distribution \(d\pi\) assigns positive probability to each element of \(\Theta,\) then a decision rule that solves the Bayesian decision problem (11.1) is admissible.
Proof. Let \({\tilde \delta}\) be a decision rule that weakly dominates a Bayesian decision rule, \(\delta\), in the sense that
for \(\theta \in \Theta.\) Suppose that the prior, \(d\pi,\) used in constructing the decision rule,\(\delta,\) assigns strictly positive probability to each value of \(\theta \in \Theta\). Use this prior to form expectations of both sides of inequality (11.6),
But this latter inequality must hold with equality. Since each element of \(\Theta\) has strictly positive positive prior probability, inequality (11.6) must also hold with equality. Therefore, \(\delta\) must be admissible.
Remark 11.1
While we are primarily interested in the use of alternative subjective priors as a way to construct admissible decision rules, sufficient conditions have been derived under which we can find priors that give Bayesian justifications for all admissible decision rules. Such results come under the heading of Complete class theorems. See, for instance, [LeCam, 1955], [Ferguson, 1967], and [Brown, 1981].
11.5.2. A simple statistical application reconsidered#
For illustration purposes, we again consider the model selection example. Consider a threshold decision rule of the form:
From formula (11.5), provided that we choose the prior probabilities to satisfy:
threshold rule (11.7) solves a Bayesian decision problem. Thus the implicit prior for the threshold rule is:
To provide a complementary analysis, form:
and use the probability measure for \(z\) under model \(\theta_1\) to induce a corresponding probability measure for the scalar \(y\). Suppose this probability measure has a density \(f( \cdot \mid \theta_1) \) relative to Lebesgue measure. Observe that the counterpart density, \(f(\cdot \mid \theta_2)\) satisfies:
For a decision rule, \(\delta_{\sf r}, \) with threshold \({\sf r},\) compute the two risks:
where we include the multiplication by \(\exp(-y)\) in the second row to change the computation using the model \(\theta_2\) probabilities.
Consider the two-dimensional curve of model risks, \(({\overline u}_1({\sf r}), {\overline u}_2({\sf r})),\) parametrized by the threshold \({\sf r}\). The slope of this curve at point corresponding to \({\sf r}\) is the ratio of the two derivatives with with respect to \({\sf r}\):
The second order derivative is
and hence the curve is concave.
Using prior probabilities to weight the two risks gives:
Maximizing this objective by choice of a threshold \({\sf r}\) gives the first-order conditions:
implying that
As expected, this agrees with (11.8). Thus the negative of the slope of the curve reveals the ratio of probabilities that would justify a Bayesian solution given a threshold \({\sf r}\).
We illustrate this computation in Figures Fig. 11.1 and Fig. 11.2. Both figures report the upper boundary of the set of. feasible risks for alternative decision rules. The risks along the boundary are attainable with admissible decision rules. The utility weights, \(\upsilon_1\) and \(\upsilon_2\) are both set to one in Figure Fig. 11.1, and \(\upsilon_2\) is to \(.5\) in Figure Fig. 11.2. Thus ,the upper envelop of risks is flatter in Figure Fig. 11.2 than in Figure Fig. 11.1. The flatter curve implies prior probabilities that are close to being the same.

Fig. 11.1 The blue curve gives the upper boundary of the feasible set of risks. The utility function parameters are given by \(\upsilon_1=\upsilon_2=1.\) When \({\overline u}_{\sf r}(\theta_1) = .9\), the implied prior is given by \(\pi(\theta_1) =.68\) and \(\pi(\theta_2)=.32,\) as implied by the slope of the tangent line.#

Fig. 11.2 The blue curve gives the upper boundary of the feasible set of risks. The utility function parameters are given by \(\upsilon_1= 1, \, \upsilon_2=.5.\) When \({\overline u}_{\sf r}(\theta_1) = .9\), the implied prior is given by \(\pi(\theta_1) =.51\) and \(\pi(\theta_2)=.49,\) as implied by the slope of the tangent line.#
11.6. Divergences#
To investigate prior sensitivity, we seek a convenient way to represent a family of alternative priors. We start with a baseline prior \(d \pi(\theta).\) Consider alternative priors of the form \(d \pi(\theta) = n(\theta) d\pi(\theta)\) for \(n \ge 0\) satisfying:
Call this collection \({\mathcal N}\). Thus the \(n\)’s in \({\mathcal N}\) are expressed as densities relative to a baseline prior distribution.
Introduce a convex function \(\phi\) to construct a divergence between a probability represented with \(n \ne 1\) and the baseline probability. Restrict \(\phi\) to be a convex function with \(\phi(1) = 0\) and \(\phi''(1) = 1\) (normalization). As a measure of divergence, form
Of course, many such divergences could be built. Three interesting ones use the convex functions:
\(\phi(n) =- \log n\)
\(\phi(n) =\frac {n^2 -1}{2}\)
\(\phi(n) = n \log n\)
The divergence implied from the third choice is commonly used in applied probability theory and information theory. It is called Kullback-Leibler divergence or relative entropy.
11.7. Robust Bayesian preferences and ambiguity aversion#
Since the Bayesian ranking of prize rules depends on the prior distribution, we now explore how to proceed if the decision maker does not have full confidence in a specific prior. This leads naturally to an investigation of prior sensitivity. A decision or policy problem provides us with an answer to the question: sensitive to what. One way to investigate prior sensitivity is to approach it from the perspective of robustness. A robust decision rule then becomes one that performs well under alternative priors of interest. To obtain robustness guarantees, we are naturally led to minimization, providing us with a lower bound on performance. As we will see, prior robustness has very close ties to preferences that display ambiguity aversion. Just as risk aversion induces a form of caution in the presence of uncertain outcomes, ambiguity aversion induces a caution because of the lack of confidence in a single prior.
We explore prior robustness by using a version of variational preferences( [Maccheroni et al., 2006])
for \(\xi_1 > 0\) and a convex function \(\phi_1\) such that \(\phi_1(1)= 0\) and \({\phi_1}''(1) = 1.\) The penalty parameter \(\xi_1\) reflects the degree of ambiguity aversion. An arbitrarily large value of \(\xi_1\) approximates subjective expected utility. Relatively small values of \(\xi_1\), inducing relatively large degrees of ambiguity aversion.
Remark 11.2
Axiomatic developments of decision theory in the presence of risk typically do not produce the functional form for the utility function. That requires additional considerations. An analogous observation applies to the axiomatic development of variational preferences by [Maccheroni et al., 2006]. Their axioms do not inform as to the how to capture the cost associated with search over alternative priors.
Remark 11.3
The variational preferences of [Maccheroni et al., 2006] also include preferences with a constraint on priors:
The more restrictive axiomatic formulation of [Gilboa and Schmeidler, 1989] supports a representation with a constraint on the set of priors. In this case we use standard Karush-Kuhn-Tucker multipliers to model the preference relation:
11.7.1. Relative entropy divergence#
Suppose we use \(n \log n\) to construct our divergence measure. Recall the construction of the risk function
Solve the Lagrangian:
This problem separates in terms of the choice of \(n(\theta)\), and can be solved \(\theta\) by \(\theta\). The first-order conditions are:
Solving for \(\log n(\theta)\):
Thus
Imposing the integral constraint on \(n\) gives the solution:
provided that the denominator is finite. This solution induces what is known as exponential tilting. The baseline probabilities are tilted towards lower values of \({\overline { U}}(\gamma \mid \theta)\). Plugging back into the minimization problem gives:
This minimized objective is known to depict be a special case of smooth ambiguity preferences initially proposed by [Klibanoff et al., 2005], although these authors provide a different motivation for their ambiguity adjustment. The connection we articulate opens the door to more direct link to challenges familiar to statisticians and econometricians wrestling with how to analyze and interpret data. Indeed [Cerreia-Vioglio et al., 2013] also adopt a robust statistics perspective when exploring smooth ambiguity aversion preferences. They use constructs and distinctions of the type we explored in Chapter 1 in characterizing what is and is not learnable from the Law of Large Numbers.
11.7.2. Robust Bayesian decision problem#
We extend Decision Problem (11.1) to include prior robustness by introduce a special case of a two-player, zero-sum game:
Game 11.4
Notice that in this formulation, the minimization depends on the choice of the decision rule \(\delta\). This is to be expected as the prior with the most adverse consequences for the expected utility should plausibly depend on the potential course of action under consideration.
For a variety of reasons, it is of interest to investigate a related problem in which the order of extremization is exchanged:
Game 11.5
Notice that for a given \(n\), the inner problem is essentially just a version of the Bayesian problem (11.1). The penalty term
is additively separable and does not depend on \(\delta\). In this formulation, we solve a Bayesian problem for each possible prior, and then minimize over the priors taking account of the penalty term. Provided that the outer minimization problem over \(n\) has a solution, \(n^*,\) the implied decision rule, \(\delta_{n^*},\) solves a Bayesian decision problem. As we know from section Constructing admissible decision rules, this opens the door to verifying admissibility.
The two decision games: (11.10) and (11.11) essentially have the same solution under a Minimax Theorem. That is the implied value functions are the same and \(\delta_{n^*},\) from Game (11.11) solves Game (11.10) and gives the robust decision rule under prior ambiguity. This result holds under a variety of sufficient conditions. Notice that the objective for Game (11.10) is convex in \(n\). A well known result due to [Fan, 1952] verifies the Minimax Theorem when the objective satisfies a generalization of concavity with a convex constraint set \(\Delta\). While we cannot always justify this exchange, there are other sufficient conditions that are also applicable.
A robust Bayesian advocate along the lines of [Good, 1952], view the solution, say \(n^*,\) from Game (11.11) as a choice of a prior to be evaluated subjectively. It is often referred as a “worst-case prior’’, but an object that is of interest in its own right. For an application of this idea in economics see [Chamberlain, 2000].[6] Typically, \(n^*(\theta) d\pi(\theta)\) can be computed as part of an algorithm for finding a robust decision rule, and is arguably worth serious inspection. While we could just view robustness considerations as a way to select a prior, the (penalized) worst-case solutions can instead be viewed as a device to implement robustness. While they are worthy of inspection, just as with the baseline prior probabilities, the worst-case probabilities are not intended to be a specification of a fully confident input of subjective probabilities and are dependent on the utility function and constraints imposed on the decision problem.
11.7.2.1. A simple statistical application reconsidered#
We again use the model selection to illustrate ambiguity aversion in the presence of a relative entropy cost of a prior deviating from the baseline. Since the Minimax Thoerem applies, and we focus our attention on admissible decision rules parameterized by threshold decision rules of form (11.7). With this simplification, we use formula (11.9) and solve the scalar maximization problem:
Two limiting cases are of interest. When \(\xi_1 \rightarrow \infty\), the objective collapses the subjective expected utility:
for \({\sf r}\) chosen so that the tangency condition
is satisfied.
When \(\xi_1 \rightarrow 0\), the cost of deviating from the baseline prior is zero. As long as the baseline prior assigns positive probability to both values of \(\theta\), the minimization for a given threshold \({\sf r}\) assigns probability one to lowest risk with an objective:
Graphically, the objective for any point to the left of the 45 degree line from the origin equals the outcome of a vertically downward movement to that same line. Analogously, the objective for any point to the right of the 45 degree line from the origin equals the outcome of a horizontally leftward movement to that same line. Thus maximizing threshold choice of \({\sf r}\) is obtained a the intersection of the 45 degree line and the boundary of risk set. Along the 45 degree line, the choice of prior is inconsequential because the two risks are the same. Nevertheless, the “worst-case” prior is determined by slope of the risk curve at the intersection point. Recall that we defined this prior after exchanging orders of extremization. Figures Fig. 11.3 and Fig. 11.4 illustrate this outcome for the two risk curves on display in Figures Fig. 11.1 and Fig. 11.2.

Fig. 11.3 The blue curve gives the upper boundary of the feasible set of risks. The utility function parameters are given by \(\upsilon_1=\upsilon_2=1.\) The implied worst-case prior is given by \(\pi(\theta_1) =.5\) and \(\pi(\theta_2)=.5,\) as implied by the slope of the tangent line at the intersection with the 45 degree line from the origin.#

Fig. 11.4 The blue curve gives the upper boundary of the feasible set of risks. The utility function parameters are given by \(\upsilon_1\) and \(\upsilon_2=.5.\) The implied worst-case prior is given by \(\pi(\theta_1) =.22\) and \(\pi(\theta_2)=.78,\) as implied by the slope of the tangent line at the intersection with the 45 degree line from the origin.#
For positive values of \(\xi_1,\) the implied worst-case priors are between the baseline \(\pi\) (\(\xi = \infty\)) and the worst-case without restricting the prior probabilities (\(\xi = 0\)). Observe that the worst-case priors depend on the utility weights, \(\upsilon_1\) and \(\upsilon_2\). See Figures Fig. 11.5 and Fig. 11.6 as illustrations.

Fig. 11.5 Minimizing prior probabilities for \(\theta_1\) as a function of \(1/\xi_1\) when the baseline prior probabilities are \(\pi(\theta_1) = .68\) and \(\pi(\theta_2) = .32\) and the utility parameters are \(\upsilon_1=\upsilon_2=1.\)#

Fig. 11.6 Minimizing prior probabilities for \(\theta_1\) as a function of \(1/\xi_1\) when the baseline prior probabilities are \(\pi(\theta_1) = .51\) and \(\pi(\theta_2) = .49\) and the utility parameters are \(\upsilon_1= 1\) and \(\upsilon_2=.5.\)#
The two-model example dramatically understates the potential value of ambiguity aversion as a way to study prior sensitivity. In typical applied problems, the subjective probabilities are imposed on a much richer collection of alternative models including families of models indexed by unknown parameters. In such problems the outcome is more subtle since the minimization isolates dimensions along with prior sensitivity has the most adverse impacts on the decision problem and perhaps most worthy of further consideration. This can be especially important in problems where baseline priors are imposed “as a matter or convenience.”
11.8. Using ambiguity aversion to represent concerns about model misspecification#
Two prominent statisticians remarked on how model misspecification is pervasive:
“Since all models are wrong, the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.” - Box (1976).
“… it does not seem helpful just to say that all models are wrong. The very word ‘model’ implies simplification and idealization. The idea that complex physical, biological or sociological systems can be exactly described by a few formulae is patently absurd. The construction of idealized representations that capture important stable aspects of such systems is, however, a vital part of general scientific analysis and statistical models, especially substantive ones …” - Cox (1995).
Other scholars have made similar remarks. Robust control theorists have suggested on way to address this challenge, an approach that we build on in the discussion that follows. Motivated by such sentiments, [Cerreia-Vioglio et al., 2025] extend decision theory axioms to accommodate misspecification concerns.
To focus on the misspecification of specific model, we temporarily fix \(\theta\) but vary the likelihood function as a way to investigate likelihood sensitivity. We then replace \(\ell(x \mid \theta) d \tau(w) \) with \(m(x \mid \theta) \ell(x \mid \theta) d \tau(x) \) for \(m\) satisfying:
and denote the set of all such \(m's\) as \(\mathcal M\).
Observe that \(m(x \mid \theta)\) can be viewed as the ratio of two densities. Consider two alternative probability densities (with respect to \(d \tau(x)\) for shock/date probabilities: \(\ell(x \mid \theta)\) and \({\tilde \ell}(x \mid \theta)\) and let:
where we assume \(\ell(x \mid \theta) > 0\) for \(x \in {\mathcal X}\). Then by construction:
and
We use density ratios to capture alternative models as inputs into divergence measures. Let \(\phi_2\) be a convex function such that \({\phi_2}(1) = 0\) and \({\phi_2}''(1) = 1\) (normalization). Instead of imposing the divergence over the probabilities over the parameter space, to explore misspecification, impose it over the \({\mathcal X}\) space:
We use this divergence to limit or constrain our search over alternative probability models. In this approach we deliberately avoid imposing a prior distribution over the space of densities (with respect to \(d \tau(x))\).
Preferences for model robustness ranks alternative prize rules, \(\gamma\), by solving:
for a penalty parameter \(\xi_2 > 0.\) The parameter, \(\xi_2\), dictates the strength of the restraint in an exploration of possible model misspecifications.
This approach has direct links to robust control theory in the case of relative entropy divergence. Suppose that \(\phi_2(m) = m \log m\) (relative entropy). Then by imitating previous computations, we find that the outcome of the minimization in (11.12) is
Remark 11.4
Robust control emerged from the study optimization of dynamical systems. The use of the relative entropy divergence showed up prominently in [Jacobson, 1973] and later in [Whittle, 1981], [Petersen et al., 2000] and many other related papers as a response to the excessive simplicity of assuming shocks to dynamical systems that were iid and mean zero with normal distributions. [Hansen and Sargent, 1995] and [Hansen and Sargent, 2001] showed to how to reformulate the insights from robust control theory to apply to dynamical economic systems with recursive formulations and [Hansen et al., 1999] used this ideas in an initial empirical investigation.
11.9. Robust Bayes with model misspecification#
To relate to decision theory, think of a statistical model as implying compound lottery. Use \(\ell(x \mid \theta) d\tau(x) \) to as a lottery conditioned on \(\theta\) and think \(d\pi(\theta)\) as a lottery over \(\theta\). Initially we thought of the first as a source of “risk” and it gave rise to what statisticians call a risk function: the expected utility (or loss) conditioned on the unknown parameter. In the Anscombe-Aumann metaphor, this is the “roulette wheel”. The distribution, \(d \pi(\theta)\), is subjective probability input and the “horse race” in the Anscombe-Aumann metaphor. The potential misspecification of likelihoods adds skepticism of the risk contribution to the compound lottery. As statisticians like Box and Cox observed, potential model misspecification arguably should be a pervasive concern.
11.9.1. Approach one#
Form a convex, compact constraint set of prior probabilities, \(n \subset {\mathcal N}_o\). Represent preferences over \(\gamma\) using:
11.9.2. Approach two#
Represent preferences over \(\gamma\) with:
Note that this approach uses a scaled version of a joint divergence over \((m,n),\) as reflected by
Consider the special case in which \(\xi_1 = \xi_2 = \xi\). Then this joint divergence contribution to preferences is \(\xi\) multiplying the relative entropy of the joint distribution \(m(x \mid \theta) n(\theta) \ell(x \mid \theta) d\tau(x) d\pi(\theta) \) relative to the baseline \(\ell(x \mid \theta) d\tau(x) d\pi(\theta)\) given by:
Joint densities can be factored in alternative ways. In solving the robust decision problem, a different factorization is more convenient. Focus on the case in which \(x = (w,z)\) and the decision rule, \(\delta,\) depends on \(z\). Form three contributions under the baseline so that
where the posterior \( d{\bar \pi}(\theta \mid z)\) depends on \(z\) through conditioning. Notice that the last term in the factorization depends only on \(z,\) whereas the decision rule conditions on \(z.\) We now explore likelihood misspecification using \(m_2(w \mid \theta, z)\) in a set \({\mathcal M}_2\) with elements that satisfy:
and posterior misspecification using \({\bar n}(\theta)\) in a set \({\overline {\mathcal N}}\) with elements that satisfy:
We may additionally alter the density
in an analogous way. This latter exploration will have a nondegenerate outcome, but it will have no impact on the robustly optimal choice of \(\delta\). The logic behind this is entirely analogous to the argument we provided for the conditional version of the Bayesian decision problem. In this case of a common penalty parameter and relative entropy the objective for a conditional robust problem with the same solution as the ex ante problem is:
Thus the decision make may proceed with constructing a robustly optimal decision rule taking as input the posterior distribution defined on a parameter space \(\Theta\) a build by a statistician.
Finally, suppose that \(u\) does not depend on \(\theta\). This provides a further simplification. Applying our previous computations with a relative entropy divergence, the minimized objective is:
where
The density \({\bar \ell_2}(w \mid z)\) is commonly referred to as in Bayesian statistical analyses as a predictive density. Thus it follows that when \(\xi_1 = \xi_2,\) a relative entropy is used as the divergence, and the utility function does not depend on \(\theta\) that our formulation becomes equivalent to one in which the decision maker explores the potential misspecification of the predictive density, \({\bar \ell_2},\) directly. [Chamberlain, 2020],proposed this alternative way to formulate preferences with uncertainty aversion..
11.10. A dynamic decision problem under commitment#
So far, we have studied static decision problems. This formulation can accommodate dynamic problems by allowing for decision rules that depend on histories of data available up until the date of the decision. While there is a “commitment” to these rules at the initial date, the rules themselves can depend on pertinent information only revealed in the future. Recall from Chapter 1, that we use an ergodic decomposition to identify a family of statistical models that are dynamic in nature along with probabilities across models that are necessarily subjective as they are not revealed by data.
We illustrate how we can use the ideas in this “static” chapter to study a macro-investment problem with parameter uncertainty.
Consider an example of a real investment problem with a single stochastic option for transferring goods from one period to another. This problem could be a planner’s problem supporting a competitive equilibrium outcome associated with a stochastic growth model with a single capital good. Introduce an exogenous stochastic technology process that has an impact on the growth rate of capital as an example of what we call a structured model. This stochastic technology process captures what a previous literature in macro-finance has referred to as “long-run risk.” For instance, see [Bansal and Yaron, 2004].[7]
We extend this formulation by introducing an unknown parameter \(\theta\) used to index members of a parameterized family of stochastic technology processes. The investor’s exploration of the entire family of these processes reflects uncertainty among possible structured models. We also allow the investor to entertain misspecification concerns over the parameterized models of the stochastic technology.
The exogenous (system) state vector \(Z_{t}\) used to capture fluctuations in the technological opportunities has realizations in \({\mathcal Z}\) and the shock vector \(W_{t}\) has realizations in \({\mathcal W}\). We build the exogenous technology process from the shocks in a parameter dependent way:
for a given initial condition \(Z_{0}\).
For instance, in long-run risk modeling one component of \(Z_{t}\) evolves as a first-order autoregression:
and another component is given by:
At each time \(t\) the investor observes past and current values \(\mathbf{Z} ^{t}=\left \{ Z_{0},Z_{1},...,Z_{t}\right \} \) of the technology process, but does not know \(\theta\) and does not directly observe the random shock vector \(W_{t}\).
Similarly, we consider a recursive representation of capital evolution given by:
where consumption \(X_{t}\geq0\) and investment \(I_{t}\geq0\) are constrained by an output relation:
for a pre-specified initial condition \(K_{0}\). The parameter \(\alpha\) captures the productivity of capital. By design this technology is homogeneous of degree one, which opens the door to stochastic growth as assumed in long-run risk models.
Both \(I_{t}\) and \(C_{t}\) are constrained to be functions of \(\mathbf{Z}^{t}\) at each date \(t\) reflecting the observational constraint that \(\theta\) is unknown to the investor in contrast to the history \(\mathbf{Z}^{t}\) of the technology process.[8] Preferences are defined over consumption processes.
In this intertemporal setting, we consider an investor who solves a date \(0\) commitment problem. We pose this as a static problem with consumptions and investments that depend on information as it gets realized.[9] Form the risk function
While the initial conditions \(K_0\) and \(Z_0\) are known, the parameter vector \(\theta\) is not.
Include divergences, one for the parameter \(\theta\) and the other for the potential misspecification in the dynamics as reflected in the shock distributions. The latter divergence will necessarily be dynamic in nature. We will use positive random variables with unit expectations as a way to introduce changes in probabilities. Introduce a positive martingale \(\{ M_t : t \ge 0 \}\) for an altered probability. Let \(M_0=1\) and let \(M_t\) depend state variables and shocks up to period \(t\) along with \(\theta\). We use the random variable \(M_t\) alter date \(t\) probabilities conditioned on \(K_0, Z_0, \theta\). The martingale property ensures that the altered probabilities implied by \(M_{t+1}\) agree with the altered probabilities implied by \(M_t\) as an implication of the Law of Iterated Expectations. Let the intertemporal divergence be:
as a measure of divergence and scale this by a penalty parameter, \(\xi_2\). We purposefully discount the relative entropies in the same way as we discount the utility function and the computations condition on \(\theta\). We then use a second divergence over the parameter vector \(\theta\) with a penalty parameter \(\xi_1.\)
In this model, the investor or planner will actively learn about \(\theta\). The potential model misspecifications are not linked over time and presumed not to be learnable. This model formulation presumes a preference for prior and likelihood robustness. Unfortunately it does not have a convenient recursive formulation, making it challenging to solve.
11.11. Recursive counterparts#
We comment briefly on recursive counterparts. We have seen in the previous chapter how to perform recursive learning and filtering. Positive martingales also have a convenient recursive structure. Write:
By the Law of Iterated Expectations:
Using this calculation and applying “summation-by-parts” (implemented by changing the order of summation) gives:
In this formula,
is relative entropy pertinent to the transition probabilities between date \(t\) and \(t+1\). With this formula, we form a discounted objective with date \(t\) contribution to confront potential likelihood misspecification:
where date \(t\) minimizing choice variable is \(M_{t+1}/M_{t} \ge 0\) subject to \(E\left(M_{t+1}/M_{t} \mid {\mathfrak A}_t \right) = 1.\) The ratio \(M_{t+1}/M_{t}\) is used when computing the conditional expectation of next periods continuation value needed to rank current period actions.
Remark 11.5
Our discounting of relative entropy has important consequences for exploration of potential misspecification. From (11.13), it follows that the sequence
is increasing in \(t\). Observe that
gives an upper bound on
for each \(s \ge 0\). This follows from monotonicity since
Taking limits as \(\beta \uparrow 1\) gives the bound of interest.
If the discounted limit in (11.15) is finite, then the increasing sequence (11.14) has a finite limit. It follows from a version of the Martingale Convergence Theorem (see [Barron, 1990]), that there is limiting random nonnegative variable \(M_\infty\) such that \(E \left( M_\infty \mid {\mathfrak A}_0 \right) =1\) and
Observe that in the limiting case when \(\beta \uparrow 1\) and the resulting relative entropy measure is finite, the altered probability must imply Law of Large Numbers that agrees with the baseline probability. In this sense only transient departures from the baseline probability are part of the misspecification exploration. By including discounting in the manner described, we expand the family of alternative probabilities in a substantively important way.
As an alternative calculation, consider a different discount factor scaling:
The limiting version of this measures allows for substantially larger set of alternative probabilities and results in limiting characterization that is used in Large Deviation Theory as applied in dynamic settings.
Remark 11.6
To explore potential misspecification [Chen et al., 2020] suggest other divergences with convenient recursive structures. A discounted version of their proposal is
for a convex function \(\phi_2\) where this function equals one when evaluated at one and its second derivative at one is normalized to be one.