11. Risk, Ambiguity, and Misspecification[1]#


../_images/knightwaldsavage.jpg

Pioneers in Uncertainty and Decision Theory. Frank Knight, Abraham Wald, abd Jimmie Savage.

“Uncertainty must be taken in a sense radically distinct from the familiar notion of risk, from which it has never been properly separated…. and there are far-reaching and crucial differences in the bearings of the phenomena depending on which of the two is really present and operating.” [Knight, 1921]

11.1. Introduction#

Likelihood functions are probability distributions conditioned on parameters; prior probability distributions describe a decision maker’s subjective belief about those parameters.[2] By distinguishing roles played by likelihood functions and subjective priors over their parameters, this chapter brings some recent contributions to decision theory into contact with statistics and econometrics in ways that can address practical econometric concerns about model misspecifications and choices of prior probabilities.

We combine ideas from control theories that construct decision rules that are robust to a class of model misspecifications with axiomatic decision theories invented by economic theorists. Such decision theories originated with axiomatic formulations by von Neumann and Morgenstern, Savage, and Wald ([Wald, 1947, Wald, 1949, Wald, 1950], [Savage, 1954]). Ellsberg ([Ellsberg, 1961]) pointed out that Savage’s framework seems to include nothing that could be called “uncertainty” as distinct from “risk”. Theorists after Ellsberg constructed coherent systems of axioms that embrace a notion of ambiguity aversion. However, most recent axiomatic formulations of decision making under uncertainty in economics are not cast explicitly in terms of likelihood functions and prior distributions over parameters.

This chapter reinterprets objects that appear in some of those axiomatic foundations of decision theories in ways useful to an econometrician. We do this by showing how to use an axiomatic structure to express ambiguity about a prior over a family of statistical models, on the one hand, along with concerns about misspecifications of those models, on the other hand.

Although they proceeded differently than we do here, [Chamberlain, 2020], [Cerreia-Vioglio et al., 2013], and [Denti and Pomatto, 2022] studied related issues. [Chamberlain, 2020] emphasized that likelihoods and priors are both vulnerable to potential misspecifications. He focused on uncertainty about predictive distributions constructed by integrating likelihoods with respect to priors. In contrast to Chamberlain, we formulate a decision theory that distinguishes uncertainties about priors from uncertainties about likelihoods. [Cerreia-Vioglio et al., 2013] (section 4.2) provided a rationalization of the smooth ambiguity preferences proposed by [Klibanoff et al., 2005] that includes likelihoods and priors as components. [Denti and Pomatto, 2022] extended this approach by using an axiomatic revealed preference approach to deduce a parameterization of a likelihood function. However, neither [Cerreia-Vioglio et al., 2013] nor [Denti and Pomatto, 2022] sharply distinguished prior uncertainty from concerns about misspecifications of likelihood functions. We want to do that. We formulate concerns about statistical model misspecifications as uncertainty about likelihoods.

More specifically, we align definitions of statistical models, uncertainty, and ambiguity with ideas from decision theories that build on [Anscombe and Aumann, 1963]’s way of representing subjective and objective uncertainties. In particular, we connect our analysis to econometrics and robust control theory by using Anscombe-Aumann states as parameters that index alternative statistical models of random variables that affect outcomes that a decision maker cares about. By modifying aspects of [Gilboa et al., 2010], [Cerreia-Vioglio et al., 2013], and [Denti and Pomatto, 2022], we show how to use variational preferences to represent uncertainty about priors and also concerns about statistical model misspecifications.

Discrepancies between two probability distributions occur throughout our analysis. This fact opens possible connections between our framework and some models in “behavioral” economics and finance that assume that decision makers inside their models have expected utility preferences in which an agent’s subjective probability – typically a predictive density – differs systematically from the predictive density that the model user assumes governs the data.[3] Other “behavioral” models focus on putative differences among agents’ degrees of confidence in their views of the world. Our framework implies that the form taken by a “lack of confidence” should depend on the probabilistic concept about which a decision maker is uncertain. Preference structures that we describe in this chapter allow us to formalize different amounts of “confidence” about details of specifications of particular statistical models, on one hand, and about subjective probabilities to attach to alternative statistical models, on the other hand. Our representations of preferences provide ways to characterize degrees of confidence in terms of statistical discrepancies between alternative probability distributions.[4]

11.2. Background motivation#

We are sometimes told that we live in a “data rich” environment. Nevertheless, data are often not “rich” along all of the dimensions that we care about for decision making. Furthermore, data don’t “speak for themselves”. To get them to say something, we have to posit a statistical model. For all the hype, the types of statistical learning we actually do is infer parameters of a family of statistical models. Doubts about what existing evidence has taught them about some important dimensions has led some scientists to think about what they call “deep uncertainties.” For example, in a recent paper we read:

“The economic consequences of many of the complex risks associated with climate change cannot, however, currently be quantified. … these unquantified, poorly understood and often deeply uncertain risks can and should be included in economic evaluations and decision-making processes.” [Rising et al., 2022]

In this chapter, we formulate “deep uncertainties” as lack of confidence in how we represent probabilities of events and outcomes that are pertinent for designing tax, benefit, and regulatory policies. We do this by confessing ambiguities about probabilities, though necessarily in a restrained way.

In our experience as macroeconomists, model uncertainties are not taken seriously enough, too often being dismissed as being of “second-order”, whatever that means in various contexts. In policy-making settings, there is a sometimes misplaced wisdom that acknowledging uncertainty should tilt decisions toward passivity. But in other times and places, one senses that the model uncertainty emboldens pretense:

“Even if true scientists should recognize the limits of studying human behavior, as long as the public has expectations, there will be people who pretend or believe that they can do more to meet popular demand than what is really in their power.” [Hayek, 1989]

As economists, part of our job is to delineate tradeoffs. Explicit incorporation of precise notions of uncertainty allows us to explore two tradeoffs pertinent to decision making. Difficult tradeoffs emerge when we consider implications from multiple statistical models and alternative parameter configurations. Thus, when making decisions, how much weight should we assign to best “guesses” in the face of our model specification doubts, versus possibly bad outcomes that our doubts unleash? Focusing exclusively on best guesses can lead us naively to ignore adverse possibilities worth considering. Focusing exclusively on worrisome bad outcomes can lead to extreme policies that perform poorly in more normal outcomes. Such considerations induce us to formalize tradeoffs in terms of explicit expressions of aversions to uncertainty.

There are also intertemporal tradeoffs: should we act now, or should we wait until we have learned more? While waiting is tempting, it can also be so much more costly that it becomes prudent to take at least some actions now even though we anticipate knowing more later.

11.2.1. Aims#

In this chapter we allow for uncertainties that include

  • risks: unknown outcomes with known probabilities;

  • ambiguities: unknown weights to assign to alternative probability models;

  • misspecifications: unknown ways in which a model provides flawed probabilities;

We will focus on formulations that are tractable and enlightening.

11.3. Decision theory overview#

Decision theory under uncertainty provides alternative axiomatic formulations of “rationality.” As there are multiple axiomatic formulations of decision making under uncertainty, it is perhaps best to replace the term “rational” with “prudent.” While these axiomatic formulations are of intellectual and substantive interest, in this chapter we will focus on the implied representations. This approach remains interesting because we have sympathy for Savage’s own perspective on his elegant axiomatic formulation:

Indeed the axioms have served their only function in justifying the existential parts of Theorems 1 and 3; in further exploitation of the theory, …, the axioms themselves can end, in my experience, and should be forgotten.” [Savage, 1952]

11.3.1. Approach#

In this chapter we will exploit modifications of Savage-style axiomatic formulations from decision theory under uncertainty, to investigate notions of uncertainty beyond risk. The overall aim is to make contact with applied challenges in economics and other disciplines. We will start with the basics of statistical decision theory and then proceed to explore extensions that distinguish concerns about potential misspecifications of likelihoods from concerns about the misspecification of priors. This opens the door to better ways for conducting uncertainty quantification for dynamic, stochastic economic models used for private sector planning and governmental policy assessment. It is achieved by providing tractable and revealing methods for exploring sensitivity to subjective uncertainties, including potential model misspecification and ambiguity across models. This will allow us to systematically:

  • assess the impact of uncertainty on prudent decision or policy outcomes;

  • isolate the forms of uncertainty that are most consequential for these outcomes.

To make the methods tractable and revealing we will utilize tools from probability and statistics to limit the type and amount of uncertainty that is entertained. As inputs, the resulting representations of objectives for decision making will require a specification of aversion to or dislike of uncertainty about probabilities over future events.

11.3.2. Anscombe-Aumann (AA)#

[Anscombe and Aumann, 1963] provided a different way to justify Savage’s representation of decision making in the presence of subjective uncertainty. They feature prominently the distinction between a “horse race” and a “roulette wheel”. They rationalize preferences over acts, where an act maps states into lotteries over prizes. The latter is the counterpart to a roulette wheel. Probability assignments over states then become the subjective input and the counterpart to the “horse race.”

[Anscombe and Aumann, 1963] used this formulation to extend the von Neumann-Morgenstern expected utility with known probabilities to decisions problems where subjective probabilities also play a central role as in Savage’s approach. While [Anscombe and Aumann, 1963] provides an alternative derivation of subjective expected utility, many subsequent contributions used the Anscombe-Aumann framework to extend the analysis to incorporate forms of ambiguity aversion. Prominent examples include [Gilboa and Schmeidler, 1989] and [Maccheroni et al., 2006]. In what follows we provide a statistical slant to such analyses.

11.3.3. Basic setup#

Consider a parameterized model of a random vector with realization x:

(xθ)dτ(x)

where

X(xθ)dτ(x)=1,

θΘ and Θ is a parameter space, and X is the space of possible realizations of x. We refer to as a likelihood and the probability implied by each θ as a “structured” probability model.

Denote by π a prior distribution over Θ. We will sometimes have reason to focus on a particular prior, π that we will call a baseline prior. We denote a prize rule by γ which maps X into prizes. We define a decision rule, δ, that can condition on observations or signals. To elaborate further, partition x=(w,z) where the decision rule can depend on z, and the prize rule on the entire x vector. A probability distribution over the w’s reflects random outcomes realized after a decision has been made. For instance, in an intertemporal context, w may reflect future shocks. Thus decisions may have further repercussions for prizes beyond z:

γδ(x)=Ψ[δ(z),x].

Since the decision rule can depend on z , ex post learning is entertained. Decision rules are restricted to be in a set Δ and imply restrictions for prize rules:

Γ(Δ)=def{γδ:δΔ}.

The preferences over prize rules imply a ranking over decision rules, the δ’s, in Δ. While we are featuring the impact of a decision rule on a prize rule, we may extend the analysis to allow δ to influence as happens when we entertain experimentation.

Risk is assessed using expected utility with a utility function U. We compute the expected utility for prize rules as a function of θ as:

U(γθ)=XU[γ(x),θ](xθ)dτ(x).

Following a language from statistical decision theory, we call U(γθ) the risk function for a given prize rule when viewed as a function of θ.[5]

We allow the utility function to depend on the unknown parameter θ, as is common in statistical decision problems. Arguably, such a formulation is a short hand for a more primitive specification in which in a dynamic setup the parameter has ramifications for a future prize and hence shows up in a value function formulation.

11.3.4. A simple statistical application#

As an illustration, we consider a model selection problem. Suppose Θ={θ1,θ2}, and that the decision maker can use the entire underlying random vector. Thus, we make no distinction between γ and δ and no distinctions between x and z.

The decision rule, δ=γ is a mapping from Z to [0,1] where δ(z)=1 means that the decision maker selects model θ1 and δ(z)=0 means that the decision maker selects model θ2.
We allow for intermediate values between zero and one, which can be interpreted as randomization. These intermediate choices will end up not being of particular interest for this example.

The utility function for assessing risk is:

U(δ,θ1)=υ1δU(δ,θ2)=υ2(1δ)

where υ1, and υ2 are positive utility parameters.

A class of decision rules, called threshold rules, will be of particular interest. Partition Z into two sets:

Z=Z1Z2,

where the intersection of Z1 and Z2 is empty. A threshold rule has the form:

δ(z)={1 if zZ10 if zZ2.

For a threshold rule, the conditional expected utility is

U(γθ1)=υ1Z1(zθ1)dτ(z)U(γθ2)=υ2Z2(zθ2)dτ(z).

Suppose that the utility weights υ1 and υ2 are both one. Under this threshold decision rule, 1U(γθ1) is the probability of making a mistake when model θ1 generates the data, and 1U(γθ2) is the probability of making a mistake if model θ2 generates the data. In statistics, the first of these is called a Type I error, and the second a Type II error, assuming we consider θ1 to be the “null model” and θ2 to be the “alternative model.” The utility weights determine the relative importance, to the decision maker, of making a correct identification of the model.

11.4. Subjective expected utility#

Order preferences over γ and hence δ

Θ[XU[γ(x),θ](x|θ)dτ(x)]dπ(θ)=ΘU(γθ)dπ(θ)

for a specific π. This representation is supported by Savage and Anscombe-Aumann axioms, but imposes full confidence with no potential misspecification of the priors or the likelihood.

We use these preference for a decision problem where prize rules are restricted to be in the set Γ(Δ):

Problem 11.1

(11.1)#maxγΓ(Δ)Θ(XU[γ(x),θ](x|θ)dτ(x))dπ(θ).

Recall that partitioning of x=(w,z) where the decision rule can only depend on z and the prize rule on the entire x vector. Factor (θ) and τ as:

(11.2)#dτ(x)=dτ2(w)dτ1(z)(xθ)=2(wz,θ)1(zθ)

These factorizations in (11.2) allow us to write the objective as:

Θ[Z(WU[γ(x),θ]2(wz,θ)dτ2(w))1(zθ)dτ1(z)]dπ(θ).

To solve problem (11.1), it is convenient to exchange the orders of integration in the objective:

Z[Θ(WU[γ(x),θ]2(wz,θ)dτ2(w))1(zθ)dπ(θ)]dτ1(z)

Notice that even if the utility function U does not depend on θ, this dependence may emerge after we integrate over W because of the dependence of 2(θ) on the unknown parameter.

As δ only depends on z and the objective is additively separable in z, we may equivalently solve a conditional problem: using the objective:

U~[δ(z)]=defΘ(WU(Ψ[δ(z),w,z],θ)2(wz,θ)dτ2(w))1(zθ)dπ(θ)

for each value of z provided that the restrictions imposed on δ by the construction of the set of decision rules Δ are separable in z. That is, provided that we can write:

(11.3)#Δ={δ:δ(z)Δ(z) for all zZ}

for given constraint sets Δ(z), we may solve

Problem 11.2

(11.4)#maxδ(z)Δ(z)U~[δ(z)]

Finally, notice that

dπ¯(θz)=def[1(zθ)Θ1(zθ)dπ(θ)]dπ(θ)

is the Bayesian posterior distribution for θ. Equivalently, we may solve the conditional Bayesian problem:

Problem 11.3

(11.5)#maxδ(z)Δ(z)Θ(WU[Ψ(δ(z),w,z),θ]2(wz,θ)dτ2(w))dπ¯(θz),

since in forming the objective of conditional problem, (11.5), we divided the objective for the conditional problem, (11.4) by a function of z alone.

For illustration purposes, consider the example given in Section A simple statistical application. In this example, x=z and there is no w contribution. Impose prior probabilities, π(θ1), and π(θ2)=1π(θ1) on the two models. Compute the Bayesian posterior probabilities for each value of θ,

π¯(θ1z)=(zθ1)π(θ1)(zθ1)π(θ1)+(zθ2)π(θ2)π¯(θ2z)=(zθ2)π(θ2)(zθ1)π(θ1)+(zθ2)π(θ2)

Consider the conditional problem. If the decision maker chooses model one, then the conditional expected utility is υ1π¯(θ1z) and similarly for choosing model two. Thus the Bayesian decision maker computes:

max{υ1π¯(θ1z),υ2π¯(θ2z)}

and chooses a model in accordance to this maximization. This maximization is equivalent to

max{υ1π(θ1)(zθ1),υ2π(θ2)(zθ2)}

expressed in terms of the prior, likelihood, and utility contributions. Taking logarithms and rearranging, we see that model θ1 is selected if

(11.6)#log(zθ1)log(zθ2)logυ2logυ1+logπ(θ2)logπ(θ1)

If the right side of this inequality is zero, say because prior probabilities are the same across models and utility weights are also the same, then the decision rule says to maximize the log likelihood. More generally, both prior weights and utility weights come into play.

Notice that this decision rule is a threshold rule where we use the posterior probabilities to partition the Z space. The subset Z1 contains all z such that inequality (11.6) is satisfied with a weak inequality. We arbitrarily include the indifference values in Z1.

The Bayesian solution to the decision problem is posed assuming full confidence in a subjective prior distribution. In many problems, including ones with multiple sources of uncertainty, such confidence may well not be warranted. Such a concern might well have been the motivation behind Savage’s remark:

… if I knew of any good way to make a mathematical model of these phenomena [vagueness and indecision], I would adopt it, but I despair of finding one. One of the consequences of vagueness is that we are able to elicit precise probabilities by self-interrogation in some situations but not in others.

Personal communication from L. J. Savage to Karl Popper in 1957

11.5. An extreme response#

Suppose we go to the other extreme and avoid imposing a prior altogether. Compare two prize rules, γ1 and γ2, by computing the conditional (on θ) expected utilities, U[γ1,θ] and U[γ2,θ] for each θΘ. Then γ2 is preferred to γ1 if the conditional expected utility of the former exceeds that of the latter for all θΘ. This, however, only implies a partial ordering among prize rules. Many such rules cannot be ranked. This partial ordering gives rise to a construct called admissibility, where an admissible δΔ cannot be dominated in the sense of this partial order.

11.5.1. Constructing admissible decision rules#

One way to construct an admissible decision rule is to impose a prior and solve the resulting Bayesian decision problem. We give two situations in which this result necessarily applies, but there are other settings where this result is known to hold.

Proposition 11.1

If an ex ante Bayesian decision problem, (11.1), has a unique solution (except possibly on a set that has measure zero under τ2) , then this Bayesian solution is admissible.

Proof. Let δ~ be a decision rule that weakly dominates a Bayesian decision rule, δ, in the sense that

U(δθ)U(δ~θ)

for all θΘ. The δ~ must also solve the ex ante Bayesian decision problem. Since the solution to the ex ante decision problem is unique, δ~=δ.

Proposition 11.2

Suppose Θ has a finite number of elements. If a prior distribution dπ assigns positive probability to each element of Θ, then a decision rule that solves the Bayesian decision problem (11.1) is admissible.

Proof. Let δ~ be a decision rule that weakly dominates a Bayesian decision rule, δ, in the sense that

(11.7)#U(δθ)U(δ~θ)

for θΘ. Suppose that the prior, dπ, used in constructing the decision rule,δ, assigns strictly positive probability to each value of θΘ. Use this prior to form expectations of both sides of inequality (11.7),

ΘU(δθ)dπ(θ)ΘU(δ~θ)dπ(θ)

But this latter inequality must hold with equality. Since each element of Θ has strictly positive positive prior probability, inequality (11.7) must also hold with equality. Therefore, δ must be admissible.

Remark 11.1

While we are primarily interested in the use of alternative subjective priors as a way to construct admissible decision rules, sufficient conditions have been derived under which we can find priors that give Bayesian justifications for all admissible decision rules. Such results come under the heading of Complete class theorems. See, for instance, [LeCam, 1955], [Ferguson, 1967], and [Brown, 1981].

11.5.2. A simple statistical application reconsidered#

For illustration purposes, we again consider the model selection example. Consider a threshold decision rule of the form:

(11.8)#δr(z)={1,log(zθ1)log(zθ2)r0,log(zθ1)log(zθ2)<r

From formula (11.6), provided that we choose the prior probabilities to satisfy:

(11.9)#logπ(θ2)logπ(θ1)=r+logυ1logυ2,

threshold rule (11.8) solves a Bayesian decision problem. Thus the implicit prior for the threshold rule is:

π(θ1)=11+exp(r)(υ1υ2)π(θ2)=exp(r)(υ1υ2)1+exp(r)(υ1υ2).

To provide a complementary analysis, form:

y=log(zθ1)log(zθ2),

and use the probability measure for z under model θ1 to induce a corresponding probability measure for the scalar y. Suppose this probability measure has a density f(θ1) relative to the Lebesgue measure. Observe that the counterpart density, f(θ2) satisfies:

f(yθ2)=exp(y)f(yθ1).

This follows from the construction of y because the ratio of densities is equal to exp(y). For a decision rule, δr, with threshold r, compute the two risks:

u1(r)=defU(δrθ1)=υ1r+f(yθ1)dyu2(r)=defU(δrθ2)=υ2rexp(y)f(yθ1)dy

where we include the multiplication by exp(y) in the second row to change the computation using the model θ2 probabilities.

Consider the two-dimensional curve of model risks, (u1(r),u2(r)), parametrized by the threshold r. The slope of this curve at point corresponding to r is the ratio of the two derivatives with with respect to r:

du2(r)/drdu1(r)/dr=(υ2υ1)exp(r)

We compute the second-order derivative of the curve as

ddrdu2(r)/drdu1(r)/drddru1(r)=υ2exp(r)(υ1)2f(r)<0,

and hence the curve is concave.

Using prior probabilities to weight the two risks gives:

π(θ1)u1(r)+π(θ2)u2(r)

Maximizing this objective by choice of a threshold r gives the first-order conditions:

π(θ1)υ1f(r)+π(θ2)υ2exp(r)f(r)=0,

implying that

(υ2υ1)exp(r)=π(θ1)π(θ2)

As expected, this agrees with (11.9). Thus the negative of the slope of the curve reveals the ratio of probabilities that would justify a Bayesian solution given a threshold r.

We illustrate this computation in Figures Fig. 11.1 and Fig. 11.2. Both figures report the upper boundary of the set of feasible risks for alternative decision rules. The risks along the boundary are attainable with admissible decision rules. The utility weights, υ1 and υ2 are both set to one in Figure Fig. 11.1, and υ2 is to .5 in Figure Fig. 11.2. Thus ,the upper envelop of risks is flatter in Figure Fig. 11.2 than in Figure Fig. 11.1. The flatter curve implies prior probabilities that are close to being the same.

../_images/risk_nu2_1.0.png

Fig. 11.1 The blue curve gives the upper boundary of the feasible set of risks. The utility function parameters are given by υ1=υ2=1. When ur(θ1)=.9, the implied prior is given by π(θ1)=.68 and π(θ2)=.32, as implied by the slope of the tangent line.#

../_images/risk_nu2_0.5.png

Fig. 11.2 The blue curve gives the upper boundary of the feasible set of risks. The utility function parameters are given by υ1=1,υ2=.5. When ur(θ1)=.9, the implied prior is given by π(θ1)=.51 and π(θ2)=.49, as implied by the slope of the tangent line.#

11.6. Divergences#

To investigate prior sensitivity, we seek a convenient way to represent a family of alternative priors. We start with a baseline prior dπ(θ). Consider alternative priors of the form dπ(θ)=n(θ)dπ(θ) for n0 satisfying:

Θn(θ)dπ(θ)=1.

Call this collection N. Thus the n’s in N are expressed as densities relative to a baseline prior distribution.

Introduce a convex function ϕ to construct a divergence between a probability represented with n1 and the baseline probability. Restrict ϕ to be a convex function with ϕ(1)=0 and ϕ(1)=1 (normalization). As a measure of divergence, form

Θϕ[n(θ)]dπ(θ)0.

Of course, many such divergences could be built. Three interesting ones use the convex functions:

  • ϕ(n)=logn

  • ϕ(n)=n212

  • ϕ(n)=nlogn

The divergence implied from the third choice is commonly used in applied probability theory and information theory. It is called Kullback-Leibler divergence or relative entropy.

11.7. Robust Bayesian preferences and ambiguity aversion#

Since the Bayesian ranking of prize rules depends on the prior distribution, we now explore how to proceed if the decision maker does not have full confidence in a specific prior. This leads naturally to an investigation of prior sensitivity. A decision or policy problem provides us with an answer to the question: sensitive to what. One way to investigate prior sensitivity is to approach it from the perspective of robustness. A robust decision rule then becomes one that performs well under alternative priors of interest. To obtain robustness guarantees, we are naturally led to minimization, providing us with a lower bound on performance. As we will see, prior robustness has very close ties to preferences that display ambiguity aversion. Just as risk aversion induces a form of caution in the presence of uncertain outcomes, ambiguity aversion induces a caution because of the lack of confidence in a single prior.

We explore prior robustness by using a version of variational preferences( [Maccheroni et al., 2006])

minnNΘ(XU[γ(x),θ](x|θ)dτ(x))n(θ)dπ(θ)+ξ1Θϕ1[n(θ)]dπ(θ)

for ξ1>0 and a convex function ϕ1 such that ϕ1(1)=0 and ϕ1(1)=1. The penalty parameter ξ1 reflects the degree of ambiguity aversion. An arbitrarily large value of ξ1 approximates subjective expected utility. Relatively small values of ξ1, inducing relatively large degrees of ambiguity aversion.

Remark 11.2

Axiomatic developments of decision theory in the presence of risk typically do not produce the functional form for the utility function. That requires additional considerations. An analogous observation applies to the axiomatic development of variational preferences by [Maccheroni et al., 2006]. Their axioms do not inform as to the how to capture the cost associated with search over alternative priors.

Remark 11.3

The variational preferences of [Maccheroni et al., 2006] also include preferences with a constraint on priors:

Θϕ1[n(θ)]dπ(θ)κ.

The more restrictive axiomatic formulation of [Gilboa and Schmeidler, 1989] supports a representation with a constraint on the set of priors. In this case we use standard Karush-Kuhn-Tucker multipliers to model the preference relation:

maxξ10minnNΘ(XU[γ(x),θ](x|θ)dτ(x)))n(θ)dπ(θ)+ξ1[Θϕ1[n(θ)]dπ(θ)κ].

11.7.1. Relative entropy divergence#

Suppose we use nlogn to construct our divergence measure. Recall the construction of the risk function

U(γθ)=XU[γ(x),θ](xθ)dτ(x)

Solve the Lagrangian:

minnΘU(γθ)n(θ)dπ(θ)+ξ1Θlogn(θ)n(θ)dπ(θ)+λΘ[n(θ)1]dπ(θ)

This problem separates in terms of the choice of n(θ), and can be solved θ by θ. The first-order conditions are:

U(γθ)+ξ1+ξ1logn(θ)+λ=0.

Solving for logn(θ):

logn(θ)=1ξ1U(γθ)1λξ1

Thus

n(θ)exp[1ξ1U(γθ)].

Imposing the integral constraint on n gives the solution:

n(θ)=exp[1ξ1U(γθ)]Θexp[1ξ1U(γθ)]dπ(θ),

provided that the denominator is finite. This solution induces what is known as exponential tilting. The baseline probabilities are tilted towards lower values of U(γθ). Plugging back into the minimization problem gives:

(11.10)#ξ1logΘexp[1ξ1U(γθ)]dπ(θ).

This minimized objective is known to depict be a special case of smooth ambiguity preferences initially proposed by [Klibanoff et al., 2005], although these authors provide a different motivation for their ambiguity adjustment. The connection we articulate opens the door to more direct link to challenges familiar to statisticians and econometricians wrestling with how to analyze and interpret data. Indeed [Cerreia-Vioglio et al., 2013] also adopt a robust statistics perspective when exploring smooth ambiguity aversion preferences. They use constructs and distinctions of the type we explored in Chapter 1:Laws of Large Numbers and Stochastic Processes in characterizing what is and is not learnable from the Law of Large Numbers.

11.7.2. Robust Bayesian decision problem#

We extend Decision Problem (11.1) to include prior robustness by introduce a special case of a two-player, zero-sum game:

Game 11.4

(11.11)#maxγΓ(Δ)minnNΘ(XU[γ(x),θ](x|θ)dτ(x))n(θ)dπ(θ)+ξ1Θϕ1[n(θ)]dπ(θ).

Notice that in this formulation, the minimization depends on the choice of the decision rule δ. This is to be expected as the prior with the most adverse consequences for the expected utility should plausibly depend on the potential course of action under consideration.

For a variety of reasons, it is of interest to investigate a related problem in which the order of extremization is exchanged:

Game 11.5

(11.12)#minnNmaxγΓ(Δ)Θ(XU[γ(x),θ](x|θ)dτ(x))n(θ)dπ(θ)+ξ1Θϕ1[n(θ)]dπ(θ).

Notice that for a given n, the inner problem is essentially just a version of the Bayesian problem (11.1). The penalty term

ξ1Θϕ1[n(θ)]dπ(θ)

is additively separable and does not depend on δ. In this formulation, we solve a Bayesian problem for each possible prior, and then minimize over the priors taking account of the penalty term. Provided that the outer minimization problem over n has a solution, n, the implied decision rule, δn, solves a Bayesian decision problem. As we know from section Constructing admissible decision rules, this opens the door to verifying admissibility.

The two decision games: (11.11) and (11.12) essentially have the same solution under a Minimax Theorem. That is the implied value functions are the same and δn, from Game (11.12) solves Game (11.11) and gives the robust decision rule under prior ambiguity. This result holds under a variety of sufficient conditions. Notice that the objective for Game (11.11) is convex in n. A well known result due to [Fan, 1952] verifies the Minimax Theorem when the objective satisfies a generalization of concavity with a convex constraint set Δ. While we cannot always justify this exchange, there are other sufficient conditions that are also applicable.

A robust Bayesian advocate along the lines of [Good, 1952], view the solution, say n, from Game (11.12) as a choice of a prior to be evaluated subjectively. It is often referred as a “worst-case prior’’, but an object that is of interest in its own right. For an application of this idea in economics see [Chamberlain, 2000].[6] Typically, n(θ)dπ(θ) can be computed as part of an algorithm for finding a robust decision rule, and is arguably worth serious inspection. While we could just view robustness considerations as a way to select a prior, the (penalized) worst-case solutions can instead be viewed as a device to implement robustness. While they are worthy of inspection, just as with the baseline prior probabilities, the worst-case probabilities are not intended to be a specification of a fully confident input of subjective probabilities and are dependent on the utility function and constraints imposed on the decision problem.

11.7.2.1. A simple statistical application reconsidered#

We again use the model selection example to illustrate ambiguity aversion in the presence of a relative entropy cost of a prior deviating from the baseline. Since the Minimax Thoerem applies, we focus our attention on admissible decision rules parameterized by thresholds of the form (11.8). With this simplification, we use formula (11.10) and solve the scalar maximization problem:

maxrξ1log(exp[1ξ1u1(r)]π(θ1)+exp[1ξ1u2(r)]π(θ2))

Two limiting cases are of interest. When ξ1, the objective collapses the subjective expected utility:

u1(r)π(θ1)+u2(r)π(θ2)

for r chosen so that the tangency condition

(υ2υ1)exp(r)=π(θ1)π(θ2)

is satisfied.

When ξ10, the cost of deviating from the baseline prior is zero. As long as the baseline prior assigns positive probability to both values of θ, the minimization for a given threshold r assigns probability one to lowest risk with an objective:

min{u1(r),u2(r)}

Graphically, the objective for any point to the left of the 45 degree line from the origin equals the outcome of a vertically downward movement to that same line. Analogously, the objective for any point to the right of the 45 degree line from the origin equals the outcome of a horizontally leftward movement to that same line. Thus the maximizing threshold choice of r is obtained at the intersection of the 45 degree line and the boundary of the risk set. Along the 45 degree line, the choice of prior is inconsequential because the two risks are the same. Nevertheless, the “worst-case” prior is determined by slope of the risk curve at the intersection point. Recall that we defined this prior after exchanging orders of extremization. Figures Fig. 11.3 and Fig. 11.4 illustrate this outcome for the two risk curves on display in Figures Fig. 11.1 and Fig. 11.2.

../_images/risk_equal_corrected_arrows.png

Fig. 11.3 The blue curve gives the upper boundary of the feasible set of risks. The utility function parameters are given by υ1=υ2=1. The implied worst-case prior is given by π(θ1)=.5 and π(θ2)=.5, as implied by the slope of the tangent line at the intersection with the 45 degree line from the origin.#

../_images/risk_unequal_corrected_arrows.png

Fig. 11.4 The blue curve gives the upper boundary of the feasible set of risks. The utility function parameters are given by υ1 and υ2=.5. The implied worst-case prior is given by π(θ1)=.22 and π(θ2)=.78, as implied by the slope of the tangent line at the intersection with the 45 degree line from the origin.#

For positive values of ξ1, the implied worst-case priors are between the baseline π (ξ=) and the worst-case without restricting the prior probabilities (ξ=0). Observe that the worst-case priors depend on the utility weights, υ1 and υ2. See Figures Fig. 11.5 and Fig. 11.6 as illustrations.

../_images/probabilities_nu2_1.0_pi1_0.68.png

Fig. 11.5 Minimizing prior probabilities for θ1 as a function of 1/ξ1 when the baseline prior probabilities are π(θ1)=.68 and π(θ2)=.32 and the utility parameters are υ1=υ2=1.#

../_images/probabilities_nu2_1.0_pi1_0.51.png

Fig. 11.6 Minimizing prior probabilities for θ1 as a function of 1/ξ1 when the baseline prior probabilities are π(θ1)=.51 and π(θ2)=.49 and the utility parameters are υ1=1 and υ2=.5.#

The two-model example dramatically understates the potential value of ambiguity aversion as a way to study prior sensitivity. In typical applied problems, the subjective probabilities are imposed on a much richer collection of alternative models including families of models indexed by unknown parameters. In such problems the outcome is more subtle since the minimization isolates dimensions along which prior sensitivity has the most adverse impacts on the decision problem and perhaps most worthy of further consideration. This can be especially important in problems where baseline priors are imposed “as a matter or convenience.”

11.8. Using ambiguity aversion to represent concerns about model misspecification#

Two prominent statisticians remarked on how model misspecification is pervasive:

“Since all models are wrong, the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.” - Box (1976).

“… it does not seem helpful just to say that all models are wrong. The very word ‘model’ implies simplification and idealization. The idea that complex physical, biological or sociological systems can be exactly described by a few formulae is patently absurd. The construction of idealized representations that capture important stable aspects of such systems is, however, a vital part of general scientific analysis and statistical models, especially substantive ones …” - Cox (1995).

Other scholars have made similar remarks. Robust control theorists have suggested one way to address this challenge, an approach that we build on in the discussion that follows. Motivated by such sentiments, [Cerreia-Vioglio et al., 2025] extend decision theory axioms to accommodate misspecification concerns.

11.8.1. Basic approach#

To focus on the misspecification of specific model, we fix θ but vary the likelihood function as a way to investigate likelihood sensitivity. We then replace with m satisfying:

Xm(xθ)(xθ)dτ(x)=1,

and denote the set of all such ms as M.

Observe that m(xθ) can be viewed as the ratio of two densities. Consider two alternative probability densities (with respect to dτ(x) for shock/signal probabilities: (xθ) and ~(xθ) and let:

m(xθ)=~(xθ)(xθ)

where we assume (xθ)>0 for xX. Then by construction:

m(xθ)(xθ)=~(xθ),

and

Xm(xθ)(xθ)dτ(x)=X~(xθ)dτ(x)=1.

We use density ratios to capture alternative models as inputs into divergence measures. Let ϕ2 be a convex function such that ϕ2(1)=0 and ϕ2(1)=1 (normalization). Instead of imposing the divergence over the probabilities over the parameter space, to explore misspecification, impose it over the X space:

Xϕ2[m(xθ)]dτ(x).

We use this divergence to limit or constrain our search over alternative probability models. In this approach we deliberately avoid imposing a prior distribution over the space of densities (with respect to dτ(x)).

Preferences for model robustness ranks alternative prize rules, γ, by solving:

(11.13)#minmMX(U[γ(x),θ]m(xθ)+ξ2ϕ2[m(xθ)])(xθ)dτ(x).

for a penalty parameter ξ2>0. The parameter, ξ2, dictates the strength of the restraint in an exploration of possible model misspecifications.

11.8.2. Relative entropy divergence#

This approach to model misspecification has direct links to robust control theory in the case of relative entropy divergence. Suppose that ϕ2(m)=mlogm (relative entropy). Then by imitating previous computations, we find that the outcome of the minimization in (11.13) is

ξ2logXexp[(1ξ2)U[γ(x),θ]](xθ)dτ(x)

Remark 11.4

Robust control emerged from the study of optimization of dynamical systems. The use of the relative entropy divergence showed up prominently in [Jacobson, 1973] and later in [Whittle, 1981], [Petersen et al., 2000] and many other related papers as a response to the excessive simplicity of assuming shocks to dynamical systems that were iid and mean zero with normal distributions. [Hansen and Sargent, 1995] and [Hansen and Sargent, 2001] showed to how to reformulate the insights from robust control theory to apply to dynamical economic systems with recursive formulations and [Hansen et al., 1999] used this ideas in an initial empirical investigation.

When we use relative entropy as a measure of divergence, we have the ability to factor likelihoods in convenient ways. Recall that partitioning of x=(w,z) where the decision rule can only depend on z and the prize rule on the entire x vector. As in (11.2), factor (θ) and τ as:

dτ(x)=dτ2(w)dτ1(z)(xθ)=2(wz,θ)1(zθ).

Add to this a factorization of m(θ),

(11.14)#m(xθ)=m2(wz,θ)m1(zθ)Wm2(wθ,z)2(wθ,z)dτ2(w)=1Zm1(zθ)1(zθ)dτ1(z)=1

Let M1 denote the set of m1(zθ)0 that satisfy the relevant integral constraint in (11.14), and similarly let M2 be the set of m2(wz,θ)0 that satisfy the relevant integral constraint.

Using these factorizations, the relative entropy may be written as:

ZWlog[m2(wz,θ)]m(xθ)(xθ)dτ(x)+Zlog[m1(zθ)]m1(zθ)1(zθ)dτ1(z)

where the second term only features integration over z because log[m1] does not depend on w and m22 integrates to one with respect to τ2 for all z. In particular,

(11.15)#ZWlog[m2(wz,θ)]m(xθ)(xθ)dτ(x)=Z[Wlog[m2(wz,θ)]m2(wz,θ)2(wz,θ)dτ2(w)]×m1(zθ)1(zθ)dτ1(z).

Rewrite the expected utility function in (11.13) with an inner integral:

(11.16)#Z[WU(Ψ[δ(z),w,z],θ)m2(wz,θ)2(wz,θ)dτ2(w)]×m1(zθ)1(zθ)dτ1(z)

Notice that both the formulas (11.15) and (11.16) scale linearly in m1(zθ) and that the additional relative entropy term depends only on z. Thus we can use a conditional objective when solving for robust decision rule δ:

Game 11.6

(11.17)#maxδ(z)Δ(z)minm2M2WU(Ψ[δ(z),w,z],θ)m2(wz,θ)2(wz,θ)dτ2(w)+ξ2Wlog[m2(wz,θ)]m2(wz,θ)2(wz,θ)dτ2(w)

where constraint set Δ satisfies the separability constraint (11.3).

11.8.3. Robust prediction under misspecification#

A decision rule is chosen to forecast

f(w,z)=f1(z)+f2(z)w

where the probability distribution over the w’s is a standard normal. The admissible forecast rules express δ as a function of the data z. A prize rule gives the implied forecast error, γ(x)=f(x)δ(z). Take the utility function to be:

12γ(x)2=12[f(x)δ(z)]2.

We find the robust forecasting rule by solving Game (11.17). We first solve the inner minimization problem which is given by:

ξ2logE(exp[(12ξ2)[f1(z)+f2(z)wδ(z)]2]z,θ).

To compute this objective, two exponentials contribute to this objective: one from the normal density for w and the other from the decision maker objective scaled by: 1/ξ2. Adding together the logarithms of these two components:

(12ξ2)[f1(z)+f2(z)wδ(z)]212w2=12[(1ξ2)[f2(z)]21]w2+1ξ2f2(z)[f1(z)δ(z)]w+12ξ2[f1(z)δ(z)]2=12pr(wm)2+12pr(m)2+12ξ2[f1(z)δ(z)]2=[12pr(wm)2+12logpr]+12[f1(z)δ(z)]2ξ2[f2(z)]212logpr

where

pr=def1(1ξ2)[f2(z)]2m=deff2(z)[f1(z)δ(z)]ξ2[f2(z)]2.

The first term in the square brackets is the logarithm of a normal density with mean m and precision pr except for a constant term, one that is contributed by the standard normal density. This same normal distribution is the “worst-case” distribution for forecasting rule δ(z). For the objective to be finite, we need that

ξ2>[f2(z)]2

Given this calculation, the outcome of the minimization problem can be rewritten as

(ξ22)[f1(z)δ(z)]2ξ2[f2(z)]2+ξ22log[1(1ξ2)[f2(z)]2].

Maximizing with respect to δ(z) implies that δ(z)=f1(z) with the resulting objective function given by:

ξ22log[1(1ξ2)[f2(z)]2]<12f2(z)2

where the inequality follows from the concavity of the log function.
Thus for this example the robust prediction is to set δ(z) equal to the conditional mean under the baseline distribution. Robustness considerations only alter the final objective, and not the decision rule. This calculation, however, relies heavily on the normal baseline distribution.

11.9. Robust Bayes with model misspecification#

To relate to decision theory, think of a statistical model as implying a compound lottery. Use (xθ)dτ(x) as a lottery conditioned on θ and think dπ(θ) as a lottery over θ. Initially we thought of the first as a source of “risk” and it gave rise to what statisticians call a risk function: the expected utility (or loss) conditioned on the unknown parameter. In the Anscombe-Aumann metaphor, this is the “roulette wheel”. The distribution, dπ(θ), is subjective probability input and the “horse race” in the Anscombe-Aumann metaphor. The potential misspecification of likelihoods adds skepticism of the risk contribution to the compound lottery. As statisticians like Box and Cox observed, potential model misspecification arguably should be a pervasive concern.

11.9.1. Approach one#

Form a convex, compact constraint set of prior probabilities, NNo. Represent preferences over γ using:

minnNominmMΘ(XU[γ(x),θ]m(xθ)(xθ)dτo(x))n(θ)dπ(θ)+ξ2Θ(Xϕ2[m(xθ)](xθ)dτ(x))n(θ)dπ(θ)

11.9.2. Approach two#

Represent preferences over γ with:

minnNminmMΘ(XU[γ(x),θ]m(xθ)(xθ)dτ(x))n(θ)dπ(θ)+ξ2Θ(Xϕ2[m(xθ)](xθ)dτ(x))n(θ)dπ(θ)+ξ1Θϕ1[n(θ)]dπ(θ)

Note that this approach uses a scaled version of a joint divergence over (m,n), as reflected in the term

ξ2Θ(Xϕ2[m(xθ)](xθ)dτ(x))n(θ)dπ(θ)+ξ1Θϕ1[n(θ)]dπ(θ)

The associated decision problem is:

Game 11.6

(11.18)#maxγΓminnNminmMΘ(XU[γ(x),θ]m(xθ)(x|θ)dτ(x))n(θ)dπo(θ)+ξ2Θ(Xϕ2[m(xθ)](xθ)dτ(x))n(θ)dπ(θ)+ξ1Θϕ1[n(θ)]dπ(θ).

Remark 11.5

In Section Robust prediction under misspecification we studied a prediction problem under misspecification and established the conditional expectation under the base-line model is “robust.” Now suppose there is parameter uncertainty in the sense that we have multiple specifications of the pair (f1(zθ),f2(zθ)) for θΘ. For starters, for each θ the ex ante contribution to the decision maker objective conditioned on a model is:

ξ2logE(exp[(12ξ2)[f1(zθ)+f2(zθ)wδ(z)]2]θ).

This objective adjusts for likelihood (or model) uncertainty but not prior uncertainty. The conditional expectation uses the probability measure; (xθ)dτ(x).. Since θ is unknown, the decision rule, δ(z)=f1(zθ) is infeasible to implement. Game (11.18) provides an algorithm for deducing the robust decision rule with

maxδΔminnNξ2θlogE(exp[(12ξ2)[f1(zθ)+f2(zθ)wδ(z)]2]θ)n(θ)dπ(θ)+ξ1Θlog[n(θ)]n(θ)dπ(θ).

Consider the special case in which ξ1=ξ2=ξ, and use this common parameter to scale a relative entropy divergence. Then the combined robustness cost to preferences is measured by ξ multiplying the relative entropy of the joint distribution m(xθ)n(θ)(xθ)dτ(x)dπ(θ) relative to the baseline (xθ)dτ(x)dπ(θ) and given by:

Θ(Xlog[m(xθ)]m(xθ)(xθ)dτ(x))n(θ)dπ(θ)+Θlog[n(θ)]n(θ)dπ(θ)

Joint densities can be factored in alternative ways. In solving the robust decision problem, a different factorization is more convenient. We focus on the case in which x=(w,z) and the decision rule, δ, depends only on z. Form three contributions of the joint density under the baseline:

(xθ)dτ(x)dπ(θ)=2(wz,θ)1(zθ)dτ2(w)dτ1(z)dπ(θ)=2(wz,θ)dτ2(w)dπ¯(θz)[1(zθ)dπ(θ)]dτ1(z).

Notice that the last term in the factorization depends only on z, whereas the decision rule conditions on z. We now explore likelihood misspecification using m2(wz,θ) satisfying:

Wm2(wz,θ)2(wz,θ)dτ2(w)=1

and posterior misspecification using n¯(θz) where

Θn¯(θz)dπ¯(θz)=1.

Additionally, we may alter the density

Θ1(zθ)dπ(θ)

in an analogous way. While this latter exploration will have a nondegenerate outcome, it will have no impact on the robustly optimal choice of δ. We may instead focus on a conditional counterpart to Game (11.18). The logic behind this is entirely analogous to the argument we provided for the conditional version of the Bayesian decision problem. The objective for conditional robust game with the same solution as the ex ante game is:

Game 11.7

maxδ(z)Δ(z)minn¯Nminm2M2Θ(WU(Ψ[δ(z),w,z],θ)m2(wz,θ)2(wz,θ)dτ2(w))n¯(θz)dπ¯(θz)+ξΘ(Wlog[m2(wz,θ)]m2(wz,θ)2(wz,θ)dτ2(w))n¯(θz)dπ¯(θz)+ξΘlog[n¯(θz)]n¯(θz)dπ¯(θz)

where constraint set Δ satisfies the separability constraint (11.3).

Thus the decision maker may proceed with constructing a robustly optimal decision rule taking as input the posterior distribution defined on a parameter space Θ as computed by a statistician. The decision maker explores the robustness of the posterior and density for the shocks conditioned on (z,θ).

Finally, suppose that U does not depend on θ. This provides a further simplification. In this case, instead of working with the factorization:

2(wz,θ)dτ2(w)dπ¯(θz)

we use

dπ~(θw,z)¯2(wz)dτ2(w)

where

¯2(wz)=defΘ2(wz,θ)dπ¯(θz),

dπ¯(θx) is posterior distribution formed using data on both z and w. Statisticians call ¯2(z) a predictive density, in this case defined on the space W of shocks. With this alternative factorization, the minimization step has no incentive to explore the misspecification of dπ~(θx), and instead focuses exclusively on the potential misspecification of the predictive density. This leads to the following construction of a robust decision rule:

Game 11.8

maxδ(z)Δ(z)minm¯2(wz)M2WU(Ψ[δ(z),w,z])m¯2(wz)¯2(wz)dτ2(w)+ξWlog[m¯2(w z)]m¯2(wz)¯2(wz)dτ2(w)

where constraint set Δ satisfies the separability constraint (11.3).

[Chamberlain, 2020] features this as a way to formulate preferences with uncertainty aversion.

In many applications it will be of considerable interest to allow for ξ1ξ2, in which case the simplifications implied by some of these alternative factorizations will not be applicable. Indeed we find it valuable and revealing to differentiate prior robustness and likelihood robustness.[7]

11.10. A dynamic decision problem under commitment#

So far, we have studied static decision problems. This formulation can accommodate dynamic problems by allowing for decision rules that depend on histories of data available up until the date of the decision. While there is a “commitment” to these rules at the initial date, the rules themselves can depend on pertinent information only revealed in the future. Recall from Chapter 1, that we use an ergodic decomposition to identify a family of statistical models that are dynamic in nature along with probabilities across models that are necessarily subjective as they are not revealed by data.

We illustrate how we can use the ideas in this “static” chapter to study a macro-investment problem with parameter uncertainty.

Consider an example of a real investment problem with a single stochastic option for transferring goods from one period to another. This problem could be a planner’s problem supporting a competitive equilibrium outcome associated with a stochastic growth model with a single capital good. Introduce an exogenous stochastic technology process that has an impact on the growth rate of capital as an example of what we call a structured model. This stochastic technology process captures what a previous literature in macro-finance has referred to as “long-run risk.” For instance, see [Bansal and Yaron, 2004].[8]

We extend this formulation by introducing an unknown parameter θ used to index members of a parameterized family of stochastic technology processes. The investor’s exploration of the entire family of these processes reflects uncertainty among possible structured models. We also allow the investor to entertain misspecification concerns over the parameterized models of the stochastic technology.

The exogenous (system) state vector Zt used to capture fluctuations in the technological opportunities has realizations in Z and the shock vector Wt has realizations in W. We build the exogenous technology process from the shocks in a parameter dependent way:

Zt+1=ψ(Zt,Wt+1,θ)

for a given initial condition Z0.

For instance, in long-run risk modeling one component of Zt evolves as a first-order autoregression:

Zt+11=aθZt1+bθ1Wt+1

and another component is given by:

Zt+12=dθ+cθ2Wt+1

At each time t the investor observes past and current values Zt={Z0,Z1,...,Zt} of the technology process, but does not know θ and does not directly observe the random shock vector Wt.

Similarly, we consider a recursive representation of capital evolution given by:

Kt+1=Ktφ(It/Kt,Zt+1)

where consumption Ct0 and investment It0 are constrained by an output relation:

Ct+It=αKt

for a pre-specified initial condition K0. The parameter α captures the productivity of capital. By design this technology is homogeneous of degree one, which opens the door to stochastic growth as assumed in long-run risk models.

Both It and Ct are constrained to be functions of Zt at each date t reflecting the observational constraint that θ is unknown to the investor in contrast to the history Zt of the technology process.[9] Preferences are defined over consumption processes.

In this intertemporal setting, we consider an investor who solves a date 0 commitment problem. We pose this as a static problem with consumptions and investments that depend on information as it gets realized.[10] Form the risk function

(1β)E[t=0βtυ(Ct)K0,Z0,θ].

While the initial conditions K0 and Z0 are known, the parameter vector θ is not.

Include divergences, one for the parameter θ and the other for the potential misspecification in the dynamics as reflected in the shock distributions. The latter divergence will necessarily be dynamic in nature. We will use positive random variables with unit expectations as a way to introduce changes in probabilities. Introduce a positive martingale {Mt:t0} for an altered probability. Let M0=1 and let Mt depend on state variables and shocks up to period t along with θ. We use the random variable Mt to alter date t probabilities conditioned on K0,Z0,θ. The martingale property ensures that the altered probabilities implied by Mt+1 agree with the altered probabilities implied by Mt as an implication of the Law of Iterated Expectations. Let the intertemporal divergence be:

(1β)E(t=0βtMt+1logMt+1K0,Z0,θ)

as a measure of divergence and scale this by a penalty parameter, ξ2. We purposefully discount the relative entropies in the same way as we discount the utility function and the computations condition on θ. We then use a second divergence over the parameter vector θ with a penalty parameter ξ1.

In this model, the investor or planner will actively learn about θ. The potential model misspecifications are not linked over time and presumed not to be learnable. This model formulation presumes a preference for prior and likelihood robustness. Unfortunately it does not have a convenient recursive formulation, making it challenging to solve.

11.11. Recursive counterparts#

We comment briefly on recursive counterparts. We have seen in the previous chapter how to perform recursive learning and filtering. Positive martingales also have a convenient recursive structure. Write:

logMt+1=j=0tlogMj+1logMj

and M0 is initialized to be one.
By the Law of Iterated Expectations:

(11.19)#E(Mt+1logMt+1A0)=E[Mt+1(j=0tlogMj+1logMj)A0]=E[j=0tMj+1(logMj+1logMj)A0]=E[j=0tMj(Mj+1Mj)(logMj+1logMj)A0]=E(j=0tMjE[(Mj+1Mj)(logMj+1logMj)Aj]A0).

Using this calculation and applying “summation-by-parts” (implemented by changing the order of summation) gives:

(1β)E(t=0βtMt+1logMt+1A0)=(1β)t=0βtE(j=0tMjE[(Mj+1Mj)(logMj+1logMj)Aj]A0)=t=0βtE(MtE[(Mt+1Mt)(logMt+1logMt)At]A0)

In this formula,

E[(Mt+1Mt)(logMt+1logMt)At]

is relative entropy pertinent to the transition probabilities between date t and t+1. With this formula, we form a discounted objective with date t contribution to confront potential likelihood misspecification:

υ(Ct)+βξ2E[(Mt+1Mt)(logMt+1logMt)At]

where date t minimizing choice variable is Mt+1/Mt0 subject to E(Mt+1/MtAt)=1. The ratio Mt+1/Mt is used when computing the conditional expectation of next periods continuation value needed to rank current period actions.

Remark 11.6

Our discounting of relative entropy has important consequences for exploration of potential misspecification. From (11.19), it follows that the sequence

(11.20)#{E(MtlogMtA0):t0}

is increasing in t. Observe that

(11.21)#limβ1(1β)E(t=0βtMt+1logMt+1A0)

gives an upper bound on

E(Ms+1logMs+1A0)

for each s0. This follows from monotonicity since

(1β)E(t=0βtMt+1logMt+1A0)(1β)t=sβtE(Mt+1logMt+1A0)=βsE(Ms+1logMs+1A0)

Taking limits as β1 gives the bound of interest.

If the discounted limit in (11.21) is finite, then the increasing sequence (11.20) has a finite limit. It follows from a version of the Martingale Convergence Theorem (see [Barron, 1990]), that there is limiting random nonnegative variable M such that E(MA0)=1 and

Mt=E(MAt).

Observe that in the limiting case when β1 and the resulting relative entropy measure is finite, the altered probability must imply Law of Large Numbers that agrees with the baseline probability. In this sense only transient departures from the baseline probability are part of the misspecification exploration. By including discounting in the manner described, we expand the family of alternative probabilities in a substantively important way.

As an alternative calculation, consider a different discount factor scaling:

(1β)t=0βtE(MtE[(Mt+1Mt)(logMt+1logMt)At]A0)

The limiting version of this measures allows for substantially larger set of alternative probabilities and results in limiting characterization that is used in Large Deviation Theory as applied in dynamic settings.

Remark 11.7

To explore potential misspecification [Chen et al., 2020] suggest other divergences with convenient recursive structures. A discounted version of their proposal is

(1β)E[j=0tβtE[Mtϕ2(Mt+1/Mt)At]A0]

for a convex function ϕ2 where this function equals one when evaluated at one and its second derivative at one is normalized to be one.

11.12. Implications for uncertainty quantification#

Uncertainty quantification is a challenge that pervades many scientific disciplines. The methods we describe here open the door to answering the “so what” aspect of uncertainty measurement. So far, we have deliberately explored examples that are low-dimensional to illustrate results. While these are pedagogically revealing, the methods we described have all the more potency in problems with high-dimensional uncertainty. By including minimization as part of the decision problem, we isolate the uncertainties that are of most relevance to the decision or policy problem. This may open the door to incorporating sharper prior inputs or to guiding future efforts aimed at providing additional evidence relevant to the decision-making challenge. Furthermore, there may be multiple channels by which uncertainty can impact the decision problem. As an example consider an economic analysis of climate change. There is uncertainty in i) the global warming implications of increases in carbon emissions, ii) the impact of global warming on economic opportunities, and iii) the prospects for the discovery of new, clean technologies that are economically viable. A direct extension of the methods developed in this chapter provide a (non-additive) decomposition of the channels of uncertainty. By modifying the penalization, uncertainty in each channel could be activated separately in comparison to activating uncertainty in all channels simultaneously. Comparing outcomes of such computations reveals which channel of uncertainty is most consequential to the structuring of a prudent decision rule.[11]