1. Laws of Large Numbers and Stochastic Processes#

1.1. Introduction#

We shall interpret economic time series as temporally organized data whose source is a single random draw from a parameterized joint probability distribution. That joint distribution determines intertermporal probabilistic relationships among components of the data. We imagine a statistician who doesn’t know values of the parameters that characterize the joint probability distribution and wants to use those data to infer them. A key ingredient in doing this successfully is somehow to form averages over time of some functions of the data and hope that they converge to something informative about model parameters. Some laws of large numbers (LLNs) can help us with this, but others can’t. This chapter describes one that can.

Classic LLNs that are typically studied in entry level probability classes adopt a narrow perspective that is not applicable to economic time series. To help us appreciate why a plain vanilla LLN won’t help us study time series, we’ll start with a setting in which such a plain vanilla LLN is all that we need. Here the statistician views a data set as a random draw from a particular joint probability distribution that is one among a family of models that he knows. Once again, from those data, our statistician wants to infer which member of the family is most plausible. The family could be finite or it could be represented more generally with an unknown parameter vector indexing alternative models. The parameter vector could be finite or infinite-dimensional. We call each joint probability distribution associated with a particular parameter vector a statistical model.

A plain vanilla LLN assumes that the data are a sequence of independent draws from the same (i.e., identical) probability distribution. This assumption immediately leads to a logical problem when it comes to thinking about learning which model has generated the data. If the parameter vector that pins down the probability distribution is known and draws are independent and identically distributed (IID), then past draws indicate nothing about future draws, meaning that there is nothing to learn. But if the parameters that pin down data generation are not known, then there is something to learn: a sequence of draws from that unknown distribution are not IID because past draws indicate something about probabilities of future draws.

../_images/brass_tacks_dice_edited.jpg

Fig. 1.1 Top: a sequence of draws from a fair coin. Bottom: a sequence of draws from a possibly unfair brass tack.#

To illustrate the issue, Fig. 1.1 shows IID draws from two Bernoulli distributions. The top sequence are flips of fair coins. Here the probability of a heads is one half for every coin. After observing the outcome of N coin flips, a statistician would predict that the probability that the next flip will be one half, regardless of earlier outcomes. In this example, the statistician has no reason to use data to learn the “parameter vector of interest”, i.e., the probability of a heads on a single draw of one coin, because he knows it. The bottom sequence is formed from successive tosses of brass tacks. We assume that the statistician does not know the probability that each brass tack will land on its head. The statistician wants to use this observed outcome of N tosses to predict outcomes of future tack tosses. A presumption that observations of past tosses of the tack contain information about future tosses is tantamount to assuming that the sequence is not IID. This is a context in which it is natural to assume that the way brass tacks are constructed is the same (i.e., identical) for all of the brass tacks, and thus that the probabilities of tacks landing on their head are the same for each toss. If we condition on the common probability that a tack lands on its head, it makes sense to view the hypothetical observations as conditionally IID.

To achieve a theory of statistical learning, we can relax the IID assumption and instead assume a property called exchangeability. For a sequence of random draws to be exchangeable, it must be true that the joint probability distribution of any finite segment does not change if we rearrange individuals’ positions in the sequence. In other words the order of draws does not matter when forming joint probabilities. If prospective outcomes of tack tosses can plausibly be viewed as being exchangeable, this opens the door to learning about the probability of a heads as we accumulate more observations. Exchangeable sequences are conditionally IID, where we interpret conditioning as being on a statistical model.[1] Exchangeable sequences obey a conditional version of a Law of Large Numbers that we describe later in this chapter.

Because we are interested in economic time series, LLNs based on exchangeabilty are too restrictive because we are interested in probability models in which the temporal order in which observations arise matters. This motivates us to replace exchangeability with a notion of stationarity and to use an ergodic decomposition theorem for stationary processes. This LLN for stationary processes is conditional on what we call the statistical model. The resulting LLN allows us to study a statistician who considers several alternative statistical models simultaneously and who understands how long-run averages of some salient statistics depend on unknown parameters on which the LLN is conditioned.

A LLN can teach us about statistical challenges that we would face us even if we were to have time series of infinite length. While this is a good starting point, we’ll have to do more. In later chapters, we will describe additional approaches to statistical inference that can help us to understand how much we can learn about model parameters from finite histories of data.

1.2. Stochastic Processes#

We start with a probability space, namely, a triple (Ω,F,Pr), where Ω is a set of sample points, F is a collection of subsets of Ω called events (formally a sigma algebra) and Pr assigns probabilities to events. We refer to Pr as a probability measure. The following definition makes reference to Borel sets. Borel sets include open sets, closed sets, finite intersections, and countable unions of such sets.

Definition 1.1

X is an n-dimensional random vector if X:ΩRn has the property that for any Borel set b in Rn,

{ωΩ:X(ω)b}

is in F.

A result from measure theory states that if {Xo}=def{ωΩ:X(ω)o is an event in F whenever o is an open set in Rn, then X is an n-dimensional random vector. In what follows, we will often omit the explicit reference to ω when it self evident.

This formal structure facilitates using mathematical analysis to formulate problems in probability theory. A random vector induces a probability distribution over the collection of Borel sets in which the probability assigned to set b is given by

Pr{Xb}

By changing the set b, we trace out a probability distribution implied by the random vector X that is called the induced distribution. An induced distribution is what typically interests an applied worker. In practice, an induced distribution is just specified directly without constructing the foundations under study here. However, proceeding at a deeper level, as we have, by defining a random vector to be a function that satisfies particular measurable properties and imposing the probability measure Pr over the domain of that function has mathematical payoffs. We will exploit this mathematical formalism in various ways, among them being in construction of stochastic processes.

Definition 1.2

An n-dimensional stochastic process is an infinite sequence of n-dimensional random vectors {Xt:t=0,1,...}.

The measure Pr assigns probabilities to a rich and interesting collection of events. For example, consider a stacked random vector

X[](ω)=def[X0(ω)X1(ω)X(ω)]

and Borel sets b in Rn(+1). The joint distribution of X[] induced by Pr over such Borel sets is

Pr{X[]b}.

Since the choice of is arbitrary, Pr implies a distribution over a an entire sequence of random vectors {Xt(ω):t=0,1,...}.[2].

We may also go the other way. Given a probability distribution over infinite sequences of vectors, we can construct a probability space and a stochastic process that induce this distribution. Thus, the following way to construct a probability space is particularly enlightening.

Construction 1.1:

Let Ω be a collection of infinite sequences in Rn with a sample point ωΩ being a sequence of vectors ω=(r0,r1,...), where rtRn. Let Bt be the collection of Borel sets of Rn(t+1), and let F be the smallest sigma-algebra that contains the Borel sets of Rn(+1) for =0,1,2....

For each integer 0, let Pr assign probabilities to the Borel sets of Rn(+1). A Borel set in Rn(+1) can also be viewed as a Borel set in Rn(+2) with rn(+1) left unrestricted. Specifically, let b be a Borel set in Rn(+1). Then

b+1={(r0,r1,...,r,r+1):(r0,r1,...,r)b}

is a Borel set in Rn(+2). For probability measures {Pr:=0,1,...} to be consistent, we require that the probability assigned by Pr+1 satisfy

Pr(b)=Pr+1(b+1)

With this restriction, we can extend the probability Pr to the space (Ω,F) that is itself consistent with the probability assigned by Pr for all nonnegative integers .[3]

Finally, we construct the stochastic process {Xt:t=0,1,...} by letting

Xt(ω)=rt

for t=0,1,2,.... A convenient feature of this construction is that Pr is the probability induced by the random vector [X0,X1,...,X].

We refer to this construction as canonical. While this is only one among other possible constructions of probability spaces, it illustrates the flexibility in building sequences of random vectors that induce alternative probabilities of interest.

The remainder of this chapter is devoted to studying Laws of Large Numbers. What is perhaps the most familiar Law of Large Numbers presumes that the stochastic process {Xt:t=0,1,...} is IID. Then

1Nt=1Nϕ(Xt)Eϕ(X0)

for any (Borel measurable) function ϕ for which the expectation is well defined. Convergence holds in several senses that we state later. Notice that as we vary the function ϕ we can infer the (induced) probability distribution for X0. In this sense, the outcome of the Law of Large Numbers under an IID sequence determines what we will call a statistical model.

For our purposes, an IID version of the Law of Large Numbers is too restrictive. First, we are interested in economic dynamics in which model outcomes are temporally dependent. Second, we want to put ourselves in the situation of a statistician who does not know a priori what the underlying data generating process is and therefore entertains multiple models. We will present a Law of Large Numbers that covers both settings.

1.3. Representing a Stochastic Process#

We now generalize the canonical construction 1.1 of a stochastic process in a way that facilitates stating the Law of Large Numbers that interests us.

../_images/fig2_1.png

Fig. 1.2 The evolution of a sample point ω induced by successive applications of the transformation S. The oval shaped region is the collection Ω of all sample points.#

../_images/fig2_2.png

Fig. 1.3 An inverse image S1(Λ) of an event Λ is itself an event; ωS1(Λ) implies that S(ω)Λ.#

We use two objects.[4]

The first is a (measurable) transformation S:ΩΩ that describes the evolution of a sample point ω. See Fig. 1.2. Transformation S has the property that for any event ΛF,

S1(Λ)={ωΩ:S(ω)Λ}

is an event in F, as depicted in Fig. 1.3. The second object is an n-dimensional vector X(ω) that describes how observations depend on sample point ω.

We construct a stochastic process {Xt:t=0,1,...} via the formula:

Xt(ω)=X[St(ω)]

or

Xt=XSt,

where we interpret S0 as the identity mapping asserting that ω0=ω.

Because a known function S maps a sample point ωΩ today into a sample point S(ω)Ω tomorrow, the evolution of sample points is deterministic. For instance, ωt+j=St+j(ω) for all j1 can be predicted perfectly if we know S and ωt. But we typically do not observe ωt at any t. Instead, we observe an (n×1) vector X(ω) that contains incomplete information about ω. We assign probabilities Pr to collections of sample points ω called events, then use the functions S and X to induce a joint probability distribution over sequences of X’s. The resulting stochastic process {Xt:t=0,1,2,...} is a sequence of n-dimensional random vectors.

This way of constructing a stochastic process might seem restrictive; but actually, it is more general than the canonical construction presented above.

Example 1.1

Consider again our canonical construction 1.1. Recall that the set of sample points Ω is the collection of infinite sequences of elements rtRn so that ω=(r0,r1,...). For this example, S(ω)=(r1,r2,...). This choice of S is called the shift transformation. Notice that the time t iterate is

St(ω)=(rt,rt+1,...)

Let the measurement function be: X(ω)=r0 so that

Xt(ω)=X[St(ω)]=rt

as posited in construction 1.1.

1.4. Stationary Stochastic Processes#

We start with a probabilistic notion of invariance. We call a stochastic process stationary if for any finite integer , the joint probability distribution induced by the composite random vector [Xt,Xt+1,...,Xt+] is the same for all t0.[5] This notion of stationarity can be thought of as a stochastic version of a steady state of a dynamical system.

We now use the objects (S,X) to build a stationary stochastic process by restricting construction 1.1.
Consider the set {ωΩ:X(ω)b}=defΛ and its successors

{ωΩ:X1(ω)b}={ωΩ:X[S(ω)]b}=S1(Λ){ωΩ:Xt(ω)b}={ωΩ:X[St(ω)]b}=St(Λ).

Evidently, if Pr(Λ)=Pr[S1(Λ)] for all ΛF, then the probability distribution induced by Xt equals the probability distribution of X for all t. This fact motivates the following definition and proposition.

Definition 1.3

The pair (S,Pr) is said to be measure-preserving if

Pr(Λ)=Pr{S1(Λ)}

for all ΛF.

Theorem 1.1

When (S,Pr) is measure-preserving, probability distributions induced by the random vectors Xt=defX[St(ω)] are identical for all t0.

The measure-preserving property restricts the probability measure Pr for a given transformation S. Some probability measures Pr used in conjunction with S will be measure-preserving and others not, a fact that will play an important role at several places below.

Suppose that (S,Pr) is measure-preserving relative to probability measure Pr. Given X and an integer >1, form a vector

X[](ω)=def[X0(ω)X1(ω)...X(ω)].

We can apply Theorem 1.1 to X[] to conclude that the joint distribution function of (Xt,Xt+1,...,Xt+) is independent of t for t=0,1,. That this property holds for any choice of implies that the stochastic process {Xt:t=1,2,...} is stationary. Moreover, f(X[]) where f is a Borel measurable function from Rn(+1) into R is also a valid measurement function. Such f’s include indicator functions of interesting events defined in terms of X[].

For a given S, we now present examples that illustrate how to construct a probability measure Pr that makes S measure-preserving and thereby brings stationarity.

Example 1.2

Suppose that Ω contains two points, Ω={ω1,ω2}. Consider a transformation S that maps ω1 into ω2 and ω2 into ω1: S(ω1)=ω2 and S(ω2)=ω1. Since S1({ω2})={ω1} and S1({ω1})={ω2}, for S to be measure-preserving, we must have Pr({ω1})=Pr({ω2})=1/2.

Example 1.3

Suppose that Ω contains two points, Ω={ω1,ω2} and that S(ω1)=ω1 and S(ω2)=ω2. Since S1({ω2})={ω2} and S1({ω1})={ω1}, S is measure-preserving for any Pr that satisfies Pr({ω1})0 and Pr({ω2})=1Pr({ω1}).

Example 1.4

Suppose that Ω contains four points, Ω={ω1,ω2,ω3,ω4}. Moreover, S(ω1)=ω2, S(ω2)=ω1, S(ω3)=ω1, and S(ω4)=ω2. Notice that the sample points ω3 and ω4 are transient. Applying the S transformation does not allow for access to these points as they are not in image of S. As a consequence, the only measure-preserving probability is the same one described in Example 1.2.

The next example illustrates how to represent an i.i.d. sequence of zeros and ones in terms of an Ω,Pr and an S.

Example 1.5

Suppose that Ω=[0,1) and that Pr is the uniform measure on [0,1). Let

S(ω)={2ω ω[0,1/2)2ω1 ω[1/2,1),X(ω)={1 ω[0,1/2)0ω[1/2,1).

Calculate Pr{X1=1|X0=1}=Pr{X1=1|X0=0}=Pr{X1=1}=1/2 and Pr{X1=0|X0=1}=Pr{X1=0|X0=0}=Pr{X1=0}=1/2. So X1 is statistically independent of X0. By extending these calculations, it can be verified that {Xt:t=0,1,...} is a sequence of independent random variables.[6] We can alter Pr to obtain other stationary distributions. For instance, suppose that Pr{13}=Pr{23}=.5. Then the process {Xt:t=0,1,...} alternates in a deterministic fashion between zero and one. This provides a version of Example 1.2 in which ω1=13 and ω2=23.

1.5. Invariant Events and Conditional Expectations#

In this section, we present a Law of Large Numbers that asserts that time series averages converge when S is measure-preserving relative to Pr.

1.5.1. Invariant events#

We use the concept of an invariant event to understand how limit points of time series averages relate to a conditional mathematical expectation.

../_images/fig2_3.png

Fig. 1.4 Two invariant events Λ1 and Λ2 and an event Λ3 that is not invariant.#

Definition 1.4

An event Λ is invariant if Λ=S1(Λ).

Fig. 1.4 illustrates two invariant events in a space Ω. Notice that if Λ is an invariant event and ωΛ, then St(ω)Λ for t=0,1,...,. Thus under the transformation S, sample points that are in Λ remain there. Furthermore, for each ωΛ, there exists ωΛ such that ω=S(ω).

Let I denote the collection of invariant events. The entire space Ω and the null set are both invariant events. Like F, the collection of invariant events I is a sigma algebra.

1.5.2. Conditional expectation#

../_images/fig2_4.png

Fig. 1.5 A conditional expectation E(X|I) is constant for ωΛj=S1(Λj).#

We want to construct a random vector E(X|I) called the “mathematical expectation of X conditional on the collection I of invariant events”. We begin with a situation in which a conditional expectation is a discrete random vector as occurs when invariant events are unions of sets Λj belonging to a countable partition of Ω (together with the empty set). Later we’ll extend the definition beyond this special setting.

A countable partition consists of a countable collection of nonempty events Λj such that ΛjΛk= for jk and such that the union of all Λj is Ω. Assume that each set Λj in the partition is itself an invariant event and has positive probability. Define the mathematical expectation conditioned on event Λj as

ΛjXdPrPr(Λj)

when ωΛj. To extend the definition of conditional expectation to all of I, take

E(X|I)(ω)=ΛjXdPrPr(Λj)  if  ωΛj.

Thus, the conditional expectation E(X|I) is constant for ωΛj but varies across Λj’s. Fig. 1.5 illustrates this characterization for a finite partition.

1.5.3. Least Squares#

Now let X be a random vector with finite second moments EXX=X(ω)X(ω)dPr(ω). When a random vector X has finite second moments, a conditional expectation is a least squares projection. Let Z be an n-dimensional measurement function that is time-invariant and so satisfies

Zt(ω)=Z[St(ω)]=Z(ω).

Let Z denote the collection of all such time-invariant random vectors. In the special case in which the invariant events can be constructed from a finite partition, Z can vary across sets Λj but must remain constant within Λj.[7] Consider the least squares problem

(1.1)#minZZE[|XZ|2].

Denote the minimizer in problem (1.1) by X~=E(X|I). Necessary conditions for the least squares minimizer X~Z imply that

E[(XX~)Z]=0

for Z in Z so that each entry of the vector XX~ of regression errors is orthogonal to every vector Z in Z.

A measure-theoretic approach constructs a conditional expectation by extending the orthogonality property of least squares. Provided that E|X|<, E(X|I)(ω) is the essentially unique random vector that, for any invariant event Λ, satisfies

E([XE(X|I)]1Λ)=0,

where 1Λ is the indicator function that is equal to one on the set Λ and zero otherwise.

1.6. Law of Large Numbers#

An elementary Law of Large Numbers asserts that the limit of an average over time of a sequence of independent and identically distributed random vectors equals the unconditional expectation of the random vector. We want a more general Law of Large Numbers that applies to averages over time of sequences of observations that are intertemporally dependent. To do this, we use a notion of probabilistic invariance that is expressed in terms of the measure-preserving restriction and that implies a Law of Large Numbers applicable to stochastic processes.

The following theorem asserts two senses in which averages of intertemporally dependent processes converge to mathematical expectations conditioned on invariant events.

Theorem 1.2

(Birkhoff) Suppose that S is measure-preserving relative to the probability space (Ω,F,Pr).[8]

  1. For any X such that E|X|<,

1Nt=1NXt(ω)E(X|I)(ω)

with probability one;

  1. For any X such that E|X|2<,

E[|1Nt=1NXtE(X|I)|2]0.

Part 1) asserts almost-sure convergence; part 2) asserts mean-square convergence.

We have ample flexibility to specify a measurement function ϕ(X), where ϕ is a Borel measurable function from Rn(+1) into R. In particular, an indicator functions for event Λ={Xb} can be used as a measurement function where:

ϕ(x)=1b={1if xb0if xb.

where x is a hypothetical realization of the random vector X. The Law of Large Numbers applies to limits of

1Nt=1Nϕ[Xt]

for alternative ϕ’s, so choosing ϕ’s to be indicator functions shows how the Law of Large Numbers uncovers event probabilities of interest.

Definition 1.5

A transformation S that is measure-preserving relative to Pr is said to be ergodic under probability measure Pr if all invariant events have probability zero or one.

Thus, when a transformation S is ergodic under measure Pr, the invariant events have either the same probability measure as the entire sample space Ω (whose probability measure is one), or the same probability measure as the empty set (whose probability measure is zero).

Proposition 1.1

Suppose that the measure-preserving transformation S is ergodic under measure Pr. Then E(X|I)=E(X).

Theorem 1.2 describes conditions for convergence in the general case that S is measure-preserving under Pr, but in which S is not necessarily ergodic under Pr. Proposition 1.1 describes a situation in which probabilities assigned to invariant events are degenerate in the sense that all invariant events have the same probability as either Ω (probability one) or the null set (probability zero). When S is ergodic under measure Pr, limit points of time series averages equal corresponding unconditional expectations, an outcome we can call a standard Law of Large Numbers. When S is not ergodic under Pr, limit points of time series averages equal expectations conditioned on invariant events.

The following examples remind us how ergodicity restricts S and Pr.

Example 1.6

Consider Example 1.2 again.
Ω contains two points and S maps ω1 into ω2 and ω2 into ω1: S(ω1)=ω2 and S(ω2)=ω1. Suppose that the measurement vector is

X(ω)={1 if ω=ω10 if ω=ω2.

Then it follows directly from the specification of S that

1Nt=1NXt(ω)12

for both values of ω. The limit point is the average across sample points.

Example 1.7

Return to Example 1.3. Ω contains two points, Ω={ω1,ω2} and that S(ω1)=ω1 and S(ω2)=ω2. Xt(ω)=X(ω) so that the sequence is time invariant and equal to its time-series average. A time-series average of Xt(ω) equals the average across sample points only when Pr assigns probability 1 to either ω1 or ω2.

1.7. Limiting Empirical Measures#

Given a triple (Ω,F,Pr) and a measure-preserving transformation S, we can use Theorem 1.2 to construct limiting empirical measures on F. To start, we will analyze a setting with a countable partition of Ω consisting of invariant events {Λj:j=1,2,...}, each of which has strictly positive probability under Pr. With the exception of the null set, we assume that all invariant events are unions of the members of this partitition. We consider a more general setting later. Given an event Λ in F and for almost all ωΛj, define the limiting empirical measure Qrj as

(1.2)#Qrj(Λ)(ω)=limN1Nt=1N1Λ[St(ω)]=Pr(ΛΛj)Pr(Λj).

Thus, when ωΛj, Qrj(Λ) is the fraction of time St(ω)Λ in very long samples. If we hold Λj fixed and let Λ be an arbitrary event in F, we can treat Qrj as a probability measure on (Ω,F). By doing this for each Λj,j=1,2,, we can construct a countable set of probability measures {Qrj}j=1. These comprise the set of all measures that can be recovered by applying the Law of Large Numbers. If nature draws an ωΛj, then measure Qrj describes outcomes.

So far, we started with a probability measure Pr and then constructed the set of possible limiting empirical measures Qrj’s. We now reverse the direction of the logic by starting with probability measures Qrj and then finding measures Pr that are consistent with them. We do this because Qrj’s are the only measures that long time series can disclose through the Law of Large Numbers: each Qrj defined by (1.2) uses the Law of Large Numbers to assign probabilities to events ΛF. However, because

Qrj(Λ)=Pr(ΛΛj)=Pr(ΛΛj)Pr(Λj) for j=1,2,,

are conditional probabilities, such Qrj’s are silent about the probabilities Pr(Λj) of the underlying invariant events Λj. There are multiple ways to assign probabilities Pr that imply identical probabilities conditioned on invariant events.

Because Qrj is all that can ever be learned by “letting the data speak”, we regard each probability measure Qrj as a statistical model.[9]

Proposition 1.2

A statistical model is a probability measure that a Law of Large Numbers can disclose.

Probability measure Qrj describes a statistical model associated with invariant set Λj.

Remark 1.1

For each j, S is measure-preserving and ergodic on (Ω,F,Qrj).
The second equality of definition (1.2) assures ergodicity by assigning probability one to the event Λj.

Relation (1.2) implies that probability Pr connects to probabilities Qrj by

(1.3)#Pr(Λ)=jQrj(Λ)Pr(Λj).

While decomposition (1.3) follows from definitions of the elementary objects that comprise a stochastic process and is “just mathematics”, it is interesting because it tells how to construct alternative probability measures Pr for which S is measure-preserving. Because long data series disclose probabilities conditioned on invariant events to be Qrj, to respect evidence from long time series we must hold the Qrj’s fixed, but we can freely assign probabilities Pr to invariant events Λj. In this way, we can create a family of probability measures for which S is measure-preserving.

1.8. Ergodic Decomposition#

Up to now, we have represented invariant events with a countable partition. Dynkin [1978] deduced a more general version of decomposition (1.3) without assuming a countable partition. Thus, start with a pair (Ω,F). Also, assume that there is a metric on Ω and that Ω is separable. We also assume that F is the collection of Borel sets (the smallest sigma algebra containing the open sets). Given (Ω,F), take a (measurable) transformation S and consider the set P of probability measures Pr for which S is measure-preserving. For some of these probability measures, S is ergodic, but for others, it is not. Let Q denote the set of probability measures for which S is ergodic. Under a nondegenerate convex combination of two probability measures in Q, S is measure-preserving but not ergodic. Dynkin [1978] constructed limiting empirical measures Qr on Q and justified the following representation of the set P of probability measures Pr.

Proposition 1.3

For each probability measure Pr~ in P, there is a unique probability measure π over Q such that

(1.4)#Pr~(Λ)=QQr(Λ)π(dQr)

for all ΛF.[10]

Proposition 1.3 generalizes representation (1.3). It asserts a sense in which the set P of probabilities for which S is measure-preserving is convex. Extremal points of this set are in the smaller set Q of probability measures for which the transformation S is ergodic. Representation (1.4) shows that by forming “mixtures” (i.e., weighted averages or convex combinations) of probability measures under which S is ergodic, we can represent all probability specifications for which S is measure-preserving.

To add another perspective, a collection of invariant events I is associated with a transformation S. There exists a common conditional expectation operator JE(|I) that assigns mathematical expectations to bounded measurable functions (mapping Ω into R) conditioned on the set of invariant events I. The conditional expectation operator J characterizes limit points of time series averages of indicator functions of events of interest as well as other random vectors. Alternative probability measures Pr assign different probabilities to the invariant events.

1.9. Risk and Uncertainty#

An applied researcher typically does not know which statistical model generated the data. This situation leads us to specifications of S that are consistent with a family P of probability models under which S is measure-preserving and a stochastic process is stationary. Representation (1.4) describes uncertainty about statistical models with a probability distribution π over the set of statistical models Q.

For a Bayesian, π is a subjective prior probability distribution that pins down a convex combination of “statistical models.”[11] A Bayesian expresses trust in that convex combination of statistical models used to construct a complete probability measure over outcomes[12] and uses it to compute expected utility. A Bayesian decision theory axiomatized by Savage makes no distinction between how decision makers respond to the probabilities described by the component statistical models and the π probabilities that he uses to mix them. All that matters to a Bayesian decision maker is the complete probability distribution over outcomes, not how it is attained as a π-mixture of component statistical models.

Some decision and control theorists challenge the complete confidence in a single prior probability assumed in a Bayesian approach.[13] They want to distinguish ‘ambiguity’, meaning not being able confidently to assign π, from ‘risk’, meaning prospective outcomes with probabilities reliably described by a statistical model. They imagine decision makers who want to evaluate decisions under alternative π’s.[14] We explore these ideas in later chapters.

An important implication of the Law of Large Numbers is that for a given initial π, using Bayes’ rule to update the π probabilities as data arrive will eventually concentrate posterior probability on the statistical model that generates the data. Even when a decision maker entertains a family of π’s, the updated probabilities conditioned on the data may still concentrate on the statistical model that generates the data.

1.10. Estimating Vector Autoregressions#

We now apply the Law of Large Numbers to the estimation of the equations in a vector autoregression

Let Yt+1 be one of the entries of Xt+1, and consider the regression equation:

Yt+1=βXt+Ut+1,

where Ut+1 is a least squares residual. By choosing Yt+1 to be alternative entries of Xt+1, we obtain the different equations in a VAR system. Our perspective in this discussion is that of an econometrician who fits such a regression system without taking a stand on the actual dynamic stochastic evolution of the {Xt:t=0,1,....}. To express subjective uncertainty about β, we allow it to be random but measurable in terms of the collection of invariant events J. As implied by least squares, we impose that the regression error, Ut+1 is orthogonal to the vector Xt of regressors conditioned on J:

E(XtUt+1|J)=0.

Then

(1.5)#E(XtYt+1|J)=E[Xt(Xt)|J]β,

which uniquely pins down the regression coefficient β provided that the matrix E[Xt(Xt)|J] is nonsingular with probability one. Notice that

1Nt=1NXtYt+1E(XtYt+1|J)
1Nt=1NXt(Xt)E(Xt(Xt)|J),

where convergence is with probability one. Thus, from equation (1.5) it follows that a consistent estimator of β is a bN that satisfies

1Nt=1NXtYt+1=1Nt=1NXt(Xt)bN.

Solving for bN gives the familiar least squares formula:

bN=[t=1NXt(Xt)]1t=1NXtYt+1.

Note how statements about the consistency of bN are conditioned on J. This conditioning is necessary when we do not know ex ante which among a family vector autoregressions generates the data.

1.11. Inventing an Infinite Past#

When Pr is measure-preserving and the process {Xt:t=0,1,...} is stationary, it can be useful to invent an infinite past. To accomplish this, we reason in terms of the (measurable) transformation S:ΩΩ that describes the evolution of a sample point ω. Until now we have assumed that S has the property that for any event ΛF,

S1(Λ)={ωΩ:S(ω)Λ}

is an event in F. In Section Stationary Increments, we want more. To prepare the way for that chapter, in this section we shall also assume that S is one-to-one and has the property that for any event ΛF,

(1.6)#S(Λ)={ωΩ:S1(ω)Λ}F.

Because

Xt(ω)=X[St(ω)]=Xt=XSt

is well defined for negative values of t, restrictions (1.6) allow us to construct a ``two-sided’’ process that has both an infinite past and an infinite future.

Let A be a subsigma algebra of F, and let

(1.7)#At={ΛtF:Λt={ωΩ:St(ω)Λ} for some ΛA}.

We assume that {At:<t<+} is nondecreasing sequence of subsigma algebras of F. The nondecreasing structure captures the information accumulation over time. If the original measurement function X is A-measurable, then Xt is At-measurable. Furthermore, Xtj is in At for all j0. The set At depicts information available at date t, including past information. Invariant events in I are contained in At for all t.

We construct the following moving-average representation of a scalar process {Xt} in terms of an infinite history of shocks.

Example 1.8

(Moving average) Suppose that {Wt:<t<} is a vector stationary process for which[15]

E(Wt+1|At)=0

and that

E(WtWt|I)=I

for all <t<+.
Use a sequence of vectors {αj}j=0 to construct

(1.8)#Xt=j=0αjWtj

where

(1.9)#j=0|αj|2<.

Restriction (1.9) implies that Xt is well defined as a mean square limit. Xt is constructed from the infinite past {Wtj:0j<}. The process {Xt:<t<} is stationary and is often called an infinite-order moving average process. The sequence {αj:j=0,1,...} can depend on the invariant events.

Remark 1.2

Slutsky [1927] and Yule [1927] used probability models to analyze economic time series. Their models implied moving-average representations like the one in Example 1.8. Their idea was to view economic time series as responding linearly to current and past independent and identically distributed impulses or shocks. In distinct contributions, they showed how such models generate recurrent but aperiodic fluctuations that resemble business cycles and longer-term cycles as well. Yule [1927] and Slutsky [1927] came from different backgrounds and brought different perspectives. Yule [1927] was an eminent statistician who, among other important contributions, managed “effectively to invent modern time series analysis” in the words of Stigler [1986]. Yule constructed and estimated what we would now call a second-order autoregression and applied it to study sunspots. Yule’s estimates implied αj coefficients showed damped oscillations at the same periodicity as sunspots. In Russia in the 1920s, Slutsky [1927] wrote a seminal paper in Russian motivated by his interest in business cycles. Later an English version of his paper published in Econometrica. Even before that, it influenced economists including Ragnar Frisch. Indeed, Frisch was keenly aware of both Slutsky [1927] and Yule [1927] and generously acknowledged both of them in his seminal paper Frisch [1933] on the impulse and propagation problem. Building on insights of Slutsky [1927] and Yule [1927], Frisch [1933] pioneered impulse response functions. He aspired to provide explicit economic interpretations for how shocks alter economic time series intertemporally.[16]

1.12. Summary#

For a fixed S there are often many possible probabilities Pr that are measure-preserving. A subset of these are ergodic. These ergodic probabilities can serve as building blocks for the other measure-preserving probabilities. Thus, each measure-preserving Pr can be expressed as a weighted average of the ergodic probabilities. We call the ergodic probabilities statistical models. The Law of Large Numbers applies to each of the ergodic building blocks with limit points that are unconditional expectations. As embodied in (1.3) and its generalization (1.4), this decomposition interests both frequentist and Bayesian statisticians.