We shall
interpret economic time series as temporally organized data whose source is a single random draw from a parameterized joint probability distribution. That joint distribution determines intertermporal probabilistic relationships among components of the data. We imagine a statistician who doesn’t know values of the parameters that characterize the joint probability distribution and wants to use those data to infer them.
A key ingredient in doing this successfully is somehow to form averages over time of some functions of the data and hope that they converge to something informative about model parameters. Some laws of large numbers (LLNs) can help us with this, but others can’t.
This chapter describes one that can.
Classic LLNs that are typically studied in entry level probability classes adopt a narrow perspective that is not applicable to economic time series. To help us appreciate why a plain vanilla LLN won’t help us study time series, we’ll start with a setting in which such a plain vanilla LLN is all that we need. Here the statistician views a data set as a random draw from a particular joint probability distribution that is one among a family of models that he knows. Once again, from those data, our statistician wants to infer which member of the family is most plausible. The family could be finite or it could be represented more generally with an unknown parameter vector indexing alternative models. The parameter vector could be finite or infinite-dimensional. We call each joint probability distribution associated with a particular parameter vector a statistical model.
A plain vanilla LLN assumes that the data are a sequence of independent draws from the same (i.e., identical) probability distribution. This assumption immediately leads to a logical problem when it comes to thinking about learning which model has generated the data. If the parameter vector that pins down the probability distribution is known and draws are independent and identically distributed (IID), then past draws indicate nothing about future draws, meaning that there is nothing to learn. But if the parameters that pin down data generation are not known, then there is something to learn: a sequence of draws from that unknown distribution are not IID because past draws indicate something about probabilities of future draws.
Fig. 1.1 Top: a sequence of draws from a fair coin. Bottom: a sequence of draws from a possibly unfair brass tack.#
To illustrate the issue, Fig. 1.1 shows IID draws from two Bernoulli distributions. The top sequence are flips of fair coins. Here the probability of a heads is one half for every coin. After observing the outcome of coin flips, a statistician would predict that the probability that the next flip will be one half, regardless of earlier outcomes. In this example, the statistician has no reason to use data to learn the “parameter vector of interest”, i.e., the probability of a heads on a single draw of one coin, because he knows it. The bottom sequence is formed from successive tosses of brass tacks. We assume that the statistician does
not know the probability that each brass tack will land on its head. The statistician wants to use this observed outcome of tosses to predict outcomes of future tack tosses. A presumption that observations of past tosses of the tack contain information about future tosses is tantamount to assuming that the sequence is not IID. This is a context in which it is natural to assume that the way brass tacks are constructed is the same (i.e., identical) for all of the brass tacks, and thus that the probabilities of tacks landing on their head are the same for each toss. If we condition on the common probability that a tack lands on its head, it makes sense to view the hypothetical observations as conditionally IID.
To achieve a theory of statistical learning, we can relax the IID assumption and instead assume a property called exchangeability. For a sequence of random draws to be exchangeable, it must be true that the joint probability distribution of any finite segment does not change if we rearrange individuals’ positions in the sequence. In other words the order of draws does not matter when forming joint probabilities. If prospective outcomes of tack tosses can plausibly be viewed as being exchangeable, this opens the door to learning about the probability of a heads as we accumulate more observations. Exchangeable sequences are conditionally IID, where we interpret conditioning as being on a statistical model.[1] Exchangeable sequences obey a conditional version of a Law of Large Numbers that we describe later in this chapter.
Because we are interested in economic time series, LLNs based on exchangeabilty are too restrictive because we are interested in probability models in which the temporal order in which observations arise matters. This motivates us to replace exchangeability with a notion of stationarity and to use an ergodic decomposition theorem for stationary processes. This LLN for stationary processes is conditional on what we call the statistical model. The resulting LLN allows us to study a statistician who considers several alternative statistical models simultaneously and who understands how long-run averages of some salient statistics depend on unknown parameters on which the LLN is conditioned.
A LLN can teach us about statistical challenges that we would face us even if we were to have time series of infinite length. While this is a good starting point, we’ll have to do more. In later chapters, we will describe additional approaches to statistical inference that can help us to understand how much we can learn about model parameters from finite histories of data.
We start with a probability space, namely, a triple , where is a set of sample points, is a collection of subsets of called events (formally a sigma algebra) and assigns probabilities to events. We refer to as a probability measure. The following definition makes reference to Borel sets. Borel sets include open sets, closed sets, finite intersections, and countable unions of such sets.
Definition 1.1
is an -dimensional random vector if has the property that for any Borel set in
is in .
A result from measure theory states that if is an event in whenever is an open set in , then is an -dimensional random vector. In what follows, we will often omit the explicit reference to when it self evident.
This formal structure facilitates using mathematical analysis to formulate problems in probability theory. A random vector induces a probability distribution over the collection of Borel sets in which the probability assigned to set is given by
By changing the set , we trace out a probability distribution implied by the random vector that is called the induced distribution. An induced distribution is what typically interests an applied worker. In practice, an induced distribution is just specified directly without constructing the foundations under study here. However, proceeding at a deeper level, as we have, by defining a random vector to be a function that satisfies particular measurable properties and imposing the probability measure over the domain of that function has mathematical payoffs. We will exploit this mathematical formalism in various ways, among them being in construction of stochastic processes.
Definition 1.2
An -dimensional stochastic process is an infinite sequence of -dimensional random vectors .
The measure assigns probabilities to a rich and interesting collection of events. For example, consider a stacked random vector
and Borel sets in . The joint distribution of induced by over such Borel sets is
Since the choice of is arbitrary, implies a distribution over a an entire sequence of random vectors .[2].
We may also go the other way. Given a probability distribution over infinite sequences of vectors, we can construct a probability space and a stochastic process that induce this distribution. Thus, the following way to construct a probability space is particularly enlightening.
Construction 1.1:
Let be a collection of infinite sequences in with a sample point being a sequence of vectors , where .
Let be the collection of Borel sets of , and let be the smallest sigma-algebra that contains the Borel sets of
for .
For each integer , let assign probabilities to the Borel sets of . A Borel set in can also be viewed as a Borel set in with left unrestricted. Specifically, let be a Borel set in .
Then
is a Borel set in .
For probability measures to be consistent, we require that the probability assigned by satisfy
With this restriction, we can extend the probability to the space that is itself consistent with the probability assigned by for all nonnegative integers .[3]
Finally, we construct the stochastic process by letting
for A convenient feature of this construction is that is the probability induced by the random vector .
We refer to this construction as canonical. While this is only one among other possible constructions of probability spaces, it illustrates the flexibility in building sequences of random vectors that induce alternative probabilities of interest.
The remainder of this chapter is devoted to studying Laws of Large Numbers.
What is perhaps the most familiar Law of Large Numbers presumes that the stochastic process is IID. Then
for any (Borel measurable) function for which the expectation is well defined. Convergence holds in several senses that we state later. Notice that as we vary the function we can infer the (induced) probability distribution for . In this sense, the outcome of the Law of Large Numbers under an IID sequence determines what we will call a statistical model.
For our purposes, an IID version of the Law of Large Numbers is too restrictive. First, we are interested in economic dynamics in which model outcomes are temporally dependent. Second, we want to put ourselves in the situation of a statistician who does not know a priori what the underlying data generating process is and therefore entertains multiple models. We will present a Law of Large Numbers that covers both settings.
We now generalize the canonical construction 1.1 of a stochastic process in a way that facilitates stating the Law of Large Numbers that interests us.
Fig. 1.2 The evolution of a sample point induced by successive applications of the transformation . The oval shaped region is the collection of all sample points.#
Fig. 1.3 An inverse image of an event is itself an event; implies that .#
The first is a (measurable) transformation that describes the evolution of a sample point . See Fig. 1.2. Transformation has the property that for any event ,
is an event in , as depicted in Fig. 1.3. The second object is an -dimensional vector that describes how observations depend on sample point .
We construct a stochastic process via the formula:
or
where we interpret as the identity mapping asserting that .
Because a known function maps a sample point today into a sample point tomorrow, the evolution of sample points is deterministic. For instance, for all can be predicted perfectly if we know and . But we typically do not observe at any . Instead, we observe an vector that contains incomplete information about . We assign probabilities to collections of sample points called events, then use the functions and to induce a joint probability distribution over sequences of ’s. The resulting stochastic process is a sequence of -dimensional random vectors.
This way of constructing a stochastic process might seem restrictive; but actually, it is more general than the canonical construction presented above.
Example 1.1
Consider again our canonical construction 1.1. Recall that the set of sample points is the collection of infinite sequences of elements so that . For this example, . This choice of is called the shift transformation. Notice that the time iterate is
We start with a probabilistic notion of invariance. We call a stochastic process stationary if for any finite integer , the joint probability distribution induced by the composite random vector is the same for all .[5] This notion of stationarity can be thought of as a stochastic version of a steady state of a dynamical system.
We now use the objects to build a stationary stochastic process by restricting construction 1.1.
Consider the set and its successors
Evidently, if for all , then the probability distribution induced by equals the probability distribution of for all . This fact motivates the following definition and proposition.
Definition 1.3
The pair is said to be measure-preserving if
for all .
Theorem 1.1
When is measure-preserving, probability distributions induced by the random vectors are identical for all .
The measure-preserving property restricts the probability measure for a given transformation . Some probability measures used in conjunction with will be measure-preserving and others not, a fact that will play an important role at several places below.
Suppose that is measure-preserving relative to probability measure . Given and an integer , form a vector
We can apply Theorem 1.1 to to conclude that the joint distribution function of is independent of for . That this property holds for any choice of implies that the stochastic process is stationary. Moreover, where is a Borel measurable function from into is also a valid measurement function. Such ’s include indicator functions of interesting events defined in terms of .
For a given , we now present examples that illustrate how to construct a probability measure that makes measure-preserving and thereby brings stationarity.
Example 1.2
Suppose that contains two points, . Consider a transformation that maps into and into : and . Since and , for to be measure-preserving, we must have .
Example 1.3
Suppose that contains two points, and that and .
Since
and , is measure-preserving for any that satisfies
and .
Example 1.4
Suppose that contains four points, . Moreover, and Notice that the sample points and are transient. Applying the transformation does not allow for access to these points as they are not in image of . As a consequence, the only measure-preserving probability is the same one described in Example 1.2.
The next example illustrates how to represent an i.i.d. sequence of zeros and ones in terms of an and an .
Example 1.5
Suppose that and that is the uniform measure on . Let
Calculate and .
So is statistically independent of .
By extending these calculations, it can be verified that is a sequence of independent random variables.[6]
We can alter to obtain other stationary distributions.
For instance, suppose that .
Then the process alternates in a deterministic fashion between zero and one.
This provides a version of Example 1.2 in which and .
1.5. Invariant Events and Conditional Expectations#
In this section, we present a Law of Large Numbers that asserts that time series averages converge when is measure-preserving relative to .
We use the concept of an invariant event to understand how limit points of time series averages relate to a conditional mathematical expectation.
Fig. 1.4 Two invariant events and and an event that is not invariant.#
Definition 1.4
An event is invariant if .
Fig. 1.4 illustrates two invariant events in a space .
Notice that if is an invariant event and , then for . Thus under the transformation sample points that are in remain there. Furthermore, for each there exists such that
Let denote the collection of invariant events. The entire space and the null set are both invariant events. Like , the collection of invariant events is a sigma algebra.
Fig. 1.5 A conditional expectation is constant for .#
We want to construct a random vector called the “mathematical expectation of conditional on the collection of invariant events”. We begin with a situation in which a conditional expectation is a discrete random vector as occurs when invariant events are unions of sets belonging to a countable partition of (together with the empty set). Later we’ll extend the definition beyond this special setting.
A countable partition consists of a countable collection of nonempty events such that for and such that the union of all is . Assume that each set in the partition is itself an invariant event and has positive probability. Define the mathematical expectation conditioned on event as
when . To extend the definition of conditional expectation to all of , take
Thus, the conditional expectation is constant for but varies across ’s. Fig. 1.5 illustrates this characterization for a finite partition.
Now let be a random vector with finite second moments . When a random vector has finite second moments, a conditional expectation is a least squares projection. Let be an -dimensional measurement function that is time-invariant and so satisfies
Let denote the collection of all such time-invariant random vectors. In the special case in which the invariant events can be constructed from a finite partition, can vary across sets but must remain constant within .[7] Consider the least squares problem
Denote the minimizer in problem (1.1) by . Necessary conditions for the least squares minimizer imply that
for in so that each entry of the vector of regression errors is orthogonal to every vector in .
A measure-theoretic approach constructs a conditional expectation by extending the orthogonality property of least squares. Provided that , is the essentially unique random vector that, for any invariant event , satisfies
where is the indicator function that is equal to one on the set and zero otherwise.
An elementary Law of Large Numbers asserts that the limit of an average over time of a sequence of independent and identically distributed random vectors equals the unconditional expectation of the random vector. We want a more general Law of Large Numbers that applies to averages over time of sequences of observations that are intertemporally dependent. To do this, we use a notion of probabilistic invariance that is expressed in terms of the measure-preserving restriction and that implies a Law of Large Numbers applicable to stochastic processes.
The following theorem asserts two senses in which averages of intertemporally dependent processes converge to mathematical expectations conditioned on invariant events.
Theorem 1.2
(Birkhoff) Suppose that is measure-preserving relative to the probability space
.[8]
For any such that ,
with probability one;
For any such that ,
Part 1) asserts almost-sure convergence; part 2) asserts mean-square convergence.
We have ample flexibility to specify a measurement function where is a Borel measurable function from into . In particular, an indicator functions for event can be used as a measurement function where:
where is a hypothetical realization of the random vector . The Law of Large Numbers applies to limits of
for alternative ’s, so choosing ’s to be indicator functions shows how the Law of Large Numbers uncovers event probabilities of interest.
Definition 1.5
A transformation that is measure-preserving relative to
is said to be ergodic under probability measure if all invariant events have probability zero
or one.
Thus, when a transformation is ergodic under measure , the invariant events have either the
same probability measure as the entire sample space (whose probability measure is one), or the same probability
measure as the empty set (whose probability measure is zero).
Proposition 1.1
Suppose that the measure-preserving transformation is ergodic under measure . Then .
Theorem 1.2 describes conditions for convergence in the general case that is measure-preserving under , but in which is not necessarily ergodic under . Proposition 1.1 describes a situation in which probabilities assigned to invariant events are degenerate in the sense that all invariant events have the same probability as either (probability one) or the null set (probability zero).
When is ergodic under measure , limit points of time series averages equal corresponding unconditional expectations, an outcome we can call a standard Law of Large Numbers.
When is not ergodic under , limit points of time series averages equal expectations conditioned on invariant events.
The following examples remind us how ergodicity restricts and .
Example 1.6
Consider Example 1.2 again. contains two points and maps into and into : and .
Suppose that the measurement vector is
Then it follows directly from the specification of that
for both values of . The limit point is the average across sample points.
Example 1.7
Return to Example 1.3. contains two points, and that and . so that the sequence is time invariant and equal to its time-series average. A time-series average of equals the average across sample points only when assigns probability to either or .
Given a triple and a measure-preserving transformation , we can use Theorem 1.2 to construct limiting empirical measures on . To start, we will analyze a setting with a countable partition of consisting of invariant events , each of which has strictly positive probability under . With the exception of the null set, we assume that all invariant events are unions of the members of this partitition. We consider a more general setting later. Given an event in and for almost all , define the limiting empirical measure as
Thus, when , is the fraction of time in very long samples. If we hold fixed and let be an arbitrary event in , we can treat as a probability measure on . By doing this for each , we can construct a countable set of probability measures . These comprise the set of all measures that can be recovered by applying the Law of Large Numbers. If nature draws an , then measure describes outcomes.
So far, we started with a probability measure and then constructed the set of possible limiting empirical measures ’s. We now reverse the direction of the logic by starting with probability measures and then finding measures that are consistent with them. We do this because ’s are the only measures that long time series can disclose through the Law of Large Numbers: each defined by (1.2) uses the Law of Large Numbers to assign probabilities to events . However, because
are conditional probabilities, such ’s are silent about the probabilities of the underlying invariant events . There are multiple ways to assign probabilities that imply identical probabilities conditioned on invariant events.
Because is all that can ever be learned by “letting the data speak”, we regard each probability measure as a statistical model.[9]
Proposition 1.2
A statistical model is a probability measure that a Law of Large Numbers can disclose.
Probability measure describes a statistical model associated with invariant set .
Remark 1.1
For each , is measure-preserving and ergodic on .
The second equality of
definition (1.2) assures ergodicity by assigning probability one to the event .
Relation (1.2) implies that probability connects to probabilities by
While decomposition (1.3)
follows from definitions of the elementary objects
that comprise a stochastic process and is “just mathematics”, it is interesting because it tells
how to construct alternative probability measures for which is measure-preserving.
Because long data series disclose probabilities conditioned on invariant events to be , to respect evidence from long time series we must hold the ’s fixed,
but we can freely assign probabilities to invariant events . In
this way, we can create a family of probability measures
for which is measure-preserving.
Up to now, we have represented invariant events with a countable partition. Dynkin [1978] deduced a more general version of decomposition (1.3) without assuming a countable partition. Thus, start with a pair . Also, assume that there is a metric on and that is separable. We also assume that is the collection of Borel sets (the smallest sigma algebra containing the open sets). Given , take a (measurable) transformation and consider the set of probability measures for which is measure-preserving. For some of these probability measures, is ergodic, but for others, it is not. Let denote the set of probability measures for which is ergodic. Under a nondegenerate convex combination of two probability measures in , is measure-preserving but not ergodic. Dynkin [1978] constructed limiting empirical measures on and justified the following representation of the set of probability measures .
Proposition 1.3
For each probability measure in , there is a unique probability measure over such that
Proposition 1.3 generalizes representation (1.3). It asserts a sense in which the set of probabilities for which is measure-preserving is convex. Extremal points of this set are in the smaller set of probability measures for which the transformation is ergodic. Representation (1.4) shows that by forming “mixtures” (i.e., weighted averages or convex combinations) of probability measures under which is ergodic, we can represent all probability specifications for which is measure-preserving.
To add another perspective, a collection of invariant events is associated with a transformation . There exists a common conditional expectation operator that assigns mathematical expectations to bounded measurable functions (mapping into ) conditioned on the set of invariant events . The conditional expectation operator characterizes limit points of time series averages of indicator functions of events of interest as well as other random vectors. Alternative probability measures assign different probabilities to the invariant events.
An applied researcher typically does not know which statistical model generated the data. This situation leads us to specifications of that are consistent with a family of probability models under which is measure-preserving and a stochastic process is stationary. Representation (1.4) describes uncertainty about statistical models with a probability distribution over the set of statistical models .
For a Bayesian, is a subjective prior probability distribution that pins down a convex combination of “statistical models.”[11] A Bayesian expresses trust in that convex combination of statistical models used to construct a complete probability measure over outcomes[12] and uses it to compute expected utility. A Bayesian decision theory axiomatized by Savage makes no distinction between how decision makers respond to the probabilities described by the component statistical models and the probabilities that he uses to mix them. All that matters to a Bayesian decision maker is the complete probability distribution over outcomes, not how it is attained as a -mixture of component statistical models.
Some decision and control theorists challenge the complete confidence in a single prior probability assumed in a Bayesian approach.[13] They want to distinguish ‘ambiguity’, meaning not being able confidently to assign , from ‘risk’, meaning prospective outcomes with probabilities reliably described by a statistical model. They imagine decision makers who want to evaluate decisions under alternative ’s.[14] We explore these ideas in later chapters.
An important implication of the Law of Large Numbers is that for a given initial , using Bayes’ rule to update the probabilities as data arrive will eventually concentrate posterior probability on the statistical model that generates the data. Even when a decision maker entertains a family of ’s, the updated probabilities conditioned on the data may still concentrate on the statistical model that generates the data.
We now apply the Law of Large Numbers to the estimation of the equations in a vector autoregression
Let be one of the entries of , and consider the regression equation:
where is a least squares residual. By choosing to be alternative entries of we obtain the different equations in a VAR system. Our perspective in this discussion is that of an econometrician who fits such a regression system without taking a stand on the actual dynamic stochastic evolution of the To express subjective uncertainty about we allow it to be random but measurable in terms of the collection of invariant events . As implied by least squares, we impose that the regression error, is orthogonal to the vector of regressors conditioned on :
which uniquely pins down the regression coefficient provided that the matrix is nonsingular with probability one. Notice that
where convergence is with probability one. Thus, from equation (1.5) it follows that a consistent estimator of is a that satisfies
Solving for gives the familiar least squares formula:
Note how statements about the consistency of are conditioned on . This conditioning is necessary when we do not know ex ante which among a family vector autoregressions generates the data.
When is measure-preserving and the process is stationary, it can be useful to invent an infinite past. To accomplish this, we reason in terms of the (measurable) transformation that describes the evolution of a sample point . Until now we have assumed that has the property that for any event ,
is an event in . In Section Stationary Increments, we want more. To prepare the way for that chapter, in this section we shall also assume that is one-to-one and has the property that for any event ,
is well defined for negative values of , restrictions (1.6) allow us to construct a ``two-sided’’ process that has both an infinite past and an infinite future.
We assume that is nondecreasing sequence of subsigma algebras of The nondecreasing structure captures the information accumulation over time. If the original measurement function is -measurable, then is -measurable. Furthermore, is in for all . The set depicts information available at date , including past information. Invariant events in are contained in for all .
We construct the following moving-average representation of a scalar process in terms of an infinite history of shocks.
Example 1.8
(Moving average) Suppose that is a vector stationary process for
which[15]
Restriction (1.9) implies that is well defined as a mean square limit. is constructed from
the infinite past .
The process
is stationary and is often called an infinite-order moving average process.
The sequence can depend on the invariant events.
Remark 1.2
Slutsky [1927] and Yule [1927] used probability models to analyze economic time series. Their models implied moving-average representations like the one in Example 1.8. Their idea was to view economic time series as responding linearly to current and past independent and identically distributed impulses or shocks. In distinct contributions, they showed how such models generate recurrent but aperiodic fluctuations that resemble business cycles and longer-term cycles as well. Yule [1927] and Slutsky [1927] came from different backgrounds and brought different perspectives. Yule [1927] was an eminent statistician who, among other important contributions, managed “effectively to invent modern time series analysis” in the words of Stigler [1986]. Yule constructed and estimated what we would now call a second-order autoregression and applied it to study sunspots. Yule’s estimates implied coefficients showed damped oscillations at the same periodicity as sunspots. In Russia in the 1920s, Slutsky [1927] wrote a seminal paper in Russian motivated by his interest in business cycles. Later an English version of his paper published in Econometrica. Even before that, it influenced economists including Ragnar Frisch. Indeed, Frisch was keenly aware of both Slutsky [1927] and Yule [1927] and generously acknowledged both of them in his seminal paper Frisch [1933] on the impulse and propagation problem. Building on insights of Slutsky [1927] and Yule [1927], Frisch [1933] pioneered impulse response functions. He aspired to provide explicit economic interpretations for how shocks alter economic time series intertemporally.[16]
For a fixed there are often many possible probabilities that are measure-preserving. A subset of these are ergodic. These ergodic probabilities can serve as building blocks for the other measure-preserving probabilities. Thus, each measure-preserving can be expressed as a weighted average of the ergodic probabilities. We call the ergodic probabilities statistical models. The Law of Large Numbers applies to each of the ergodic building blocks with limit points that are unconditional expectations. As embodied in (1.3) and its generalization (1.4), this decomposition interests both frequentist and Bayesian statisticians.