Contents

About STAMP In detail Order STAMP Documentation Examples Literature

External links

Timberlake UK Office Timberlake US Office OxMetrics SsfPack Econometric Links

Unobserved Components Time Series Methodology

Click on one of the following topics
14 Statistical Treatment of Models
  14.1 Model definitions
  14.1.1 Univariate time series models
  14.1.2 Explanatory variables and interventions
  14.1.3 Multivariate time series models
  14.1.4 Common factors
  14.1.4.1 Common levels
  14.1.4.2 Smooth trends with common slopes
  14.1.4.3 Common trends: level and slopes
  14.1.4.4 Common seasonals
  14.1.4.5 Common cycles
  14.1.5 Explanatory variables
  14.2 State space form
  14.2.1 Structural time series models in SSF
  14.2.1.1 Univariate models in SSF
  14.2.1.2 Multivariate model in SSF
  14.2.1.3 Deterministic trend and seasonal components
  14.3 Kalman filter
  14.3.1 The augmented Kalman filter
  14.3.2 The likelihood function
  14.3.2.1 Univariate models: the concentrated likelihood
  14.3.2.2 Homogeneous models
  14.3.2.3 Multivariate models
  14.3.3 Prediction error variance
  14.3.4 The final state and regression estimates
  14.3.5 Filtered components
  14.3.6 Residuals
  14.3.6.1 Generalised least squares residuals
  14.3.6.2 Generalised recursive residuals
  14.4 Disturbance smoother
  14.4.1 The augmented disturbance smoother
  14.4.2 The EM algorithm and exact score
  14.4.3 Smoothed components
  14.4.4 Auxiliary residuals
  14.5 Forecasting
  14.5.1 Forecasts of series and components
  14.5.2 Extrapolative residuals
  14.6 Parameter estimation
  14.6.1 The parameters of STAMP
  14.6.1.1 Variance matrices in a multivariate model
  14.6.1.2 Vector autoregressive components
  14.6.2 BFGS Estimation procedure*
  14.6.3 Estimation of univariate models*
  14.6.3.1 Initial values
  14.6.3.2 Strategy for setting parameters fixed
  14.6.3.3 Strategy for determining the concentrated parameter
  14.6.4 Estimation of multivariate models*
  14.6.4.1 Initial values
  14.6.4.2 Strategy for fixing parameters
  14.7 Appendix 1: Diffuse distributions*
  14.7.1 Collapse of the augmented KF
  14.7.2 Likelihood calculation
  14.7.3 Recursive regressions
  14.8 Appendix 2: Numerical optimisation*
  14.8.1 Newton type methods
  14.8.2 Numerical score vector and diagonal Hessian matrix
15 Model Output
  15.1 Output from STAMP
  15.1.1 Model estimation
  15.1.2 Selected model and estimation output
  15.1.3 Summary statistics
  15.1.4 The sample period
  15.1.5 Test
  15.2 Parameters
  15.2.1 Variances and standard deviations
  15.2.2 Cycle and AR(1)
  15.2.3 Variance matrices (multivariate models)
  15.2.4 Factor loading matrices (multivariate models)
  15.2.5 Transformed parameters and standard errors
  15.3 Final state
  15.3.1 Analysis of state
  15.3.2 Regression analysis
  15.3.3 Seasonal tests
  15.3.4 Cycle tests
  15.3.5 Data in logs
  15.4 Goodness of fit
  15.4.1 Prediction error variance
  15.4.2 Prediction error mean deviation
  15.4.3 Coefficients of determination
  15.4.4 Information criteria : AIC and BIC
  15.5 Components
  15.5.1 Series with components
  15.5.2 Detrended
  15.5.3 Seasonally adjusted
  15.5.4 Individual seasonals
  15.5.5 Data in logs
  15.6 Residuals
  15.6.1 Correlogram
  15.6.2 Periodogram and spectrum
  15.6.3 Cumulative statistics and graphs
  15.6.4 Distribution statistics
  15.6.5 Heteroskedasticity
  15.7 Auxiliary residuals
  15.8 Predictive testing
  15.9 Forecast
  Part:4 Statistical Output
13 Descriptive Statistics
  13.1 Transformations
  13.2 Correlogram
  13.3 Periodogram
  13.4 Spectrum
  13.5 CUSUM
  13.6 CUSUM of squares
  13.7 Histogram
  13.8 Density estimation
  13.9 Distribution function
  13.10 Write report
  13.10.1 Serial correlation
  13.10.2 Normality test
List of tables
  Table:13.1 Empirical size of normality tests
  Table:14.1 Some special level and trend specifications
  Table:14.2 Parameters and transformations

Part:4 Statistical Output

Chapter 13 Descriptive Statistics

The Transform entry in the Data menu involves transformation of series and calculation of statistics. This chapter provides the formulae underlying the computations of descriptive statistics for individual time series. Model-related diagnostic statistics are presented in Chapter 15.

The Describe dialog, to be accessed from the Transform dialog, produces two types of output. The first, and most useful, is graphical output. Each of the options we discuss below produces graphs of one sort or another. However, the last entry in the Describe dialog, `Write report', generates output to be written to the Results window. By default this is turned off, but it is often useful to mark it so that precise summary statistics can be recorded for later use in a research report.

13.1 Transformations

This option provides two different kinds of transformations: a transform function for the actual values and a time series transform function. These different transformations are described below.

The very common and basic transformations are the logarithm, the exponent and the absolute values. The `Box--Cox' and the `Stochastic volatility' transformations require some explanation. The transformation of Box and Cox (1964) is given by

(ytλ -1)/(λÿ) if λ≠0
log yt if λ=0
,  where ÿ= exp ( 1/T ∑t=1T log yt) ,  t=1,...,T,
(eq:13.1)

where λ usually ranges between -2 and 2 (STAMP does not accept values outside this range to ensure numerical stability). Note that the Box--Cox transformation with the default λ= 1/2 is very similar to the `square root' transformation. The following stochastic volatility transformation (see also Chapter ??) is suggested by Professor Wayne Fuller, see Fuller (1996, p.494--497); that is,

log yt2= log (yt2+λsy2)-λsy2/(yt2+λsy2),  t=1,...,T
(eq:13.2)

where sy2 is the unconditional variance of the yt, and λ is a small constant value, the default value 0.02.

The second type of transformations focuses on differencing or integrating time series. If differencing is required, then the following is optional:

None yt
First Δyt=(1-L)yt
Second Δ2yt=(1-L)2yt
Seasonal sum S(L)yt=(1+L+...+Ls-1)yt
Seasonal Δsyt=(1-Ls)yt
First &seasonal ΔΔsyt=(1-L)(1-Ls)yt
(eq:13.3)

where L is the lag operator (so Lyt=yt-1) and s is the number of seasonals in a year. When the differencing involves missing data, such as y0 and y-1, then the resulting differenced series will record a missing value. An example of this is that Δy1 is regarded as a missing value. If integration is required, then the resulting series, called zt, are defined as

None zt=yt
First Δzt=yt
Second Δ2zt=yt
Seasonal sum S(L)zt=yt
Seasonal Δszt=yt
First &seasonal ΔΔszt=yt
(eq:13.4)

where zt is set to zero for any t≤0. This operation does not generate extra missing values.

Finally, the series can be detrended. This is carried out using the following trend model:

yt = μtt, εt~NID(0,σ2),
μt = μt-1t-1
βt = βt-1t, σt~NID(0,qσ2),
(eq:13.5)

which is discussed in §14.1. A forwards and backwards recursive algorithm is used to estimate μt, written by \mu t, allowing the `detrended' series to be computed as yt-\mu t. The default parameter for q is 0.000625=1/1600, as suggested by Hodrick and Prescott. By increasing this number, a more responsive trend line is fitted, while if it is set to zero, a deterministic time trend is obtained.

13.2 Correlogram

The correlogram is defined as the series rτ , for 1,...,p, where rτ is the sample correlation between yt and yt-τ. The length of the correlogram p is pre-specified, leading to a figure which shows r1,r2,...,rp plotted against 1,2,...,p. The autocorrelation for lag τ is defined by

rτ =
t=τ+1T( yt-y) ( yt-τ-y)

t=τ+1T( yt-y) 2
.
(eq:13.6)

where y= 1/T ∑t=1Tyt is the sample mean of yt. The approximate standard error for rτ is 1/(T)½.

13.3 Periodogram

The information in the correlogram can also be expressed in the frequency domain. The ordinates of the periodogram are

pj)=aj2+bj2,    for λj=2πj/T,  j=0,...,n,  and  n=[T/2],
(eq:13.7)

where aj and bj are given by

a0=(T)½y, aj=( 2/T )½t=1Tytcj,t, an= ( 1/T )½t=1Tyt( -1) t,
b0=0, bj=( 2/T )½t=1Tytsj,t, bn=0,
(eq:13.8)

with cj,t= cos λjt and sj,t= sin λjt, for j=1,...,n-1. The sine and cosine terms are evaluated recursively, using the trigonometric addition rule; that is,

(
sj,t+1
cj,t+1
) =[
cos λj sin λj
- sin λj cos λj
] (
sj,t
cj,t
) ,  with  (
sj,0
cj,0
) =(
0
1
) ,
(eq:13.9)

for t=0,...,T-1. Note that the terms cos λj and sin λj, for j=0,...,n, can also be evaluated recursively using a similar scheme. These recursions speed up the calculations by a significant amount. The periodogram is a plot of the ordinates pj) against λj/π, for j=1,...,n. Thus the X-axis ranges between 0 and 1.

The periodogram can also be written in terms of the sample autocorrelations; that is,

pj)=
1

[ c(0)+2∑τ=1T-1c(τ) cos (λτ)] ,
(eq:13.10)

where c(τ)= 1/T ∑t=1T(yt-y)(yt-τ-y).

In the literature on the spectral analysis of stationary time series, the plot of pj) against λj is sometimes called the spectrogram, but the term periodogram is more widely used. For more information on this topic see Priestley (1981).

13.4 Spectrum

The periodogram is an inconsistent estimator of the spectrum; a consistent estimator of the spectrum is given by

( λi) =
1

j=-mmwjp( λi+j) ,
(eq:13.11)

where m is called the window length and the wj's are normalised weights which are constrained by w-j=wj, for j=0,...,m. The weights in STAMP are directly associated with the odd rows of the Pascal's triangular scheme, for example,

m c0 c1 c2 c3 c4
2 2 1
3 6 4 1
4 20 15 6 1
5 70 56 28 8 1
(eq:13.12)

and wj=cj41-m. Note that, for this Pascal's scheme, the normalisation factor is based on ∑j=-mmcj=4m-1. At the beginning and at the end of the estimated spectrum the weighting scheme is simply truncated and the normalisation factor is adjusted appropriately so that ∑j wj=1.

13.5 CUSUM

The CUSUM graph is designed to assess whether the mean of a process yt changes over time; see Brown, Durbin and Evans (1975) and Harvey (1990, p. 153--5). In STAMP the CUSUMs are defined with respect to the standardised series

CUSUM(t)=σ̂ y-1j=1t(yj-y),  t=1,...,T
(eq:13.13)

where y= 1/T ∑j=1Tyj and σ̂ y=( 1/T ∑j=1T( yj-y) )½. The motivation for the CUSUM is that if the mean changes through the standardised series, it will have runs of positive or negative observations. Thus the CUSUM will drift up (or down), before going back to zero at CUSUM(T). A formal check on the CUSUM is based on assessing whether it crosses two `boundary' lines as given by

CUSUM=±[0.85(T)½+1.7t/(T)½].
(eq:13.14)

These lines are based on a significance level of 10%.

13.6 CUSUM of squares

A similar device to the CUSUM is available for detecting a change in variance. Attention is now focused on the squares of the data and so we work with,

CUSUMSQ(t)=∑j=1t(yj-y)2/∑j=1T(yj-y)2,  t=1,...,T.
(eq:13.15)

As t increases, the CUSUMSQ increases. If the model is homoskedastic, the increase should be linear. Departures from the 45° line in the CUSUMSQ graph suggest a breakdown of this assumption.

13.7 Histogram

Let zt denote the standardised observations (yt-y)/σ̂ y, for t=1,...,T, where y and σ̂ y are defined in §13.5. The range of zt is divided into N intervals of length h with h defined in §13.8. Then the proportion of zt in each interval constitutes the histogram.

13.8 Density estimation

The density can be estimated as a smoothed function of the histogram using a normal or Gaussian kernel. This can then be summed (`integrated') to obtain the estimated cumulative distribution function (CDF).

Denote the actual density of Z at z by fz( z) . A non-parametric estimate of the density is obtained from the sample by:

fz( z) ̂= 1/Th ∑t=1TK(
z-zt

h
) ,
(eq:13.16)

where h is the window width or smoothing parameter, and K( .) is a kernel such that:

-∞ K( x) dx=1.
(eq:13.17)

STAMP sets h=1.06σ̂ z/T0.2 (=1.06/T0.2 as σ̂ z=1) as a default and uses the standard normal density for K( .) :

K(
z-zt

h
) =
1

(2π)½
exp [ -
1/2
( z-zt/h ) 2] .
(eq:13.18)

We thank Professor Bernard Silverman for permission to use his algorithm for calculating fz( z) ̂. A general treatment of density function estimation can be found in Silverman (1986).

13.9 Distribution function

The estimated CDF of Z can be derived from fz( z) ̂; this is shown with a standard normal CDF for comparison. When selecting one graph, a non-parametric density estimation menu allows the choices determining h to be set. The menu does not appear for multiple graphs on one screen.

13.10 Write report

When the `Write report' option is active in the Describe dialog, some descriptive statistics will be sent to the Results window. Which are selected to be reported will depend on the other options which are activated in the Describe dialog. This can be broken down into two types of output: measures of serial dependence and normality.

13.10.1 Serial correlation

The correlogram values rτ , for τ=1,...,p, are defined in §??. Their standard errors, under the null hypothesis of white noise, are equal to 1/(T)½. The Box--Ljung portmanteau statistic is given by

Q( p,q) =T(T+2)∑τ=1p
rτ 2

T-p
(eq:13.19)

see Ljung and Box (1978). This is tested against a χq2 distribution. When no model is fitted, the degrees of freedom, q, is set equal to p. However, when the Q-statistic is applied to the residuals of an estimated structural time series model, q is set equal to p+1 minus the number of parameters. Chapter 15 gives the details on how the Q -statististic is used in the context of fitting structural time series models. The Box--Ljung Q-statistic is a modification of the earlier Box and Pierce (1970) statistic, corrected to have better size behaviour in moderate samples. It is reported for a variety of values of p along with its Prob. value computed using a χq2 distribution.

13.10.2 Normality test

Let μ, σ2 denote the mean and variance of y, and write μi=E[ y-μ] i, so that σ22. The skewness and kurtosis are defined as

√β1=
μ3

μ23/2
and β2=
μ4

μ22
.
(eq:13.20)

Sample counterparts are defined by

y= 1/T ∑t=1Tyt,  mi= 1/T ∑t=1T( yt-y) i,  √b1=
m3

m23/2
and b2=
m4

m22
.
(eq:13.21)

A normal variate will have √β1=0 and β2=3. STAMP reports

s=
T( √b1) 2

6
  and  k=
T( b2-3) 2

24
,
(eq:13.22)

which are both tested against a χ12 distribution. The Bowman--Shenton test for normality is given by

NBS=s+k,
(eq:13.23)

which is tested against a χ22 distribution. As Bowman and Shenton (1975) note, the above tests are unsuitable unless used in very large samples. The statistics √b1 and b2 are not independently distributed in small samples, and the sample kurtosis especially approaches normality very slowly. Another normality test reported by STAMP is described in Doornik and Hansen (1994). It derives from Shenton and Bowman (1977), who give b2 (conditional on b2>1+b1) a gamma distribution, and D'Agostino (1970), who approximates the distribution of √b1 by the Johnson Su system. Let s* and k* denote the transformed skewness and kurtosis, respectively, where the transformation creates statistics which are much closer to standard normal. The alternative normality test is

NDH=s*+k*,
(eq:13.24)

which is also tested against a χ22 distribution.Table Table:13.1 compares (eq:13.24) with its asymptotic form (eq:13.23). It gives the rejection frequencies under the null of normality, using χ22 critical values. The experiments are based on 10,000 replications and common random numbers.

Table:13.1 Empirical size of normality tests

nominal probabilities of NDH nominal probabilities of NBS
T 20%10%5%1%20%10%5%1%
50 0.1734 0.0869 0.0450 0.0113 0.0939 0.0547 0.0346 0.0175
100 0.1771 0.0922 0.0484 0.0111 0.1258 0.0637 0.0391 0.0183
150 0.1845 0.0937 0.0495 0.0131 0.1456 0.0703 0.0449 0.0188
250 0.1889 0.0948 0.0498 0.0133 0.1583 0.0788 0.0460 0.0180

STAMP reports the following statistics under this option:

mean y maximum
standard deviation σy= (m2)½ skewness χ1 2 s
skewness √b1 kurtosis χ1 2 k
excess kurtosis b2-3 normality χ2 2 NBS
minimum normality χ2 2 NDH

Chapter 14 Statistical Treatment of Models

To understand fully the output of STAMP it is useful to understand the way in which the models are handled statistically. The algorithms used to carry out the computations are based on the state space form. The interested reader may find explanations of the underlying ideas and proofs of the algorithms in Anderson and Moore (1979), Harvey (1989, Chs. 3 and 4), de Jong (1991) and Koopman (1993). This chapter simply sets out the algorithms and gives the exact technical details of implementation. We first formally define the models which were used in Chapters ??--??.

14.1 Model definitions

14.1.1 Univariate time series models

A univariate model may be written as

ytttttt,  εt~NID(0,σε 2),  t=1,...,T,
(eq:14.1)

where μt is the trend, γt is the seasonal, ψt is the cycle, νt is a first-order autoregressive component and εt is the irregular. The model can be extended with two similar cycle components; see below.

The stochastic trend component is specified as

μt = μt-1t-1t, ηt~NID(0,ση 2),
βt = βt-1t, ζt~NID(0,σζ 2),

where βt is the slope or gradient of the trend μt. The irregular εt, the level disturbance ηt and the slope disturbance ζt are mutually uncorrelated. The slope component βt can be excluded from the trend specification when this is appropriate. Some special trend specifications are listed in Table Table:14.1.

Table:14.1 Some special level and trend specifications

Level σε ση
constant term * 0
local level (LL) * *
random walk (RW) 0 *
Trend σε ση σζ
deterministic * 0 0
LL with fixed slope * * 0
RW with fixed drift 0 * 0
local linear (LLT) * * *
smooth trend * 0 *
second differencing 0 0 *
Hodrick-Prescott * 0 .025 σε
10.5cm Asterix * indicates any positive value.

The seasonal component may have the dummy variable form or the trigonometric form. The number of seasonal frequencies in a period (e.g. a year) is given by integer s. When s is even, [s/2]=s/2, and when s is odd, [s/2]=(s-1)/2. The seasonal dummy is given by

γt=-γt-1-...-γt-s+1t,  ωt~NID(0,σω 2).

The trigonometric seasonal form is

γt=∑j=1[s/2]γj,t

where each γj,t is generated by

[
γj,t
γj,t*
] =[
cos λj sin λj
- sin λj cos λj
] [
γj,t-1
γj,t-1*
] +[
ωj,t
ωj,t*
] ,  
j=1,...,[s/2],
t=1,...,T,

where λj=2πj/s is the frequency, in radians, and the seasonal disturbances ωt and ωt* are two mutually uncorrelated NID disturbances with zero mean and common variance σω 2. For s even, the component at j=s/2 collapses to

γj,tj,t-1 cos λjj,t.

The statistical specification of a cycle, ψt, is given by

[
ψt
ψt*
] =ρψ [
cos λc sin λc
- sin λc cos λc
] [
ψt-1
ψt-1*
] +[
κt
κt*
] ,  t=1,...,T,

where ρψ , in the range 0<ρψ ≤1, is a damping factor; λc is the frequency, in radians, in the range 0 ≤λc ≤π; κt and κt* are two mutually uncorrelated NID disturbances with zero mean and common variance σκ 2. Note that the period of the cycle is equal to 2π/λc. There may be two additional cycles of this form incorporated in the model.

A first-order autoregressive, AR(1), process is given by

νtν νt-1t,  ξt~NID(0,σξ 2),

with ρν in the range 0<ρν <1. The AR(1) component is actually a limiting case of the stochastic cycle when λc is equal to 0 or π, though it is specified separately in STAMP, partly to avoid confounding and partly because it is not a limiting case in multivariate models.

Finally, the disturbances driving each of the components in the model are mutually uncorrelated.

14.1.2 Explanatory variables and interventions

A single equation model may include exogenous explanatory variables, lagged values of the dependent variable and intervention variables, as well as unobserved components such as trend, seasonal and cycle. Thus (eq:14.1) can be extended as

ytttt+rt+∑τ=1pφτ yt-τ+∑i=1kτ=0qΔxi,t-τ+∑j=1hλjwj,tt
(eq:14.2)

where xit is an exogenous variable, wjt is an intervention (dummy) variable and φτ and λj are unknown parameters.

14.1.3 Multivariate time series models

Multivariate models have a similar form to univariate models, except that yt is now an N×1 vector of observations, which depends on unobserved components which are also vectors. Thus in the special case of a multivariate local level model

yt = μt+εt, εt~NID(0,Σ ε ),
μt = μt-1+ηt, ηt~NID(0,Σ η ),
(eq:14.3)

where Σε and Ση are both N×N variance matrices, and ηt and εt are mutually uncorrelated in all time periods. The other disturbances in the general model (eq:14.1) similarly become vectors which have N×N variance matrices. Models of this kind are called seemingly unrelated time series equations (SUTSE).

In a homogeneous model the variance matrices are proportional. In the case of a multivariate local level model this implies Σ η =qΣ ε , where q is a non-negative scalar.

The other components are incorporated in a multivariate model in a similar manner by letting the components and disturbances become N×1 vectors and letting the variances become N×N variance matrices. In the case of the cycle, the parameters ρ and λ are the same for all series. As regards the disturbances,

E( κtκt') =E( κt*κt*') =Σκ and E( κtκt*') =0.
(eq:14.4)

The specification of the cycle model can also be written as

[
ψt
ψt*
] ={ρψ [
cos λc sin λc
- sin λc cos λc
] ⊗IN}[
ψt-1
ψt-1*
] +[
κt
κt*
] ,
(eq:14.5)

where

Var[
κt
κt*
] =I2Σκ ,   ψt and ψt-1*are N ×1 vectors.

A stationary first-order vector autoregressive may be included in a multivariate model as an alternative to, or even as well as, a cycle. Thus

νt=Φνt-1+ξt,   and  Var(ξt)=Σξ ,
(eq:14.6)

where Φ is a N×N matrix of coefficients. The condition for νt to be stationary is that the roots of the matrix polynomial I-ΦL should all lie outside the unit circle.

14.1.4 Common factors

In a common factor model, some or all of the components are driven by disturbance vectors with less than N elements. In terms of a SUTSE model, the presence of common factors means that the variance matrices of the relevant disturbances are less than full rank.

Common factors may appear in all components, including the irregular. Common trends arise through common levels, common slopes or a combination of the two.

14.1.4.1 Common levels

Consider the local level model, (eq:14.3), but suppose that the rank of Ση is K<N. The model then contains K common levels or common trends and may be written as

yt = Θμt+μθ +εt, εt~NID(0,Σε ),
μt = μt-1+ηt, ηt~NID(0,Dη ),
(eq:14.7)

where ηt is a K×1 vector, Θ is a (N×K) matrix of standardised factor loadings, Dη is a diagonal matrix and μθ is a (N×1) vector in which the first N-K elements are zeros and the last K elements are contained in a vector {\bf \mu } . The standardised factor loading matrix contains ones in the `diagonal' positions, i.e. θii=1,i=1,...,K, while θij=0 for j>i.

The model may be recast in the original SUTSE form (eq:14.3) by writing μt=Θμtθ and noting that Ση =ΘDη Θ' is a singular matrix of rank K. When there are no common trends so N=K, the factor loading matrix is the Cholesky decomposition of Ση . As regards {\bf \mu }, partition Θ=(Θ1'2')' where Θ1 consists of the first K rows, and partition μt similarly. Then

\mu =-Θ2Θ1-1μ1t+μ2t,  t=1,...,T.
(eq:14.8)

The vector {\bf \mu } is estimated from the above set of equations using the estimated states at t=T. The computations are carried out by back-substitution since the load matrix is lower triangular.

14.1.4.2 Smooth trends with common slopes

Common slopes may be formulated along similar lines. We first consider the case where the variance matrix of the slope disturbances is of rank Kβ but the variance matrix of levels is null so that the estimated trends are relatively smooth. The model is therefore

yt = μt+εt, εt~NID(0,Σε ),
μt = μt-1+βt-1,
βt = βt-1+ζt, ζt~NID(0,Σζ ),

but, following (eq:14.7) above , it may be re-formulated as

yt = Θμt+μθt+εt, εt~NID(0,Σε ),
μt = μt-1+βt-1,
βt = βt-1+t, ζt~NID(0,Dζ ),

where Σζ =ΘDΘ' and the first Kβ elements in μθt are zeros and the remainder are contained in a vector {\bf \mu }+{\bf \beta }% t. The (N-K)×1 vectors, {\bf \mu } and % {\bf \beta }, may both be calculated by expressions like (eq:14.8).

14.1.4.3 Common trends: level and slopes

A general multivariate local linear trend model in which the level variance matrix is of rank Kη while the slope variance matrix is of rank Kβ may be written

yt = μt+εt, εt~NID(0,Σε ),
μt = μt-1+Θβ βt-1+βθ +ηt, ηt~NID(0,Ση ),
βt = βt-1+ζt, ζt~NID(0,Dζ ),

where the N×Kβ matrix Θβ is such that Σζ β Dζ Θβ ' ,and βθ =(0',{\bf % \beta }')' with {\bf \beta } a vector of length N-Kβ .

If Kβ =1, setting Θβ equal to a vector of ones and letting {\bf \beta }=0 would imply that all series had the same underlying growth rate (when modelling in logs). This might be plausible even if there are no common levels. The implication is that the trends in the forecast function remain parallel, in other words the long-run growth paths are the same. However, unless there are similar restrictions on the levels, the growth paths within the sample will not necessarily have kept together.

The following interpretation can be made if Kβ ≤Kμ and the common stochastic slopes only affect the observations via the common stochastic trends:

yt = Θμ μt+μθt+εt, εt~NID(0,Σε ),
μt = μt-1+Θ-1Θμββt-1+ηt, ηt~NID(0,Dη ),
βt = βt-1+ζt, ζt~NID(0,Dζ ),

where Θ is the first Kμ rows of Θμ ,Θμβ is the first Kμ rows of Θβ ,μθt=(0',β1't,{\bf \mu }'+β2't)' with {\bf \beta }1 is a vector of length Kμ -Kβ , {\bf \beta }2 is a vector of length N-Kμ and % {\bf \beta }=({\bf \beta }1', {\bf \beta }2')' . The fixed components satisfy, for any 1 ≤t ≤T,

\mu =-ΘΘ-1μ1t2t-{\bf \beta }2t   and   \beta =-ΘΘ-1β1t2t

where Θ is the first Kβ rows of Θβ and Θ is the last N-Kμ rows.

At present STAMP does not offer the option of making the restrictions implied by the above formulation. However, if Kμ =Kβ =1, it must be the case that the above restriction implies Θ=Θ. More generally for Kμ =Kβ we have

μt=μt-1+Θ-1Θβt-1+ηt,  ηt~NID(0,Dη )

and setting Θ=Θ implies the further restriction that the slopes in the common trends are mutually independent. Thus the common trends are fully independent in both level and slope disturbances. However, for this to be true for all rotations the further condition that Dζ =qζ Dη is needed. This means that Σζ =qζ Ση so the model is what we call trend homogeneous .

To summarise trend homogeneity means that Kμ =Kβ and the slope and level disturbances enter through the same common trends, and when there is more than one common trend these trends are mutually independent. When the trend homogeneous constraint is applied, any restrictions made to Θ will automatically be carried over to Θ.

14.1.4.4 Common seasonals

Common factors in seasonality implies a reduction in the number of disturbances driving changes in the seasonal patterns. It does not imply any similarity in seasonal patterns unless the deterministic seasonal components outside the common seasonals are set to zero. Thus suppose, for simplicity, that the series contain only seasonals and irregular. Then

yt=Θγt+γθt+εt,

where the last N-K elements of γθt contain fixed seasonal effects.

14.1.4.5 Common cycles

A model with common trends and common cycles could be written

yt=Θμ μt+μθ +Θψ ψt+εt,

where Θψ is N×Kψ . Since the expectation of a cycle is zero, it is not necessary to include a vector of constant terms corresponding to the vector μθ which is needed for common levels.

As with other components, the common cycle can be re-formulated as a SUTSE model in which the disturbance variance matrices are Σκ = Θψ Dκ Θψ '. This is only possible because ρψ and λ are the same in all the common cycles.

14.1.5 Explanatory variables

Explanatory variables and interventions may be included in multivariate models. Thus (eq:14.2) generalises to

yt=μt+γt+ψt+rt+∑τ=1rΦτ yt-τ+∑τ=0sδτ xt-τ+Λwt+εt,  t=1,...,T,

where xt is a K×1 vector of explanatory variables and wt is a K*×1 vector of interventions. Elements in the parameters matrices, Φ, δτ and Λ may be specified to be zero, thereby excluding certain variables from particular equations. In addition, the unobserved components may be subject to common factor restrictions as described in the previous subsection.

14.2 State space form

The statistical treatment of structural time series models is based on the state space form (SSF). We use a particular SSF which is close to the `state space model' of de Jong (1991). It joins a measurement equation

yt=Ztαt+Xtb+Gtut,  t=1,...,T,
(eq:14.9)

with a transition equation, which allows the states αt to evolve according to a first-order vector autoregressive process. We write this as

αt+1=Ttαt+Wtb+Htut.
(eq:14.10)

The transition equation is initialised at the first time point by setting

α1=W0b+H0u0.
(eq:14.11)

The model is completed by making assumptions about the error process ut and the vector of regressors b. We assume they are independently and normally distributed with

ut~NID(02I),  b=c+  and  δ~N(μ2Λ).
(eq:14.12)

This formulation of the SSF is slightly different from that exploited in books such as Harvey (1989) and West and Harrison (1989). It is used to produce rather more elegant treatments of diffuse initial conditions and fixed effects; see de Jong (1991).

The state system matrices Zt and Tt are fixed matrices which merely contain known values. The regression system matrices Xt and Wt are always known. The error system matrices Gt and Ht are also sparse but most non-zero values are unknown and are regarded as "hyper" parameters. The disturbance ut is transformed into noise for the SSF by the error system matrices Gt and Ht. All system matrices can be regarded as selection type matrices.

At first sight, the role for b in the SSF looks quite complicated. However, the SSF is written in this way to allow a unified treatment of a variety of features for time series models. There are three particularly important ones on which to focus. First, the measurement equation allows b to influence the observations directly through the regressors Xt. In general Xt, like Gt and Ht, may be a sparse selection type matrix. Second, the states are directly affected by b through Wt. This feature is particularly useful for handling interventions like structural changes (changes in trend) and growth changes (changes in slope). Third, the prior distribution of the initial state vector is (partly) defined via b.

The random vector b allows the analysis of models where diffuse distributions are placed on the regressors and on the non-stationary components of the initial state. The vector b is a linear combination of the known constant vector c and the random vector δ. The distribution of δ is regarded as diffuse, i.e. the variance matrix Λ converges to infinity. Basically, this means that a proper prior distribution for the state cannot be given when components are non-stationary. For a technical discussion on these issues, we refer to de Jong and Chu-Chun-Lin (1994). The connection between diffuse initial conditions and marginal or restricted likelihoods is discussed in Tunnicliffe-Wilson (1989) and Shephard (1993b). The unconditional distributions for stationary components, such as cycles, are handled via the second part of (eq:14.11); that is, H0u0.

The generality of the SSF can be simplified for structural time series models in three different ways.

  1. The constant vector c of the specification for b can be set to a zero vector such that b=. The matrix B is a square selection matrix of zeros and ones. The vector δ can be partitioned such that the regression effects and the initial effects are separated and that δ enter the equations of the SSF directly. This is achieved by

    δ=(
    δx
    δi
    ) ,  and   B=(Bx,Bi),

    where the (k×1) vector δx contains the regression effects, the (d×1) vector δi contains the initial effects and the dimensions of the matrices B, Bx and Bi are (k+d)×(k+d), (k+d)×k and (k+d)×d, respectively. This brings the specification for b to

    b=Bxδx+Biδi

    and

    Xtb = Xt*δx where Xt* = XtBx and XtBi = 0,
    Wtb = Wt*δx where Wt* = WtBx and WtBi = 0,
    W0b = W0*δi where W0* = W0Bi and W0Bx = 0.

  2. For all models in STAMP, Gt and Ht will be orthogonal; that is, Gt'Ht=0, so the noise terms in the two equations, Gtut and Htut, are independently distributed. This feature of a model simplifies the statistical analysis of the SSF.

  3. All models in STAMP can be handled by a time-invariant SSF. This implies that the t subscripts of the state and error system matrices can be dropped, i.e. Zt=Z, Tt=T, Gt=G and Ht=H. A time-invariant SSF also simplifies the statistical analysis of the SSF.

14.2.1 Structural time series models in SSF

Structural time series models fit nicely in the state space form. However, some care should be taken with the initial state vector specification.

14.2.1.1 Univariate models in SSF

Some examples are given to get a flavour of it.

14.2.1.2 Multivariate model in SSF

Any multivariate model in STAMP with common factors can be written in the SUTSE form with lower rank variance matrices; see §14.1.4. The SUTSE form retains the sparse structure of the system matrices Z and T which is important for computational issues related to the estimation of these models. Therefore, this general formulation is adopted to specify a multivariate model in the SSF. Since the SUTSE form is a minor generalisation to the univariate model, only one example will be given. The multivariate model is a smooth trend plus cycle model which can be represented by the SSF as

yt = (
I 0 I 0
) αt+(
Γε 0 0 0
) ut,
αt = (
I I 0 0
0 I 0 0
0 0 ρ cos λcI ρ sin λc I
0 0 -ρ sin λcI ρ cos λcI
) αt-1+(
0 0 0 0
0 Γζ 0 0
0 0 Γκ 0
0 0 0 Γκ
) ut

where state vector αt=( μt',βt',ψt',ψt*') , disturbance vector ut=(εt,ζt,κt,κt*)' and each variance matrix is decomposed as Σ=ΓΓ' with the appropriate subscript. The unknown values in transition matrix T rely on the damping factor ρ and the frequency λc. The initial state is given by

α1 = (
I 0
0 I
0 0
0 0
) δi+ 1/(1-ρ2)½ (
0 0
0 0
Γκ 0
0 Γκ
) u0

with vector u0=(κ0,κ0*)' and δi contains the initial effects for μt and βt.

14.2.1.3 Deterministic trend and seasonal components

The deterministic seasonal component is the special case of the stochastic seasonal model with σω =0. Alternatively, the fixed seasonal can be incorporated within Xt as it is usually done in regression models. The latter one is adopted in STAMP. In a similar way (partly) deterministic trends can be incorporated in Xt or Wt. However, for practical reasons such as the implementation of trend interventions within Wt, the trend components μt and βt are always part of the state vector. The inclusion of a fixed trend in the state vector has some consequences for calculation and interpretation of residuals and the prediction error variance.

14.3 Kalman filter

The Kalman filter (KF) plays the same role for time series models in SSF as least squares computations for a regression model. The KF is primarily a set of vector and matrix recursions. The importance of the KF is based on

  1. computation of one-step ahead predictions of observation and state vectors, and the corresponding mean square errors;

  2. diagnostic checking by means of one-step ahead prediction errors;

  3. computation of the likelihood function via the one-step ahead prediction error decomposition;

  4. smoothing which uses the output of the KF.

The KF has a variety of forms, but the one exploited in STAMP computes

at|t-1 = E( αt|Yt-1,δ=0),
σ2Pt|t-1 = E{(αt-at|t-1)(αt-at|t-1)'|Yt-1,δ=0},
  

which are referred to as the mean and mean square error (MSE) of the state, respectively, given the past information and setting δ to zero. Generally, we need to perform estimation with δ not zero, so these terms will be corrected by the use of an augmented Kalman filter. This will be discussed in the next subsection.

The recursive equations of the Kalman filter are given by

vt = yt-Xtb-Zat|t-1, Ft = ZPt|t-1Z'+GG',
qt = qt-1+vt'Ft-1vt, Kt = TPt|t-1Z'Ft-1,
at+1|t = Tat|t-1+Wtb+Ktvt,
Pt+1|t = TPt|t-1T'-KtFtKt'+HH',
(eq:14.13)

starting with a1|0=W0b, P1|0=H0H0' and q0=0. Here Kt is called the Kalman gain, while vt and σ2Ft are the one-step ahead prediction error (or innovation) and its mean square error, respectively. The scaled innovations Ft- 1/2 vt (or generalised least squares residuals) are approximately NID with zero mean and scale identity matrix as its variance matrix in a correctly specified model. Note that the estimate of σ2 is given by σ̂2=qT/NT. The proof of the KF is simple and it relies on standard results of linear estimation. Although the KF is presented in terms of time-invariant system and error matrices (Z, T, G and H), the time indices can be added to these matrices without any problem.

In the special case of a univariate model, N=1, the innovation vector and the corresponding mean square error matrix reduce to the scalars vt and σ2ft, respectively. Also, the Kalman gain matrix reduces to the vector kt. This notation is only used for univariate models.

The KF requires a positive (for univariate models) or a non-singular positive definite matrix (for multivariate models) Ft which may not occur at the start of the KF when, for example, the column rank of G1 is less N and P1|0=H0H0'=0. An example is when no irregular is present in a structural time series model. In these circumstances, the KF is warmed up by a number of prediction updates; that is, aj+1|0=Taj|0 and Pj+1|0=TPj|0T'+HH'. The initialisation process continues until Fj=ZPjZ'+GG' is positive definite. The KF is then re-initialised by a1|0=ai|0, P1|0=Pi|0 and q0=0 where i is the number of required prediction updates.

When Ft becomes non-positive definite during the KF updating, it is generally assumed that a numerical problem has occurred. Available remedies like a square-root version of the KF are not considered. However, the implementation of the KF in STAMP ensures for each time-point t that the MSE matrices Pt+1|t and Ft are symmetric, which is crucial in avoiding numerical problems. Also, avoiding unnecessary computations, such as adding zeros and multiplying zeros or unity values, helps in this respect.

In case of a time-invariant SSF, the KF may reach a steady state solution at some time s; that is, when Ps+1|s=Ps|s-1 for p<s≤T where p is the dimension of the state vector. This also implies that Fj and Kj are time-invariant for j=s,...,T. The KF reduces in these circumstances to the equations for at+1|t, vt and qt which are computationally very cheap. Finally, the steady state can be interpreted as the solution of the Riccati equation; see Anderson and Moore (1979).

14.3.1 The augmented Kalman filter

The KF evaluates the one-step ahead prediction of the state vector conditional on δ=0. When non-stationary components and/or fixed regression effects are present in the model, the unknown random vector δ is treated as diffuse; see [14.2]. In these circumstances an augmented Kalman filter (AKF) is applied:

Vt = - ZAt|t-1-XtB
At+1|t = TAt|t-1+WtB+KtVt
(st,St) = (st-1,St-1)+Vt'Ft-1(vt,Vt)
(eq:14.14)

for t=1,...,T and with A1|0=W0B. The AKF must be regarded as a supplement to the KF equations; the expressions for vt, Ft, qt, Kt, at+1|t and Pt+1|t remain as they are. The number of columns for Vt and At+1|t is k+d, the same number of columns as in matrix B of the SSF. The one-step ahead prediction of the state vector and the corresponding MSE matrix are given by

E(αt|Yt-1) = ât|t-1 = at|t-1+At|t-1St-1-1st-1,
MSE(αt|Yt-1) = σ2P̂t|t-1 = σ2(Pt|t-1+At|t-1St-1-1At|t-1'),
(eq:14.15)

respectively. The one-step ahead prediction error and the corresponding MSE matrix (or variance matrix) are given by

yt-E(yt|Yt-1) = v̂t = vt+VtSt-1-1st-1
MSE(yt|Yt-1) = σ2F̂t = σ2(Ft+VtSt-1-1Vt'),
(eq:14.16)

respectively. The matrix inversions for St, t=1,...,T, can be evaluated recursively using similar methods as recursive regressions; see Appendix 14.7 of this chapter. The estimate of the scalar variance σ2 is

σ̂2=
1

NT-d-k
T  where  q̂T=qT-sT'ST-1sT.
(eq:14.17)

The full sample generalised least squares estimate of δ and its MSE matrix are given by

δ̂=ST-1sT  and  MSE(δ̂)=σ2ST-1
(eq:14.18)

respectively. The MSE matrix is used to obtain standard errors and t-statistics for δ̂. In a similar way as vector δ, δ̂ is partitioned as

δ̂=(
δ̂x
δ̂i
) .

For a proof and a full exposition of the augmented Kalman filter; see de Jong (1991). When the SSF does not contain values for Xt and Wt, the KF can evaluate ât+1|t from t=d+1 onwards. This implies that the AKF is only used for the first d updates. Then, the AKF is collapsed to the KF, mainly using equation (eq:14.15) for t=d. On the other hand, when Xt and Wt are present in the SSF, a partial collapse with respect to δi is possible. The exact details of the full and the partial collapse are given in Appendix 14.7 of this chapter. Formal proofs on collapsing are given by de Jong and Chu-Chun-Lin (1994).

14.3.2 The likelihood function

The likelihood function can be obtained from the KF by using the innovations and their mean square errors. This is known as the prediction error decomposition. For example, the likelihood function, conditional on δ=0, is defined as

log L(y|θ*,δ=0)=- NT/2 log σ2- 1/2 ∑t=1T log |Ft|-
qT

2
(eq:14.19)

apart from some constant. To limit the number of logarithmic operations in computing the likelihood, it is important to check whether a steady-state has been reached such that log |Ft| is constant for all t.

The unknown parameters in the system matrices of the SSF are stacked in vector θ. For a given vector θ=θ*, the KF calculates the likelihood and, therefore, it provides means to obtain maximum likelihood estimates of the parameters. These matters are discussed in §14.6.

14.3.2.1 Univariate models: the concentrated likelihood

The diffuse likelihood function (DL) is defined as the likelihood when δ is supposed to be diffuse; see §14.2. For a given vector θ=θ*, the DL is given by

log L(y|θ*)=-
T*

2
log σ2- 1/2 ∑t=1T log ft- 1/2 log |ST|-
1

2
T
(eq:14.20)

with T*=T-k-d. When a (partial) collapse of the AKF has taken place at some time t=m, the DL is calculated by

log L(y|θ*)=-
T*

2
log σ2- 1/2 ∑t=1T log ft- 1/2 log |S*|-
1

2
T.
(eq:14.21)

Appendix 14.7 of this chapter defines S* and it gives more details of likelihood calculation when a full or a partial collapse has taken place. The scalar variance σ2 can be concentrated out of the likelihood DL. This concentrated diffuse likelihood function (CDL) is given by

log Lc(y|θ*)=-
T*

2
log σ̂2- 1/2 ∑t=1T log ft- 1/2 log |S*|.
(eq:14.22)

To calculate the CDL, one variance corresponding to a particular unobserved component is set to a unity value such that this variance is set equal to σ2 of (eq:14.12). The maximum likelihood estimate of this `concentrated out' variance is σ̂2. The other variances of the components in the model are now measured as a ratio to the concentrated variance. For example, in a local level model, where σε 2 is concentrated out, we have ση 22qη . These variance ratios are referred to as q-ratios. Parameter estimation is carried out with respect to the q-ratios; see §14.6. The CDL is used for parameter estimation of univariate models in STAMP. For a full discussion on diffuse likelihoods, we refer to de Jong (1988).

14.3.2.2 Homogeneous models

The homogeneous model is introduced in §14.1 and its restriction for a multivariate local level model reduces to

Ση =qη Σε ,
(eq:14.23)

where the q-ratio qη plays the same role as set out in the previous subsection for univariate models. The homogeneity restrictions imposed on multivariate models allow the decoupling of the KF; see Fernandez (1990). The N different time series in yt are subjected to the same KF which corresponds to a univariate SSF where the matrices G and H contain the q-ratio values of the homogeneous model. So, the covariance equations of the (univariate) KF, i.e. ft, kt and Pt+1|t, are the same for all N time series. The set of N innovation series, stacked in vt, are used to construct the diffuse likelihood. The full variance matrix of a particular component, say the irregular εt; that is, Σ=Σε , can be concentrated out of the likelihood function leading to

log Lc(y|θ*)=-
T*

2
log |{\bf % \Sigma }|- N/2 ∑t=1T log ft- N/2 log |ST|
(eq:14.24)

where

{\bf \Sigma }=
1

T*
t=k+d+1T
vtvt'

ft
.

The univariate AKF is used for homogeneous models and, if possible, a (partial) collapse is made at some time t=m. Therefore, the same technical details apply as mentioned in the previous subsection which are set out in Appendix 14.7 of this chapter.

14.3.2.3 Multivariate models

The diffuse likelihood is used for parameter estimation of multivariate models. Thus, no parameter is concentrated out from the likelihood. The AKF is used for calculation but it can be (partially) collapsed at some time t=m. The multivariate diffuse likelihood is defined as

log L(y|θ*)=- 1/2 ∑t=1T log |Ft|- 1/2 log |S*|-
T

2
,
(eq:14.25)

where S* is defined in Appendix 14.7 of this chapter.

14.3.3 Prediction error variance

The prediction error variance (PEV) of a univariate structural time series model is the variance of the one-step ahead prediction errors in the steady state. As discussed at the start of this section, the steady state may require more updates of the KF than the number of available observations. The formal definition of the estimated PEV is

PEV=σ̂2f,

where f is the steady state value of ft; that is, f= lim t→∞ft. For a multivariate model, the prediction error variance matrix can be evaluated. When a steady state is not obtained, it may be useful to report a `finite' PEV; that is,

PEVn=σ̂2n,

where n is some large finite number. The `finite' PEV at the end of the sample; that is, at n=T, might also be of interest.

14.3.4 The final state and regression estimates

The final state vector âT|T contains, together with δ̂x, all the information from the estimated SSF required to forecast future observations. Therefore, the final state is of special interest. The estimate of the state vector and its MSE matrix at the end of the sample are given by

âT|T = âT|T-1+P̂T|T-1ZT'F̂T-1v̂T,
σ̂2P̂T|T = σ̂2(P̂T|T-1-P̂T|T-1ZT'F̂T-1ZTP̂T|T-1),

respectively. The diagonal part of the MSE matrix is used to form standard errors and t-statistics of the final state âT|T. Under certain circumstances, some elements of this diagonal part may be equal to zero and calculation of the t-statistic is not possible then. Also, for stationary parts of the state vector, it is not useful to calculate the t-statistic since stationary components are not persistent throughout the series.

The regression estimates are obtained from

δ̂x={\bf S}T-1{\bf s}T   and   MSE(δ̂x)=σ̂2{\bf S}T-1,

where {\bf S}T and {\bf s}T are defined in Appendix 14.7 of this chapter. The diagonal part of the MSE matrix is used to form standard errors and t-statistics.

14.3.5 Filtered components

The filtered components are based on the one-step ahead predictions of the state vector; see equation (eq:14.15). When k=0, the AKF is applied until t=d. Then, the AKF collapses to the KF which evaluates the filtered states directly. However, when k>0, a partial collapse takes place and filtering is based on (eq:14.15) with matrix % {\bf S}T-1 calculated recursively; see Appendix 14.7 of this chapter. Note that the filtered states are only defined for t=d+k+1,...,T.

14.3.6 Residuals

The residuals in STAMP for univariate models are defined as the standardised innovations

vt=v̂t/σ̂f̂t 1/2 ,  t=d+1,...,T.
(eq:14.26)

where v̂t and σ̂ 2t are defined as in (eq:14.16). When k=0, the residuals are obtained from the KF directly after the collapse of the AKF at t=d. In a multivariate model the residuals of the j-th equation are given by

vj,t=v̂j,t/σ̂F̂jj,t 1/2

for t=d+1,...,T where ai,t is the i-th element of vector at and Aij,t is the (i,j) element of matrix At.

When a SSF contains fixed effects; that is, when k>0, two sets of residuals are distinguished inside STAMP. These residuals are defined in the next two subsections. Note that the two sets are the same when k=0.

14.3.6.1 Generalised least squares residuals

The generalised least squares (GLS) residuals are standardised innovations computed using (eq:14.26), t=d+1,...,T, but where v̂t and f̂t are obtained from the AKF (eq:14.15) applied to a SSF with bx being replaced by Bxδ̂x. The calculation of these residuals requires two steps. Firstly, the estimates δ̂x and σ̂ are obtained from the AKF with d+k columns for At+1|t. Then, a second AKF is applied, with d columns for At+1|t, from which v̂t and f̂t are obtained for t=d+1,...,T. Of course, the first AKF may be partially collapsed after t=d and the second AKF may be fully collapsed to a KF after t=d. The GLS residuals are not independently distributed with constant variance in small samples. Tests for normality, heteroskedasticity and serial correlation are applied to the GLS residuals.

14.3.6.2 Generalised recursive residuals

The generalised recursive residuals are the standardised residuals (eq:14.26), but now obtained from an AKF, with d+k columns for At+1|t, accompanied with a set of recursive regressions for the inversion calculations of St; see Appendix 14.7 of this chapter. A partial collapse of this AKF may take place at t=d. These residuals are denoted by wt, for t=d+k+1,...,T, and they are commonly used in predictive tests and diagnostics.

14.4 Disturbance smoother

Smoothing refers to the estimation of the state vector αt and the disturbance vector ut using information in the whole sample rather than just past data. Smoothing is an important feature because it is the basis for

  1. signal extraction, detrending and seasonal adjustment;

  2. diagnostic checking for detecting and distinguishing between outliers and structural changes using auxiliary residuals;

  3. EM algorithm for initial estimation of parameters;

  4. calculation of the score, defined as the first derivative of the likelihood with respect to the vector of parameters.

We consider the disturbance smoother (DS) of Koopman (1993) which directly estimates the disturbance vector of the SSF, ut, and the MSE matrix; that is,

E(ut|YT,δ=0) = ũt    and   MSE(ũt) = σ2 Ct,

respectively, with t=1,...,T. Note that in the context of structural time series models, the interest in ut is limited to elements of Gut and Hut and their corresponding MSE matrices.

The DS is based on the same backwards recursions as found by Bryson and Ho (1969), de Jong (1989) and Kohn and Ansley (1989) which are given by

et = Ft-1vt-Kt'rt, rt-1 = Z'et+T'rt,
Dt = Ft-1+Kt'NtKt, Nt-1 = Zt'Ft-1Zt+Lt'NtLt,

where Lt= T-KtZ and t=T,...,1. These backwards recursions are initialised with rT=0 and NT=0. The storage requirements for the KF are limited to the quantities vt, Ft and Kt for t=1,...,T. The smooth estimates for the disturbances of the measurement equation and the MSE matrices are

Gũt=GG'et   and   σ2GCtG'=σ2(GG'-GG'DtGG'),
(eq:14.27)

respectively. The smooth estimates for the disturbances of the transition equation and the mean square errors are

Hũt=HH'rt   and   σ2HCtH'=σ2(HH'-HH'NtHH'),
(eq:14.28)

respectively. The proof of the disturbance smoother is given by Koopman (1993). The DS is the most computationally efficient smoother available yet.

14.4.1 The augmented disturbance smoother

In a similar way as the KF, the DS must be augmented to allow for initial effects with diffuse distributions and fixed components. The supplement recursions for the DS are

Et=Ft-1Vt-Kt'Rt   and   Rt-1=Z'Et+T'Rt,

and they will be referred to as the augmented disturbance smoother (ADS). The number of columns for Et and Rt is equal to d+k. When the AKF is not collapsed at all, the full sample estimates of the disturbance vectors E(Gut|YT), and the corresponding MSE matrices MSE(Gut|YT), are given by (eq:14.27) with et and Dt replaced by

êt=et+Etδ̃   and   D̂t=Dt-EtS-1Et',
(eq:14.29)

respectively, for t=1,...,T. In a similar way, E(Hut|YT) and MSE(Hut|YT) are given by (eq:14.28) with rt and Nt replaced by

r̂t=rt+Rtδ̃   and   N̂t=Nt-RtS-1Rt',
(eq:14.30)

respectively. for the transition disturbances with Note that δ̃=ST-1sT and S=ST. For a proof of the ADS, the reader is referred to Koopman (1993). When the AKF is fully collapsed to the KF at t=m, the DS can be used for t=T,...,m+1. The ADS should be applied for t=m,...,1 but when a collapse has taken place, the matrices ST and Rm are not available. Therefore, smoothing is not straightforward when a collapse has occurred during the AKF. However, a suggestion of Chu-Chun-Lin and de Jong (1993) can be used to recover these missing matrices.

14.4.2 The EM algorithm and exact score

The EM algorithm is a recursive method to obtain maximum likelihood estimates for unknown elements in the state and error system matrices of the SSF; see Shumway and Stoffer (1982) and Watson and Engle (1983). This method requires a huge amount of computations and is not easy to implement. Koopman (1993) proposes a simple and computationally efficient EM algorithm for unknown values inside the variance matrices ΩG=GG' and ΩH=HH' where the error system matrices are supposed to be time-invariant. Let θ be the stack of these unknown values. Given a set of values for θ, say θ*, the EM step calculates new variance matrices via

ΩG)=ΩG*)+ΩG*)ΛeΩG*),
ΩH)=ΩH*)+ΩH*)ΛrΩH*),

where

Λe=∑t=1Têtêt'-D̂t   and   Λr=∑t=1Tr̂tr̂t'-N̂t.

The quantities êt and D̂t are defined in (eq:14.29) and the quantities r̂t and N̂t are defined in (eq:14.30). The log-likelihood value of the SSF with these new variance matrices is always larger compared to the log-likelihood value of the SSF with the old variance matrices. This EM algorithm only requires one pass through the KF and one pass back through the disturbance smoother. Note that the EM maximises the diffuse log-likelihood (eq:14.21). For a proof and a discussion of this EM algorithm, the reader is referred to Koopman (1993).

The score vector is defined as the first derivative of the log-likelihood with respect to the vector of parameters θ. The vector of first derivatives is usually evaluated numerically. Koopman and Shephard (1992) present a method to calculate the exact score for any parameter in the state and error system matrices. However, the computational gain is only achieved for exact score calculations with respect to parameters in the error system matrices Gt=G and Ht=H. Again, let θ be the stack of unknown values in the variance matrices of the SSF and let vector θ* contain specific values for θ, then the exact score is given by

q( θ*) =
∂ log L(y|θ*)

∂θ
= 1/2 tr{Λe
ΩG*)

∂θ
}+ 1/2 tr{Λr
ΩH*)

∂θ
},
(eq:14.31)

and the definitions of the EM algorithm apply here as well.

14.4.3 Smoothed components

The smoothed state vector is defined as the estimated state vector using the full data-set; that is,

E(αt|YT)=α̃t.
(eq:14.32)

The smoothed state vector forms the basis for detrending and seasonal adjustment procedures for the SSF. Graphical inspection of a sub-set or a linear combination of the smoothed state vector is an important feature in STAMP. The actual values for (eq:14.32) are obtained by

α̃t+1=Tα̃t+Wt*δ̃x+Hũt,    t=1,...,T,
(eq:14.33)

with α̃1=â1|0+P̂1|0(r0+R0δ̃). Hence, the smoothed state vector is calculated using the forward recursion (eq:14.33) after the DS is applied and the vector Hũt=HH'rt is stored for t=1,...,T. The MSE matrix of the smoothed state vector is not considered; see de Jong (1989) and Kohn and Ansley (1989). It is shown by Koopman (1993) that this method of state smoothing is computationally very fast. Also note that no extra storage space is required since the storage space of the Kalman gain matrix Kt can be used to store the vector Hũt for t=1,...,T.

14.4.4 Auxiliary residuals

The standardised smoothed estimates of the disturbances are referred to as auxiliary residuals. Graphs of these residuals, in conjunction with normality tests, are a valuable tool for detecting data irregularities such as outliers, level changes and slope changes. Note that the auxiliary residuals are serially correlated even when the model is correctly specified. Therefore, the normality test should be adjusted for the implied serial correlation. The technical details and some interesting applications can be found in Harvey and Koopman (1992).

The auxiliary residuals are obtained directly from (eq:14.27) and (eq:14.28). The residuals are standardised using the diagonal elements of the appropriate MSE matrix. The variance σ2 must be replaced by its estimate σ̂2.

14.5 Forecasting

An important feature of STAMP is its extensive forecast facility. The model-based extrapolations can be calculated for the actual series but also for any element of the state vector.

14.5.1 Forecasts of series and components

The forecast of the state vector l multi-steps ahead, for l=1,...,L, and its MSE matrix are given by

âT+l|T=E(αT+l|YT),    σ2P̂T+l|T=MSE(αT+l|YT),

respectively. The forecasts and MSE matrices are generated recursively by

âT+l|T = TâT+l-1|T+WT+l*δ̃x,
σ̂2P̂T+l|T = σ̂2(TP̂T+l-1|TT'+HH'),
(eq:14.34)

respectively. The initialisation is directly obtained from the AKF at time T . The forecast for yT+l and the corresponding MSE matrices are given by

ŷT+l|T = ZâT+l|T+XT+l*δ̃x,
σ̂2F̂T+l|T = σ̂2(ZP̂T+l|TZ'+GG'),
(eq:14.35)

respectively. Graphical inspection of the forecasted values can be valuable for checking model validity.

14.5.2 Extrapolative residuals

The extrapolative residuals can be calculated when future values of yt become available. They are defined as

v̂t+l=yT+l-ŷT+l|T,
(eq:14.36)

where ŷT+l|T is defined in (eq:14.35). The variance matrices of the extrapolative residuals are equal to the MSE matrices of ŷT+l|T; that is, σ̂2F̂T+l|T as in (eq:14.35). The extrapolative residuals can be standardised by using the corresponding diagonal elements of the variance matrix.

14.6 Parameter estimation

The model definitions are given in §14.1. It is shown in §14.2 how the models of STAMP can be put in SSF. The filtering and smoothing algorithms of §14.3 and §14.4, respectively, are associated with the SSF and can be applied conditional on known state and error system matrices. The unknown values inside these matrices are treated as parameters which need to be estimated. The main task of STAMP is to estimate these parameters using maximum likelihood estimation (MLE) methods. This subsection sets out the strategy for solving this problem.

14.6.1 The parameters of STAMP

The state and error system matrices of the SSF are time-invariant and contain many zero and unity values. Therefore, they can be regarded as selection matrices. For a univariate model, the unknown values in the error system matrices are the standard deviations of the unobserved components. The time-invariant transition matrix T of the SSF contains the unknown first-order autoregressive parameter of stationary components such as cycle and (optional) slope. In the case of a cycle, the state system matrix T also contains cosine and sine terms which rely on the unknown frequency λ. Table 14.2 overviews the set of possible parameters for a univariate model.

The variances are always positive, the cycle damping factors must be in the range 0<ρ<1, and the frequencies can only take values in the range 0<λ<π. Since the parameters are to be estimated via a numerical method, some special transformations are used to enforce these restrictions on the parameters; see Table 14.2. The set of transformed parameters, relevant for a specific model, are stacked in the vector θ.

As observed from Table 14.2, the variance of the cycle, σψ 2, is taken as the parameter instead of the variance of its disturbance σκ 2. In this case, σκ 2ψ 2(1-ρψ 2) 1/2 and σκ 2→0 as ρψ →1. This allows the estimation of deterministic, but stationary, cycles which is the extreme case of ρψ =1.

Table:14.2 Parameters and transformations

Variances Parameter Transformation
Irregular σε 2 exp (2θε)
Level ση 2 exp (2θη)
Slope σζ 2 exp (2θζ)
Seasonal σω 2 exp (2θω)
Cycle σψ 2 exp (2θψ)
AR(1) σξ 2 exp (2θξ)
Autoregressive Parameter Transformation
Slope ρβ β|(1 + θβ2)- 1/2
Cycle ρψ ρ| (1 + θρ2)- 1/2
AR(1) ρν θν (1 + θν2)- 1/2
Frequency Parameter Transformation
Cycle λ 2π/ 2 + exp (θλ)
10.5cm

14.6.1.1 Variance matrices in a multivariate model

The same parameters exist for a multivariate model as for a univariate model, except that the variances are replaced by variance matrices; see §14.1. A special case is the vector autoregressive component which is discussed in §14.5.2. Since all variance matrices are treated the same, we consider a N×N matrix Σ, i.e. a non-negative symmetric matrix but not necessarily full rank. The Cholesky decomposition of Σ enforces these restrictions. The variance matrix is decomposed as Σ=ΘDΘ' where N×p matrix Θ is a lower-triangular matrix with unity values on the leading diagonal, 1≤p≤N, and p×p matrix D is a non-negative diagonal matrix. The lower-triangular elements of Θ; that is, θij with i=2,...,N and j=1,...,min(i-1,p), may take any value, including zero values. The diagonal elements of D are transformed as

di= exp (2θii)

where di is the i-th diagonal element of D and i=1,...,p. Indeed, matrix Θ is the same factor loadings matrix as in the model specifications for common components; see §14.1.4. Since matrix Θ has p columns, this would suggest that variance matrix Σ has rank p. In most cases this will be true. However, as any diagonal element of D can be zero, the rank of Σ can be less than p. Note that when di=0, the i-th column of Θ does not affect variance matrix Σ and can be set to zero.

The variance matrices are placed into the SSF via the error system matrices. For example, the variance matrix of the irregular enters the SSF via G=(ΘD 1/2 ,0,...,0) where Θ and D correspond to the decomposition of the irregular variance matrix Σε and 0 denotes a zero matrix.

14.6.1.2 Vector autoregressive components

The multivariate model may include a vector autoregression of order 1. The parameters inside the matrix Φ appear in the SSF transition matrix T. All roots of the matrix polynomial I-ΦL must lie outside the unit circle. Within a numerical optimisation procedure, the constraints on Φ needed to keep the process stationary are imposed by means of an algorithm given by Ansley and Kohn (1986). In the special case of a VAR(1) model, the transformations are:

This method ensure a stationary VAR component. However, imposing restrictions on the VAR matrices may be difficult.

14.6.2 BFGS Estimation procedure*

The optimisation method is based on the BFGS quasi-Newton method and it aims to maximise the likelihood criterion l( θ) . At every call to this updating scheme, it delivers a new estimate for θ, denoted by θi+1, given a current estimate, θi, which brings the object function l( θ) closer to its maximum. The updating scheme can be written as

θi+1=θi+siδi
(eq:14.37)

where si is a step size scalar and δi is the search direction vector. Appendix 14.8 of this chapter gives some basic principles on numerical optimisation and in particular the BFGS updating scheme. Some practical details on initialisation, updating, linear search and termination, are given below.

More specific details of implementation for univariate models and multivariate models are given below.

14.6.3 Estimation of univariate models*

For univariate models, STAMP maximises the concentrated diffuse log-likelihood log Lc(y|θ) which is treated as an unconstrained non-linear function of θ, denoted by l(θ). The likelihood kernel l(θ) is appropriately scaled by a function of T, the number of observations.

14.6.3.1 Initial values

The elements of the vector of transformed parameters θ are given in Table 14.2. The elements related to standard deviations, θσ , are set equal to -0.5, except for θη (level) which initial value is -1 whereas θζ (slope) and θω (seasonal) are started off with -1.5 and -2, respectively.

One element of θσ needs to be concentrated out, which effectively means it must be restricted to zero (that is, a unity standard deviation). The choice of the concentrated parameter must be chosen appropriately. At the start of the optimisation routine, the element corresponding to the irregular standard deviation is concentrated out. When no irregular is present in the model we choose from level, slope, cycles, seasonal and autoregressive component, consecutively.

The initial values for ρψ and λ, corresponding to a particular cycle, are provided by the user and STAMP transforms them automatically to θ0. The initial values for θβ and θυ are set to 2 (which corresponds to ρ values of approximately 0.9).

Before the quasi-Newton optimisation is started, a simple method is constructed to get θσ to the neighborhood of the optimal solution in a stable but quickly way. The initialisation method is similar to the quasi-Newton updating scheme (eq:14.37). The j-th element of the direction vector δi is given by

δi,j=Hj,j(θi)qj(θi)

where qj( θi) denotes the j-th element of the score vector q( θi) and Hj,j(θi) denotes the (j,j) element of the Hessian matrix H( θi) . The score and the diagonal elements of the Hessian matrix are numerically evaluated at θi. The initialisation updating steps are repeated five times but it is stopped earlier if the stopping criterion

j∑δi,jj(θi)<0.01

holds.

During this simple estimation procedure, a choice is made of which parameter should be concentrated out of the log-likelihood. At every step i, the parameter related to the largest value in vector θi, is concentrated out when its corresponding value in δi is larger than 4. This rule reflects the desire to concentrate out the largest standard deviation of the likelihood. When the choice of concentrated parameter is changed, the other standard deviations should be adjusted, so they have the correct ratios with respect to the new concentrated parameter. This might create unstabilities in the search process. Therefore, the difference between the old and the new adjusted value is enforced to lie between -1 and 2. This strategy is a result of `fine tuning' after it has been applied to a wide range of different models and data-sets.

14.6.3.2 Strategy for setting parameters fixed

During the estimation procedure, certain parameters might take values close to their boundaries. When a parameter moves closer and closer to its boundary, it may slow down the optimisation procedure as a whole. In these circumstances, it is advised to fix parameters at their boundary values but to allow estimation to be continued for the remaining parameters. The strategy for setting parameters fixed is as follows. In the case of standard deviations, when the transformed parameter value is smaller that -5 and its gradient value is smaller than 0.0001 (in absolute terms), it is set to a constant implying a zero standard deviation.

Similarly, the damping factor ρ of a cycle is set to unity, when the element corresponding to θρ in vector θi is larger than 25. If the value related to θλ (corresponding to the period of a cycle) is larger than 7 or less than -7, the program reduces the cycle component to an autoregressive component. In both cases, it is assumed that the corresponding gradient value is smaller than 0.0001 (in absolute terms). The transformed parameters related to ρβ (slope) and ρυ (autoregressive) don't apply to this strategy.

14.6.3.3 Strategy for determining the concentrated parameter

The initial estimation procedure has determined a concentrated standard deviation from θσ . However, during the main estimation process, it might be more appropriate to concentrate out another parameter. When the largest transformed parameter value (corresponding to θσ ) exceeds the value 1.5, this particular parameter will be the new concentrated parameter. Then, the other parameters corresponding to θσ obtain values as ratios to the new concentrated parameter, the gradient vector q( θi) is re-calculated and the Hessian matrix is replaced by the diagonal matrix Hd( θi) .

14.6.4 Estimation of multivariate models*

The estimation procedure for multivariate models follows mainly the same route as for univariate models but there are some differences. The likelihood kernel l( θ) is the diffuse log-likelihood log L(y|θ) but properly scaled by a function of T. So there is no need to find an appropriate concentrated parameter. The EM algorithm plays a prominent role in the procedure to obtain initial variance matrices for the disturbances; see §14.4.2. The strategy of setting elements to zero in the Cholesky decompositions of the variance matrices is set out below. The parameters for the VAR component cannot be set fixed during the estimation procedure.

14.6.4.1 Initial values

The initial settings for the multivariate model are discussed in terms of the variance matrix, the Cholesky decomposition and the VAR coefficient matrix, rather than directly in terms of the parameter vector θ. The initialisation procedure automatically transforms the elements of these matrices into the vector θ0 .

The initial parameter settings of the variance matrices is similar to the univariate settings of the q-ratios, except that now a specific q-ratio applies to all N diagonal entries of the D matrix. The factor loading matrix Θ is set to the identity matrix. At this stage, the rank conditions of the variance matrices are ignored. The VAR matrix is set equal to a diagonal matrix with all entries equal to 0.9. The extra initial parameters for the cycle component are set manually by the user.

The Kalman filter is applied to the multivariate model and the sample variance matrix of the innovations is calculated. The variance matrices of the disturbances are set equal to this matrix but properly scaled by their appropriate q-ratios.

Then, the EM algorithm is applied; see §14.4.2. The number of EM steps is equal to 5 which has proved to be appropriate in most cases. The VAR matrix and the other stationary parameters (ρ and λ) remain fixed during the EM. After the EM steps the parameters are transformed as indicated in §14.6.1. For example, the Cholesky decomposition is used to get the diagonal matrix D and the factor loading matrix Θ from the variance matrix. At this stage, the rank conditions are taken into account. Finally, the parameters are stacked into θ0.

14.6.4.2 Strategy for fixing parameters

The need to restrict parameters to their boundary values is important to speed up the process of estimation. This is even more relevant for multivariate models since the estimation process is likely to take much more time than univariate models because filtering and smoothing involve matrices with larger dimensions and more parameters need to be estimated.

The diagonal entries of D are treated in the same way as the standard deviations in a univariate model. The restrictions for the damping factor and the frequency parameter are also done in a similar fashion as for univariate models. No restrictions can be made with respect to the parameters related to the VAR component. The factor loading matrix elements enter the parameter vector without transformation.

When the j-th diagonal element of D is restricted to zero, the rank of the variance matrix Σ is reduced by one. All elements of the corresponding j-th column of the factor loading matrix can also be restricted to zero since these elements don't have any effect on the variance matrix Σ. Any specific element of the factor loading matrix Θ is set to zero when its value is smaller than 0.0001 and its corresponding gradient is also smaller than 0.0001 (both in absolute terms).

14.7 Appendix 1: Diffuse distributions*

This appendix discusses technical details about the way we treat initial and regression effects for the SSF. Most components in STAMP are initialised using diffuse priors and so we must treat these priors carefully in our calculations to ensure stable and exact results. This is carried out using the idea of augmented filtering and smoothing. To ensure computational efficiency the Kalman filter (KF) needs to be collapsed at some point.

14.7.1 Collapse of the augmented KF

For a SSF with k=0; that is, Xt=0 and Wt=0, t=1,...,T , is treated by the AKF for t=1,...,m and the AKF is collapsed to the KF using (eq:14.15) at t=m. Then, the KF evaluates ât+1|t and P̂t+1|t directly, t=1,...,T . The collapse also contains the operation

m=qm-sm'Sm-1sm.

Indeed, the KF evaluates the residual vector v̂t and its (unscaled) variance matrix F̂t directly as well. The value for m is restricted by T≥m≥d and |Sm|>0 but usually it is set to m=d.

For a SSF with k>0, a full collapse cannot be made since Xt and Wt are time-varying. However, a partial collapse can take place with respect to δi. This means that the AKF is still applied but with a reduced number of columns. Therefore, we partition the AKF matrices

At|t-1 = (
Ax,t|t-1, Ai,t|t-1
) , Vt = (
Vx,t, Vi,t
)
st = (
sx,t
si,t
) , St = (
Sxx,t Six,t'
Six,t Sii,t
)
(eq:14.39)

When Sii,t is invertible at t=m, the AKF can be partially collapsed to

{\bf A}t+1|t = Ax,t+1|t + Ai,t+1|tSii,t-1Six,t,
{\bf P}t+1|t = Pt+1|t + Ai,t+1|tSii,t-1Ai,t+1|t',

and

qt = qt- si,t'Sii,t-1si,t,
{\bf s}t = sx,t-Six,t'Sii,t-1si,t,
% \TeXButton{bg}{\displaystyle }{\bf S}t = Sxx,t-Six,t'Sii,t-1Six,t.

Note that matrix B of the (partially collapsed) AKF is replaced by {\bf B}=Bx. The matrices {\bf A}t+1|t, {\bf V}t=-Zt{\bf A}t+1|t-Xt% {\bf B} and {\bf B} have k columns.

14.7.2 Likelihood calculation

At the end of a fully collapsed AKF, the estimate of σ2 is given by σ̂ 2=q̂T/( NT-d-k) where k=0 and q̂T is obtained from the collapsed KF. When k>0, q̂T is given by

T=qT-sT'ST-1sT.

Likelihood calculation also requires the log value of the term |S*| where S*=ST. If k=0 and a full collapse has taken place at t=m, then |S*|=|Sm|. If k>0 and a partial collapse has taken place at t=m, then |S*|=|Sii,m||% {\bf S}T|. Note that the determinants are obtained without any extra computational costs since the inverses of these matrices are calculated earlier.

14.7.3 Recursive regressions

The recursive evaluation of least squares estimates is common practice in standard regression analysis since it is a way to get the recursive residuals. The same technique can be applied to the recursive evaluation of St which consists of the equations

ât|t-1 = at|t-1+At|t-1ct-1, P̂t|t-1 = Pt|t-1+At|t-1Ct-1At|t-1',
v̂t = vt+Vtct-1, F̂t = Ft+VtCt-1Vt',
Kt* = F̂t-1VtCt-1,
ct = ct-1+Kt*'v̂t, Ct = Ct-1-Kt'F̂tKt*.
(eq:14.40)

where ct=St-1st and Ct=St-1.

14.8 Appendix 2: Numerical optimisation*

Numerical optimisation methods are used to obtain the maximum likelihood estimates for the parameter vector θ. Here we outline some of the basic principles behind the estimation procedure of STAMP.

There is a vast literature on non-linear optimisation techniques (see, among many others, Gill, Murray and Wright (1981), Cramer (1986) and Thisted (1988)). Note that many texts on optimisation focus on minimisation, rather than maximisation, but of course: max l( θ) =- min -l( θ) .

A first approach to obtaining the MLE θ from l( θ) is to consider solving the score equations, assuming the relevant partial derivatives exist:

l( θ) =
l( θ)

∂θ
=q( θ) .
(eq:14.41)

Then q( θ) =0 defines the necessary conditions for a local maximum of l( θ) at θ. A sufficient condition is that

2l( θ) =
2l( θ)

∂θ∂θ'
=
q( θ)

∂θ
'=-Q( θ)
(eq:14.42)

also exists and is negative definite at θ. If Q( .) is positive definite for all parameter values, the likelihood is concave, and hence has a unique maximum; if not, there could be local optima or singularities.

14.8.1 Newton type methods

When q(θ) is a set of equations that are linear in θ, q(θ)=0 can be solved explicitly for θ. More generally, q(θ) is non-linear, yielding a problem of locating θ which is no more tractable than maximising l(θ). Thus, we consider iterative approaches in which a sequence of values of θ (denoted θi at the ith iteration) is obtained approximating q(θi)=0, and corresponding to non-decreasing values of the criterion function l(.) : l(θi+1)≥l(θi):

θi+1=h( θi) for i=1,2,...,I≤M
(eq:14.43)

where M is a terminal maximum number of steps, from an initial value θ0. A convergence criterion is used to terminate the iteration, such as q(θi+1)~=0 or |l(θi+1)-l(θi)|≤ε. These implementation-specific aspects are discussed in §14.6.

Expand q( θ) =0 in a first-order Taylor's series:

q( θ) ~=q( θ1) -Q( θ1) ( θ-θ1) =0,
(eq:14.44)

written as the iterative rule:

θi+1=θi+H( θi) q( θi) ,
(eq:14.45)

where H( .) =Q-1( .) is the Hessian matrix. The gradient, q(θi), determines the direction of the step to be taken, and H(θi) modifies the size of the step which determines the metric: this algorithm is the Newton-Raphson technique, or Newton's method. Even when the direction is uphill, it is possible to overstep the maximum in that direction. In that case it is essential to add a line search to determine a step length si∈[ 0,1] :

θi+1=θi+siδi,
(eq:14.46)

where δi=H( θi) q( θi) and si is chosen to ensure that l( θi+1) ≥l( θi) .

STAMP uses the so-called variable metric or Broyden-Fletcher-Goldfarb-Shanno (BFGS) update quasi-Newton methods. These approximate the Hessian matrix H( θi) by a symmetric positive definite matrix Ki which is updated at every iteration but converges on H( .) .

Let d=siδi=θi-θi-1 and g=q( θi) -q( θi-1) ; then the BFGS update is:

Ki+1=Ki+( 1+
g'Kig

d'g
)
dd'

d'g
-
dg'Ki+Kigd'

d'g
.
(eq:14.47)

This satisfies the quasi-Newton condition Ki+1g=d, and possesses the properties of hereditary symmetry (Ki+1 is symmetric if Ki is), hereditary positive definiteness, and super-linear convergence.

14.8.2 Numerical score vector and diagonal Hessian matrix

When the analytic formula for q( θ) cannot be obtained easily, STAMP uses BFGS with a numerical approximation to q( .) based on finite difference approximations. The numerical derivatives are calculated using:

qj( θ) =
l( θ)

∂( ιj'θ)
~=
l( θιj) -l( θ)

ε
(eq:14.48)

where ιj is a unit vector (for example, ( 0  1  0  ...  0) ' for j=2), ε is a suitably chosen step length. Thus, ε represents a compromise between round-off error (cancellation of leading digits when subtracting nearly equal numbers) and truncation error (ignoring terms of higher order than ε in the approximation).

The diagonal entries of the Hessian matrix are numerically evaluated as

Hj,jd( θ) =-
1

Qj,j( θ)

where

Qj,j( θ) =
2l( θ)

∂( ιj'θ) ∂( ιj'θ)
=
∂qj( θ)

∂( ιj'θ)
~=
qj( θιj) -qj( θ)

ε

with the same step-length ε.

Chapter 15 Model Output

This chapter defines the output of STAMP. The first section explains when and where the various parts of the output appear. The remaining sections explain the output is presented for every option of the Test menu.

15.1 Output from STAMP

This section details the output which occurs when a model is fitted to data. There are three different forms of output. Firstly, the iterations of the numerical optimisation algorithm are displayed in the Message window. Secondly, once the estimation algorithm has stopped (hopefully having converged!), the Results window reports details of the model specification and the results of the optimisation algorithm, and gives a summary of diagnostic statistics. Finally, the Test menu can be used to prompt STAMP to output specific details of the parameter estimates, state variables, model fit and diagnostics.

15.1.1 Model estimation

In order to compute the maximum likelihood estimates, STAMP first computes initial estimates and then it moves on to maximise the likelihood function directly by numerical optimisation; see Chapter 14. During this second phase various statistics are reported in the Message window:

15.1.2 Selected model and estimation output

The initial output, written to the Results window, gives details of the selected model and various results from the numerical optimisation procedure; an example is reproduced in §??. In particular, the selected components are listed and the number of parameters is given. The sample period used for parameter estimation is also given.

A variety of statistics emerge from the numerical optimisation. The likelihood kernel evaluated at θ̂ is reported together with the number of iterations required for convergence. Three different convergence measures are reported and these are categorised by some term like STRONG CONVERGENCE; see §14.6.

15.1.3 Summary statistics

An example of the diagnostic summary report, which is printed to the Results window after fitting, is given in §??. Broadly this summary gives the basic diagnostics and goodness-of-fit statistics. The precise definitions are given in §15.4 and §15.7.

Notice that the loss in the degrees of freedom d of the Box--Ljung Q-statistic takes account of the number of relative parameters in θ. It does not take account of any lagged dependent variables and their presence necessitates further adjustment.

15.1.4 The sample period

The parameters are estimated using the model sample as recorded from the Estimation dialog in the Model menu. The calculation of diagnostics and goodness-of-fit measures are calculated using the state sample. The default is that the model sample and the state sample are the same. However, the state sample might be changed to check sub-samples isolated from the rest of the sample. This can be done in the option Final state. Of course, both samples must be within the boundaries of the data sample.

15.1.5 Test

The nine items of the Test menu are self-explanatory. The headings of the following sections correspond to them. Note that the output for Goodness of Fit is requested by the Final state option and that many of the conventional time series diagnostics (built out of the innovations, or one-step ahead prediction errors) will come up in the Residual option.

15.2 Parameters

The variance parameters (and variance matrix parameters for multivariate models), together with frequencies and damping factors for cycles and coefficients for autoregressive components are referred to as parameters. The maximum likelihood estimates of the parameters can be requested from the Further output dialog. In a multivariate model, the same output is produced for each equation in turn. In addition, the full variance matrices and their decomposition in factor loading matrices and diagonal scaling matrices can be produced. This option also outputs the transformed parameters with their scores and approximate standard errors.

15.2.1 Variances and standard deviations

This gives the variances of the disturbances driving the various components. Thus, in terms of §14.1.1, the level variance is ση 2, the slope variance is σζ 2 and so on. The standard deviations are the square roots of the variances. The q-ratios are the ratios of each variance or standard deviation to the largest.

For multivariate models, the variances and standard deviations are obtained from the diagonal entries of the variance matrices. In order to verify the q-ratios for multivariate models, the largest variance across all series is identified, and this component is set to one for all series, irrespective of whether or not it is the largest in a particular series. However, when this variance in a particular series is zero, all associated q-ratios are set to zero. This also applies to q-ratios for standard deviations.

15.2.2 Cycle and AR(1)

For a stochastic cycle component, this option outputs the damping factor ρ, the frequency λ, the period 2π/λ, and the period in terms of `years'; that is, 2π/sλ, where s is the number of `seasons'. In addition it gives the variance of the cycle, σψ 2κ 2/(1-ρ2). For a multivariate model, the variance matrix of the cycle is given. This is

Σψ =(1-ρ2)-1Σκ ,

where Σκ is the variance matrix of the cycle disturbances.

For the AR(1) component in a univariate model, the autoregressive coefficient is reported. For a multivariate model the N×N matrix of coefficients is presented. The variance matrix of the disturbance vector ξt is produced.

15.2.3 Variance matrices (multivariate models)

The N×N variance matrices of the disturbances correspond to the variance parameters in a single equation model; see §14.1.3. In the STAMP output, the variance matrix is given by the lower triangular part of an N×N matrix, with the variances corresponding to the disturbance terms in each series given on the main diagonal (these are also produced by `Variances' above). The upper triangular part of the matrix gives the off-diagonal entries in the corresponding correlation matrix.

In a common factor model, some, or all, variance matrices will be of reduced rank; see §14.1.4. This will be reflected by values of unity in specific entries of the correlation matrix.

15.2.4 Factor loading matrices (multivariate models)

The N×K factor loading matrix Θ is defined in §14.1.4. When no common factor restrictions are imposed, so that N=K, the factor loading matrix is the lower triangular matrix associated with the Cholesky decomposition of variance matrix Σ; that is, Σ=ΘDΘ'. This option outputs the lower triangular factor matrix Θ and the diagonal scaling matrix D.

The vectors of constants, such as {\bf \mu } and % {\bf \beta }, are defined in §14.1.4 and only exist when common factors are specified in the model. The dimension of these vectors is N-K. The constant vector does not apply to the irregular component.

15.2.5 Transformed parameters and standard errors

The vector of estimated parameters, which are transformed as in §14.6, can be written to the Results window using this option. It also reports the score vector, associated with the estimated vector of parameters, and the numerical standard errors of the transformed parameters.

The scores are calculated as described in §14.6. The standard errors are obtained from the diagonal entries of the numerically evaluated Hessian matrix Q. Formally, the standard error of the i-th estimated parameter is defined as

sei=-1/√Qi

where Qi is the i-th diagonal entry of matrix Q.

15.3 Final state

The estimate of the final state vector contains all the latest information on the components in the model. Together with the regression estimates of the explanatory variables and the interventions, the final state contains all the information needed to make forecasts.

The Final state dialog also provides useful information about the estimated seasonal pattern and the amplitude of the cycle. Some goodness-of-fit statistics can be generated, but these are discussed in §15.4.

15.3.1 Analysis of state

In terms of §14.3.4, the final state is the filtered estimate at time T; that is, âT|T. The square roots of the diagonal elements of the corresponding MSE matrix, σ̂2P̂T|T, are the root mean square errors (RMSEs) of the corresponding state elements. The t-value is the estimate of the state divided by the RMSE. Applied to the i-th element of the state, this is

t-valuei=ai~/~RMSEi.

where ai is the i-th element of vector âT|T.

The Prob.values shows the probability of the absolute value of a standard normal variable exceeding the absolute value of the t-value. In a hypothesis testing situation, a Prob.value of less than 0.05 corresponds to rejection of the hypothesis of a zero value on a two-sided test at the 5%level of significance. A Prob.value of less than 0.05 is indicated by *, while a value less than 0.01 is denoted **.

The various unobserved components which make up the structural time series model all appear in the final state .

The same output appears for multivariate models, but it is repeated for each individual series.

15.3.2 Regression analysis

The estimates of the regression parameters, the parameters of the intervention variables and the fixed seasonal dummies are based on the matrix (sT,σ̂2ST); see §14.3.4. These estimates are reported upon request (if part of the specified model) and they may be interpreted in much the same way as in a standard regression model. Because they are deterministic (not time-varying), their RMSEs are standard deviations. The t-values would have t-distributions if the (relative) parameters were known. Although they are asymptotically normal, a t-distribution may provide a better approximation to small sample properties. As was noted earlier, the Prob. values are based on the normal distribution.

15.3.3 Seasonal tests

If the seasonal component is deterministic, either because it is specified to be `fixed' at the outset or its disturbance variance is estimated to be zero, a joint test of significance can be carried out on the s-1 seasonal effects. The test is essentially the same as a test for the joint significance of a set of explanatory variables in regression. Under the null hypothesis of no seasonality, the large sample distribution of the test statistic, denoted by `Seasonal Chi2(s-1)' in the output, is χs-12. The Prob. value is the probability of a χs-12 variable exceeding the value of the test statistic. In the case of a stochastic seasonal, the joint seasonal test is also produced although a formal joint test of significance of the seasonal effect is inappropriate. However, the seasonal pattern is persistent throughout the series and when the seasonal pattern changes relatively slowly, which is usually the case, the test statistic can provide a useful guide to the relative importance of the seasonal.

The formal definition of the test statistic is

a'P-1a

where a contains the estimates of the s-1 seasonal effects at time T and P is the corresponding MSE matrix. In the case of a fixed seasonal, a and P are obtained from the matrix (sT,σ̂2ST). For a stochastic seasonal, a and P come from the matrix (âT|T,σ̂2P̂T|T); see §14.3.4.

The final estimate of the seasonal effect for each season; that is, the estimated seasonal pattern at time T, is given here as well.

15.3.4 Cycle tests

The amplitude of the cycle is

(a2+a*2)½

where a is the element of âT|T corresponding to ψT and a* corresponds to ψT*. The reported ratio is the comparison of the amplitude of the cycle with the level of the trend, and is also reported. This gives an indication of its relative importance. When the cycle is deterministic, but stationary, a joint significance χ2 test (the same as the seasonal test) is produced as well.

15.3.5 Data in logs

When it is indicated that a model has been estimated in logarithms, some additional output for specific components appears in the Results window.

15.4 Goodness of fit

The goodness-of-fit statistics in STAMP all relate to the residuals, which are the standardised innovations; see §14.3.6. When explanatory variables are specified in the model, the generalised least squares residuals are used; see §14.3.6.1. The diagnostic checking facility for these residuals are provided in the Residuals option; see §15.7.

15.4.1 Prediction error variance

The basic measure of goodness-of-fit is the PEV as defined in §14.3.3. The PEV is the variance of the residuals in the steady state and so it corresponds to the variance of the disturbance of the reduced form ARIMA model. When the Kalman filter does not converge to the steady state, the finite PEV is used. For a univariate model, the PEV is denoted by σ̃2, irrespective of whether or not it is the finite PEV. For a multivariate model, σ̃2 denotes the diagonal entry of the prediction error variance matrix Σ̃ corresponding to a specific series in yt. The output includes a message if the Kalman filter is not in the steady state at t=T. The Final state option includes the facility to re-calculate the PEV using a pre-set number of updates.

15.4.2 Prediction error mean deviation

Another measure of fit is the mean deviation of the residuals, vt,t=d+1,...,T, defined by

md=
σ̃

T-d
t=d+1T|vt|

where the PEV σ̃2 is defined in §14.3.3. In a correctly specified model the reported ratio

2σ̃2

π~md2

should be approximately equal to unity.

When it is indicated that the the series are modelled in logs, the relative measure of error is also given. This measure is easily computed as 100md.

15.4.3 Coefficients of determination

In econometric modelling the traditional measure of goodness-of-fit is the coefficient of determination R2. For a structural time series model this measure is defined by

R2=1-
(T-d)σ̃2

t=1T(yt-y)2
  ,

where yt is the original series and y is its sample mean. Note that for a multivariate model, yt is a specific series of the vector yt and σ̃2 is the diagonal element of Σ̃ corresponding to that series. This coefficient of determination is most useful when the series appears to be stationary with no trend or seasonal.

When the series yt shows trend movements, it is better to compare the PEV with the variance of first differences, Δyt=yt-yt-1. This leads to

RD2=1-
(T-d)σ̃2

t=2T(Δyt-\Delta y)2
.

where \Delta y is the sample mean of Δyt. It should be noted that RD2 may be negative, indicating a worse fit than a simple random walk plus drift model.

For seasonal data with a trend, it is more appropriate to measure the fit against a random walk plus drift and fixed seasonals. This requires the sum of squares, SSDSM, obtained by subtracting the seasonal mean from Δyt. The coefficient of determination is then

RS2=1-
(T-d)σ̃2

SSDSM
,

which can again be negative.

15.4.4 Information criteria : AIC and BIC

The fit of different models can be compared on the basis of the PEV. However when models have different number of parameters, it is more appropriate to compare them using the Akaike information criterion (AIC) or the Bayes information criterion (BIC). These criteria are based on the formula

log PEV+cm/T

where c is a constant and m is the number of parameters in θ plus the number of non-stationary components in the state vector. The AIC takes c as 2 and the BIC takes c as log T. For multivariate models, m is defined as for a univariate model, since the criteria are reported for each series in yt separately.

15.5 Components

The `Components' option facilitates the computation of filtered or smoothed estimates of the specified components in the model. The filtered estimates are based on the past observation and can be calculated as described in §14.3.5. The smoothed estimates are based on all the observations, see §14.4.2. In what follows, the estimates are denoted by x̃, where x is some component which can be estimated either way. When the model is multivariate, the component is associated with a specific series in yt.

15.5.1 Series with components

This option plots the original series, yt, with one of the following (as requested):

15.5.2 Detrended

The detrended series is yt-μ̃t for t=1,...,T. The smoothed estimates are usually most appropriate for detrending. Note that level and slope interventions are included in the trend.

15.5.3 Seasonally adjusted

The seasonally adjusted series is yt-γ̃t for t=1,...,T. The smoothed estimate of the seasonal component is normally used for seasonal adjustment.

15.5.4 Individual seasonals

The estimated seasonal component γ̃t is re-arranged into s different series, each corresponding to one of the seasons. The graph can be interpreted as a set of annual plots for the individual seasonal effects.

15.5.5 Data in logs

If the observations are in logarithms, the output of the various components may be amended in two ways: (1) `Anti-logs' can be calculated; or (2) `Modified anti-logs' can be calculated. These amendments do not apply in a similar way for all components. The following list gives the formal amendments:

15.6 Residuals

The residuals are the standardised one-step-ahead prediction errors or innovations and they are defined in §14.3.6. The residual series are simply denoted by vt for t=d+1,...,T. For multivariate models, the series vt are associated with a specific series in yt. For a correctly specified model with known variance parameters, the residuals are assumed to be NID(0,1). Diagnostic statistics and graphs are the tools to validate this proposition.

15.6.1 Correlogram

The residual autocorrelation at lag τ is defined by

rτ=
t=τ+d+1T( vt-v) ( vt-τ-v)

t=d+1T( vt-v) 2
,  where v=
1

T-d
t=d+1Tvt.
(eq:15.1)

The approximate standard error for rτ is 1/(T-d)½. The Durbin and Watson (1950) statistic, is given by

DW=
t=2+dT(vt-vt-1)2

t=1+dTvt2
~~=~2{1-r(1)}.

In a correctly specified model, DW is approximately N(2,4/T).

A general test for serial dependence is the portmanteau Box--Ljung Q-statistic, which is based on the first s residual autocorrelations; see Ljung and Box (1978). It is given by

Q( P,d) =T(T+2)∑j=1P
rj2

T-j
  .

and distributed approximately as χd2 under the null, where d is equal to P-n+1 and n is the number of parameters. For a multivariate model, n is determined as in a univariate model, so a variance matrix counts as one parameter. Notice that the loss in the degrees of freedom d of Q(P,d) takes account of the number of relative variance parameters. It does not take account of any lagged dependent variables and their presence necessitates further adjustment.

15.6.2 Periodogram and spectrum

These diagnostic graphs are defined in §13.3 and §??. They are applied to vt in exactly the same way. Note that the theoretical spectrum is a horizontal straight line for a white-noise series.

15.6.3 Cumulative statistics and graphs

The `Cusum' and `Cusum of squares' are as given in §13.5 and §13.6 respectively, but are now calculated using the residuals vt for t=d+1,...,T.

The `Cumulative periodogram' is another cumulative graph which is useful since it smooths out the periodogram. The cumulative periodogram cpj is defined as

cpj=
i=1jpi2

i=1npi2

where pi is the periodogram ordinate at i for i=1,...,n and n=[(T-d)/2]. The cumulative periodogram cpj is plotted against j=0,...,n. A useful diagnostic statistic is to measure the maximum deviation of the cpj from the diagonal line connecting cp0=0 and cpn=1; that is, max |cpj-j/n| over j=1,...,n-1; see Durbin (1969). This measure is outputted on request in the Results window. Some typical critical values for this two-sided test are .44 for T ~=80, .40 for T ~=100 and .30 for T ~=200.

15.6.4 Distribution statistics

The Bowman and Shenton (1975) statistic for normality is calculated using the residuals vt for t=d+1,...,T. As for the other statistics, and the graphs on the density, histogram and distribution, the method of calculation is as presented in §13.10.

15.6.5 Heteroskedasticity

A basic non-parametric test of heteroskedasticity is the two-sided Fh,h.-test,

H(h)=∑t=T-h+1Tvt2/∑t=d+1d+1+hvt2~

where h is the closest integer to (T-d)/3. An increase of the variance in the last third of the sample will cause H(h) to be large, while the reverse will cause it to be small. The heteroskedasticity test appears as part of the Summary statistics.

15.7 Auxiliary residuals

Auxiliary residuals are designed to detect unusual movements in a fitted time series model. These residuals are the smoothed estimates of the disturbances associated with the components irregular, level and slope. Section 14.4.4 sets out how to calculate the standardised auxiliary residuals. It is very useful to plot them and look for any irregularities. If requested, all residuals which have absolute value exceeding two are written to the Results window.

As well as graphical representations of the auxiliary residuals, STAMP reports the standard distribution statistics for these residuals; see §13.10. Of course, these statistics cannot be used as standard moment tests because of the serial correlation in the auxiliary residuals. However these statistics can be corrected for serial correlation as in Harvey and Koopman (1992). In STAMP, the correction factors are calculated from the first max ((T)½, 20) sample autocorrelations of the auxiliary residuals.

15.8 Predictive testing

When the final date of the state sample is smaller than the final date of the data sample; that is, there are L observations, yt, t=T+1,...,T+L `outside the sample', post-sample predictive testing may be carried out using all or some of these observations. However, when these final dates are the same, predictive testing within the sample is the only option; that is, predicting yt for t=T-l+1,...,T. Of course, in the first situation, predictive testing within the sample is also possible.

When explanatory variables and/or interventions are specified in the model, the residuals used in predictive testing within the sample are the generalised recursive residuals defined in §14.3.6.2. These residuals are denoted by wt for t=d*+1,...,T where d*=d+k and k is the number of explanatory variables in the model (excluding any variables which do not appear after time T-L). Note that wt is zero when an intervention enters into the model. The error bands for the residuals are based on twice the RMSE. The CUSUM is defined in §13.5, and it is calculated for these residuals if requested.

The Chow test within the sample is constructed as

Chow=
T-l-d*

l
t=T-l+1Twt2~/~∑t=d*+1T-lwt2.

which is approximately distributed as F(l,T-l-d*).

When post-sample predictive testing is carried out, the estimates of the coefficients of the explanatory variables and/or interventions remain the same as they were at the end of the state sample. The corresponding standardised residuals are denoted by vt for t=T+1,...,L . The post-sample predictive `failure' test statistic is then

pft=∑j=1LvT+j2,

which is approximately distributed as χL2.

The CUSUM t-test is also outputted in the form

cusumt=L-1/2j=1LvT+j,

which is approximately distributed as a t distribution with T-L-d* degrees of freedom.

The extrapolative residuals, which are defined in §14.5.2, can be used to assess the predictive performance of the model and make comparisons with rival models. If v̂T+j|T, j=1,...,L, denotes the j-step ahead extrapolative residual in the post-sample period, the extrapolative sum of squares is given by

ess(T,L)=∑j=1LT+j|T2.

This statistic can also be generated using residuals within the sample. A similar statistic based on the sum of absolute values is defined as

esa(T,L)=∑j=1L|v̂T+j|T|.

15.9 Forecast

The forecasts of the series yt and the components are generated as described in §14.5. The forecast error bounds are based on one RMSE. For example, the forecasts of a specific series ŷT+j are obtained from a particular element of ŷT+j|T, for j=1,...,L, and the RMSE r̂T+j is the square root of the corresponding diagonal element of the matrix σ̂2F̂T+j|T. When the data are in logs, the anti-log transformation ( exp ) can be calculated for the forecasts. Their corresponding RMSEs are then based on

T+j*= exp (ŷT+j+r̂T+j)- exp (ŷT+j),    j=1,...,L.

The anti-log forecasts adjusted for bias are calculated by

T+j*= exp (ŷT+j+0.5r̂T+j2),    j=1,...,L.