Unobserved Components Time Series Methodology
- Click on one of the following topics
- 14 Statistical Treatment of Models
- 14.1 Model definitions
- 14.1.1 Univariate time series models
- 14.1.2 Explanatory variables and interventions
- 14.1.3 Multivariate time series models
- 14.1.4 Common factors
- 14.1.4.1 Common levels
- 14.1.4.2 Smooth trends with common slopes
- 14.1.4.3 Common trends: level and slopes
- 14.1.4.4 Common seasonals
- 14.1.4.5 Common cycles
- 14.1.5 Explanatory variables
- 14.2 State space form
- 14.2.1 Structural time series models in SSF
- 14.2.1.1 Univariate models in SSF
- 14.2.1.2 Multivariate model in SSF
- 14.2.1.3 Deterministic trend and seasonal components
- 14.3 Kalman filter
- 14.3.1 The augmented Kalman filter
- 14.3.2 The likelihood function
- 14.3.2.1 Univariate models: the concentrated likelihood
- 14.3.2.2 Homogeneous models
- 14.3.2.3 Multivariate models
- 14.3.3 Prediction error variance
- 14.3.4 The final state and regression estimates
- 14.3.5 Filtered components
- 14.3.6 Residuals
- 14.3.6.1 Generalised least squares residuals
- 14.3.6.2 Generalised recursive residuals
- 14.4 Disturbance smoother
- 14.4.1 The augmented disturbance smoother
- 14.4.2 The EM algorithm and exact score
- 14.4.3 Smoothed components
- 14.4.4 Auxiliary residuals
- 14.5 Forecasting
- 14.5.1 Forecasts of series and components
- 14.5.2 Extrapolative residuals
- 14.6 Parameter estimation
- 14.6.1 The parameters of STAMP
- 14.6.1.1 Variance matrices in a multivariate model
- 14.6.1.2 Vector autoregressive components
- 14.6.2 BFGS Estimation procedure*
- 14.6.3 Estimation of univariate models*
- 14.6.3.1 Initial values
- 14.6.3.2 Strategy for setting parameters fixed
- 14.6.3.3 Strategy for determining the concentrated parameter
- 14.6.4 Estimation of multivariate models*
- 14.6.4.1 Initial values
- 14.6.4.2 Strategy for fixing parameters
- 14.7 Appendix 1: Diffuse distributions*
- 14.7.1 Collapse of the augmented KF
- 14.7.2 Likelihood calculation
- 14.7.3 Recursive regressions
- 14.8 Appendix 2: Numerical optimisation*
- 14.8.1 Newton type methods
- 14.8.2 Numerical score vector and diagonal Hessian matrix
- 15 Model Output
- 15.1 Output from STAMP
- 15.1.1 Model estimation
- 15.1.2 Selected model and estimation output
- 15.1.3 Summary statistics
- 15.1.4 The sample period
- 15.1.5 Test
- 15.2 Parameters
- 15.2.1 Variances and standard deviations
- 15.2.2 Cycle and AR(1)
- 15.2.3 Variance matrices (multivariate models)
- 15.2.4 Factor loading matrices (multivariate models)
- 15.2.5 Transformed parameters and standard errors
- 15.3 Final state
- 15.3.1 Analysis of state
- 15.3.2 Regression analysis
- 15.3.3 Seasonal tests
- 15.3.4 Cycle tests
- 15.3.5 Data in logs
- 15.4 Goodness of fit
- 15.4.1 Prediction error variance
- 15.4.2 Prediction error mean deviation
- 15.4.3 Coefficients of determination
- 15.4.4 Information criteria : AIC and BIC
- 15.5 Components
- 15.5.1 Series with components
- 15.5.2 Detrended
- 15.5.3 Seasonally adjusted
- 15.5.4 Individual seasonals
- 15.5.5 Data in logs
- 15.6 Residuals
- 15.6.1 Correlogram
- 15.6.2 Periodogram and spectrum
- 15.6.3 Cumulative statistics and graphs
- 15.6.4 Distribution statistics
- 15.6.5 Heteroskedasticity
- 15.7 Auxiliary residuals
- 15.8 Predictive testing
- 15.9 Forecast
- Part:4 Statistical Output
- 13 Descriptive Statistics
- 13.1 Transformations
- 13.2 Correlogram
- 13.3 Periodogram
- 13.4 Spectrum
- 13.5 CUSUM
- 13.6 CUSUM of squares
- 13.7 Histogram
- 13.8 Density estimation
- 13.9 Distribution function
- 13.10 Write report
- 13.10.1 Serial correlation
- 13.10.2 Normality test
- List of tables
- Table:13.1 Empirical size of normality tests
- Table:14.1 Some special level and trend specifications
- Table:14.2 Parameters and transformations
Part:4 Statistical Output
Chapter 13 Descriptive Statistics
The Transform entry in the Data menu involves transformation of series and calculation of statistics. This chapter provides the formulae underlying the computations of descriptive statistics for individual time series. Model-related diagnostic statistics are presented in Chapter 15.
The Describe dialog, to be accessed from the Transform dialog, produces two types of output. The first, and most useful, is graphical output. Each of the options we discuss below produces graphs of one sort or another. However, the last entry in the Describe dialog, `Write report', generates output to be written to the Results window. By default this is turned off, but it is often useful to mark it so that precise summary statistics can be recorded for later use in a research report.
13.1 Transformations
This option provides two different kinds of transformations: a transform function for the actual values and a time series transform function. These different transformations are described below.
The very common and basic transformations are the logarithm, the exponent and the absolute values. The `Box--Cox' and the `Stochastic volatility' transformations require some explanation. The transformation of Box and Cox (1964) is given by
|
|
where λ usually ranges between -2 and 2 (STAMP does not accept values outside this range to ensure numerical stability). Note that the Box--Cox transformation with the default λ= 1/2 is very similar to the `square root' transformation. The following stochastic volatility transformation (see also Chapter ??) is suggested by Professor Wayne Fuller, see Fuller (1996, p.494--497); that is,
|
|
where sy2 is the unconditional variance of the yt, and λ is a small constant value, the default value 0.02.
The second type of transformations focuses on differencing or integrating time series. If differencing is required, then the following is optional:
|
|
where L is the lag operator (so Lyt=yt-1) and s is the number of seasonals in a year. When the differencing involves missing data, such as y0 and y-1, then the resulting differenced series will record a missing value. An example of this is that Δy1 is regarded as a missing value. If integration is required, then the resulting series, called zt, are defined as
|
|
where zt is set to zero for any t≤0. This operation does not generate extra missing values.
Finally, the series can be detrended. This is carried out using the following trend model:
|
|
which is discussed in §14.1. A forwards and backwards recursive algorithm is used to estimate μt, written by \mu t, allowing the `detrended' series to be computed as yt-\mu t. The default parameter for q is 0.000625=1/1600, as suggested by Hodrick and Prescott. By increasing this number, a more responsive trend line is fitted, while if it is set to zero, a deterministic time trend is obtained.
13.2 Correlogram
The correlogram is defined as the series rτ , for 1,...,p, where rτ is the sample correlation between yt and yt-τ. The length of the correlogram p is pre-specified, leading to a figure which shows r1,r2,...,rp plotted against 1,2,...,p. The autocorrelation for lag τ is defined by
|
|
where y= 1/T ∑t=1Tyt is the sample mean of yt. The approximate standard error for rτ is 1/(T)½.
13.3 Periodogram
The information in the correlogram can also be expressed in the frequency domain. The ordinates of the periodogram are
|
|
|
|
with cj,t= cos λjt and sj,t= sin λjt, for j=1,...,n-1. The sine and cosine terms are evaluated recursively, using the trigonometric addition rule; that is,
|
|
for t=0,...,T-1. Note that the terms cos λj and sin λj, for j=0,...,n, can also be evaluated recursively using a similar scheme. These recursions speed up the calculations by a significant amount. The periodogram is a plot of the ordinates p(λj) against λj/π, for j=1,...,n. Thus the X-axis ranges between 0 and 1.
The periodogram can also be written in terms of the sample autocorrelations; that is,
|
|
where c(τ)= 1/T ∑t=1T(yt-y)(yt-τ-y).
In the literature on the spectral analysis of stationary time series, the plot of p(λj) against λj is sometimes called the spectrogram, but the term periodogram is more widely used. For more information on this topic see Priestley (1981).
13.4 Spectrum
The periodogram is an inconsistent estimator of the spectrum; a consistent estimator of the spectrum is given by
|
|
where m is called the window length and the wj's are normalised weights which are constrained by w-j=wj, for j=0,...,m. The weights in STAMP are directly associated with the odd rows of the Pascal's triangular scheme, for example,
|
|
and wj=cj41-m. Note that, for this Pascal's scheme, the normalisation factor is based on ∑j=-mmcj=4m-1. At the beginning and at the end of the estimated spectrum the weighting scheme is simply truncated and the normalisation factor is adjusted appropriately so that ∑j wj=1.
13.5 CUSUM
The CUSUM graph is designed to assess whether the mean of a process yt changes over time; see Brown, Durbin and Evans (1975) and Harvey (1990, p. 153--5). In STAMP the CUSUMs are defined with respect to the standardised series
|
|
where y= 1/T ∑j=1Tyj and σ̂ y=( 1/T ∑j=1T( yj-y) )½. The motivation for the CUSUM is that if the mean changes through the standardised series, it will have runs of positive or negative observations. Thus the CUSUM will drift up (or down), before going back to zero at CUSUM(T). A formal check on the CUSUM is based on assessing whether it crosses two `boundary' lines as given by
|
|
These lines are based on a significance level of 10%.
13.6 CUSUM of squares
A similar device to the CUSUM is available for detecting a change in variance. Attention is now focused on the squares of the data and so we work with,
|
|
As t increases, the CUSUMSQ increases. If the model is homoskedastic, the increase should be linear. Departures from the 45° line in the CUSUMSQ graph suggest a breakdown of this assumption.
13.7 Histogram
Let zt denote the standardised observations (yt-y)/σ̂ y, for t=1,...,T, where y and σ̂ y are defined in §13.5. The range of zt is divided into N intervals of length h with h defined in §13.8. Then the proportion of zt in each interval constitutes the histogram.
13.8 Density estimation
The density can be estimated as a smoothed function of the histogram using a normal or Gaussian kernel. This can then be summed (`integrated') to obtain the estimated cumulative distribution function (CDF).
Denote the actual density of Z at z by fz( z) . A non-parametric estimate of the density is obtained from the sample by:
|
|
where h is the window width or smoothing parameter, and K( .) is a kernel such that:
|
|
STAMP sets h=1.06σ̂ z/T0.2 (=1.06/T0.2 as σ̂ z=1) as a default and uses the standard normal density for K( .) :
|
|
We thank Professor Bernard Silverman for permission to use his algorithm for calculating fz( z) ̂. A general treatment of density function estimation can be found in Silverman (1986).
13.9 Distribution function
The estimated CDF of Z can be derived from fz( z) ̂; this is shown with a standard normal CDF for comparison. When selecting one graph, a non-parametric density estimation menu allows the choices determining h to be set. The menu does not appear for multiple graphs on one screen.
13.10 Write report
When the `Write report' option is active in the Describe dialog, some descriptive statistics will be sent to the Results window. Which are selected to be reported will depend on the other options which are activated in the Describe dialog. This can be broken down into two types of output: measures of serial dependence and normality.
13.10.1 Serial correlation
The correlogram values rτ , for τ=1,...,p, are defined in §??. Their standard errors, under the null hypothesis of white noise, are equal to 1/(T)½. The Box--Ljung portmanteau statistic is given by
|
|
see Ljung and Box (1978). This is tested against a χq2 distribution. When no model is fitted, the degrees of freedom, q, is set equal to p. However, when the Q-statistic is applied to the residuals of an estimated structural time series model, q is set equal to p+1 minus the number of parameters. Chapter 15 gives the details on how the Q -statististic is used in the context of fitting structural time series models. The Box--Ljung Q-statistic is a modification of the earlier Box and Pierce (1970) statistic, corrected to have better size behaviour in moderate samples. It is reported for a variety of values of p along with its Prob. value computed using a χq2 distribution.
13.10.2 Normality test
Let μ, σ2 denote the mean and variance of y, and write μi=E[ y-μ] i, so that σ2=μ2. The skewness and kurtosis are defined as
|
|
Sample counterparts are defined by
|
|
A normal variate will have √β1=0 and β2=3. STAMP reports
|
|
which are both tested against a χ12 distribution. The Bowman--Shenton test for normality is given by
|
|
which is tested against a χ22 distribution. As Bowman and Shenton (1975) note, the above tests are unsuitable unless used in very large samples. The statistics √b1 and b2 are not independently distributed in small samples, and the sample kurtosis especially approaches normality very slowly. Another normality test reported by STAMP is described in Doornik and Hansen (1994). It derives from Shenton and Bowman (1977), who give b2 (conditional on b2>1+b1) a gamma distribution, and D'Agostino (1970), who approximates the distribution of √b1 by the Johnson Su system. Let s* and k* denote the transformed skewness and kurtosis, respectively, where the transformation creates statistics which are much closer to standard normal. The alternative normality test is
|
|
which is also tested against a χ22 distribution.Table Table:13.1 compares (eq:13.24) with its asymptotic form (eq:13.23). It gives the rejection frequencies under the null of normality, using χ22 critical values. The experiments are based on 10,000 replications and common random numbers.
Table:13.1 Empirical size of normality tests
nominal probabilities of NDH | nominal probabilities of NBS | |||||||
T | 20% | 10% | 5% | 1% | 20% | 10% | 5% | 1% |
50 | 0.1734 | 0.0869 | 0.0450 | 0.0113 | 0.0939 | 0.0547 | 0.0346 | 0.0175 |
100 | 0.1771 | 0.0922 | 0.0484 | 0.0111 | 0.1258 | 0.0637 | 0.0391 | 0.0183 |
150 | 0.1845 | 0.0937 | 0.0495 | 0.0131 | 0.1456 | 0.0703 | 0.0449 | 0.0188 |
250 | 0.1889 | 0.0948 | 0.0498 | 0.0133 | 0.1583 | 0.0788 | 0.0460 | 0.0180 |
STAMP reports the following statistics under this option:
mean | y | maximum | ||
standard deviation | σy= (m2)½ | skewness χ1 2 | s | |
skewness | √b1 | kurtosis χ1 2 | k | |
excess kurtosis | b2-3 | normality χ2 2 | NBS | |
minimum | normality χ2 2 | NDH |
Chapter 14 Statistical Treatment of Models
To understand fully the output of STAMP it is useful to understand the way in which the models are handled statistically. The algorithms used to carry out the computations are based on the state space form. The interested reader may find explanations of the underlying ideas and proofs of the algorithms in Anderson and Moore (1979), Harvey (1989, Chs. 3 and 4), de Jong (1991) and Koopman (1993). This chapter simply sets out the algorithms and gives the exact technical details of implementation. We first formally define the models which were used in Chapters ??--??.
14.1 Model definitions
14.1.1 Univariate time series models
A univariate model may be written as
|
|
where μt is the trend, γt is the seasonal, ψt is the cycle, νt is a first-order autoregressive component and εt is the irregular. The model can be extended with two similar cycle components; see below.
The stochastic trend component is specified as
|
where βt is the slope or gradient of the trend μt. The irregular εt, the level disturbance ηt and the slope disturbance ζt are mutually uncorrelated. The slope component βt can be excluded from the trend specification when this is appropriate. Some special trend specifications are listed in Table Table:14.1.
Table:14.1 Some special level and trend specifications
Level | σε | ση | |
constant term | * | 0 | |
local level (LL) | * | * | |
random walk (RW) | 0 | * | |
Trend | σε | ση | σζ |
deterministic | * | 0 | 0 |
LL with fixed slope | * | * | 0 |
RW with fixed drift | 0 | * | 0 |
local linear (LLT) | * | * | * |
smooth trend | * | 0 | * |
second differencing | 0 | 0 | * |
Hodrick-Prescott | * | 0 | .025 σε |
The seasonal component may have the dummy variable form or the trigonometric form. The number of seasonal frequencies in a period (e.g. a year) is given by integer s. When s is even, [s/2]=s/2, and when s is odd, [s/2]=(s-1)/2. The seasonal dummy is given by
γt=-γt-1-...-γt-s+1+ωt, ωt~NID(0,σω 2). |
The trigonometric seasonal form is
γt=∑j=1[s/2]γj,t |
where each γj,t is generated by
[ |
| ] =[ |
| ] [ |
| ] +[ |
| ] , |
|
where λj=2πj/s is the frequency, in radians, and the seasonal disturbances ωt and ωt* are two mutually uncorrelated NID disturbances with zero mean and common variance σω 2. For s even, the component at j=s/2 collapses to
γj,t=γj,t-1 cos λj+ωj,t. |
The statistical specification of a cycle, ψt, is given by
[ |
| ] =ρψ [ |
| ] [ |
| ] +[ |
| ] , t=1,...,T, |
where ρψ , in the range 0<ρψ ≤1, is a damping factor; λc is the frequency, in radians, in the range 0 ≤λc ≤π; κt and κt* are two mutually uncorrelated NID disturbances with zero mean and common variance σκ 2. Note that the period of the cycle is equal to 2π/λc. There may be two additional cycles of this form incorporated in the model.
A first-order autoregressive, AR(1), process is given by
νt=ρν νt-1+ξt, ξt~NID(0,σξ 2), |
with ρν in the range 0<ρν <1. The AR(1) component is actually a limiting case of the stochastic cycle when λc is equal to 0 or π, though it is specified separately in STAMP, partly to avoid confounding and partly because it is not a limiting case in multivariate models.
Finally, the disturbances driving each of the components in the model are mutually uncorrelated.
14.1.2 Explanatory variables and interventions
A single equation model may include exogenous explanatory variables, lagged values of the dependent variable and intervention variables, as well as unobserved components such as trend, seasonal and cycle. Thus (eq:14.1) can be extended as
|
|
where xit is an exogenous variable, wjt is an intervention (dummy) variable and φτ ,Δiτ and λj are unknown parameters.
14.1.3 Multivariate time series models
Multivariate models have a similar form to univariate models, except that yt is now an N×1 vector of observations, which depends on unobserved components which are also vectors. Thus in the special case of a multivariate local level model
|
|
where Σε and Ση are both N×N variance matrices, and ηt and εt are mutually uncorrelated in all time periods. The other disturbances in the general model (eq:14.1) similarly become vectors which have N×N variance matrices. Models of this kind are called seemingly unrelated time series equations (SUTSE).
In a homogeneous model the variance matrices are proportional. In the case of a multivariate local level model this implies Σ η =qΣ ε , where q is a non-negative scalar.
The other components are incorporated in a multivariate model in a similar manner by letting the components and disturbances become N×1 vectors and letting the variances become N×N variance matrices. In the case of the cycle, the parameters ρ and λ are the same for all series. As regards the disturbances,
|
|
The specification of the cycle model can also be written as
|
|
where
Var[ |
| ] =I2⊗Σκ , ψt and ψt-1*are N ×1 vectors. |
A stationary first-order vector autoregressive may be included in a multivariate model as an alternative to, or even as well as, a cycle. Thus
|
|
where Φ is a N×N matrix of coefficients. The condition for νt to be stationary is that the roots of the matrix polynomial I-ΦL should all lie outside the unit circle.
14.1.4 Common factors
In a common factor model, some or all of the components are driven by disturbance vectors with less than N elements. In terms of a SUTSE model, the presence of common factors means that the variance matrices of the relevant disturbances are less than full rank.
Common factors may appear in all components, including the irregular. Common trends arise through common levels, common slopes or a combination of the two.
14.1.4.1 Common levels
Consider the local level model, (eq:14.3), but suppose that the rank of Ση is K<N. The model then contains K common levels or common trends and may be written as
|
|
where ηt† is a K×1 vector, Θ is a (N×K) matrix of standardised factor loadings, Dη is a diagonal matrix and μθ is a (N×1) vector in which the first N-K elements are zeros and the last K elements are contained in a vector {\bf \mu } . The standardised factor loading matrix contains ones in the `diagonal' positions, i.e. θii=1,i=1,...,K, while θij=0 for j>i.
The model may be recast in the original SUTSE form (eq:14.3) by writing μt=Θμt†+μθ and noting that Ση =ΘDη Θ' is a singular matrix of rank K. When there are no common trends so N=K, the factor loading matrix is the Cholesky decomposition of Ση . As regards {\bf \mu }, partition Θ=(Θ1',Θ2')' where Θ1 consists of the first K rows, and partition μt similarly. Then
|
|
The vector {\bf \mu } is estimated from the above set of equations using the estimated states at t=T. The computations are carried out by back-substitution since the load matrix is lower triangular.
14.1.4.2 Smooth trends with common slopes
Common slopes may be formulated along similar lines. We first consider the case where the variance matrix of the slope disturbances is of rank Kβ but the variance matrix of levels is null so that the estimated trends are relatively smooth. The model is therefore
|
but, following (eq:14.7) above , it may be re-formulated as
|
where Σζ =ΘDΘ' and the first Kβ elements in μθt are zeros and the remainder are contained in a vector {\bf \mu }+{\bf \beta }% t. The (N-K)×1 vectors, {\bf \mu } and % {\bf \beta }, may both be calculated by expressions like (eq:14.8).
14.1.4.3 Common trends: level and slopes
A general multivariate local linear trend model in which the level variance matrix is of rank Kη while the slope variance matrix is of rank Kβ may be written
|
where the N×Kβ matrix Θβ is such that Σζ =Θβ Dζ Θβ ' ,and βθ =(0',{\bf % \beta }')' with {\bf \beta } a vector of length N-Kβ .
If Kβ =1, setting Θβ equal to a vector of ones and letting {\bf \beta }=0 would imply that all series had the same underlying growth rate (when modelling in logs). This might be plausible even if there are no common levels. The implication is that the trends in the forecast function remain parallel, in other words the long-run growth paths are the same. However, unless there are similar restrictions on the levels, the growth paths within the sample will not necessarily have kept together.
The following interpretation can be made if Kβ ≤Kμ and the common stochastic slopes only affect the observations via the common stochastic trends:
|
where Θ1μ is the first Kμ rows of Θμ ,Θμβ is the first Kμ rows of Θβ ,μθt=(0',β1't,{\bf \mu }'+β2't)' with {\bf \beta }1 is a vector of length Kμ -Kβ , {\bf \beta }2 is a vector of length N-Kμ and % {\bf \beta }=({\bf \beta }1', {\bf \beta }2')' . The fixed components satisfy, for any 1 ≤t ≤T,
\mu =-Θ2μΘ1μ-1μ1t+μ2t-{\bf \beta }2t and \beta =-Θ2βΘ1β-1β1t+β2t |
where Θ1β is the first Kβ rows of Θβ and Θ2β is the last N-Kμ rows.
At present STAMP does not offer the option of making the restrictions implied by the above formulation. However, if Kμ =Kβ =1, it must be the case that the above restriction implies Θ2β=Θ2μ. More generally for Kμ =Kβ we have
μt†=μt-1†+Θ1μ-1Θ1ββt-1†+ηt†, ηt†~NID(0,Dη ) |
and setting Θ1β=Θ1μ implies the further restriction that the slopes in the common trends are mutually independent. Thus the common trends are fully independent in both level and slope disturbances. However, for this to be true for all rotations the further condition that Dζ =qζ Dη is needed. This means that Σζ =qζ Ση so the model is what we call trend homogeneous .
To summarise trend homogeneity means that Kμ =Kβ and the slope and level disturbances enter through the same common trends, and when there is more than one common trend these trends are mutually independent. When the trend homogeneous constraint is applied, any restrictions made to Θ1μ will automatically be carried over to Θ1β.
14.1.4.4 Common seasonals
Common factors in seasonality implies a reduction in the number of disturbances driving changes in the seasonal patterns. It does not imply any similarity in seasonal patterns unless the deterministic seasonal components outside the common seasonals are set to zero. Thus suppose, for simplicity, that the series contain only seasonals and irregular. Then
yt=Θγt†+γθt+εt, |
where the last N-K elements of γθt contain fixed seasonal effects.
14.1.4.5 Common cycles
A model with common trends and common cycles could be written
yt=Θμ μt+μθ +Θψ ψt+εt, |
where Θψ is N×Kψ . Since the expectation of a cycle is zero, it is not necessary to include a vector of constant terms corresponding to the vector μθ which is needed for common levels.
As with other components, the common cycle can be re-formulated as a SUTSE model in which the disturbance variance matrices are Σκ = Θψ Dκ Θψ '. This is only possible because ρψ and λ are the same in all the common cycles.
14.1.5 Explanatory variables
Explanatory variables and interventions may be included in multivariate models. Thus (eq:14.2) generalises to
yt=μt+γt+ψt+rt+∑τ=1rΦτ yt-τ+∑τ=0sδτ xt-τ+Λwt+εt, t=1,...,T, |
where xt is a K×1 vector of explanatory variables and wt is a K*×1 vector of interventions. Elements in the parameters matrices, Φ, δτ and Λ may be specified to be zero, thereby excluding certain variables from particular equations. In addition, the unobserved components may be subject to common factor restrictions as described in the previous subsection.
14.2 State space form
The statistical treatment of structural time series models is based on the state space form (SSF). We use a particular SSF which is close to the `state space model' of de Jong (1991). It joins a measurement equation
|
|
with a transition equation, which allows the states αt to evolve according to a first-order vector autoregressive process. We write this as
|
|
The transition equation is initialised at the first time point by setting
|
|
The model is completed by making assumptions about the error process ut and the vector of regressors b. We assume they are independently and normally distributed with
|
|
This formulation of the SSF is slightly different from that exploited in books such as Harvey (1989) and West and Harrison (1989). It is used to produce rather more elegant treatments of diffuse initial conditions and fixed effects; see de Jong (1991).
The state system matrices Zt and Tt are fixed matrices which merely contain known values. The regression system matrices Xt and Wt are always known. The error system matrices Gt and Ht are also sparse but most non-zero values are unknown and are regarded as "hyper" parameters. The disturbance ut is transformed into noise for the SSF by the error system matrices Gt and Ht. All system matrices can be regarded as selection type matrices.
At first sight, the role for b in the SSF looks quite complicated. However, the SSF is written in this way to allow a unified treatment of a variety of features for time series models. There are three particularly important ones on which to focus. First, the measurement equation allows b to influence the observations directly through the regressors Xt. In general Xt, like Gt and Ht, may be a sparse selection type matrix. Second, the states are directly affected by b through Wt. This feature is particularly useful for handling interventions like structural changes (changes in trend) and growth changes (changes in slope). Third, the prior distribution of the initial state vector is (partly) defined via b.
The random vector b allows the analysis of models where diffuse distributions are placed on the regressors and on the non-stationary components of the initial state. The vector b is a linear combination of the known constant vector c and the random vector δ. The distribution of δ is regarded as diffuse, i.e. the variance matrix Λ converges to infinity. Basically, this means that a proper prior distribution for the state cannot be given when components are non-stationary. For a technical discussion on these issues, we refer to de Jong and Chu-Chun-Lin (1994). The connection between diffuse initial conditions and marginal or restricted likelihoods is discussed in Tunnicliffe-Wilson (1989) and Shephard (1993b). The unconditional distributions for stationary components, such as cycles, are handled via the second part of (eq:14.11); that is, H0u0.
The generality of the SSF can be simplified for structural time series models in three different ways.
- The constant vector c of the specification for b can
be set to a zero vector such that b=Bδ. The matrix B is a square selection matrix of zeros and ones. The vector δ
can be partitioned such that the regression effects and
the initial effects are separated and that δ enter the
equations of the SSF directly. This is achieved by
δ=( δx δi ) , and B=(Bx,Bi), where the (k×1) vector δx contains the regression effects, the (d×1) vector δi contains the initial effects and the dimensions of the matrices B, Bx and Bi are (k+d)×(k+d), (k+d)×k and (k+d)×d, respectively. This brings the specification for b to
b=Bxδx+Biδi and
Xtb = Xt*δx where Xt* = XtBx and XtBi = 0, Wtb = Wt*δx where Wt* = WtBx and WtBi = 0, W0b = W0*δi where W0* = W0Bi and W0Bx = 0. - For all models in STAMP, Gt and Ht will be
orthogonal; that is, Gt'Ht=0, so the noise
terms in the two equations, Gtut and Htut,
are independently distributed. This feature of a model simplifies the
statistical analysis of the SSF.
- All models in STAMP can be handled by a time-invariant SSF. This implies that the t subscripts of the state and error system matrices can be dropped, i.e. Zt=Z, Tt=T, Gt=G and Ht=H. A time-invariant SSF also simplifies the statistical analysis of the SSF.
14.2.1 Structural time series models in SSF
Structural time series models fit nicely in the state space form. However, some care should be taken with the initial state vector specification.
14.2.1.1 Univariate models in SSF
Some examples are given to get a flavour of it.
- Seasonal model. The first example of a structural time series
model in SSF is a model with
smooth trend and dummy
stochastic seasonals with s=4 (quarterly
observations). The SSF representation is
yt = ( 1 0 1 0 0 ) αt+( σε 0 0 ) ut αt = ( μt βt γ1,t γ2,t γ3,t ) =( 1 1 0 0 0 0 1 0 0 0 0 0 -1 -1 -1 0 0 1 0 0 0 0 0 1 0 ) αt-1+( 0 0 0 0 σζ 0 0 0 σω 0 0 0 0 0 0 ) ut with vector ut=(εt,ζt,ωt)' and Bx=0. This illustrates the fact that the matrices G and H play the role of selectors and scaling factors for the model, often being very sparse. The initial state vector is given by α1 = δi such that B=Bi=I and H0=0.
- Cycle model. Another example is a univariate model which
consists of a level and cycle component. The SSF representation of this model is
yt = ( 1 1 0 ) αt+( σε 0 0 0 ) ut, where αt=( μt,ψt,ψt) ', αt = ( 1 0 0 0 ρ cos λc ρ sin λc 0 -ρ sin λc ρ cos λc ) αt-1+( 0 ση 0 0 0 0 σκ 0 0 0 0 σκ ) ut with vector ut=(εt,ηt,κt,κt*)'. The unknown values in transition matrix T are the damping factor ρ and frequency λc. The initial state is given by
α1 = ( 1 0 0 ) δi+( 0 0 σκ (1-ρ2)- 1/2 0 0 σκ (1-ρ2)- 1/2 ) u0 with vector u0=(κ0,κ0*)'. Notice that δi is a scalar here.
- Explanatory variables. A univariate structural time series model may include
explanatory variables. Consider a local linear trend model with two explanatory variables and a level intervention at some
time point t=j. The SSF representation of this model is given by
yt = ( 1 0 ) αt+( x1,t x2,t 0 ) δx+( σε 0 0 ) ut αt = ( μt βt ) =( 1 1 0 1 ) αt-1+( 0 0 wt 0 0 0 ) δx+( 0 ση 0 0 0 σζ ) ut with vector ut=(εt,ηt,ζt)'. The initial state vector is given by α1=δi such that B=I and H0=0. The series x1,t and x2,t are explanatory variables and wt is a series of zeros but at some time point t=j its value is unity.
14.2.1.2 Multivariate model in SSF
Any multivariate model in STAMP with common factors can be written in the SUTSE form with lower rank variance matrices; see §14.1.4. The SUTSE form retains the sparse structure of the system matrices Z and T which is important for computational issues related to the estimation of these models. Therefore, this general formulation is adopted to specify a multivariate model in the SSF. Since the SUTSE form is a minor generalisation to the univariate model, only one example will be given. The multivariate model is a smooth trend plus cycle model which can be represented by the SSF as
|
where state vector αt=( μt',βt',ψt',ψt*') , disturbance vector ut=(εt,ζt,κt,κt*)' and each variance matrix is decomposed as Σ=ΓΓ' with the appropriate subscript. The unknown values in transition matrix T rely on the damping factor ρ and the frequency λc. The initial state is given by
|
with vector u0=(κ0,κ0*)' and δi contains the initial effects for μt and βt.
14.2.1.3 Deterministic trend and seasonal components
The deterministic seasonal component is the special case of the stochastic seasonal model with σω =0. Alternatively, the fixed seasonal can be incorporated within Xt as it is usually done in regression models. The latter one is adopted in STAMP. In a similar way (partly) deterministic trends can be incorporated in Xt or Wt. However, for practical reasons such as the implementation of trend interventions within Wt, the trend components μt and βt are always part of the state vector. The inclusion of a fixed trend in the state vector has some consequences for calculation and interpretation of residuals and the prediction error variance.
14.3 Kalman filter
The Kalman filter (KF) plays the same role for time series models in SSF as least squares computations for a regression model. The KF is primarily a set of vector and matrix recursions. The importance of the KF is based on
- computation of one-step ahead predictions of observation and state
vectors, and the corresponding mean square errors;
- diagnostic checking by means of one-step
ahead prediction errors;
- computation of the likelihood function via the one-step ahead
prediction error decomposition;
- smoothing which uses the output of the KF.
The KF has a variety of forms, but the one exploited in STAMP computes
|
which are referred to as the mean and mean square error (MSE) of the state, respectively, given the past information and setting δ to zero. Generally, we need to perform estimation with δ not zero, so these terms will be corrected by the use of an augmented Kalman filter. This will be discussed in the next subsection.
The recursive equations of the Kalman filter are given by
|
|
starting with a1|0=W0b, P1|0=H0H0' and q0=0. Here Kt is called the Kalman gain, while vt and σ2Ft are the one-step ahead prediction error (or innovation) and its mean square error, respectively. The scaled innovations Ft- 1/2 vt (or generalised least squares residuals) are approximately NID with zero mean and scale identity matrix as its variance matrix in a correctly specified model. Note that the estimate of σ2 is given by σ̂2=qT/NT. The proof of the KF is simple and it relies on standard results of linear estimation. Although the KF is presented in terms of time-invariant system and error matrices (Z, T, G and H), the time indices can be added to these matrices without any problem.
In the special case of a univariate model, N=1, the innovation vector and the corresponding mean square error matrix reduce to the scalars vt and σ2ft, respectively. Also, the Kalman gain matrix reduces to the vector kt. This notation is only used for univariate models.
The KF requires a positive (for univariate models) or a non-singular positive definite matrix (for multivariate models) Ft which may not occur at the start of the KF when, for example, the column rank of G1 is less N and P1|0=H0H0'=0. An example is when no irregular is present in a structural time series model. In these circumstances, the KF is warmed up by a number of prediction updates; that is, aj+1|0=Taj|0 and Pj+1|0=TPj|0T'+HH'. The initialisation process continues until Fj=ZPjZ'+GG' is positive definite. The KF is then re-initialised by a1|0=ai|0, P1|0=Pi|0 and q0=0 where i is the number of required prediction updates.
When Ft becomes non-positive definite during the KF updating, it is generally assumed that a numerical problem has occurred. Available remedies like a square-root version of the KF are not considered. However, the implementation of the KF in STAMP ensures for each time-point t that the MSE matrices Pt+1|t and Ft are symmetric, which is crucial in avoiding numerical problems. Also, avoiding unnecessary computations, such as adding zeros and multiplying zeros or unity values, helps in this respect.
In case of a time-invariant SSF, the KF may reach a steady state solution at some time s; that is, when Ps+1|s=Ps|s-1 for p<s≤T where p is the dimension of the state vector. This also implies that Fj and Kj are time-invariant for j=s,...,T. The KF reduces in these circumstances to the equations for at+1|t, vt and qt which are computationally very cheap. Finally, the steady state can be interpreted as the solution of the Riccati equation; see Anderson and Moore (1979).
14.3.1 The augmented Kalman filter
The KF evaluates the one-step ahead prediction of the state vector conditional on δ=0. When non-stationary components and/or fixed regression effects are present in the model, the unknown random vector δ is treated as diffuse; see [14.2]. In these circumstances an augmented Kalman filter (AKF) is applied:
|
|
for t=1,...,T and with A1|0=W0B. The AKF must be regarded as a supplement to the KF equations; the expressions for vt, Ft, qt, Kt, at+1|t and Pt+1|t remain as they are. The number of columns for Vt and At+1|t is k+d, the same number of columns as in matrix B of the SSF. The one-step ahead prediction of the state vector and the corresponding MSE matrix are given by
|
|
respectively. The one-step ahead prediction error and the corresponding MSE matrix (or variance matrix) are given by
|
|
respectively. The matrix inversions for St, t=1,...,T, can be evaluated recursively using similar methods as recursive regressions; see Appendix 14.7 of this chapter. The estimate of the scalar variance σ2 is
|
|
The full sample generalised least squares estimate of δ and its MSE matrix are given by
|
|
respectively. The MSE matrix is used to obtain standard errors and t-statistics for δ̂. In a similar way as vector δ, δ̂ is partitioned as
δ̂=( |
| ) . |
For a proof and a full exposition of the augmented Kalman filter; see de Jong (1991). When the SSF does not contain values for Xt and Wt, the KF can evaluate ât+1|t from t=d+1 onwards. This implies that the AKF is only used for the first d updates. Then, the AKF is collapsed to the KF, mainly using equation (eq:14.15) for t=d. On the other hand, when Xt and Wt are present in the SSF, a partial collapse with respect to δi is possible. The exact details of the full and the partial collapse are given in Appendix 14.7 of this chapter. Formal proofs on collapsing are given by de Jong and Chu-Chun-Lin (1994).
14.3.2 The likelihood function
The likelihood function can be obtained from the KF by using the innovations and their mean square errors. This is known as the prediction error decomposition. For example, the likelihood function, conditional on δ=0, is defined as
|
|
apart from some constant. To limit the number of logarithmic operations in computing the likelihood, it is important to check whether a steady-state has been reached such that log |Ft| is constant for all t.
The unknown parameters in the system matrices of the SSF are stacked in vector θ. For a given vector θ=θ*, the KF calculates the likelihood and, therefore, it provides means to obtain maximum likelihood estimates of the parameters. These matters are discussed in §14.6.
14.3.2.1 Univariate models: the concentrated likelihood
The diffuse likelihood function (DL) is defined as the likelihood when δ is supposed to be diffuse; see §14.2. For a given vector θ=θ*, the DL is given by
|
|
with T*=T-k-d. When a (partial) collapse of the AKF has taken place at some time t=m, the DL is calculated by
|
|
Appendix 14.7 of this chapter defines S* and it gives more details of likelihood calculation when a full or a partial collapse has taken place. The scalar variance σ2 can be concentrated out of the likelihood DL. This concentrated diffuse likelihood function (CDL) is given by
|
|
To calculate the CDL, one variance corresponding to a particular unobserved component is set to a unity value such that this variance is set equal to σ2 of (eq:14.12). The maximum likelihood estimate of this `concentrated out' variance is σ̂2. The other variances of the components in the model are now measured as a ratio to the concentrated variance. For example, in a local level model, where σε 2 is concentrated out, we have ση 2=σ2qη . These variance ratios are referred to as q-ratios. Parameter estimation is carried out with respect to the q-ratios; see §14.6. The CDL is used for parameter estimation of univariate models in STAMP. For a full discussion on diffuse likelihoods, we refer to de Jong (1988).
14.3.2.2 Homogeneous models
The homogeneous model is introduced in §14.1 and its restriction for a multivariate local level model reduces to
|
|
where the q-ratio qη plays the same role as set out in the previous subsection for univariate models. The homogeneity restrictions imposed on multivariate models allow the decoupling of the KF; see Fernandez (1990). The N different time series in yt are subjected to the same KF which corresponds to a univariate SSF where the matrices G and H contain the q-ratio values of the homogeneous model. So, the covariance equations of the (univariate) KF, i.e. ft, kt and Pt+1|t, are the same for all N time series. The set of N innovation series, stacked in vt, are used to construct the diffuse likelihood. The full variance matrix of a particular component, say the irregular εt; that is, Σ=Σε , can be concentrated out of the likelihood function leading to
|
|
where
{\bf \Sigma }= |
| ∑t=k+d+1T |
| . |
The univariate AKF is used for homogeneous models and, if possible, a (partial) collapse is made at some time t=m. Therefore, the same technical details apply as mentioned in the previous subsection which are set out in Appendix 14.7 of this chapter.
14.3.2.3 Multivariate models
The diffuse likelihood is used for parameter estimation of multivariate models. Thus, no parameter is concentrated out from the likelihood. The AKF is used for calculation but it can be (partially) collapsed at some time t=m. The multivariate diffuse likelihood is defined as
|
|
where S* is defined in Appendix 14.7 of this chapter.
14.3.3 Prediction error variance
The prediction error variance (PEV) of a univariate structural time series model is the variance of the one-step ahead prediction errors in the steady state. As discussed at the start of this section, the steady state may require more updates of the KF than the number of available observations. The formal definition of the estimated PEV is
PEV=σ̂2f, |
where f is the steady state value of ft; that is, f= lim t→∞ft. For a multivariate model, the prediction error variance matrix can be evaluated. When a steady state is not obtained, it may be useful to report a `finite' PEV; that is,
PEVn=σ̂2f̂n, |
where n is some large finite number. The `finite' PEV at the end of the sample; that is, at n=T, might also be of interest.
14.3.4 The final state and regression estimates
The final state vector âT|T contains, together with δ̂x, all the information from the estimated SSF required to forecast future observations. Therefore, the final state is of special interest. The estimate of the state vector and its MSE matrix at the end of the sample are given by
|
respectively. The diagonal part of the MSE matrix is used to form standard errors and t-statistics of the final state âT|T. Under certain circumstances, some elements of this diagonal part may be equal to zero and calculation of the t-statistic is not possible then. Also, for stationary parts of the state vector, it is not useful to calculate the t-statistic since stationary components are not persistent throughout the series.
The regression estimates are obtained from
δ̂x={\bf S}T-1{\bf s}T and MSE(δ̂x)=σ̂2{\bf S}T-1, |
where {\bf S}T and {\bf s}T are defined in Appendix 14.7 of this chapter. The diagonal part of the MSE matrix is used to form standard errors and t-statistics.
14.3.5 Filtered components
The filtered components are based on the one-step ahead predictions of the state vector; see equation (eq:14.15). When k=0, the AKF is applied until t=d. Then, the AKF collapses to the KF which evaluates the filtered states directly. However, when k>0, a partial collapse takes place and filtering is based on (eq:14.15) with matrix % {\bf S}T-1 calculated recursively; see Appendix 14.7 of this chapter. Note that the filtered states are only defined for t=d+k+1,...,T.
14.3.6 Residuals
The residuals in STAMP for univariate models are defined as the standardised innovations
|
|
where v̂t and σ̂ 2f̂t are defined as in (eq:14.16). When k=0, the residuals are obtained from the KF directly after the collapse of the AKF at t=d. In a multivariate model the residuals of the j-th equation are given by
vj,t=v̂j,t/σ̂F̂jj,t 1/2 |
for t=d+1,...,T where ai,t is the i-th element of vector at and Aij,t is the (i,j) element of matrix At.
When a SSF contains fixed effects; that is, when k>0, two sets of residuals are distinguished inside STAMP. These residuals are defined in the next two subsections. Note that the two sets are the same when k=0.
14.3.6.1 Generalised least squares residuals
The generalised least squares (GLS) residuals are standardised innovations computed using (eq:14.26), t=d+1,...,T, but where v̂t and f̂t are obtained from the AKF (eq:14.15) applied to a SSF with bx being replaced by Bxδ̂x. The calculation of these residuals requires two steps. Firstly, the estimates δ̂x and σ̂ are obtained from the AKF with d+k columns for At+1|t. Then, a second AKF is applied, with d columns for At+1|t, from which v̂t and f̂t are obtained for t=d+1,...,T. Of course, the first AKF may be partially collapsed after t=d and the second AKF may be fully collapsed to a KF after t=d. The GLS residuals are not independently distributed with constant variance in small samples. Tests for normality, heteroskedasticity and serial correlation are applied to the GLS residuals.
14.3.6.2 Generalised recursive residuals
The generalised recursive residuals are the standardised residuals (eq:14.26), but now obtained from an AKF, with d+k columns for At+1|t, accompanied with a set of recursive regressions for the inversion calculations of St; see Appendix 14.7 of this chapter. A partial collapse of this AKF may take place at t=d. These residuals are denoted by wt, for t=d+k+1,...,T, and they are commonly used in predictive tests and diagnostics.
14.4 Disturbance smoother
Smoothing refers to the estimation of the state vector αt and the disturbance vector ut using information in the whole sample rather than just past data. Smoothing is an important feature because it is the basis for
- signal extraction, detrending and seasonal adjustment;
- diagnostic checking for detecting and
distinguishing between outliers and structural changes using
auxiliary residuals;
- EM algorithm for initial estimation of parameters;
- calculation of the score, defined as the first derivative of the likelihood with respect to the vector of parameters.
We consider the disturbance smoother (DS) of Koopman (1993) which directly estimates the disturbance vector of the SSF, ut, and the MSE matrix; that is,
E(ut|YT,δ=0) = ũt and MSE(ũt) = σ2 Ct, |
respectively, with t=1,...,T. Note that in the context of structural time series models, the interest in ut is limited to elements of Gut and Hut and their corresponding MSE matrices.
The DS is based on the same backwards recursions as found by Bryson and Ho (1969), de Jong (1989) and Kohn and Ansley (1989) which are given by
|
where Lt= T-KtZ and t=T,...,1. These backwards recursions are initialised with rT=0 and NT=0. The storage requirements for the KF are limited to the quantities vt, Ft and Kt for t=1,...,T. The smooth estimates for the disturbances of the measurement equation and the MSE matrices are
|
|
respectively. The smooth estimates for the disturbances of the transition equation and the mean square errors are
|
|
respectively. The proof of the disturbance smoother is given by Koopman (1993). The DS is the most computationally efficient smoother available yet.
14.4.1 The augmented disturbance smoother
In a similar way as the KF, the DS must be augmented to allow for initial effects with diffuse distributions and fixed components. The supplement recursions for the DS are
Et=Ft-1Vt-Kt'Rt and Rt-1=Z'Et+T'Rt, |
and they will be referred to as the augmented disturbance smoother (ADS). The number of columns for Et and Rt is equal to d+k. When the AKF is not collapsed at all, the full sample estimates of the disturbance vectors E(Gut|YT), and the corresponding MSE matrices MSE(Gut|YT), are given by (eq:14.27) with et and Dt replaced by
|
|
respectively, for t=1,...,T. In a similar way, E(Hut|YT) and MSE(Hut|YT) are given by (eq:14.28) with rt and Nt replaced by
|
|
respectively. for the transition disturbances with Note that δ̃=ST-1sT and S=ST. For a proof of the ADS, the reader is referred to Koopman (1993). When the AKF is fully collapsed to the KF at t=m, the DS can be used for t=T,...,m+1. The ADS should be applied for t=m,...,1 but when a collapse has taken place, the matrices ST and Rm are not available. Therefore, smoothing is not straightforward when a collapse has occurred during the AKF. However, a suggestion of Chu-Chun-Lin and de Jong (1993) can be used to recover these missing matrices.
14.4.2 The EM algorithm and exact score
The EM algorithm is a recursive method to obtain maximum likelihood estimates for unknown elements in the state and error system matrices of the SSF; see Shumway and Stoffer (1982) and Watson and Engle (1983). This method requires a huge amount of computations and is not easy to implement. Koopman (1993) proposes a simple and computationally efficient EM algorithm for unknown values inside the variance matrices ΩG=GG' and ΩH=HH' where the error system matrices are supposed to be time-invariant. Let θ be the stack of these unknown values. Given a set of values for θ, say θ*, the EM step calculates new variance matrices via
|
where
Λe=∑t=1Têtêt'-D̂t and Λr=∑t=1Tr̂tr̂t'-N̂t. |
The quantities êt and D̂t are defined in (eq:14.29) and the quantities r̂t and N̂t are defined in (eq:14.30). The log-likelihood value of the SSF with these new variance matrices is always larger compared to the log-likelihood value of the SSF with the old variance matrices. This EM algorithm only requires one pass through the KF and one pass back through the disturbance smoother. Note that the EM maximises the diffuse log-likelihood (eq:14.21). For a proof and a discussion of this EM algorithm, the reader is referred to Koopman (1993).
The score vector is defined as the first derivative of the log-likelihood with respect to the vector of parameters θ. The vector of first derivatives is usually evaluated numerically. Koopman and Shephard (1992) present a method to calculate the exact score for any parameter in the state and error system matrices. However, the computational gain is only achieved for exact score calculations with respect to parameters in the error system matrices Gt=G and Ht=H. Again, let θ be the stack of unknown values in the variance matrices of the SSF and let vector θ* contain specific values for θ, then the exact score is given by
|
|
and the definitions of the EM algorithm apply here as well.
14.4.3 Smoothed components
The smoothed state vector is defined as the estimated state vector using the full data-set; that is,
|
|
The smoothed state vector forms the basis for detrending and seasonal adjustment procedures for the SSF. Graphical inspection of a sub-set or a linear combination of the smoothed state vector is an important feature in STAMP. The actual values for (eq:14.32) are obtained by
|
|
with α̃1=â1|0+P̂1|0(r0+R0δ̃). Hence, the smoothed state vector is calculated using the forward recursion (eq:14.33) after the DS is applied and the vector Hũt=HH'rt is stored for t=1,...,T. The MSE matrix of the smoothed state vector is not considered; see de Jong (1989) and Kohn and Ansley (1989). It is shown by Koopman (1993) that this method of state smoothing is computationally very fast. Also note that no extra storage space is required since the storage space of the Kalman gain matrix Kt can be used to store the vector Hũt for t=1,...,T.
14.4.4 Auxiliary residuals
The standardised smoothed estimates of the disturbances are referred to as auxiliary residuals. Graphs of these residuals, in conjunction with normality tests, are a valuable tool for detecting data irregularities such as outliers, level changes and slope changes. Note that the auxiliary residuals are serially correlated even when the model is correctly specified. Therefore, the normality test should be adjusted for the implied serial correlation. The technical details and some interesting applications can be found in Harvey and Koopman (1992).
The auxiliary residuals are obtained directly from (eq:14.27) and (eq:14.28). The residuals are standardised using the diagonal elements of the appropriate MSE matrix. The variance σ2 must be replaced by its estimate σ̂2.
14.5 Forecasting
An important feature of STAMP is its extensive forecast facility. The model-based extrapolations can be calculated for the actual series but also for any element of the state vector.
14.5.1 Forecasts of series and components
The forecast of the state vector l multi-steps ahead, for l=1,...,L, and its MSE matrix are given by
âT+l|T=E(αT+l|YT), σ2P̂T+l|T=MSE(αT+l|YT), |
respectively. The forecasts and MSE matrices are generated recursively by
|
|
respectively. The initialisation is directly obtained from the AKF at time T . The forecast for yT+l and the corresponding MSE matrices are given by
|
|
respectively. Graphical inspection of the forecasted values can be valuable for checking model validity.
14.5.2 Extrapolative residuals
The extrapolative residuals can be calculated when future values of yt become available. They are defined as
|
|
where ŷT+l|T is defined in (eq:14.35). The variance matrices of the extrapolative residuals are equal to the MSE matrices of ŷT+l|T; that is, σ̂2F̂T+l|T as in (eq:14.35). The extrapolative residuals can be standardised by using the corresponding diagonal elements of the variance matrix.
14.6 Parameter estimation
The model definitions are given in §14.1. It is shown in §14.2 how the models of STAMP can be put in SSF. The filtering and smoothing algorithms of §14.3 and §14.4, respectively, are associated with the SSF and can be applied conditional on known state and error system matrices. The unknown values inside these matrices are treated as parameters which need to be estimated. The main task of STAMP is to estimate these parameters using maximum likelihood estimation (MLE) methods. This subsection sets out the strategy for solving this problem.
14.6.1 The parameters of STAMP
The state and error system matrices of the SSF are time-invariant and contain many zero and unity values. Therefore, they can be regarded as selection matrices. For a univariate model, the unknown values in the error system matrices are the standard deviations of the unobserved components. The time-invariant transition matrix T of the SSF contains the unknown first-order autoregressive parameter of stationary components such as cycle and (optional) slope. In the case of a cycle, the state system matrix T also contains cosine and sine terms which rely on the unknown frequency λ. Table 14.2 overviews the set of possible parameters for a univariate model.
The variances are always positive, the cycle damping factors must be in the range 0<ρ<1, and the frequencies can only take values in the range 0<λ<π. Since the parameters are to be estimated via a numerical method, some special transformations are used to enforce these restrictions on the parameters; see Table 14.2. The set of transformed parameters, relevant for a specific model, are stacked in the vector θ.
As observed from Table 14.2, the variance of the cycle, σψ 2, is taken as the parameter instead of the variance of its disturbance σκ 2. In this case, σκ 2=σψ 2(1-ρψ 2) 1/2 and σκ 2→0 as ρψ →1. This allows the estimation of deterministic, but stationary, cycles which is the extreme case of ρψ =1.
Table:14.2 Parameters and transformations
Variances | Parameter | Transformation |
Irregular | σε 2 | exp (2θε) |
Level | ση 2 | exp (2θη) |
Slope | σζ 2 | exp (2θζ) |
Seasonal | σω 2 | exp (2θω) |
Cycle | σψ 2 | exp (2θψ) |
AR(1) | σξ 2 | exp (2θξ) |
Autoregressive | Parameter | Transformation |
Slope | ρβ | |θβ|(1 + θβ2)- 1/2 |
Cycle | ρψ | |θρ| (1 + θρ2)- 1/2 |
AR(1) | ρν | θν (1 + θν2)- 1/2 |
Frequency | Parameter | Transformation |
Cycle | λ | 2π/ 2 + exp (θλ) |
14.6.1.1 Variance matrices in a multivariate model
The same parameters exist for a multivariate model as for a univariate model, except that the variances are replaced by variance matrices; see §14.1. A special case is the vector autoregressive component which is discussed in §14.5.2. Since all variance matrices are treated the same, we consider a N×N matrix Σ, i.e. a non-negative symmetric matrix but not necessarily full rank. The Cholesky decomposition of Σ enforces these restrictions. The variance matrix is decomposed as Σ=ΘDΘ' where N×p matrix Θ is a lower-triangular matrix with unity values on the leading diagonal, 1≤p≤N, and p×p matrix D is a non-negative diagonal matrix. The lower-triangular elements of Θ; that is, θij with i=2,...,N and j=1,...,min(i-1,p), may take any value, including zero values. The diagonal elements of D are transformed as
di= exp (2θii) |
where di is the i-th diagonal element of D and i=1,...,p. Indeed, matrix Θ is the same factor loadings matrix as in the model specifications for common components; see §14.1.4. Since matrix Θ has p columns, this would suggest that variance matrix Σ has rank p. In most cases this will be true. However, as any diagonal element of D can be zero, the rank of Σ can be less than p. Note that when di=0, the i-th column of Θ does not affect variance matrix Σ and can be set to zero.
The variance matrices are placed into the SSF via the error system matrices. For example, the variance matrix of the irregular enters the SSF via G=(ΘD 1/2 ,0,...,0) where Θ and D correspond to the decomposition of the irregular variance matrix Σε and 0 denotes a zero matrix.
14.6.1.2 Vector autoregressive components
The multivariate model may include a vector autoregression of order 1. The parameters inside the matrix Φ appear in the SSF transition matrix T. All roots of the matrix polynomial I-ΦL must lie outside the unit circle. Within a numerical optimisation procedure, the constraints on Φ needed to keep the process stationary are imposed by means of an algorithm given by Ansley and Kohn (1986). In the special case of a VAR(1) model, the transformations are:
- The parameters of the VAR component are put into the N×N
unrestricted matrix A and the N×N positive definite
symmetric matrix Γ.
- Cholesky transformations are applied on the matrices I+AA'=BB' and Γ=LL' .
- The VAR matrices are defined by Φ=LA'B'-1L-1 and Συ =Γ-ΦΓΦ'.
This method ensure a stationary VAR component. However, imposing restrictions on the VAR matrices may be difficult.
14.6.2 BFGS Estimation procedure*
The optimisation method is based on the BFGS quasi-Newton method and it aims to maximise the likelihood criterion l( θ) . At every call to this updating scheme, it delivers a new estimate for θ, denoted by θi+1, given a current estimate, θi, which brings the object function l( θ) closer to its maximum. The updating scheme can be written as
|
|
where si is a step size scalar and δi is the search direction vector. Appendix 14.8 of this chapter gives some basic principles on numerical optimisation and in particular the BFGS updating scheme. Some practical details on initialisation, updating, linear search and termination, are given below.
- At the start of the estimation process an initial parameter vector θ0 and an initial search direction δ0 must be
given. The procedure of setting the initial parameter vector is different
for univariate models and for multivariate models; see §14.6.3 and
§14.6.4, respectively. The initial direction vector is calculated as
δ0=Hd(θ0)q(θ0) (eq:14.38) where q(θ0) is the score vector and Hd(θ0) is the Hessian matrix which only contains values for the diagonal entries. These values are evaluated numerically and are based on parameter vector θ0. The off-diagonal entries are zero. The initial score vector is evaluated numerically. For multivariate models, the score vector is evaluated analytically for any next update; see §14.4.2.
- The maximisation process is terminated when all three convergence
criteria hold depending on a small value ε or when the number
of BFGS updates is equal to the integer M or when irregularities occur. In
the latter case, the estimation procedure is stopped with an error code attached to it.
The three convergence criteria are
1. Likelihood: crit1= |l( θi) -l( θi+1) | / |l( θi) | < ε 2. Gradient: crit2= 1/m ∑j=1m|qj( θi) | < 10ε 3. Parameter: crit3= 1/m ∑j=1m|θi+1,j-θi,j| / |θi,j| < 100ε where θi,j denotes the j-th element of the parameter vector θi and qj( θi) denotes the j-th element of the score vector q( θi) . When all three convergence criteria hold, the estimation procedure terminates and returns one of the following messages
Very strong crit1<ε crit2<ε crit3<ε Strong crit1<ε crit2<ε crit3<10Nε Weak crit1<ε crit2<10Nε crit3<10Nε Very weak crit1<10Nε crit2<10Nε crit3<10Nε where N is the number of elements in the vector yt of the state space form; see §14.2. For univariate models, N is equal to unity. The default values are ε=10-7 and M=100.
- The direction vector δi is obtained via the BFGS
scheme; see Appendix 14.8 of this chapter. To ensure stability in the
optimisation routine: the BFGS updating of (eq:14.47) does not take place
when d'g in (eq:14.47) is smaller than ε.
- The linear search strategy is as follows. The step length si is
given by
si= j min {(c-|θi,j|)~/~|δi,j|} where θi,j denotes the j-th element of the parameter vector θi and qj( θi) denotes the j-th element of the score vector q( θi) . If si<0.0001, si is set to 0.0001, and if si>0.9, si is set to 1. The constant c is set to 7 for parameters related to standard deviations, but c is set to 28 and 8 for parameters related to ρ and λ, respectively. The constants determine the parameter range for the optimisation procedure; that is, θi,j∈ [-c,c]. Also, when the maximum value in the direction vector δi, δ max , exceeds the value 3, then si is overwritten by min (δ max -1,si). Subsequently, if l(θi+1) is not larger than l(θi), the scalar si is divided by 2. This is repeated until si is smaller than ε. When si<ε, the Hessian matrix H( θi) is set equal its diagonal matrix Hd( θi) . The re-setting of the Hessian matrix is only allowed twice. At the third occasion, the estimation procedure is terminated with an error message. Notice that in most cases the value for si is the unity value.
- When the likelihood function, the score or the diagonal elements of the Hessian matrix cannot be evaluated for some reason, the estimation procedure terminates with an error message.
More specific details of implementation for univariate models and multivariate models are given below.
14.6.3 Estimation of univariate models*
For univariate models, STAMP maximises the concentrated diffuse log-likelihood log Lc(y|θ) which is treated as an unconstrained non-linear function of θ, denoted by l(θ). The likelihood kernel l(θ) is appropriately scaled by a function of T, the number of observations.
14.6.3.1 Initial values
The elements of the vector of transformed parameters θ are given in Table 14.2. The elements related to standard deviations, θσ , are set equal to -0.5, except for θη (level) which initial value is -1 whereas θζ (slope) and θω (seasonal) are started off with -1.5 and -2, respectively.
One element of θσ needs to be concentrated out, which effectively means it must be restricted to zero (that is, a unity standard deviation). The choice of the concentrated parameter must be chosen appropriately. At the start of the optimisation routine, the element corresponding to the irregular standard deviation is concentrated out. When no irregular is present in the model we choose from level, slope, cycles, seasonal and autoregressive component, consecutively.
The initial values for ρψ and λ, corresponding to a particular cycle, are provided by the user and STAMP transforms them automatically to θ0. The initial values for θβ and θυ are set to 2 (which corresponds to ρ values of approximately 0.9).
Before the quasi-Newton optimisation is started, a simple method is constructed to get θσ to the neighborhood of the optimal solution in a stable but quickly way. The initialisation method is similar to the quasi-Newton updating scheme (eq:14.37). The j-th element of the direction vector δi is given by
δi,j=Hj,j(θi)qj(θi) |
where qj( θi) denotes the j-th element of the score vector q( θi) and Hj,j(θi) denotes the (j,j) element of the Hessian matrix H( θi) . The score and the diagonal elements of the Hessian matrix are numerically evaluated at θi. The initialisation updating steps are repeated five times but it is stopped earlier if the stopping criterion
j∑δi,j∇j(θi)<0.01 |
holds.
During this simple estimation procedure, a choice is made of which parameter should be concentrated out of the log-likelihood. At every step i, the parameter related to the largest value in vector θi, is concentrated out when its corresponding value in δi is larger than 4. This rule reflects the desire to concentrate out the largest standard deviation of the likelihood. When the choice of concentrated parameter is changed, the other standard deviations should be adjusted, so they have the correct ratios with respect to the new concentrated parameter. This might create unstabilities in the search process. Therefore, the difference between the old and the new adjusted value is enforced to lie between -1 and 2. This strategy is a result of `fine tuning' after it has been applied to a wide range of different models and data-sets.
14.6.3.2 Strategy for setting parameters fixed
During the estimation procedure, certain parameters might take values close to their boundaries. When a parameter moves closer and closer to its boundary, it may slow down the optimisation procedure as a whole. In these circumstances, it is advised to fix parameters at their boundary values but to allow estimation to be continued for the remaining parameters. The strategy for setting parameters fixed is as follows. In the case of standard deviations, when the transformed parameter value is smaller that -5 and its gradient value is smaller than 0.0001 (in absolute terms), it is set to a constant implying a zero standard deviation.
Similarly, the damping factor ρ of a cycle is set to unity, when the element corresponding to θρ in vector θi is larger than 25. If the value related to θλ (corresponding to the period of a cycle) is larger than 7 or less than -7, the program reduces the cycle component to an autoregressive component. In both cases, it is assumed that the corresponding gradient value is smaller than 0.0001 (in absolute terms). The transformed parameters related to ρβ (slope) and ρυ (autoregressive) don't apply to this strategy.
14.6.3.3 Strategy for determining the concentrated parameter
The initial estimation procedure has determined a concentrated standard deviation from θσ . However, during the main estimation process, it might be more appropriate to concentrate out another parameter. When the largest transformed parameter value (corresponding to θσ ) exceeds the value 1.5, this particular parameter will be the new concentrated parameter. Then, the other parameters corresponding to θσ obtain values as ratios to the new concentrated parameter, the gradient vector q( θi) is re-calculated and the Hessian matrix is replaced by the diagonal matrix Hd( θi) .
14.6.4 Estimation of multivariate models*
The estimation procedure for multivariate models follows mainly the same route as for univariate models but there are some differences. The likelihood kernel l( θ) is the diffuse log-likelihood log L(y|θ) but properly scaled by a function of T. So there is no need to find an appropriate concentrated parameter. The EM algorithm plays a prominent role in the procedure to obtain initial variance matrices for the disturbances; see §14.4.2. The strategy of setting elements to zero in the Cholesky decompositions of the variance matrices is set out below. The parameters for the VAR component cannot be set fixed during the estimation procedure.
14.6.4.1 Initial values
The initial settings for the multivariate model are discussed in terms of the variance matrix, the Cholesky decomposition and the VAR coefficient matrix, rather than directly in terms of the parameter vector θ. The initialisation procedure automatically transforms the elements of these matrices into the vector θ0 .
The initial parameter settings of the variance matrices is similar to the univariate settings of the q-ratios, except that now a specific q-ratio applies to all N diagonal entries of the D matrix. The factor loading matrix Θ is set to the identity matrix. At this stage, the rank conditions of the variance matrices are ignored. The VAR matrix is set equal to a diagonal matrix with all entries equal to 0.9. The extra initial parameters for the cycle component are set manually by the user.
The Kalman filter is applied to the multivariate model and the sample variance matrix of the innovations is calculated. The variance matrices of the disturbances are set equal to this matrix but properly scaled by their appropriate q-ratios.
Then, the EM algorithm is applied; see §14.4.2. The number of EM steps is equal to 5 which has proved to be appropriate in most cases. The VAR matrix and the other stationary parameters (ρ and λ) remain fixed during the EM. After the EM steps the parameters are transformed as indicated in §14.6.1. For example, the Cholesky decomposition is used to get the diagonal matrix D and the factor loading matrix Θ from the variance matrix. At this stage, the rank conditions are taken into account. Finally, the parameters are stacked into θ0.
14.6.4.2 Strategy for fixing parameters
The need to restrict parameters to their boundary values is important to speed up the process of estimation. This is even more relevant for multivariate models since the estimation process is likely to take much more time than univariate models because filtering and smoothing involve matrices with larger dimensions and more parameters need to be estimated.
The diagonal entries of D are treated in the same way as the standard deviations in a univariate model. The restrictions for the damping factor and the frequency parameter are also done in a similar fashion as for univariate models. No restrictions can be made with respect to the parameters related to the VAR component. The factor loading matrix elements enter the parameter vector without transformation.
When the j-th diagonal element of D is restricted to zero, the rank of the variance matrix Σ is reduced by one. All elements of the corresponding j-th column of the factor loading matrix can also be restricted to zero since these elements don't have any effect on the variance matrix Σ. Any specific element of the factor loading matrix Θ is set to zero when its value is smaller than 0.0001 and its corresponding gradient is also smaller than 0.0001 (both in absolute terms).
14.7 Appendix 1: Diffuse distributions*
This appendix discusses technical details about the way we treat initial and regression effects for the SSF. Most components in STAMP are initialised using diffuse priors and so we must treat these priors carefully in our calculations to ensure stable and exact results. This is carried out using the idea of augmented filtering and smoothing. To ensure computational efficiency the Kalman filter (KF) needs to be collapsed at some point.
14.7.1 Collapse of the augmented KF
For a SSF with k=0; that is, Xt=0 and Wt=0, t=1,...,T , is treated by the AKF for t=1,...,m and the AKF is collapsed to the KF using (eq:14.15) at t=m. Then, the KF evaluates ât+1|t and P̂t+1|t directly, t=1,...,T . The collapse also contains the operation
q̂m=qm-sm'Sm-1sm. |
Indeed, the KF evaluates the residual vector v̂t and its (unscaled) variance matrix F̂t directly as well. The value for m is restricted by T≥m≥d and |Sm|>0 but usually it is set to m=d.
For a SSF with k>0, a full collapse cannot be made since Xt and Wt are time-varying. However, a partial collapse can take place with respect to δi. This means that the AKF is still applied but with a reduced number of columns. Therefore, we partition the AKF matrices
|
|
When Sii,t is invertible at t=m, the AKF can be partially collapsed to
|
and
|
Note that matrix B of the (partially collapsed) AKF is replaced by {\bf B}=Bx. The matrices {\bf A}t+1|t, {\bf V}t=-Zt{\bf A}t+1|t-Xt% {\bf B} and {\bf B} have k columns.
14.7.2 Likelihood calculation
At the end of a fully collapsed AKF, the estimate of σ2 is given by σ̂ 2=q̂T/( NT-d-k) where k=0 and q̂T is obtained from the collapsed KF. When k>0, q̂T is given by
q̂T=qT-sT'ST-1sT. |
Likelihood calculation also requires the log value of the term |S*| where S*=ST. If k=0 and a full collapse has taken place at t=m, then |S*|=|Sm|. If k>0 and a partial collapse has taken place at t=m, then |S*|=|Sii,m||% {\bf S}T|. Note that the determinants are obtained without any extra computational costs since the inverses of these matrices are calculated earlier.
14.7.3 Recursive regressions
The recursive evaluation of least squares estimates is common practice in standard regression analysis since it is a way to get the recursive residuals. The same technique can be applied to the recursive evaluation of St which consists of the equations
|
|
where ct=St-1st and Ct=St-1.
14.8 Appendix 2: Numerical optimisation*
Numerical optimisation methods are used to obtain the maximum likelihood estimates for the parameter vector θ. Here we outline some of the basic principles behind the estimation procedure of STAMP.
There is a vast literature on non-linear optimisation techniques (see, among many others, Gill, Murray and Wright (1981), Cramer (1986) and Thisted (1988)). Note that many texts on optimisation focus on minimisation, rather than maximisation, but of course: max l( θ) =- min -l( θ) .
A first approach to obtaining the MLE θ from l( θ) is to consider solving the score equations, assuming the relevant partial derivatives exist:
|
|
Then q( θ) =0 defines the necessary conditions for a local maximum of l( θ) at θ. A sufficient condition is that
|
|
also exists and is negative definite at θ. If Q( .) is positive definite for all parameter values, the likelihood is concave, and hence has a unique maximum; if not, there could be local optima or singularities.
14.8.1 Newton type methods
When q(θ) is a set of equations that are linear in θ, q(θ)=0 can be solved explicitly for θ. More generally, q(θ) is non-linear, yielding a problem of locating θ which is no more tractable than maximising l(θ). Thus, we consider iterative approaches in which a sequence of values of θ (denoted θi at the ith iteration) is obtained approximating q(θi)=0, and corresponding to non-decreasing values of the criterion function l(.) : l(θi+1)≥l(θi):
|
|
where M is a terminal maximum number of steps, from an initial value θ0. A convergence criterion is used to terminate the iteration, such as q(θi+1)~=0 or |l(θi+1)-l(θi)|≤ε. These implementation-specific aspects are discussed in §14.6.
Expand q( θ) =0 in a first-order Taylor's series:
|
|
written as the iterative rule:
|
|
where H( .) =Q-1( .) is the Hessian matrix. The gradient, q(θi), determines the direction of the step to be taken, and H(θi) modifies the size of the step which determines the metric: this algorithm is the Newton-Raphson technique, or Newton's method. Even when the direction is uphill, it is possible to overstep the maximum in that direction. In that case it is essential to add a line search to determine a step length si∈[ 0,1] :
|
|
where δi=H( θi) q( θi) and si is chosen to ensure that l( θi+1) ≥l( θi) .
STAMP uses the so-called variable metric or Broyden-Fletcher-Goldfarb-Shanno (BFGS) update quasi-Newton methods. These approximate the Hessian matrix H( θi) by a symmetric positive definite matrix Ki which is updated at every iteration but converges on H( .) .
Let d=siδi=θi-θi-1 and g=q( θi) -q( θi-1) ; then the BFGS update is:
|
|
This satisfies the quasi-Newton condition Ki+1g=d, and possesses the properties of hereditary symmetry (Ki+1 is symmetric if Ki is), hereditary positive definiteness, and super-linear convergence.
14.8.2 Numerical score vector and diagonal Hessian matrix
When the analytic formula for q( θ) cannot be obtained easily, STAMP uses BFGS with a numerical approximation to q( .) based on finite difference approximations. The numerical derivatives are calculated using:
|
|
where ιj is a unit vector (for example, ( 0 1 0 ... 0) ' for j=2), ε is a suitably chosen step length. Thus, ε represents a compromise between round-off error (cancellation of leading digits when subtracting nearly equal numbers) and truncation error (ignoring terms of higher order than ε in the approximation).
The diagonal entries of the Hessian matrix are numerically evaluated as
Hj,jd( θ) =- |
|
where
Qj,j( θ) = |
| = |
| ~= |
|
with the same step-length ε.
Chapter 15 Model Output
This chapter defines the output of STAMP. The first section explains when and where the various parts of the output appear. The remaining sections explain the output is presented for every option of the Test menu.
15.1 Output from STAMP
This section details the output which occurs when a model is fitted to data. There are three different forms of output. Firstly, the iterations of the numerical optimisation algorithm are displayed in the Message window. Secondly, once the estimation algorithm has stopped (hopefully having converged!), the Results window reports details of the model specification and the results of the optimisation algorithm, and gives a summary of diagnostic statistics. Finally, the Test menu can be used to prompt STAMP to output specific details of the parameter estimates, state variables, model fit and diagnostics.
15.1.1 Model estimation
In order to compute the maximum likelihood estimates, STAMP first computes initial estimates and then it moves on to maximise the likelihood function directly by numerical optimisation; see Chapter 14. During this second phase various statistics are reported in the Message window:
- Parameters - the current values of
the transformed parameters;
- Gradients - the derivatives of the likelihood function with
respect to the transformed parameters;
- Function value - the kernel of the log-likelihood function;
- Initial value - the value of the function after the initial
phase;
- %change - the percentage increase of the current likelihood kernel over the initial value.
15.1.2 Selected model and estimation output
The initial output, written to the Results window, gives details of the selected model and various results from the numerical optimisation procedure; an example is reproduced in §??. In particular, the selected components are listed and the number of parameters is given. The sample period used for parameter estimation is also given.
A variety of statistics emerge from the numerical optimisation. The likelihood kernel evaluated at θ̂ is reported together with the number of iterations required for convergence. Three different convergence measures are reported and these are categorised by some term like STRONG CONVERGENCE; see §14.6.
15.1.3 Summary statistics
An example of the diagnostic summary report, which is printed to the Results window after fitting, is given in §??. Broadly this summary gives the basic diagnostics and goodness-of-fit statistics. The precise definitions are given in §15.4 and §15.7.
- Log-likelihood - this is the actual log-likelihood function at
its maximum;
- Prediction error variance (PEV) - the
variance, or variance matrix, of the one-step ahead prediction errors in
the steady state, or the finite PEV if
accompanied by a message of `No steady state';
- Std. Error - the equation standard error is the square root of
the PEV;
- Normality - the Doornik--Hansen statistic, which is the Bowman--Shenton statistic
with the correction of Doornik and Hansen (1994), distributed
approximately as χ22 under the null hypothesis;
- H(h) - a test for heteroskedasticity, distributed
approximately as F(h,h);
- r(τ) - the residual autocorrelation at lag τ,
distributed approximately as N(0,1/T);
- DW - Durbin-Watson statistic,
distributed approximately as N(2,4/T);
- Q(P,d) - Box--Ljung Q-statistic based on the first P residual
autocorrelations and distributed approximately as χd2;
- R2, RD2 or RS2 - the most suitable coefficient of determination.
Notice that the loss in the degrees of freedom d of the Box--Ljung Q-statistic takes account of the number of relative parameters in θ. It does not take account of any lagged dependent variables and their presence necessitates further adjustment.
15.1.4 The sample period
The parameters are estimated using the model sample as recorded from the Estimation dialog in the Model menu. The calculation of diagnostics and goodness-of-fit measures are calculated using the state sample. The default is that the model sample and the state sample are the same. However, the state sample might be changed to check sub-samples isolated from the rest of the sample. This can be done in the option Final state. Of course, both samples must be within the boundaries of the data sample.
15.1.5 Test
The nine items of the Test menu are self-explanatory. The headings of the following sections correspond to them. Note that the output for Goodness of Fit is requested by the Final state option and that many of the conventional time series diagnostics (built out of the innovations, or one-step ahead prediction errors) will come up in the Residual option.
15.2 Parameters
The variance parameters (and variance matrix parameters for multivariate models), together with frequencies and damping factors for cycles and coefficients for autoregressive components are referred to as parameters. The maximum likelihood estimates of the parameters can be requested from the Further output dialog. In a multivariate model, the same output is produced for each equation in turn. In addition, the full variance matrices and their decomposition in factor loading matrices and diagonal scaling matrices can be produced. This option also outputs the transformed parameters with their scores and approximate standard errors.
15.2.1 Variances and standard deviations
This gives the variances of the disturbances driving the various components. Thus, in terms of §14.1.1, the level variance is ση 2, the slope variance is σζ 2 and so on. The standard deviations are the square roots of the variances. The q-ratios are the ratios of each variance or standard deviation to the largest.
For multivariate models, the variances and standard deviations are obtained from the diagonal entries of the variance matrices. In order to verify the q-ratios for multivariate models, the largest variance across all series is identified, and this component is set to one for all series, irrespective of whether or not it is the largest in a particular series. However, when this variance in a particular series is zero, all associated q-ratios are set to zero. This also applies to q-ratios for standard deviations.
15.2.2 Cycle and AR(1)
For a stochastic cycle component, this option outputs the damping factor ρ, the frequency λ, the period 2π/λ, and the period in terms of `years'; that is, 2π/sλ, where s is the number of `seasons'. In addition it gives the variance of the cycle, σψ 2=σκ 2/(1-ρ2). For a multivariate model, the variance matrix of the cycle is given. This is
Σψ =(1-ρ2)-1Σκ , |
where Σκ is the variance matrix of the cycle disturbances.
For the AR(1) component in a univariate model, the autoregressive coefficient is reported. For a multivariate model the N×N matrix of coefficients is presented. The variance matrix of the disturbance vector ξt is produced.
15.2.3 Variance matrices (multivariate models)
The N×N variance matrices of the disturbances correspond to the variance parameters in a single equation model; see §14.1.3. In the STAMP output, the variance matrix is given by the lower triangular part of an N×N matrix, with the variances corresponding to the disturbance terms in each series given on the main diagonal (these are also produced by `Variances' above). The upper triangular part of the matrix gives the off-diagonal entries in the corresponding correlation matrix.
In a common factor model, some, or all, variance matrices will be of reduced rank; see §14.1.4. This will be reflected by values of unity in specific entries of the correlation matrix.
15.2.4 Factor loading matrices (multivariate models)
The N×K factor loading matrix Θ is defined in §14.1.4. When no common factor restrictions are imposed, so that N=K, the factor loading matrix is the lower triangular matrix associated with the Cholesky decomposition of variance matrix Σ; that is, Σ=ΘDΘ'. This option outputs the lower triangular factor matrix Θ and the diagonal scaling matrix D.
The vectors of constants, such as {\bf \mu } and % {\bf \beta }, are defined in §14.1.4 and only exist when common factors are specified in the model. The dimension of these vectors is N-K. The constant vector does not apply to the irregular component.
15.2.5 Transformed parameters and standard errors
The vector of estimated parameters, which are transformed as in §14.6, can be written to the Results window using this option. It also reports the score vector, associated with the estimated vector of parameters, and the numerical standard errors of the transformed parameters.
The scores are calculated as described in §14.6. The standard errors are obtained from the diagonal entries of the numerically evaluated Hessian matrix Q. Formally, the standard error of the i-th estimated parameter is defined as
sei=-1/√Qi |
where Qi is the i-th diagonal entry of matrix Q.
15.3 Final state
The estimate of the final state vector contains all the latest information on the components in the model. Together with the regression estimates of the explanatory variables and the interventions, the final state contains all the information needed to make forecasts.
The Final state dialog also provides useful information about the estimated seasonal pattern and the amplitude of the cycle. Some goodness-of-fit statistics can be generated, but these are discussed in §15.4.
15.3.1 Analysis of state
In terms of §14.3.4, the final state is the filtered estimate at time T; that is, âT|T. The square roots of the diagonal elements of the corresponding MSE matrix, σ̂2P̂T|T, are the root mean square errors (RMSEs) of the corresponding state elements. The t-value is the estimate of the state divided by the RMSE. Applied to the i-th element of the state, this is
t-valuei=ai~/~RMSEi. |
where ai is the i-th element of vector âT|T.
The Prob.values shows the probability of the absolute value of a standard normal variable exceeding the absolute value of the t-value. In a hypothesis testing situation, a Prob.value of less than 0.05 corresponds to rejection of the hypothesis of a zero value on a two-sided test at the 5%level of significance. A Prob.value of less than 0.05 is indicated by *, while a value less than 0.01 is denoted **.
The various unobserved components which make up the structural time series model all appear in the final state .
- Trend - `Lvl' is the estimate of the level of the trend μT
while `Slp' is the estimate of the slope βT. The t-value might
not appear for the level when no irregular is specified in the model or when
the irregular variance is estimated as zero.
- Dummy variable seasonal - the estimate of the seasonal, γT, is part of the final state, together with the previous s-2 seasonals.
These estimated elements of the state are denoted in the output by
`Sea_1',...,`Sea_s-1'. Since the summing to zero constraint is used,
the latest seasonal pattern can be obtained. It is given explicitly by the
program if `Seasonal tests' is active.
- Trigonometric seasonal - the
final state contains the estimates for γ1,T, γ1,T*,
γ2,T, γ2,T*, etc. and are denoted in the output by
`Sea_1', `Sea_2' and so on. Note that the meaning of `Sea_1', `Sea_2,
etc., is different to what it is in dummy variable seasonality, and so
the final estimated seasonal pattern cannot be identified directly. It must
be requested in `Seasonal tests'.
- Cycle - the
terms `Cyk_1' and `Cyk_2' give the estimates of ψT and ψT* for cycle k, where k is 1, 2 and/or 3. Since a cycle component
is not persistent throughout the series, a t-value is not appropriate
here.
- Autoregressive - the estimate of υT is denoted as `Ar1'.
The same output appears for multivariate models, but it is repeated for each individual series.
15.3.2 Regression analysis
The estimates of the regression parameters, the parameters of the intervention variables and the fixed seasonal dummies are based on the matrix (sT,σ̂2ST); see §14.3.4. These estimates are reported upon request (if part of the specified model) and they may be interpreted in much the same way as in a standard regression model. Because they are deterministic (not time-varying), their RMSEs are standard deviations. The t-values would have t-distributions if the (relative) parameters were known. Although they are asymptotically normal, a t-distribution may provide a better approximation to small sample properties. As was noted earlier, the Prob. values are based on the normal distribution.
15.3.3 Seasonal tests
If the seasonal component is deterministic, either because it is specified to be `fixed' at the outset or its disturbance variance is estimated to be zero, a joint test of significance can be carried out on the s-1 seasonal effects. The test is essentially the same as a test for the joint significance of a set of explanatory variables in regression. Under the null hypothesis of no seasonality, the large sample distribution of the test statistic, denoted by `Seasonal Chi2(s-1)' in the output, is χs-12. The Prob. value is the probability of a χs-12 variable exceeding the value of the test statistic. In the case of a stochastic seasonal, the joint seasonal test is also produced although a formal joint test of significance of the seasonal effect is inappropriate. However, the seasonal pattern is persistent throughout the series and when the seasonal pattern changes relatively slowly, which is usually the case, the test statistic can provide a useful guide to the relative importance of the seasonal.
The formal definition of the test statistic is
a'P-1a |
where a contains the estimates of the s-1 seasonal effects at time T and P is the corresponding MSE matrix. In the case of a fixed seasonal, a and P are obtained from the matrix (sT,σ̂2ST). For a stochastic seasonal, a and P come from the matrix (âT|T,σ̂2P̂T|T); see §14.3.4.
The final estimate of the seasonal effect for each season; that is, the estimated seasonal pattern at time T, is given here as well.
15.3.4 Cycle tests
The amplitude of the cycle is
(a2+a*2)½ |
where a is the element of âT|T corresponding to ψT and a* corresponds to ψT*. The reported ratio is the comparison of the amplitude of the cycle with the level of the trend, and is also reported. This gives an indication of its relative importance. When the cycle is deterministic, but stationary, a joint significance χ2 test (the same as the seasonal test) is produced as well.
15.3.5 Data in logs
When it is indicated that a model has been estimated in logarithms, some additional output for specific components appears in the Results window.
- Trend - The value exp (a) is reported where a is the
element of âT|T corresponding to μT. When a slope
component is specified, the estimated annual growth rate is reported as
well. The annual growth rate is the estimate of βT
multiplied by 100s.
- Seasonal - The multiplicative effect of the seasonal pattern on
the level of the series is given by exp (aj), where aj
is the estimate of the seasonal effect in season j. Also the percentage
effect is calculated; that is, 100{ exp (aj)-1}. This output is added to
the output generated for the option `Seasonal tests'.
- Cycle - Additional to Cycle tests, the amplitude as a percentage of the level is computed. This is simply multiplying the estimated amplitude by 100.
15.4 Goodness of fit
The goodness-of-fit statistics in STAMP all relate to the residuals, which are the standardised innovations; see §14.3.6. When explanatory variables are specified in the model, the generalised least squares residuals are used; see §14.3.6.1. The diagnostic checking facility for these residuals are provided in the Residuals option; see §15.7.
15.4.1 Prediction error variance
The basic measure of goodness-of-fit is the PEV as defined in §14.3.3. The PEV is the variance of the residuals in the steady state and so it corresponds to the variance of the disturbance of the reduced form ARIMA model. When the Kalman filter does not converge to the steady state, the finite PEV is used. For a univariate model, the PEV is denoted by σ̃2, irrespective of whether or not it is the finite PEV. For a multivariate model, σ̃2 denotes the diagonal entry of the prediction error variance matrix Σ̃ corresponding to a specific series in yt. The output includes a message if the Kalman filter is not in the steady state at t=T. The Final state option includes the facility to re-calculate the PEV using a pre-set number of updates.
15.4.2 Prediction error mean deviation
Another measure of fit is the mean deviation of the residuals, vt,t=d+1,...,T, defined by
md= |
| ∑t=d+1T|vt| |
where the PEV σ̃2 is defined in §14.3.3. In a correctly specified model the reported ratio
|
should be approximately equal to unity.
When it is indicated that the the series are modelled in logs, the relative measure of error is also given. This measure is easily computed as 100md.
15.4.3 Coefficients of determination
In econometric modelling the traditional measure of goodness-of-fit is the coefficient of determination R2. For a structural time series model this measure is defined by
R2=1- |
| , |
where yt is the original series and y is its sample mean. Note that for a multivariate model, yt is a specific series of the vector yt and σ̃2 is the diagonal element of Σ̃ corresponding to that series. This coefficient of determination is most useful when the series appears to be stationary with no trend or seasonal.
When the series yt shows trend movements, it is better to compare the PEV with the variance of first differences, Δyt=yt-yt-1. This leads to
RD2=1- |
| . |
where \Delta y is the sample mean of Δyt. It should be noted that RD2 may be negative, indicating a worse fit than a simple random walk plus drift model.
For seasonal data with a trend, it is more appropriate to measure the fit against a random walk plus drift and fixed seasonals. This requires the sum of squares, SSDSM, obtained by subtracting the seasonal mean from Δyt. The coefficient of determination is then
RS2=1- |
| , |
which can again be negative.
15.4.4 Information criteria : AIC and BIC
The fit of different models can be compared on the basis of the PEV. However when models have different number of parameters, it is more appropriate to compare them using the Akaike information criterion (AIC) or the Bayes information criterion (BIC). These criteria are based on the formula
log PEV+cm/T |
where c is a constant and m is the number of parameters in θ plus the number of non-stationary components in the state vector. The AIC takes c as 2 and the BIC takes c as log T. For multivariate models, m is defined as for a univariate model, since the criteria are reported for each series in yt separately.
15.5 Components
The `Components' option facilitates the computation of filtered or smoothed estimates of the specified components in the model. The filtered estimates are based on the past observation and can be calculated as described in §14.3.5. The smoothed estimates are based on all the observations, see §14.4.2. In what follows, the estimates are denoted by x̃, where x is some component which can be estimated either way. When the model is multivariate, the component is associated with a specific series in yt.
15.5.1 Series with components
This option plots the original series, yt, with one of the following (as requested):
- the estimated trend, μ̃t (which includes
interventions on the level and slope);
- the estimated trend plus cycle(s) plus AR(1), μ̃t+ψ̃t+υ̃t;
- the estimated trend plus cycle(s) plus AR(1) plus explanatory variables, μ̃t+ψ̃t+υ̃t+xt'δ̃x, where the last part includes all the explanatory variables in the equation.
15.5.2 Detrended
The detrended series is yt-μ̃t for t=1,...,T. The smoothed estimates are usually most appropriate for detrending. Note that level and slope interventions are included in the trend.
15.5.3 Seasonally adjusted
The seasonally adjusted series is yt-γ̃t for t=1,...,T. The smoothed estimate of the seasonal component is normally used for seasonal adjustment.
15.5.4 Individual seasonals
The estimated seasonal component γ̃t is re-arranged into s different series, each corresponding to one of the seasons. The graph can be interpreted as a set of annual plots for the individual seasonal effects.
15.5.5 Data in logs
If the observations are in logarithms, the output of the various components may be amended in two ways: (1) `Anti-logs' can be calculated; or (2) `Modified anti-logs' can be calculated. These amendments do not apply in a similar way for all components. The following list gives the formal amendments:
- Trend - (1) exp (μ̃t) and (2) exp (μ̃t);
- Growth rate (Slope) - (1) 100s~β̃t and (2) 100s~β̃t;
- Seasonal, Cycle, AR(1) and Irregular - (1) exp (ã) and (2) exp (μ̃t+ã)- exp (μ̃t), where ã is a specific component;
- Detrended - (1) exp (yt-μ̃t) and (2) exp (yt)- exp (μ̃t);
- Seasonally adjusted - (1) exp (yt-γ̃t) and (2) exp (yt-γ̃t).
15.6 Residuals
The residuals are the standardised one-step-ahead prediction errors or innovations and they are defined in §14.3.6. The residual series are simply denoted by vt for t=d+1,...,T. For multivariate models, the series vt are associated with a specific series in yt. For a correctly specified model with known variance parameters, the residuals are assumed to be NID(0,1). Diagnostic statistics and graphs are the tools to validate this proposition.
15.6.1 Correlogram
The residual autocorrelation at lag τ is defined by
|
|
The approximate standard error for rτ is 1/(T-d)½. The Durbin and Watson (1950) statistic, is given by
DW= |
| ~~=~2{1-r(1)}. |
In a correctly specified model, DW is approximately N(2,4/T).
A general test for serial dependence is the portmanteau Box--Ljung Q-statistic, which is based on the first s residual autocorrelations; see Ljung and Box (1978). It is given by
Q( P,d) =T(T+2)∑j=1P |
| . |
and distributed approximately as χd2 under the null, where d is equal to P-n+1 and n is the number of parameters. For a multivariate model, n is determined as in a univariate model, so a variance matrix counts as one parameter. Notice that the loss in the degrees of freedom d of Q(P,d) takes account of the number of relative variance parameters. It does not take account of any lagged dependent variables and their presence necessitates further adjustment.
15.6.2 Periodogram and spectrum
These diagnostic graphs are defined in §13.3 and §??. They are applied to vt in exactly the same way. Note that the theoretical spectrum is a horizontal straight line for a white-noise series.
15.6.3 Cumulative statistics and graphs
The `Cusum' and `Cusum of squares' are as given in §13.5 and §13.6 respectively, but are now calculated using the residuals vt for t=d+1,...,T.
The `Cumulative periodogram' is another cumulative graph which is useful since it smooths out the periodogram. The cumulative periodogram cpj is defined as
cpj= |
|
where pi is the periodogram ordinate at i for i=1,...,n and n=[(T-d)/2]. The cumulative periodogram cpj is plotted against j=0,...,n. A useful diagnostic statistic is to measure the maximum deviation of the cpj from the diagonal line connecting cp0=0 and cpn=1; that is, max |cpj-j/n| over j=1,...,n-1; see Durbin (1969). This measure is outputted on request in the Results window. Some typical critical values for this two-sided test are .44 for T ~=80, .40 for T ~=100 and .30 for T ~=200.
15.6.4 Distribution statistics
The Bowman and Shenton (1975) statistic for normality is calculated using the residuals vt for t=d+1,...,T. As for the other statistics, and the graphs on the density, histogram and distribution, the method of calculation is as presented in §13.10.
15.6.5 Heteroskedasticity
A basic non-parametric test of heteroskedasticity is the two-sided Fh,h.-test,
H(h)=∑t=T-h+1Tvt2/∑t=d+1d+1+hvt2~ |
where h is the closest integer to (T-d)/3. An increase of the variance in the last third of the sample will cause H(h) to be large, while the reverse will cause it to be small. The heteroskedasticity test appears as part of the Summary statistics.
15.7 Auxiliary residuals
Auxiliary residuals are designed to detect unusual movements in a fitted time series model. These residuals are the smoothed estimates of the disturbances associated with the components irregular, level and slope. Section 14.4.4 sets out how to calculate the standardised auxiliary residuals. It is very useful to plot them and look for any irregularities. If requested, all residuals which have absolute value exceeding two are written to the Results window.
As well as graphical representations of the auxiliary residuals, STAMP reports the standard distribution statistics for these residuals; see §13.10. Of course, these statistics cannot be used as standard moment tests because of the serial correlation in the auxiliary residuals. However these statistics can be corrected for serial correlation as in Harvey and Koopman (1992). In STAMP, the correction factors are calculated from the first max ((T)½, 20) sample autocorrelations of the auxiliary residuals.
15.8 Predictive testing
When the final date of the state sample is smaller than the final date of the data sample; that is, there are L observations, yt, t=T+1,...,T+L `outside the sample', post-sample predictive testing may be carried out using all or some of these observations. However, when these final dates are the same, predictive testing within the sample is the only option; that is, predicting yt for t=T-l+1,...,T. Of course, in the first situation, predictive testing within the sample is also possible.
When explanatory variables and/or interventions are specified in the model, the residuals used in predictive testing within the sample are the generalised recursive residuals defined in §14.3.6.2. These residuals are denoted by wt for t=d*+1,...,T where d*=d+k and k is the number of explanatory variables in the model (excluding any variables which do not appear after time T-L). Note that wt is zero when an intervention enters into the model. The error bands for the residuals are based on twice the RMSE. The CUSUM is defined in §13.5, and it is calculated for these residuals if requested.
The Chow test within the sample is constructed as
Chow= |
| ∑t=T-l+1Twt2~/~∑t=d*+1T-lwt2. |
which is approximately distributed as F(l,T-l-d*).
When post-sample predictive testing is carried out, the estimates of the coefficients of the explanatory variables and/or interventions remain the same as they were at the end of the state sample. The corresponding standardised residuals are denoted by vt for t=T+1,...,L . The post-sample predictive `failure' test statistic is then
pft=∑j=1LvT+j2, |
which is approximately distributed as χL2.
The CUSUM t-test is also outputted in the form
cusumt=L-1/2∑j=1LvT+j, |
which is approximately distributed as a t distribution with T-L-d* degrees of freedom.
The extrapolative residuals, which are defined in §14.5.2, can be used to assess the predictive performance of the model and make comparisons with rival models. If v̂T+j|T, j=1,...,L, denotes the j-step ahead extrapolative residual in the post-sample period, the extrapolative sum of squares is given by
ess(T,L)=∑j=1Lv̂T+j|T2. |
This statistic can also be generated using residuals within the sample. A similar statistic based on the sum of absolute values is defined as
esa(T,L)=∑j=1L|v̂T+j|T|. |
15.9 Forecast
The forecasts of the series yt and the components are generated as described in §14.5. The forecast error bounds are based on one RMSE. For example, the forecasts of a specific series ŷT+j are obtained from a particular element of ŷT+j|T, for j=1,...,L, and the RMSE r̂T+j is the square root of the corresponding diagonal element of the matrix σ̂2F̂T+j|T. When the data are in logs, the anti-log transformation ( exp ) can be calculated for the forecasts. Their corresponding RMSEs are then based on
r̂T+j*= exp (ŷT+j+r̂T+j)- exp (ŷT+j), j=1,...,L. |
The anti-log forecasts adjusted for bias are calculated by
ŷT+j*= exp (ŷT+j+0.5r̂T+j2), j=1,...,L. |