Model Validation and Prediction Errors

R² represents the percentage of variance in the dependent variable (y) that can be predicted from the independent variable(s) (x).
It assesses how well a model explains the variability of the data.
Formula: $R^2 = 1 - \frac{\sum (yi - \hat{y}i)^2}{\sum (y_i - \bar{y})^2}$
Where:
- $y_i$ : Actual values.
- $\hat{y}_i$ : Predicted values.
- $\bar{y}$ : Mean of the actual values.

Linear Regression (ARX):
- Under linear regression, particularly Autoregressive with Exogenous inputs (ARX) models, specific conditions can skew the R² value.
No Cross-Validation (C.V.): Training data equals test data.

For two random variables, X and Y, the correlation coefficient (r) measures the strength and direction of a linear relationship between them.
Formula:
- $r = \frac{Cov(X, Y)}{\sqrt{Var(X) \cdot Var(Y)}}$
- Where:
 - $Cov(X, Y) = \frac{\sum (xi - \bar{x})(yi - \bar{y})}{N-1}$
 - $Var(X) = \frac{\sum (x_i - \bar{x})^2}{N-1}$
 - $Var(Y) = \frac{\sum (y_i - \bar{y})^2}{N-1}$
The correlation coefficient r lies between -1 and 1 (inclusive): $r \in [-1, 1]$ .

For a time series $y(t)$ , the prediction $\hat{y}(t+1|t)$ is made based on past values.
Decomposition of Variance:
- $Var(y) = Var(\hat{y}) + Var(\epsilon)$
- Where $\epsilon$ represents the error term (unexplained variance), and it's uncorrelated with the predicted values.
% Variance explained by model = R²

Over-sampled data can lead to an inflated R² value, even for poorly performing models.

Prediction of y(t) based on past values:
- $\hat{y}(t+1|t) = E[y(t) | y(t-1), y(t-2), …, u(t), u(t-1), …]$
Even for a random walk model, $\hat{y}(t+H|t) = y(t)$ , the model can appear better than it is due to smoothness.
- Random Walk Example: $y(t) = y(t-1) + e(t)$

Model Comparison: Compare the model with a random walk model.
k-step Ahead Prediction: Evaluate the model's performance using k-step ahead predictions.
Model Differencing: Apply differencing to the time series: $\Delta y(t) = y(t) - y(t-k)$

Assume the true system is represented as:
- $y(t) = G0(q)u(t) + H0(q)e(t)$
- Where:
 - $G_0(q)$ : Transfer function of the true system.
 - $H_0(q)$ : Noise model of the true system.
 - $u(t)$ : Input.
 - $e(t)$ : White noise.
If the estimated model perfectly matches the true system:
- $G(q; \theta) = G_0(q)$
- $H(q; \theta) = H_0(q)$

Error: $\epsilon = y - \hat{y}$
If $G(q; \theta) = G0(q)$ and $H(q; \theta) = H0(q)$ , then: $\epsilon = H_0 e$
If $H(q; \theta) = 1$ , then the error is white noise if it is small. Not necessarily the case.

Direct Inspection: Examine the error sequence $\epsilon(t)$ directly.
MATLAB Demo
Autocorrelation Function (ACF): Compute and analyze the ACF of the error sequence.
- $R\epsilon(\tau) = \frac{\sum \epsilon(t) \epsilon(t-\tau)}{N{test}}$
- If $\epsilon$ is white noise, the ACF should be close to zero for all non-zero lags.
Statistical Hypothesis Testing

Under the null hypothesis that $\epsilon$ is white noise, the following statistic can be used:
- $\frac{N{test} R\epsilon(\tau)}{\sqrt{R_\epsilon(0)}} \sim N(0, 1)$
- This test assumes that the errors are independent and identically distributed (i.i.d.).

Given i.i.d. random variables $X1, X2, … , X_n$ :
- If $E[Xi] = \mu$ and $Var(Xi) = \sigma^2$
- Then, $\frac{\sum{i=1}^{n} Xi - n\mu}{\sigma \sqrt{n}} \rightarrow N(0, 1)$

Test statistic for whiteness:
- $\sum{\tau=1}^{T{max}} \frac{N{test} R\epsilon^2(\tau)}{R\epsilon^2(0)} \sim \chi^2(T{max})$
- This statistic follows a chi-squared distribution with $T_{max}$ degrees of freedom.
If the errors are white, the test statistic will be small.