Model Validation and Prediction Errors

Model Validation

R² (Coefficient of Determination)

  • R² represents the percentage of variance in the dependent variable (y) that can be predicted from the independent variable(s) (x).
  • It assesses how well a model explains the variability of the data.
  • Formula: R2=1(y<em>iy^</em>i)2(yiyˉ)2R^2 = 1 - \frac{\sum (y<em>i - \hat{y}</em>i)^2}{\sum (y_i - \bar{y})^2}
  • Where:
    • yiy_i: Actual values.
    • y^i\hat{y}_i: Predicted values.
    • yˉ\bar{y}: Mean of the actual values.

Conditions Affecting R²

  • Linear Regression (ARX):
    • Under linear regression, particularly Autoregressive with Exogenous inputs (ARX) models, specific conditions can skew the R² value.
  • No Cross-Validation (C.V.): Training data equals test data.

Pearson Correlation Coefficient

  • R² is the square of the Pearson Correlation Coefficient.
  • R2=(Pearson Correlation Coefficient)2R^2 = (\text{Pearson Correlation Coefficient})^2
  • It indicates the proportion of variance in y(t) explained by the model.

Correlation Coefficient

Definition

  • For two random variables, X and Y, the correlation coefficient (r) measures the strength and direction of a linear relationship between them.
  • Formula:
    • r=Cov(X,Y)Var(X)Var(Y)r = \frac{Cov(X, Y)}{\sqrt{Var(X) \cdot Var(Y)}}
    • Where:
      • Cov(X,Y)=(x<em>ixˉ)(y</em>iyˉ)N1Cov(X, Y) = \frac{\sum (x<em>i - \bar{x})(y</em>i - \bar{y})}{N-1}
      • Var(X)=(xixˉ)2N1Var(X) = \frac{\sum (x_i - \bar{x})^2}{N-1}
      • Var(Y)=(yiyˉ)2N1Var(Y) = \frac{\sum (y_i - \bar{y})^2}{N-1}
  • The correlation coefficient r lies between -1 and 1 (inclusive): r[1,1]r \in [-1, 1].

Application

  • For a time series y(t)y(t), the prediction y^(t+1t)\hat{y}(t+1|t) is made based on past values.
  • Decomposition of Variance:
    • Var(y)=Var(y^)+Var(ϵ)Var(y) = Var(\hat{y}) + Var(\epsilon)
    • Where ϵ\epsilon represents the error term (unexplained variance), and it's uncorrelated with the predicted values.
  • % Variance explained by model = R²

Smoothness and Over-sampling

Over-sampled Data

  • Over-sampled data can lead to an inflated R² value, even for poorly performing models.

Prediction

  • Prediction of y(t) based on past values:
    • y^(t+1t)=E[y(t)y(t1),y(t2),,u(t),u(t1),]\hat{y}(t+1|t) = E[y(t) | y(t-1), y(t-2), …, u(t), u(t-1), …]
  • Even for a random walk model, y^(t+Ht)=y(t)\hat{y}(t+H|t) = y(t), the model can appear better than it is due to smoothness.
    • Random Walk Example: y(t)=y(t1)+e(t)y(t) = y(t-1) + e(t)

Solutions for Model Comparison

  1. Model Comparison: Compare the model with a random walk model.
  2. k-step Ahead Prediction: Evaluate the model's performance using k-step ahead predictions.
  3. Model Differencing: Apply differencing to the time series: Δy(t)=y(t)y(tk)\Delta y(t) = y(t) - y(t-k)

Prediction Errors

Assumptions

  • Assume the true system is represented as:
    • y(t)=G<em>0(q)u(t)+H</em>0(q)e(t)y(t) = G<em>0(q)u(t) + H</em>0(q)e(t)
    • Where:
      • G0(q)G_0(q): Transfer function of the true system.
      • H0(q)H_0(q): Noise model of the true system.
      • u(t)u(t): Input.
      • e(t)e(t): White noise.
  • If the estimated model perfectly matches the true system:
    • G(q;θ)=G0(q)G(q; \theta) = G_0(q)
    • H(q;θ)=H0(q)H(q; \theta) = H_0(q)

Error Analysis

  • Error: ϵ=yy^\epsilon = y - \hat{y}
  • If G(q;θ)=G<em>0(q)G(q; \theta) = G<em>0(q) and H(q;θ)=H</em>0(q)H(q; \theta) = H</em>0(q), then: ϵ=H0e\epsilon = H_0 e
  • If H(q;θ)=1H(q; \theta) = 1, then the error is white noise if it is small. Not necessarily the case.

Checking for Whiteness

Methods

  1. Direct Inspection: Examine the error sequence ϵ(t)\epsilon(t) directly.
  2. MATLAB Demo
  3. Autocorrelation Function (ACF): Compute and analyze the ACF of the error sequence.
    • R<em>ϵ(τ)=ϵ(t)ϵ(tτ)N</em>testR<em>\epsilon(\tau) = \frac{\sum \epsilon(t) \epsilon(t-\tau)}{N</em>{test}}
    • If ϵ\epsilon is white noise, the ACF should be close to zero for all non-zero lags.
  4. Statistical Hypothesis Testing

Statistical Test

  • Under the null hypothesis that ϵ\epsilon is white noise, the following statistic can be used:
    • N<em>testR</em>ϵ(τ)Rϵ(0)N(0,1)\frac{N<em>{test} R</em>\epsilon(\tau)}{\sqrt{R_\epsilon(0)}} \sim N(0, 1)
    • This test assumes that the errors are independent and identically distributed (i.i.d.).

Central Limit Theorem (CLT)

  • Given i.i.d. random variables X<em>1,X</em>2,,XnX<em>1, X</em>2, … , X_n:
    • If E[X<em>i]=μE[X<em>i] = \mu and Var(X</em>i)=σ2Var(X</em>i) = \sigma^2
    • Then, <em>i=1nX</em>inμσnN(0,1)\frac{\sum<em>{i=1}^{n} X</em>i - n\mu}{\sigma \sqrt{n}} \rightarrow N(0, 1)

Chi-Squared Test

Test Statistic

  • Test statistic for whiteness:
    • <em>τ=1T</em>maxN<em>testR</em>ϵ2(τ)R<em>ϵ2(0)χ2(T</em>max)\sum<em>{\tau=1}^{T</em>{max}} \frac{N<em>{test} R</em>\epsilon^2(\tau)}{R<em>\epsilon^2(0)} \sim \chi^2(T</em>{max})
    • This statistic follows a chi-squared distribution with TmaxT_{max} degrees of freedom.
  • If the errors are white, the test statistic will be small.