Influential observation

Various methods have been proposed for measuring influence.^[3]^[4] Assume an estimated regression $\mathbf {y} =\mathbf {X} \mathbf {b} +\mathbf {e}$ , where $\mathbf {y}$ is an n×1 column vector for the response variable, $\mathbf {X}$ is the n×k design matrix of explanatory variables (including a constant), $\mathbf {e}$ is the n×1 residual vector, and $\mathbf {b}$ is a k×1 vector of estimates of some population parameter $\mathbf {\beta } \in \mathbb {R} ^{k}$ . Also define $\mathbf {H} \equiv \mathbf {X} \left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} \right)^{-1}\mathbf {X} ^{\mathsf {T}}$ , the projection matrix of $\mathbf {X}$ . Then we have the following measures of influence:

${\text{DFBETA}}_{i}\equiv \mathbf {b} -\mathbf {b} _{(-i)}={\frac {\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} \right)^{-1}\mathbf {x} _{i}^{\mathsf {T}}e_{i}}{1-h_{ii}}}$ , where $\mathbf {b} _{(-i)}$ denotes the coefficients estimated with the i-th row $\mathbf {x} _{i}$ of $\mathbf {X}$ deleted, $h_{ii}=\mathbf {x} _{i}\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} \right)^{-1}\mathbf {x} _{i}^{\mathsf {T}}$ denotes the i-th value of matrix's $\mathbf {H}$ main diagonal. Thus DFBETA measures the difference in each parameter estimate with and without the influential point. There is a DFBETA for each variable and each observation (if there are N observations and k variables there are N·k DFBETAs).^[5] Table shows DFBETAs for the third dataset from Anscombe's quartet (bottom left chart in the figure):

x	y	intercept	slope
10.0	7.46	-0.005	-0.044
8.0	6.77	-0.037	0.019
13.0	12.74	-357.910	525.268
9.0	7.11	-0.033	0
11.0	7.81	0.049	-0.117
14.0	8.84	0.490	-0.667
6.0	6.08	0.027	-0.021
4.0	5.39	0.241	-0.209
12.0	8.15	0.137	-0.231
7.0	6.42	-0.020	0.013
5.0	5.73	0.105	-0.087

DFFITS - difference in fits
Cook's D measures the effect of removing a data point on all the parameters combined.^[2]

Influential observation

Assessment

Outliers, leverage and influence

See also

References

Further reading

Wikiwand - on