線性回歸

在統計學中，線性回歸（英語：linear regression）是利用稱為線性回歸方程的最小平方函數對一個或多個自變量和應變量之間關係進行建模的一種回歸分析。這種函數是一個或多個稱為回歸系數的模型參數的線性組合。只有一個自變量的情況稱為簡單回歸，大於一個自變量情況的叫做多元回歸（multivariable linear regression）。^[1]

在線性回歸中，數據使用線性預測函數來建模，並且未知的模型參數也是通過數據來估計。這些模型被叫做線性模型。^[2]最常用的線性回歸建模是給定X值的y的條件均值是X的仿射函數。不太一般的情況，線性回歸模型可以是一個中位數或一些其他的給定X的條件下y的條件分佈的分位數作為X的線性函數表示。像所有形式的回歸分析一樣，線性回歸也把焦點放在給定X值的y的條件概率分佈，而不是X和y的聯合概率分佈（多元分析領域）。

線性回歸是回歸分析中第一種經過嚴格研究並在實際應用中廣泛使用的類型。^[3]這是因為線性依賴於其未知參數的模型比非線性依賴於其未知參數的模型更容易擬合，而且產生的估計的統計特性也更容易確定。

線性回歸有很多實際用途。分為以下兩大類：

如果目標是預測或者映射，線性回歸可以用來對觀測數據集的和X的值擬合出一個預測模型。當完成這樣一個模型以後，對於一個新增的X值，在沒有給定與它相配對的y的情況下，可以用這個擬合過的模型預測出一個y值。
給定一個變量y和一些變量 $X_{1}$ ,..., $X_{p}$ ，這些變量有可能與y相關，線性回歸分析可以用來量化y與Xj之間相關性的強度，評估出與y不相關的 $X_{j}$ ，並識別出哪些 $X_{j}$ 的子集包含了關於y的冗餘資訊。

線性回歸模型經常用最小平方逼近來擬合，但他們也可能用別的方法來擬合，比如用最小化「擬合缺陷」在一些其他規範里（比如最小絕對誤差回歸），或者在橋回歸中最小化最小平方損失函數的懲罰。相反，最小平方逼近可以用來擬合那些非線性的模型。因此，儘管「最小平方法」和「線性模型」是緊密相連的，但他們是不能劃等號的。

線性回歸的「回歸」指的是回歸到平均值（英語：regression toward the mean）。

簡介

理論模型

給一個隨機樣本 $(Y_{i},X_{i1},\ldots ,X_{ip}),\,i=1,\ldots ,n$ ，一個線性回歸模型假設回歸子 $Y_{i}$ 和回歸量 $X_{i1},\ldots ,X_{ip}$ 之間的關係是除了X的影響以外，還有其他的變數存在。我們加入一個誤差項 $\varepsilon _{i}$ （也是一個隨機變量）來捕獲除了 $X_{i1},\ldots ,X_{ip}$ 之外任何對 $Y_{i}$ 的影響。所以一個多變量線性回歸模型表示為以下的形式：

Y_{i}=\beta _{0}+\beta _{1}X_{i1}+\beta _{2}X_{i2}+\ldots +\beta _{p}X_{ip}+\varepsilon _{i},\qquad i=1,\ldots ,n

其他的模型可能被認定成非線性模型。一個線性回歸模型不需要是自變量的線性函數。線性在這裏表示 $Y_{i}$ 的條件均值在參數 $\beta$ 裏是線性的。例如：模型 $Y_{i}=\beta _{1}X_{i}+\beta _{2}X_{i}^{2}+\varepsilon _{i}$ 在 $\beta _{1}$ 和 $\beta _{2}$ 裏是線性的，但在 $X_{i}^{2}$ 裏是非線性的，它是 $X_{i}$ 的非線性函數。

數據和估計

區分隨機變量和這些變量的觀測值是很重要的。通常來說，觀測值或數據（以小寫字母表記）包括了n個值 $(y_{i},x_{i1},\ldots ,x_{ip}),\,i=1,\ldots ,n$ .

我們有 $p+1$ 個參數 $\beta _{0},\ldots ,\beta _{p}$ 需要決定，為了估計這些參數，使用矩陣表記是很有用的。

Y=X\beta +\varepsilon \,

其中Y是一個包括了觀測值 $Y_{1},\ldots ,Y_{n}$ 的列向量， $\varepsilon$ 包括了未觀測的隨機成份 $\varepsilon _{1},\ldots ,\varepsilon _{n}$ 以及回歸量的觀測值矩陣 $X$ ：

X={\begin{pmatrix}1&x_{11}&\cdots &x_{1p}\\1&x_{21}&\cdots &x_{2p}\\\vdots &\vdots &\ddots &\vdots \\1&x_{n1}&\cdots &x_{np}\end{pmatrix}}

X通常包括一個常數項。

如果X列之間存在線性相關，那麽參數向量 $\beta$ 就不能以最小平方法估計除非 $\beta$ 被限制，比如要求它的一些元素之和為0。

古典假設

樣本是在總體之中隨機抽取出來的。
因變量Y在實直線上是連續的，
殘差項是獨立且相同分佈的(iid)，也就是說，殘差是獨立隨機的，且服從高斯分佈。

這些假設意味着殘差項不依賴自變量的值，所以 $\varepsilon _{i}$ 和自變量X（預測變量）之間是相互獨立的。

在這些假設下，建立一個顯式線性回歸作為條件預期模型的簡單線性回歸，可以表示為：

{\mbox{E}}(Y_{i}\mid X_{i}=x_{i})=\alpha +\beta x_{i}\,

最小平方法分析

最小平方法估計

回歸分析的最初目的是估計模型的參數以便達到對數據的最佳擬合。在決定一個最佳擬合的不同標準之中，最小平方法是非常優越的。這種估計可以表示為：

{\hat {\beta }}=(X^{T}X)^{-1}X^{T}y\,

迴歸推論

對於每一個 $i=1,\ldots ,n$ ，我們用 $\sigma ^{2}$ 代表誤差項 $\varepsilon$ 的方差。一個無偏誤的估計是：

{\hat {\sigma }}^{2}={\frac {S}{n-p}},

其中 $S:=\sum _{i=1}^{n}{\hat {\varepsilon }}_{i}^{2}$ 是誤差平方和（殘差平方和）。估計值和實際值之間的關係是：

{\hat {\sigma }}^{2}\cdot {\frac {n-p}{\sigma ^{2}}}\sim \chi _{n-p}^{2}

其中 $\chi _{n-p}^{2}$ 服從卡方分佈，自由度是 $n-p$

對普通方程的解可以寫為：

{\hat {\boldsymbol {\beta }}}=(\mathbf {X^{T}X)^{-1}X^{T}y} .

這表示估計項是因變量的線性組合。進一步地說，如果所觀察的誤差服從正態分佈。參數的估計值將服從聯合正態分佈。在當前的假設之下，估計的參數向量是精確分佈的。

{\hat {\beta }}\sim N(\beta ,\sigma ^{2}(X^{T}X)^{-1})

其中 $N(\cdot )$ 表示多變量正態分佈。

參數估計值的標準差是：

{\hat {\sigma }}_{j}={\sqrt {{\frac {S}{n-p}}\left[\mathbf {(X^{T}X)} ^{-1}\right]_{jj}}}.

參數 $\beta _{j}$ 的 $100(1-\alpha )\%$ 置信區間可以用以下式子來計算：

{\hat {\beta }}_{j}\pm t_{{\frac {\alpha }{2}},n-p}{\hat {\sigma }}_{j}.

誤差項可以表示為：

\mathbf {{\hat {r}}=y-X{\hat {\boldsymbol {\beta }}}=y-X(X^{T}X)^{-1}X^{T}y} .\,

單變量線性回歸

單變量線性回歸，又稱簡單線性回歸（simple linear regression, SLR），是最簡單但用途很廣的回歸模型。其回歸式為：

Y=\alpha +\beta X+\varepsilon

為了從一組樣本 $(y_{i},x_{i})$ （其中 $i=1,\ 2,\ldots ,n$ ）之中估計最合適（誤差最小）的 $\alpha$ 和 $\beta$ ，通常採用最小平方法，其計算目標為最小化殘差平方和：

\sum _{i=1}^{n}\varepsilon _{i}^{2}=\sum _{i=1}^{n}(y_{i}-\alpha -\beta x_{i})^{2}

使用微分法求極值：將上式分別對 $\alpha$ 和 $\beta$ 做一階偏微分，並令其等於0：

\left\{{\begin{array}{lcl}n\ \alpha +\sum \limits _{i=1}^{n}x_{i}\ \beta =\sum \limits _{i=1}^{n}y_{i}\\\sum \limits _{i=1}^{n}x_{i}\ \alpha +\sum \limits _{i=1}^{n}x_{i}^{2}\ \beta =\sum \limits _{i=1}^{n}x_{i}y_{i}\end{array}}\right.

此二元一次線性方程組可用克萊姆法則求解，得解 ${\hat {\alpha }},\ {\hat {\beta }}$ ：

{\hat {\beta }}={\frac {n\sum \limits _{i=1}^{n}x_{i}y_{i}-\sum \limits _{i=1}^{n}x_{i}\sum \limits _{i=1}^{n}y_{i}}{n\sum \limits _{i=1}^{n}x_{i}^{2}-\left(\sum \limits _{i=1}^{n}x_{i}\right)^{2}}}={\frac {\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}={\frac {{\text{cov}}(X,Y)}{{\text{var}}(X)}}\,

{\hat {\alpha }}={\frac {\sum \limits _{i=1}^{n}x_{i}^{2}\sum \limits _{i=1}^{n}y_{i}-\sum \limits _{i=1}^{n}x_{i}\sum \limits _{i=1}^{n}x_{i}y_{i}}{n\sum \limits _{i=1}^{n}x_{i}^{2}-\left(\sum \limits _{i=1}^{n}x_{i}\right)^{2}}}={\bar {y}}-{\bar {x}}{\hat {\beta }}

S=\sum \limits _{i=1}^{n}(y_{i}-{\hat {y}}_{i})^{2}=\sum \limits _{i=1}^{n}y_{i}^{2}-{\frac {n(\sum \limits _{i=1}^{n}x_{i}y_{i})^{2}+(\sum \limits _{i=1}^{n}y_{i})^{2}\sum \limits _{i=1}^{n}x_{i}^{2}-2\sum \limits _{i=1}^{n}x_{i}\sum \limits _{i=1}^{n}y_{i}\sum \limits _{i=1}^{n}x_{i}y_{i}}{n\sum \limits _{i=1}^{n}x_{i}^{2}-\left(\sum \limits _{i=1}^{n}x_{i}\right)^{2}}}

{\hat {\sigma }}^{2}={\frac {S}{n-2}}.

協方差矩陣是：

{\frac {1}{n\sum _{i=1}^{n}x_{i}^{2}-\left(\sum _{i=1}^{n}x_{i}\right)^{2}}}{\begin{pmatrix}\sum x_{i}^{2}&-\sum x_{i}\\-\sum x_{i}&n\end{pmatrix}}

平均響應置信區間為：

y_{d}=(\alpha +{\hat {\beta }}x_{d})\pm t_{{\frac {\alpha }{2}},n-2}{\hat {\sigma }}{\sqrt {{\frac {1}{n}}+{\frac {(x_{d}-{\bar {x}})^{2}}{\sum (x_{i}-{\bar {x}})^{2}}}}}

預報響應置信區間為：

y_{d}=(\alpha +{\hat {\beta }}x_{d})\pm t_{{\frac {\alpha }{2}},n-2}{\hat {\sigma }}{\sqrt {1+{\frac {1}{n}}+{\frac {(x_{d}-{\bar {x}})^{2}}{\sum (x_{i}-{\bar {x}})^{2}}}}}

方差分析

在方差分析（ANOVA）中，總平方和分解為兩個或更多部分。

總平方和SST (sum of squares for total) 是：

{\text{SST}}=\sum _{i=1}^{n}(y_{i}-{\bar {y}})^{2}

　，其中：　

{\bar {y}}={\frac {1}{n}}\sum _{i}y_{i}

同等地：

{\text{SST}}=\sum _{i=1}^{n}y_{i}^{2}-{\frac {1}{n}}\left(\sum _{i}y_{i}\right)^{2}

回歸平方和SSReg (sum of squares for regression。也可寫做模型平方和，SSM，sum of squares for model) 是：

{\text{SSReg}}=\sum \left({\hat {y}}_{i}-{\bar {y}}\right)^{2}={\hat {\boldsymbol {\beta }}}^{T}\mathbf {X} ^{T}\mathbf {y} -{\frac {1}{n}}\left(\mathbf {y^{T}uu^{T}y} \right),

殘差平方和SSE (sum of squares for error) 是：

{\text{SSE}}=\sum _{i}{\left({y_{i}-{\hat {y}}_{i}}\right)^{2}}=\mathbf {y^{T}y-{\hat {\boldsymbol {\beta }}}^{T}X^{T}y} .

總平方和SST又可寫做SSReg和SSE的和：

{\text{SST}}=\sum _{i}\left(y_{i}-{\bar {y}}\right)^{2}=\mathbf {y^{T}y} -{\frac {1}{n}}\left(\mathbf {y^{T}uu^{T}y} \right)={\text{SSReg}}+{\text{SSE}}.

回歸系數R²是：

R^{2}={\frac {\text{SSReg}}{\text{SST}}}=1-{\frac {\text{SSE}}{\text{SST}}}.

其他方法

廣義最小平方法

廣義最小平方法可以用在當觀測誤差具有異方差或者自相關的情況下。

總體最小平方法

總體最小平方法用於當自變量有誤時。

廣義線性模式

廣義線性模式應用在當誤差分佈函數不是正態分佈時。比如指數分佈，伽瑪分佈，逆高斯分佈，泊松分佈，二項式分佈等。

穩健回歸

將平均絕對誤差最小化，不同於在線性回歸中是將均方誤差最小化。

線性回歸的應用

趨勢線

一條趨勢線代表着時間序列數據的長期走勢。它告訴我們一組特定數據（如GDP、石油價格和股票價格）是否在一段時期內增長或下降。雖然我們可以用肉眼觀察數據點在坐標系的位置大體畫出趨勢線，更恰當的方法是利用線性回歸計算出趨勢線的位置和斜率。

流行病學

有關吸煙對死亡率和發病率影響的早期證據來自採用了回歸分析的觀察性研究。為了在分析觀測數據時減少偽相關，除最感興趣的變量之外,通常研究人員還會在他們的回歸模型里包括一些額外變量。例如，假設有一個回歸模型，在這個回歸模型中吸煙行為是我們最感興趣的獨立變量，其相關變量是經數年觀察得到的吸煙者壽命。研究人員可能將社會經濟地位當成一個額外的獨立變量，已確保任何經觀察所得的吸煙對壽命的影響不是由於教育或收入差異引起的。然而，我們不可能把所有可能混淆結果的變量都加入到實證分析中。例如，某種不存在的基因可能會增加人死亡的幾率，還會讓人的吸煙量增加。因此，比起採用觀察數據的回歸分析得出的結論，隨機對照試驗常能產生更令人信服的因果關係證據。當可控實驗不可行時，回歸分析的衍生，如工具變量回歸，可嘗試用來估計觀測數據的因果關係。

金融

資本資產定價模型利用線性回歸以及Beta系數的概念分析和計算投資的系統風險。這是從聯繫投資回報和所有風險性資產回報的模型Beta系數直接得出的。

經濟學

線性回歸是經濟學的主要實證工具。例如，它是用來預測消費支出，^[4]固定投資支出，存貨投資，一國出口產品的購買，^[5]進口支出，^[5]要求持有流動性資產，^[6]勞動力需求、^[7]勞動力供給。^[7]

參考文獻

引用

[1]
Rencher, Alvin C.; Christensen, William F., Chapter 10, Multivariate regression – Section 10.1, Introduction, Methods of Multivariate Analysis, Wiley Series in Probability and Statistics 709 3rd, John Wiley & Sons: 19, 2012 [2019-05-14], ISBN 9781118391679, （原始內容存檔於2019-06-15）.
[2]
Hilary L. Seal. The historical development of the Gauss linear model. Biometrika. 1967, 54 (1/2): 1–24. JSTOR 2333849. doi:10.1093/biomet/54.1-2.1.
[3]
Yan, Xin, Linear Regression Analysis: Theory and Computing, World Scientific: 1–2, 2009 [2019-05-14], ISBN 9789812834119, （原始內容存檔於2019-06-08）, Regression analysis ... is probably one of the oldest topics in mathematical statistics dating back to about two hundred years ago. The earliest form of the linear regression was the least squares method, which was published by Legendre in 1805, and by Gauss in 1809 ... Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the sun.
[4]
Deaton, Angus. Understanding Consumption. Oxford University Press. 1992. ISBN 978-0-19-828824-4.
[5]
Krugman, Paul R.; Obstfeld, M.; Melitz, Marc J. International Economics: Theory and Policy 9th global. Harlow: Pearson. 2012. ISBN 9780273754091.
[6]
Laidler, David E. W. The Demand for Money: Theories, Evidence, and Problems 4th. New York: Harper Collins. 1993. ISBN 978-0065010985.
[7]
Ehrenberg; Smith. Modern Labor Economics 10th international. London: Addison-Wesley. 2008. ISBN 9780321538963.

來源

書籍

Cohen, J., Cohen P., West, S.G., & Aiken, L.S. Applied multiple regression/correlation analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum Associates. 2003.
Draper, N.R. and Smith, H. Applied Regression Analysis. Wiley Series in Probability and Statistics. 1998.
Robert S. Pindyck and Daniel L. Rubinfeld. Chapter One. Econometric Models and Economic Forecasts. 1998.
Charles Darwin. The Variation of Animals and Plants under Domestication. (1868) (Chapter XIII describes what was known about reversion in Galton's time. Darwin uses the term "reversion".)

刊物文章

Galton, Francis. Regression Towards Mediocrity in Hereditary Stature (PDF). Journal of the Anthropological Institute. 1886, 15: 246–263 [2008-12-30]. （原始內容存檔 (PDF)於2016-03-10）.

延伸閱讀

Pedhazur, Elazar J. Multiple regression in behavioral research: Explanation and prediction 2nd. New York: Holt, Rinehart and Winston. 1982. ISBN 0-03-041760-0.
Barlow, Jesse L. Chapter 9: Numerical aspects of Solving Linear Least Squares Problems. Rao, C.R. (編). Computational Statistics. Handbook of Statistics 9. North-Holland. 1993. ISBN 0-444-88096-8.
Björck, Åke. Numerical methods for least squares problems. Philadelphia: SIAM. 1996. ISBN 0-89871-360-9.
Goodall, Colin R. Chapter 13: Computation using the QR decomposition. Rao, C.R. (編). Computational Statistics. Handbook of Statistics 9. North-Holland. 1993. ISBN 0-444-88096-8.
National Physical Laboratory. Chapter 1: Linear Equations and Matrices: Direct Methods. Modern Computing Methods. Notes on Applied Science 16 2nd. Her Majesty's Stationery Office. 1961.

參見

方差分析
安斯庫姆四重奏
橫截面回歸
曲線擬合
經驗貝葉斯方法
邏輯斯蒂回歸
M估計
非線性回歸
非參數回歸
多元自適應回歸樣條
Lack-of-fit sum of squares
截斷回歸模型
刪失回歸模型
簡單線性回歸
分段線性回歸

外部連結

Least-Squares Regression （頁面存檔備份，存於互聯網檔案館）, PhET Interactive simulations, University of Colorado at Boulder

[1] [1]
Rencher, Alvin C.; Christensen, William F., Chapter 10, Multivariate regression – Section 10.1, Introduction, Methods of Multivariate Analysis, Wiley Series in Probability and Statistics 709 3rd, John Wiley & Sons: 19, 2012 [2019-05-14], ISBN 9781118391679, （原始內容存檔於2019-06-15）.

[2] [2]
Hilary L. Seal. The historical development of the Gauss linear model. Biometrika. 1967, 54 (1/2): 1–24. JSTOR 2333849. doi:10.1093/biomet/54.1-2.1.

[3] [3]
Yan, Xin, Linear Regression Analysis: Theory and Computing, World Scientific: 1–2, 2009 [2019-05-14], ISBN 9789812834119, （原始內容存檔於2019-06-08）, Regression analysis ... is probably one of the oldest topics in mathematical statistics dating back to about two hundred years ago. The earliest form of the linear regression was the least squares method, which was published by Legendre in 1805, and by Gauss in 1809 ... Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the sun.

[4] [4]
Deaton, Angus. Understanding Consumption. Oxford University Press. 1992. ISBN 978-0-19-828824-4.

[Krugman-5] [5]
Krugman, Paul R.; Obstfeld, M.; Melitz, Marc J. International Economics: Theory and Policy 9th global. Harlow: Pearson. 2012. ISBN 9780273754091.

[6] [6]
Laidler, David E. W. The Demand for Money: Theories, Evidence, and Problems 4th. New York: Harper Collins. 1993. ISBN 978-0065010985.

[Ehrenberg-7] [7]
Ehrenberg; Smith. Modern Labor Economics 10th international. London: Addison-Wesley. 2008. ISBN 9780321538963.

[1]

[2]

[3]

[4]

[5]

[6]

[7]