一組統計過程,用嚟估計變數之間嘅關係。 From Wikipedia, the free encyclopedia
迴歸分析(粵拼:wui4 gwai1 fan1 sik1;英文:regression analysis)係統計模型上嘅一類技術,用嚟預測兩個或者以上唔同變數之間嘅關係:喺統計學上,研究者好多時會想用一個變數嘅數值嚟預測第啲變數嘅數值;喺最簡單嗰種情況下,個統計模型會涉及兩個連續嘅變數[1],當中一個係自變數 IV,另一個係應變數 DV,而個研究者會用個 IV 嘅數值嚟預測個 DV 嘅數值;對個研究者嚟講,一個可能嘅做法係搜集數據返嚟,用啲數據做迴歸分析,整一個模型出嚟,個模型就能夠幫佢預測「當 IV 係呢個數值嗰陣,假設第啲因素不變,個 DV 嘅數值會傾向係幾多」[2][3]。
-迴歸分析方法常見於生物學、心理學同社會科學等研究複雜現象嘅領域,喺複雜現象當中,一個因素嘅值會受千百個因素嘅值影響-相比之下,喺(例如)物理學當中,一個 DV 嘅值好多時可以靠 3 至 4 個 IV 嘅值嚟預測,例子可以睇吓牛頓運動定律嗰啲式噉[2];喺生物學等嘅領域當中,研究者冇可能完全知道同控制嗮嗰千百個因素,所以只能夠靠住手頭上嘅 IV 嚟估計 DV 嘅數值,如果個 IV 嘅預測力夠勁,用佢預測 DV 數值嘅準確度會高,但準確度都仲係會明顯低過 100% [10][11]。呢句嘢用統計學行話嚟講嘅話,即係 DV 嘅數值會喺個迴歸函數周圍有個概率分佈[2][8][12]。
喺做迴歸分析嗰陣,唔同嘅觀察必需要係獨立嘅:而家假想一個迴歸模型,有三個未知參數,、、同 ,三個 IV,、、同 ,個研究者手頭上嘅樣本有 10 個個體,而呢個樣本當中只有一個 值(即係話嗰 10 個個體喺啲 IV 上嘅數值完全一樣);噉嘅話,迴歸分析將會唔能夠為嗰三個未知參數分別估計一個獨有嘅數值,而個研究者只能夠估計 Y 嘅平均值同標準差;而如果個樣本有 10 個個體,當中只有兩個 值(當中一部份個體喺啲 IV 上嘅數值完全一樣,而另一部份喺啲 IV 上嘅數值同前者唔同、但彼此之間完全一樣),噉個研究者將會只能夠估計兩個參數嘅值。如此類推[13]。
c=0;# 暫時當住啲參數同 IV 通通係 0 先。x=0;# 個 IV,可以係(例如)食量。a=0;y=c+ax;# 個迴歸模型,y 將會係個程式俾嘅輸出,可以係(例如)體重。c=random(1,100);# 隨機噉設 c 同 a 嘅數值,噉做會整出一個唔準嘅迴歸模型。a=random(0,1);# 下一步嘅碼需要做以下嘅工作:# (1)搵出手上嗰條線係咪一個理想(誤差低)嘅迴歸模型;# (2)如果唔係,就要改變 c 同 a 嘅數值,令到條線愈嚟愈接近「能夠做準確預測嘅樣」;# (3)仲要計出 c 同 a 嘅數值應該要改變幾多。defread_data():# 要有個子程式教部電腦讀取設計者提供嘅數據。...# 省略咗returnpatient_valuedefstep_cost_function_for(patient_value,constant,slope):......# 一個省略咗嘅子程序,簡單啲講,呢個子程序涉及重複係噉做# 攞下一個數據個體,將佢個 x 值同手上個模型預測 y,# 計吓個誤差係幾多,再# 有條式計吓要點樣按照個誤差調整 c 同 a,# 做到讀完嗮成柞數據為止。returnnew_constant,new_slope# 個子程序會俾出 c 同 a 嘅新數值。# 下略 ...
Draper, N.R.; Smith, H. (1998). Applied Regression Analysis (3rd ed.). John Wiley. ISBN 978-0-471-17082-2.
Elfhag, K., Tholin, S., & Rasmussen, F. (2008). Consumption of fruit, vegetables, sweets and soft drinks are associated with psychological dimensions of eating behaviour in parents and their 12-year-old children. Public health nutrition, 11(9), 914-923,用逐步迴歸,剖析一班細路飲食夠唔夠均衡。
Hardle, W. (1990). Applied Nonparametric Regression. ISBN 0-521-42950-1.
Lindley, D.V. (1987). "Regression and correlation analysis," New Palgrave: A Dictionary of Economics, v. 4, pp. 120-23.
Christopher M. Bishop (2006). Pattern Recognition and Machine Learning. Springer. p. 3. Cases [...] in which the aim is to assign each input vector to one of a finite number of discrete categories, are called classification problems. If the desired output consists of one or more continuous dependent variables, then the task is called regression.
YangJing Long (2009). "Human age estimation by metric learning for regression problems". Proc. International Conference on Computer Analysis of Images and Patterns: 74-82.
Pearcey, S. M., & De Castro, J. M. (2002). Food intake and meal patterns of weight-stable and weight-gaining persons. The American journal of clinical nutrition, 76(1), 107-112.
Jarque, C. M., & Bera, A. K. (1980). Efficient tests for normality, homoscedasticity and serial independence of regression residuals. Economics letters, 6(3), 255-259.
Rokhlin, V., & Tygert, M. (2008). A fast randomized algorithm for overdetermined linear least-squares regression. Proceedings of the National Academy of Sciences, 105(36), 13212-13217.
Cleveland, W. S., & Devlin, S. J. (1988). Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American statistical association, 83(403), 596-610.
Steel, R.G.D, and Torrie, J. H., Principles and Procedures of Statistics with Special Reference to the Biological Sciences., McGraw Hill, 1960, page 288.
The Shortcomings of Standardized Regression Coefficients. University of Virginia,佢哋噉講:"Standardized regression coefficients express the average change in standard deviations of an outcome variable associated with a one-standard-deviation change in a predictor variable."
Winkler, A. M., Ridgway, G. R., Webster, M. A., Smith, S. M., & Nichols, T. E. (2014). Permutation inference for the general linear model. Neuroimage, 92, 381-397.
Hosmer, D. W., & Lemesbow, S. (1980). Goodness of fit tests for the multiple logistic regression model. Communications in statistics-Theory and Methods, 9(10), 1043-1069.
Cameron, A. C., & Windmeijer, F. A. (1997). An R-squared measure of goodness of fit for some common nonlinear regression models. Journal of econometrics, 77(2), 329-342.
Lemeshow, S., & Hosmer Jr, D. W. (1982). A review of goodness of fit statistics for use in the development of logistic regression models. American journal of epidemiology, 115(1), 92-106.
Adamowski, J., Fung Chan, H., Prasher, S. O., Ozga‐Zielinski, B., & Sliusarieva, A. (2012). Comparison of multiple linear and nonlinear regression, autoregressive integrated moving average, artificial neural network, and wavelet artificial neural network methods for urban water demand forecasting in Montreal, Canada. Water Resources Research, 48(1).
Chen, K. Y., & Wang, C. H. (2007). Support vector regression with genetic algorithms in forecasting tourism demand. Tourism Management, 28(1), 215-226.
Koufaris, M. (2002). Applying the technology acceptance model and flow theory to online consumer behavior. Information systems research, 13(2), 205-223.