نوع مقاله : مروری
نویسندگان
1 بخش تحقیقات زراعی و باغی، مرکز تحقیقات و آموزش کشاورزی و منابع طبیعی استان کرمانشاه، سازمان تحقیقات، آموزش و ترویج کشاورزی، کرمانشاه،
2 بخش گیاهپزشکی، دانشکده کشاورزی، دانشگاه رازی، کرمانشاه، ایران
3 گروه مهندسی منابع طبیعی، دانشکده کشاورزی، دانشگاه شیراز، شیراز، ایران
4 بخش تحقیقات علوم زراعی و باغی، مرکز تحقیقات و آموزش کشاورزی و منابع طبیعی کرمانشاه، مرکز تحقیقات، آموزش و ترویج کشاورزی، کرمانشاه، ایران
5 گروه گیاهپزشکی، دانشکده کشاورزی، دانشگاه کردستان، سنندج، ایران
چکیده
کلیدواژهها
موضوعات
عنوان مقاله [English]
نویسندگان [English]
Powerful and practical statistical packages have simplified the analysis and thus developed the application of data science in all fields of research. Accordingly, regression has been applied to almost all aspects of the life sciences. However, misuse of this model has been reported in the past decades. The purpose of this article is to examine modeling with this important statistical method and to introduce readers to the correct use of this method. In the required assumptions of the regression model, the residuals of the model must be normally distributed, but performing the normality test for the actual values of the response variable or any of the explanatory variables is not mandatory. Therefore, researchers should not obsess more than necessary about the normal distribution of real data. On the other hand, almost all normality test methods, such as Kolmogorov-Smirnov, are designed for large numbers of data, typically more than a thousand samples. This suggests that the use of such methods to test the normality of model residuals estimated from a small number of data, mostly less than a hundred cases, would not be very accurate. Another issue regarding the application of the regression model is related to the co-linearity of the explanatory variables. In a data set where all variables are generated separately and randomly in a statistical package, there are still signs of correlation. This means that it is very hard to find a correlation coefficient equal to zero (r = 0) even between any pair of separate, random variables. Therefore, in all regression models, there are some kinds of correlation between explanatory variables, but the important issue here is that only high correlation causes severe problems in the model. For collinearity test it would be better to use specialized methods such as Variance Inflation Factor (VIF) or Principal Component Analysis (PCA). The linearity of the model is one other assumption of regression model. Under the situation of non-linearity of the model, data transformation might be helpful. However, transformation changes the variables unit resulting in altering the array direction in a geometric space. Researchers should be careful regarding the use of modeling a large number of data affects the probability values in variance analysis due to increasing the value of the degree of freedom of the model. As the number of data points increases, the degree of freedom of the error term increases rapidly. Therefore, the final error mean squared significantly reduces. In contrast, the scatter of data points around the regression line may be too wide. For this reason, the use of the coefficient of determination, which is usually called (R-Squared), is a suitable criterion for testing the fit of the model. High values of this coefficient indicate a suitable model for the data set used. It should be noted that in a multiple regression model, the higher the number of explanatory variables used in the model, the higher the value of this coefficient increases. For such conditions, when the number of explanatory variables is large, another form of this coefficient, which is called the adjusted coefficient of determination (adjusted R2), has been introduced. The use of this coefficient in the approximations creates a limit on the number of variables used in the regression model. Accordingly, the number of variables in the model as explanatory variables should not exceed the number of samples (or the number of tens) in a set, and researchers should avoid using more variables than the number of samples.
کلیدواژهها [English]