Effect of outliers on the coefficient of determination in multiple regression analysis with the application on the GPA for student

This study aims to solve the problem of contradiction between the statistical significance and real significance of regression parameters when using multiple linear regression analysis. In this regard, an algorithm was presented based on the simple and multiple of determination coefficient, and the sum of averages to estimate multiple outliers when outliers are real. Regression analysis was applied to a phenomenon, whose results are known in advance (The relationship between Semester average and Cumulative average). The results were misleading, and we cannot firmly stand on analysis results. Also, the regression model did not improve much when an increased sample size more than doubled, so the study presents an algorithm for finding a solution to this contradiction. After checking Ordinary Least Squares (OLS) assumptions, outliers were identified, based on Cook's distance because it was the best. The proposed algorithm was compared with some robust regression methods, [Weighted Least Squares, Fully Modified Least Squares, and Least Median of Squares]. The results proved that the proposed method is a robust solution for outliers’ estimation. Therefore, it is recommended to use the proposed algorithm to estimate multiple outliers on other similar phenomena (e.g., The algorithm can be applied to a credit card transaction control system in a bank), and also software Packages statistical for the proposed algorithm. Also, the novelty of this study can be observed by investigating testing the significance of outliers as most of the previous researchers were interested in diagnosing the outliers without checking its significance.


Introduction
*The Ordinary Least Squares (OLS) method is the most common way to fit the regression model, but this method fails in dealing with data that contains outliers. Therefore, one cannot firmly stand on regression analysis results because OLS is said to be not robust to violations of its assumptions. All major software packages (SAS, SPSS, R, MINITAB, and STATA) provide both the model estimates and the diagnostic of the model fit. However, the wide popularity and routine use of linear regression create some problems. The problems of multiple linear regression models arise when there is an outlier in the data. Identifying and estimating outliers is an important step in building the regression model. If outliers are identified and estimated, they will lead to a different model (Rahman et al., 2012). Sometimes, when natural phenomena are studied, an effect of one or more independent variables is insignificant, although it is known that these variables only effect on the dependent variable. For example, the balance of any person in the bank depends on only two variables (Addition and withdrawal), so the relationship between them strong and significant will be expected. Any behavior other than this expectation is due to one or several outliers. One should be worried about outliers because it can distort estimates of regression coefficients, can produce misleading results, and the interpretation of the results may be in doubt. It is possible that another researcher could analyze these data and question these results showing an improved analysis that may contradict these results and undermine the conclusions (Gad and Qura, 2016). In this regard, a new algorithm is presented based on the partial and multiple correlation coefficient, coefficient of determination, and the sum of averages for predictors to estimate multiple outliers in the multiple linear regression model. One of the conditions for estimating multiple outliers is the true presence of outliers, which cannot be presented in the form of errors. The novelty of this study can be observed by investigating testing the significance of outliers as most of the previous researchers were interested in detecting and addressing the outliers, without checking its significance. The importance of research is to present a new idea for estimating outliers in independent variables and dependent variables, using an easy algorithm, to obtain the reliable model of prediction when only these variables affect the dependent variable.

Multiple linear regression models
Multiple linear regression helps to predict the values of a dependent variable by knowing the values of independent variables with statistical significance. It can be expressed in the following form (Salleh et al., 2015;Park et al., 2012); Fit multiple linear regression model (Neter et al., 1996); where, ŷ is Fitted response, n is independent variables; n is Number of observations; p is Number of model parameters; β n is a regression coefficient; en is i th residual. Estimation of Parameters with Ordinary Last Square OLS (Freund et al., 2006):

The goodness-of-fit (OLS) regression
R-Sq→R 2 is known as the coefficient of determination. A commonly used measure of goodness of fit of a linear model can be measured as (Altland, 1999);

Outliers
Which are extreme values in the y-direction relative to the fitted regression line, or as an observation that has a large residual (Kim, 2000;Adikaram et al., 2014;Weisberg, 2013). Rousseeuw (1987) explained how the single outlier changed from the direction of the lower squares. Huber and Ronchetti (1981) explained the effect of outliers on the OLS estimates by destroying the least squares.

Leverage
Extremes values are in the x-direction, which will pull the regression line towards it and can have a large effect on regression coefficients (Cerioli et al., 2013).

Influential observations
Influential observations can change the slope of the line, and it has a large influence on the fit of the model. On the other hand, an observation is said to be influential if removing the observation substantially changes the estimate of regression coefficients (Alguraibawi et al., 2015).

Identification of unusual observations
To identify unusual observations, the study has used diagnostic measures, which include Residuals, standardized residual, Studentized Deleted Residuals, leverage values, and Cook's D. Formulas of diagnostic measures are as following (Cook, 1977;Turkan et al., 2012);

Residuals
Residuals are the distance between observed values and the predicted values (Rahmatullah Imon and Ali, 2005;Richard et al., 2019). The residual is defined as Studentized Deleted Residuals (Greenwell et al., 2018;Cook and Weisberg, 1982):

Cook's distance
It combines information on residual and leverage (Judd et al., 2017;Belsley et al., 1980). It identifies influential cases as it considers changes in all residuals when a case is omitted. It is calculated from the following relationship: where, b(i) is a coefficient vector calculated after deleting the i th observation. DFFITS is as follows: where, ̂( ) is fitted value calculated without the i th observation (Srivastava and Lee, 1984). COVRATIO is as follows: is determinant of the covariance matrix for the full model (Valliant, 2012).

Proposed work
Influential observations should be examined carefully both in the dependent variable and independent variables, before applying the proposed algorithm in and y.

Estimating the outliers in the independent variables
If is an independent variable and it is regression coefficient is not statistical significant, then independent variable contains one or multiple outliers, the algorithm will be as follows; The coefficient of determination ( 2 ) is calculated in the simple linear regression, and calculating the sum of the averages of the independent variables for the same observation (∑ ̅ = ), using the following formula: where, * is The outlier estimation; ̅ is Average independent variables for the outlier ( ); 2 is the coefficient of determination in the simple linear regression.

Estimating the outliers in the dependent variable
If y i is a dependent variable and this variable contains one or multiple outliers, then the algorithm will be as follows: where, j * is Outlier estimation; 2 is multiple determinant coefficient; ̅ is average independent variables for the outlier (j).

Overview of data
The data was obtained from the academic record of the student from the Prince Sattam Bin Abdulaziz university website. Independent variables used in this study are represented as (xi); from the semester average for the first level to the semester average for the sixth level. Dependent Variable(y): Cumulative Grade Point Average (GPA). Table 1 shows that the parameter for the third level has a probability value of less than 0.05. This result indicates that this variable has a statistically significant effect on the cumulative average. But a probability value for the other parameters indicates that there is not a statistically significant effect on the cumulative average (This contradicts reality). Although the cumulative average of the student is affected only by the semesters average, these results are misleading, so the study makes efforts to find a solution to this contradiction.

Assumptions of the OLS estimator
Many graphical methods and numerical tests have been developed over the years for regression diagnostics (Abuzaid et al., 2011). Statistical Software makes many of these methods easy to access and use. Consider the following assumptions.

Linearity and multicollinearity
Checking the linearity assumption is not so straightforward in the case of multiple regression.
The study has fitted the best fit line, known as the Loess Curve through the scatterplot to see if any nonlinear relationship can be detected. And to verify the absence of multiple linearities between the predictors was used Variance Inflation Factor (VIF). The values of the inflation factor should be less than 10 (Müller, 1992;Ibrahim and Yahya, 2017). From the Loess curve, it appears that the relationship of fitted value against residuals is roughly linear. And this indicates that the linearity assumption is satisfied. And also, it has been observed that the Variance Inflation Factor (VIF) is less than 10. This is evidence of the absence of multicollinearity between predictors. This is confirmed by the matrix plot. See (Fig. 1). Fig. 2 shows for probability plot that the points do not cluster around the line; this indicates that the residues are not normality. This is confirmed by the Kolmogorov-Smirnov test. Also, Levene's test clearly indicates that the residuals have a constant variance. Table 2 shows that the Durbin-Watson statistic (D.W=2.89955) is far from a tabulated value (D.W=1.639). We will use the Breusch-Godfrey Serial Correlation LM Test to make sure there is no autocorrelation, and this test indicates there is autocorrelation between the residues and also the model is not stability.   From Tables 1-2 and Figs. 1-2 the above, we conclude that four assumptions have not achieved (Statistical significance for five regression coefficients, normality, the independence of residuals, and the stability of the model).   Table 3 shows that the regression model did not improve much when increasing the size of the sample, where still a parameter 4 and a parameter 6 are not statistically significant. Also, there is autocorrelation between the residues. The resulting probability plot shows in Fig. 3 that the points no cluster around the line, this indicates that the residues are not distributed according to normal distribution. This is confirmed by Jarque-Bera and Kolmogorov-Smirnov test, and also the regression model hasn't stability. After fitting the regression model, when the sample size is doubled.

Diagnosis of outliers
In commencing, one should get familiar with the data file and looking for errors to collect and input data using the Moses test (Nussbaum, 2014). Table 4 shows that there are no errors in data collection.

Identifying outliers using the residuals
The goal is to detect the cases which have large residuals (outliers) and the cases that, if they are removed, lead to a different result. The distinction between these two kinds of cases is not always obvious. Both types of points are of great concern. (Choonpradub and McNeil, 2005). Not necessarily that all outliers are influential. In this regard, a box plot will be used by overall measures of influence (DFFITS, COVRATIO, and Cook's D) to discover influential cases. Fig. 4 shows that Star-shaped states are influential, while circle cases are not influential (For example, Cook's Distance indicates that case number 32 is one that has a large residual, which suggests that it may be influential, and the observations (37,34,33,36,6,41,29) are outliers, but cases (32,37,36,34,33,6) is are an influential case. And these cases require more attention as they stand out from all other points. Fig. 5 shows that the cases diagnosed as outliers through the Grubbs' test had a significant effect on the regression coefficients. But cases that have been diagnosed as outliers through the Dixon's test had not any effect.

Application of the proposed algorithm
Application of the proposed algorithm according to overall measures of influence. Cook distance was relied on to identify influential outliers because it is the best (Table 5).

Fitting the regression model using the proposed algorithm
The resulting probability plot shows in Fig. 6 that the points cluster around the line; this indicates that the residues are distributed according to normal distribution. This is confirmed Kolmogorov-Smirnov and Jarque-Bera test. And we observe there is no autocorrelation between the residues, and also regression model has stability. Formula (10 and 11) was used to estimate regression parameters. The result indicates in Table 6 that all the variables have a statistically significant effect on the cumulative average (This no contradicts reality).

Comparison of the proposed algorithm with some Robust Regression methods
The proposed algorithm will be compared with some robust regression methods, [Weighted Least Squares (WLS), Fully Modified Least Squares (FMOLS), and Least Median of Squares (LMS)]. From Table 7.
The statistically significant was achieved for all regression parameters, and the normality hypothesis for residues was achieved by using the proposed method only. In addition, the proposed method has the highest coefficient of determination was (Adj R 2 =0.9402) and the lest standard error (S.E.= 0.198).

Conclusion
The study has used MINITAP, SPSS, and EVIEWS to perform the computations. All methods of estimation were compared using three standards [The significant of regression parameters, adjusted determination coefficient (Adj.R^2), Standard Error (S.E.) of regression and achieve the assumptions of OLS]. Regression analysis was applied to a phenomenon, whose results are known in advance (The relationship between Semester average and Cumulative average). Since there is no method could correctly treat outliers 100%.
The results of this study proved that the proposed method is a robust solution for outliers estimation. Most importantly, the method is a solution for estimating multiple significant outliers in the data set. The study has found that the proposed algorithm performs well to obtain highly efficient estimates of regression coefficients. The proposed method can be applied on others similar phenomena (e.g., The proposed method can be applied to a credit card transaction control system in a bank, which aims to detect fraud, to detect unusual purchases, as an outlier, compared to the normal behavior of the customer of the cardholder. Another example delays in the delivery of orders to homes, such as when there is a delay in the delivery of 20 orders in one day, and therefore restaurant management can use the algorithm to solve the problem). Also, the novelty of this study can be observed by investigating testing the significance of outliers as most of the previous researchers were interested in detecting and addressing the outliers, without checking its significance. The research also found that the cause of the existence of outliers in the data was errors in the university website, and also the data has a high torsion to the right. Therefore, it is recommended to using the proposed algorithm to estimate multiple outliers on any phenomenon, whose results are known in advance, and also doing designing software Packages statistical for the proposed algorithm.