Regression modeling and correlation analysis spread of COVID-19 data for Pakistan

This study presents a mathematical analysis of the coronavirus spread in Pakistan by analyzing the (COVID-19) situation in six provinces, including Gilgit Baltistan, Azad Jammu Kashmir and federal capital (seven zones) individually. The influence of each province and the Federal Capital territory is then observed over the other territories. By subdividing the associated data into confirmed cases, death cases, and recovery cases, the dependence of the (COVID-19) situation from one province to the other provinces is investigated. Since the worsening circumstance in the neighboring countries were considered as a catalyst to initiate the outburst in Pakistan, it seemed necessary to have an understanding of the situation in neighboring countries, particularly, Iran, India, and Bangladesh. Exploratory data analysis is utilized to understand the behavior of confirmed cases, death cases, and recovery cases data of (COVID-19) in Pakistan. Also, an understanding of the pandemic spread during different waves of (COVID-19) is obtained. Depending on the individual situation in each of the provinces, it is expected to have a different ARIMA model in each case. Hunt for the most suitable ARIMA models is an essential part of this study. The time-series data forecasts by processing the most suitable ARIMA models to observe the influence of one territory over the other. Moreover, forecasting for the month of August 2021 is performed and a possible correlation with actual data is determined. Linear, multiple regression, and exponential models have been applied and the best-fitted model is acquired. The information obtained from such analysis can be employed to vary possible parameters and variables in the system to achieve optimal performance.


Introduction
*The novel coronavirus has corroborated itself to be one of the most fatal pandemics of this century. Globally, the earliest cases of coronavirus were noticed in December 2019 in the city of Wuhan, the capital of Hubei province, China (WHO, 2020), and the situation was declared as pandemic on March 11 th , 2020. This virus typically attacks a person's respiratory system and the disease commonly known as COVID-19. It has an aptitude to transmit itself at an alarmingly quick pace, thus dispersal of the virus across the globe became a matter of a few months (Tomar and Gupta, 2020). To counter the spread of the disease, protective measures, enhancement of medical facilities and research and production of a possible vaccine overburdened the global financial budgets. Strict lockdowns caused huge losses to world economies. Some facts about the situation are summarized as follows: 1. An unceasing enhancement in expenses to support medical necessities. 2. A continued effort to keep the infection restricted in a specific domain. 3. To assure unfailing precautionary measures. 4. To keep the general public aware of the severity of the situation to avoid the appearance of any panic. 5. To overcome the indirect consequences resulting in cessation of businesses activities and steep rise in the unemployment rate (Kawohl and Nordt, 2020). 6. To extend testing and vaccination services to the general public protracted and exceptional test of public mental health, with foremost inferences of suicide (Gunnell et al., 2020).
The key corrective measure for COVID-19 is the maintenance of social distancing. This strategy was found successful to decelerate the outburst of the pandemic in a few countries (Anderson et al., 2020). The Transmission Rate of COVID-19, denoted by Ro, by major research studies is estimated to be somewhere between (1.5 to 3.5) (Imai et al., 2020;Read et al., 2021) which is quite high.
Mathematical models help in understanding the transmission, spread, recovery, and death due to the virus, hence playing a vital part in policy making and constructing important decisions in the public health sector through evidence-derived statistics (Khajanchi et al., 2018). Across the globe, scientists have been attempting to produce an effective model to depict the behavior of COVID-19. In Jung et al. (2020), the researchers offered a model forecasting the rate of mortality from COVID-19. One such study (Jung et al., 2020) predicted the rate of mortality to be 5.1% with a reproduction rate of 2.1 while the other (Jung et al., 2020) estimated it to be 8.4% with 3.2 reproductions. The estimation results hinted at the danger of an upcoming pandemic. Susceptible-Infected-Removed (SIR) model was one of the earliest established models, for the appropriate forecast of the coronavirus outburst in China (Zhong et al., 2020). The SIR model was also utilized to predict the COVID-19 outbreak in Iran; the parameters were estimated via Generalized Additive Model (GAM) models (Zareie et al., 2020). Delivering actual information associated with COVID-19 with a statistical analysis of figures, Susceptible-Exposed-Infectious-Recovered (SEIR) modeling was utilized for estimating on daily basis by creating microservices to extract data via various sources (Hamzah et al., 2020). SEIR modeling was also utilized with the data provided for countries South Korea, Italy, and Iran for forecasting the COVID-19 dispersion profile (Zhan et al., 2020). NNETAR, ARIMA, Hybrid, Holt-Winter, BSTS, TBATS, Prophet, MLP, and ELM network models were compared with Iranian data to determine the model with the lowest error in forecasting (Talkhi et al., 2021). Recurrent neural network (RNN) grounded variants of long shortterm memory (LSTM) were applied to construct models to forecast future timelines for India and USA (Shastri et al., 2020). LSTM method along with classical curve fitting was implemented to predict the number of new patients in India (WHO, 2021). Adaptive Neuro-Fuzzy Inference System (ANFIS) estimated the number of forthcoming cases for the next ten days in China with the help of data provided by WHO. Time-series analysis (Li, 2020;Guo et al., 2020) is an effective means for future estimation. It aids in generating a mathematical model with respect to the consistency and trend of the previously observed statistics against time. There are multiple time series evaluation models that are tested to provide fruitful results for the monitoring of virus control. Auto-Regressive Integrated Moving Average (ARIMA) model is implemented in our research. In the cases of infectious disease, this technique is generally implemented for time series prediction. This model, originally designed for economic applications, has proven its worth in the medical field too. The principle of this estimation consists of filtering of the high-frequency noise in the statistics, sensing the local tendencies, grounded on linear dependence, and predicting the future trends (Kane et al., 2014). In Wang et al. (2018), authors utilized the ARIMA model and grey model GM (1,1) to forecast the number of cases of hepatitis B in China with the help of data acquired over seven years, ARIMA provided a better estimation performance as compared to the grey model. The Coronavirus disease (COVID-19) broke out in Wuhan in 2019. By early 2020, it was declared as a pandemic. Pakistan reported its first case from the capital city, Islamabad in February 2020. Pilgrims from Iran were considered as the primary source for carrying this contagious disease in Pakistan at that time. This study analyses the (COVID-19) spread phenomenon in Pakistan from February 2020 to August 2021, when the third wave had much progressed and the fourth wave of (COVID-19) was preparing to take off. Following (WHO, 2021) ARIMA models are used to study the available data for each province. Time-series data analysis has been performed for the respective data using different regression models and exploratory analysis determined to understand the virus spread, data correlation and visualization has been discussed in the following five sections namely, introduction, Pakistan COVID data, research methodology, results discussion, and conclusion.

Pakistan COVID-19 data
COVID-19 data of Pakistan has been taken from the Ministry of National Health Services Regulations and Coordination. This statistic is divided into 7 provinces namely Sindh, Punjab, Balochistan, Khyber Pakhtunkhwa (KPK), Islamabad, Azad Jammu and Kashmir (AJK), and Gilgit-Baltistan (GB). Islamabad, Azad Jammu and Kashmir, and Gilgit-Baltistan are considered as provinces of Pakistan for COVID-19 propagation. The data set is further segregated in confirmed cases (CC) 1159,427, death cases (DC) 25,599 and recovery cases (RC) 1042,734; it has been observed from March 2020 to date. Pakistan COVID-19 data has been processed and analyzed for March 2020 to August 2021, the future forecasting has been presented for the month of August 2021 and its correlation is performed with actual observed data.

Research methodology
ARIMA: The exploratory data analysis is used to understand its frequency distribution over the entire region. This data can be considered as discrete-time series (signal for processing and analysis) for analytical analysis. The total number of cases statistics has been processed and observed for mean value, standard deviation, skewness, and kurtosis. Further Box-Jenkins model has been applied which is known as Autoregressive Integrated Moving Average (ARIMA) model. It has been considered a powerful method to analyze forecasts and predict the future conduct of the time series. This model is based on its own past value and error terms that are based on its own lags and forecast error of lagged. It has been applied to stationary data. ARIMA is characterized by three order terms (p, d, q) with AR(p) and MA(q). The general term of ARIMA is defined as below: where, is the predicted value which is defined in words as given above. It has been estimated on autoregressive (AR), integrated (I), and moving average (MA) for different orders. The statistical data of COVID-19 is non-stationary, and it is converted into stationary data by applying different tests like unit root (augmented dickey fuller) with the different operations (Sharma et al., 2020). The model identification has been performed with correlation. Correlation function has been found on stationary data to pick the AR(p) and MA(q) order terms with partial correlation (PACF) and autocorrelation function (ACF). The models have been estimated with different orders of (p, d, q). The best fit values are selected for all provinces by Akaike's information criterion (AIC) on an individual basis. The best model diagnostic checking has been observed with Q-statistic and it was noted that all roots of the best fit model were lying inside the unit circle. The selected models have been used to forecast the future values of the process. Moreover, the correlation between July-August 2021 real data and forecasted data has been performed to test the selected model performance.

Regression model
The regression analysis has been performed because it provides tremendous flexibility in different circumstances. It is applied here to understand the relationship between the independent and dependent variables on the basis of their coefficients. The confirmed, recovered and death cases are considered as dependent variables whereas months, average month temperature, and humidity are presented as independent variables. The following method and analysis have been performed: Linear regression for total COVID 19 cases, multiple regressions have been applied on confirmed, recovered and death cases with monthly growth, average temperature, and humidity; and exponential growth model (EGM) with respect to monthly growth.

Linear regression model (LRM)
It is described as below with the parameters °, ° are the slope/gradient and intercept respectively. °, ° represents the growth rate of the spread COVID-19 on daily basis. and is the daily growth of confirmed, recovered, death cases and represents the day or a month under consideration respectively.
The error term can be calculated with given following equation.

Multiple regressions (MR)
It is the relationship between two or more independent variables and modeled with corresponding coefficients such as:

Exponential growth model (EGM)
It is the relationship between the months and increasing rate of affected cases as the cumulative sum on the monthly basis, EGM equation is presented here:

Results and discussion
COVID-19 data of Pakistan is selected and divided into 7 provinces as discussed in section "Pakistan COVID-19 data."

Tail analysis
The Exploratory data analysis has been applied with data of CC, DC, and RC to understand the behavior of COVID-19 statistics. Table 1 shows the total number of cases, mean, median, standard deviation, skewness, and kurtosis of time series with a random process. Data frequency of distribution has been observed by the skewness and kurtosis. The random process statistics of all provinces are positively skewed for CC, DC, and RC. And others have lower frequency distribution than normal. Figs. 1-3 shows the CC, DC, and RC data of COVID-19 in all provinces from March 2020 to August 2021.
The correlation of COVID-19 spread among all provinces has been calculated and presented in Tables 2-4 with CC, DC, and RC respectively. The correlation among Sindh, Punjab, Baluchistan, KPK, and Islamabad is higher as these are populationdense areas. The correlation between AJK and GB with other provinces is comparatively weaker, being the less population density area. Similarly, the RC among Sindh, Punjab, Baluchistan, KPK, and Islamabad is higher than AJK and GB. It is because they have better medical facilities than AJK and GB.

COVID-19 forecasts for Pakistan
Further, the Autoregressive Integrated Moving Average Model (ARIMA) has been constructed for all provinces' CC, DC, and RC with recorded data. The first two orders have been selected to find ARIMA (p, d, q) model. The best models have been selected by Akaike's information criterion (AIC) shown in Table  5. The equation of Best ARIMA models for CC, DC, and RC are shown in Tables 6-8 respectively. The correlation between August recorded data and forecasted data has been shown in Table 9, which is also presented in Fig. 4. It has been observed that during the second wave, numbers of cases have increased in highly populated areas like Sindh, Punjab, Balochistan, Islamabad, and KPK. The death rate is minimum due to COVID-19 smart lockdown in Pakistan. It has been observed that the numbers of deaths are minimum in the lowest populated cities like AJK and GB.

Neighboring country comparison
Some of the neighboring countries are compared for confirmed and death cases like India, Bangladesh, and Iran in Figs. 5 and 6. It is observed that India has a higher rate of confirmed and death cases than Iran. Pakistan and Bangladesh have the almost same rate of COVID-19 cases for confirmed and death. In addition, it's to make mention of similar studies (Ghosh, 2020;Fargana et al., 2021). According to WHO in India, Bangladesh, and Iran DC cases are reported 2893589, 146020, and 843140 respectively. The graphs clearly show that there is a higher side of confirmed cases and death cases in India whereas; Bangladesh manages a uniform low rate in both of these cases. A better economy and better literacy rate perhaps have played their role in better control. In India dense population may be a possible cause of the high spread of the disease. Table 10 shows LRM variable results for total C, R, and D cases.

Regression analysis
The R², adjusted R², standard error, and coefficients of the model have been observed for the best fit model is presented in Table 10 for LRM.
Figs. 7 to 9 presents the linear regression for the total confirmed (Fig. 7), totally recovered (Fig. 8), and total death cases ( Fig. 9) to show the predicted statistics with respect to the monthly data of increasing cases (Figs. 7-9 a and d), the effect of average temperature (Figs. 7-9 b and e), and effect of average humidity (Figs. 7-9 c and f), using 95% confidence interval. It has been observed the value of the R², adjusted R² for temperature and humidity is smaller than monthly growth. Affected cases increases as humidity increases and the temperature do not have a significance effect on the number of cases. Fig. 10 presents the multi regression modeling with similar coefficients in Table 10. It has been observed that the numbers of confirmed and recovered cases are increasing on the monthly basis in Fig 10a. Fig. 10c shows the humidity analysis has been confirmed and the confirmed cases increase with the rise in humidity. Fig. 10b indicates that temperature does not directly affect the cases, as the regression line decreases when the temperature increases. Fig. 11 presents the cumulative exponential growth function as nonlinear regression analysis. The graphs are presented with trend equations on the basis of monthly growth in Fig. 11  (a, b, and c). EGM is performed similarly to linear regression for the temperature and humidity as shown in Figs. 7, 8, and 9 (d, e, and f). EGM shows the better performance of the LRM and MR (error increases); it shows the fewer data fluctuations across the regression line and has the smaller standard error.

Conclusion
The presented study has been performed on the exploratory data analysis to understand the behavior of confirmed cases, death cases, and recovery cases in Pakistan to investigate the spread during different waves. Moreover, the autoregressive An integrated moving average model has been applied, which includes model identification followed by model estimation with diagnostic checking and future trend predicted for COVID-19 through correlation function. The best models for CC, RC, and DC have been identified. It has been observed that populated areas are more affected and the death rate increases as well. Also, the number of recoveries improved during the later waves.
Moreover, exploratory data analysis has been performed on the basis of monthly increasing cases, average temperature, and humidity. The descriptive and graphical statistics have been analyzed with the models LRM, MR, and EGM. It has been observed that fluctuations across the regression line for LRM and EM models, it is detected that cases increases as the humidity increases and the EGM is found better adequate to predict the COVID-19 growth.

Conflict of interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.