A study of principal components analysis for mixed data

Analyzing data requires statistical tools to interpret the data information, which helps to improve the process. This is the interpretation of the qualitative and quantitative status of mixed data. The objective of this paper was to study the implementation of principal component analysis on mixed data and explain how to handle this type of databases and to make it possible to extract statistical information over a population under study. The effectiveness of principal component analysis on mixed data was studied using data sets available in the R package and simulated data.


Introduction
*Most data applications involve dealing with large data sets, which contain several measures (variables) that can be either numerical or categorical. Thus, increasingly, scientific researchers such as businesses and members of medical fields require powerful visual and analytical tools to visualize and analyze data.
Processing large data becomes more and more difficult as the number of dimensions' increases. Dimension reduction is a collection of statistical methods used to analyze mixtures of big data. This is done in two different ways: By selecting the most significant features from all features, which is used to make model building (this technique is called feature selection) or by transforming the highdimensional data into low-dimensional and saving the most important information. This procedure saves the data information that must be processed, while still accurately and completely describing the original data set (this technique is called feature extraction). Principal component analysis (PCA) is one of the commonly used dimension reduction methods, and it is known as a feature extraction method that is used for mixed data. It was invented in 1901 by Pearson (1901).
The central idea is to find a new coordinate system in which input data can be expressed but at the same time information loss can be minimized. The idea of PCA is to reduce the dimension of original data by computing a few numbers of orthogonal linear combinations with minimal loss of information, which means assigning the principal components (PCs) of the original variables with the largest variance. PCA is used for many applications, for example, image compression, bioinformatics, data mining, psychology, and pattern recognition, among others (Kalantan et al., 2017;Kalantan, 2019).
Practically, principal component analysis (PCA) handles numerical variables, while multiple correspondence analysis (MCA) handles categorical variables. PCA on mixed data is one of the several proposed methods to handle large data. This method can be seen as a mixture of PCA and MCA. It was proposed by De Leeuw and van Rijckevorsel (1980). This paper illustrates this method with details and discusses the effectiveness using the method implementation on a real dataset.
The paper is organized as follows. Section 2 presents a brief review of PCA. MCA is discussed in Section 3. Section 4 demonstrates how PCA is obtained for mixed data. Finally, the interpretation of a case study and associated graphics is discussed in Section 5.
Principal components are uncorrelated linear combinations whose variances are as large as possible. Therefore, the first principal component is the linear combination with the maximum variance or ( 1 ) =́1Σ 1 that has the largest variance. Since one can increase 1 by any constant, we impose the restriction that maximizing (́1 ) is subject to ́1 1 = 1. Thus, the principal components are such that: 1 st principal component= linear combination ́1 that maximizes (́1 ) subject to ́1 1 = 1 2 nd principal component= linear combination ́2 that maximizes (́2 ) subject to ́2 2 = 1 and (́1 ,́2 ) = 0 3 th principal component= linear combination ́ that maximizes (́) subject to ́= 1 and (́,́) = 0 for < .

Multiple correspondence analysis (MCA)
Multiple correspondence analysis is a statistical technique. It is an extension of simple correspondence analysis (CA) which allows one to study the association and visualize a data table between two or more qualitative variables. It can be seen as an analogue of principal components analysis (PCA) when the variables to be analyzed are categorical variables instead of quantitative variables (Abdi and Valentin, 2007).
There are categorical variables, and each categorical variable has levels where = ∑ . There are observations. Let be an indicator matrix with × dimensions. MCA is performed by applying CA on the indicator matrix. Then, the two sets of factor scores are obtained for the rows and the columns. These factor scores are standardized where their variance equals their corresponding eigenvalue.
Firstly, we compute the probability matrix Z = −1 X, where is the whole number. Let D = { }, D = { }, where the vector of the row totals and the columns totals of Z is denoted by and , respectively. We obtain the factor scores by applying the following SVD: where Δ is the diagonal matrix of the singular values and Λ = Δ 2 is the matrix of the eigenvalues.
Then, we obtain the rows factor scores which are denoted by F and the columns factor scores which are denoted by G as follows (Abdi and Valentin, 2007): and

Principal component analysis for mixed data
In this paper, we implemented the PCA on mixed data following the approach proposed by Chavent et al. (2014). The dataset to be analyzed by PCA mix consists of n observations described by 1 numerical variables and 2 categorical variables. Let X 1 be an n × 1 matrix which represents the numerical variables and X 2 be an n × 2 matrix that represents the categorical variables. Let d denote the total number of all variables. An indicator matrix G with n × d dimensions contains binary coding from each level of categorical variables.
where Y 1 is the standardized matrix constructed by centered and normalized columns of X 1 , and Y 2 denotes the centered indicator matrix X 2 . Now, let N be the diagonal matrix of the weights of the rows of Y, where 1 represents the weights of the diagonal matrix of the weights of the columns of Y and = 1, … , represents the number of observations appearing at the th level. Then, the eigenvalue of Y is obtained using the generalized singular value decomposition (GSVD) as: where Λ = diag(√ 1 , √ 2 , … , √ ) is the × diagonal matrix, such that 1 , 2 , … , are the eigenvalues of Y and denotes the rank of Y. U is a matrix with × dimensions, where the first eigenvectors of ZDZ t N such that U NU = I . V is the × matrix of the first eigenvectors of Z NZD such that V DV = I . Therefore, the principal component of PCA mix can be computed as: with the dimensions of × . The scores of rows computed as R = UΛ represent the principal component scores. The scores of columns C = DVΛ and the standard PCA will be C = VΛ.

Experimental results
In this section, we discuss the effectiveness of PCA on mixed data that contain both numerical and categorical data. This is illustrated with a simulation case and real data available in R packages.

Simulation case
A generalized sample of size 500 consists of seven variables. The first four are quantitative variables: Age, IQ, grade, and height, while the variables race, sex, and smoker are considered as qualitative variables; the data are available in the 'Wakefield' package (Rinker, 2018). As a pre-processing step, we split the data into two data matrices: A 500 × 4 numerical data matrix named data A, and data B, representing the categorical variables as a matrix of 500 × 3. We established the analysis with the implementation for PCA, and the results are summarized in Table 1, which shows that 80.84% of the total variance is explained via 10 PCA components. Fig. 1a displays the graphical output of the results of the factor coordinates, absolute contribution, and the squared cosinus for all variables. Table 2 presents the contributions of all variables; the contribution squared correlation for each quantitative variable and the contribution correlation ratio of qualitative variables are shown in Fig. 1b in a graphical output. More graphical outputs are presented in Fig. 2a and Fig. 2b. Fig. 2a shows the factor coordinates, absolute contribution, and the squared cosinus of the qualitative variables. The results for the quantitative variables are presented in Fig. 2b

Application case
We implemented the PCA mix method on an R dataset from the "ElemStatLearn" package and named it "SAheart". It is a sample of males in a heartdisease high-risk region of the Western Cape, South Africa. The dataset consists of 462 observations on the following 10 variables, two of which are qualitative variables and the rest are quantitative variables, as shown in Table 3.
As a pre-processing step, we split the data into two data matrices: A 462 × 8 numerical data matrix named data A, and data B, representing the categorical variables as a matrix of 462 × 2. We established the analysis with the implementation for PCA, and the results are summarized in Table 4, which shows that 81.23% of the total variance is explained via 6 PCA components.    Fig. 3a displays the graphical output of the results of the factor coordinates, absolute contribution, and the squared cosinus for all variables. Table 5 presents the contributions of all variables; the contribution squared correlation for each quantitative variable and the contribution correlation ratio of qualitative variables are shown in Fig. 3b in a graphical output.
More graphical outputs are presented in Fig. 4a and Fig. 4b. Fig. 4a shows the factor coordinates, absolute contribution, and the squared cosinus of the qualitative variables. The results for the quantitative variables are presented in Fig. 4b.

Conclusion
The PCA is a powerful technique for mixed data to interpret the variables status for different data types. The objective of this process is to reduce the number of dimensions by selecting the components that describe 80% of the variance of the data. It was found that through this method, we can analyze a mixture of numerical and categorical variables and extract relevant information without having to deal with each type separately.