## Overview of PPCA, an extension of classical PCA, and its application to incomplete data via the EM Algorithm

Probabilistic Principal Components Evaluation (PPCA) is a dimensionality reduction technique that leverages a latent variable framework to get well the directions of maximal variance in the info. When the noise follows an Isotropic Gaussian distribution, the probabilistic principal components might be closely related to the classical principal components, equivalent as much as a scaling factor and an orthogonal rotation. PPCA can thus be used for most of the same applications as classical PCA, comparable to data visualization and have extraction. The latent variable framework behind PPCA also offers functionality which classical PCA doesn’t. As an illustration, PPCA can easily be prolonged to accommodate data with missing values, whereas classical PCA is undefined on incomplete data.

PPCA will be implemented using quite a few different methods. Tipping and Bishop provided an implementation of PPCA via the EM algorithm of their original 1999 article; nonetheless, they didn’t explicitly show how the EM algorithm for PPCA extends to incomplete data. A previous article on Towards Data Science discussed an alternate approach to PPCA, which uses variational inference rather than the EM algorithm to impute missing values and derive the probabilistic principal components. This approach starts from the simplifying assumption that the usual deviation of the noise is thought upfront, an assumption which makes it easier to optimize the variational distribution but just isn’t representative of most applications. For this post, I’ll concentrate on the EM algorithm, expanding upon previous discussions by illustrating all the steps needed to increase Tipping and Bishopâ€™s EM algorithm for PPCA to incomplete data.

**Overview of PPCA and its Relationship to Classical PCA:**

Classical PCA is a deterministic method which doesn’t model the info when it comes to distinct signal and noise components. In contrast, PPCA is derived from a probabilistic model of the shape

where z_i is a vector of q unobserved latent variables, W is a loading matrix that maps the q latent variables into the p observed variables, and epsilon_i is a noise term which makes the procedure probabilistic fairly than deterministic. Itâ€™s typically assumed that z_i follows a typical normal distribution. The noise term, epsilon_i, must follow an Isotropic Gaussian distribution with mean zero and a covariance matrix of the shape sigma ^2 I for the connection between PPCA and classical PCA to carry.

Under this latent variable model, Tipping and Bishop (1999) have shown that the directions of maximal variance in the info will be recovered through maximum likelihood estimation. They prove that the MLEs for W and sigma are given by

Here, U_q is the matrix whose columns are the q principal eigenvectors of the sample covariance matrix, Lambda_q is a diagonal matrix of the eigenvalues corresponding to the q principal eigenvectors, and R is an arbitrary orthogonal rotation matrix. Note that the classical principal components are given by the matrix U_q. In consequence of the opposite terms within the expression for W_MLE, the probabilistic principal components could have different scalings and different orientations than the classical components, but each sets of components will span the identical subspace and will be used interchangeably for many applications requiring dimensionality reduction.

In practice, the identity matrix will be substituted for the arbitrary orthogonal matrix R to calculate W_MLE. Using the identity matrix not only reduces the computational complexity but additionally ensures that there might be an ideal correlation or anti-correlation between the probabilistic principal components and their classical counterparts. These closed-form expressions are very convenient for complete data, but they can not directly be applied to incomplete data. An alternate option for locating W_MLE, which might easily be prolonged to accommodate incomplete data, is to make use of the EM algorithm to iteratively arrive at the utmost likelihood estimators, by which case R might be arbitrary at convergence. Iâ€™ll briefly review the EM algorithm below before illustrating how it could actually be used to use PPCA to incomplete data.

**EM Algorithm for PPCA with Missing Data:**

The EM algorithm is an iterative optimization method which alternates between estimating the latent variables and updating the parameters. Initial values should be specified for all parameters at first of the EM algorithm. Within the E-Step, the expected value of the log-likelihood is computed with respect to the present parameter estimates. The parameter estimates are then re-calculated to maximise the expected log-likelihood function within the M-Step. This process repeats until the change within the parameter estimates is small and the algorithm has thus converged.

Before illustrating how the EM algorithm for PPCA extends to incomplete data, I’ll first introduce some notation. Suppose that we observe D different combos of observed and unobserved predictors in the info. The set of D combos will include the pattern where all predictors are observed, assuming the info comprises no less than some complete observations. For every distinct combination d=1,â€¦,D, let x_1,â€¦,x_n_d denote the set of observations which share the dth pattern of missing predictors. Each data point on this set will be decomposed as

where the subscripts obs_d and mis_d denote which predictors are observed and unobserved within the dth combination.

The algorithm within the image above shows all the steps needed to use PPCA to incomplete data, using my notation for observed and unobserved values. To initialize the parameters for the EM algorithm, I impute any missing values with the predictor means after which use the closed-form estimators given by Tipping and Bishop (1999). The mean-imputed estimates could also be biased, but they supply a more optimal place to begin than a random initialization, reducing the likelihood of the algorithm converging to a neighborhood minimum. Note that the imputed data just isn’t used beyond the initialization.

Within the E-step, I first compute the expectation of every z_i with respect to the observed values and the present parameter estimates. Then I treat the missing values as additional latent variables and compute their expectation with respect to the present parameters and z_i. Within the M-step, I update the estimates for W, mu, and sigma based on the expected log-likelihood that was computed within the E-Step. This differs from Tipping and Bishopâ€™s EM algorithm for complete data, where mu is estimated based on the sample mean and only W and sigma are iteratively updated within the EM algorithm. It just isn’t needed to iteratively estimate mu when X is complete or when the unobserved values are missing completely at random because the sample mean is the MLE. For other patterns of missingness, nonetheless, the mean of the observed values is mostly not the MLE, and the EM algorithm will yield more accurate estimates of mu by accounting for the likely values of the missing data points. Thus, Iâ€™ve included the update for mu within the M-Step together with the opposite parameter updates.

**Python Implementation:**

I even have provided an implementation of the EM algorithm for PPCA here, following the steps of the algorithm above to accommodate missing data. My function requires the user to specify the info matrix and the variety of components in PPCA (i.e., q, the latent variable dimension). Common techniques for choosing the variety of components in classical PCA can be applied to PPCA when the latent variable dimension just isn’t known. As an illustration, one option for choosing q could be to create a scree plot and apply the so-called â€˜elbow method.â€™ Alternatively, q may very well be chosen through cross-validation.

I’ll now consider three different simulations to check my implementation of the EM algorithm for PPCA: one with none missing values, one with missing values which are chosen completely at random, and one with missing values which are chosen not-at-random. To simulate data that’s missing completely at random, I assume that every of the nxp values has an equal 10% probability of being unobserved. Meanwhile, to simulate data that’s missing not-at-random, I assume that data points with a better z-score have a greater probability of being unobserved, with an expected proportion of 10% missing overall.

I exploit the identical synthetic data for all simulations, simply altering the pattern of missingness to evaluate the performance on incomplete data. To generate the info, I let n=500, p=30, and q=3. I sample each W and mu from a Uniform[-1, 1] distribution. Then I draw the latent variables z_i, i=1,â€¦,n from a typical normal distribution and the noise terms epsilon_i, i=1,â€¦,n, from an Isotropic Gaussian distribution with sigma=0.7. I assume that q is accurately laid out in all simulations. For extra details, see the simulation code here.

To guage the accuracy of the EM algorithm, I report the relative error of the parameter estimates. The relative error for mu is reported with respect to the l2 norm, and the relative error for W is reported with respect to the Frobenius norm. Since W can only be recovered as much as an orthogonal rotation, I even have used NumPyâ€™s orthogonal prosecutes function to rotate the estimated matrix W within the direction of the true W before computing the relative error.

My results confirm that the EM algorithm can accurately estimate the parameters in all three of those setups. It just isn’t surprising that the EM algorithm performs well in the entire data simulation because the initialization should already be optimal. Nevertheless, it’s more noteworthy that the accuracy stays high when missing values are introduced. The parameter estimates even show a comparatively high degree of robustness to non-random patterns of missingness, no less than under the assumed setup for this simulation. For real datasets, the performance of the EM algorithm will rely upon quite a few various factors, including the accuracy of the initialization, the pattern of missingness, the signal to noise ratio, and the sample size.

**Conclusion:**

Probabilistic principal components evaluation (PPCA) can get well much of the identical structure as classical PCA while also extending its functionality, for example, by making it possible to accommodate data with missing values. In this text, I even have introduced PPCA, explored the connection between PPCA and classical PCA, and illustrated how the EM algorithm for PPCA will be prolonged to accommodate missing values. My simulations indicate that the EM algorithm for PPCA yields accurate parameter estimates when there are missing values, even demonstrating some robustness when values are missing not-at-random.

**References:**

M. Tipping, C. Bishop, *Probabilistic principal component evaluation *(1999). Journal of the Royal Statistical Society Series B: Statistical Methodology.