Computes a robust version of PCA (principal component analysis) on data. function [U, cost] = robustpca(X, d) Given a matrix X of size p by n, such that each column represents a point in R^p, this computes U: an orthonormal basis of size p by d such that the column space of U captures the points X as well as possible. More precisely, the function attempts to compute U as the minimizer over the Grassmann manifold (the set of linear subspaces) of: f(U) = (1/n) Sum_{i = 1:n} dist(X(:, i), the space spanned by U) = (1/n) Sum_{i = 1:n} || U*U'*X(:, i) - X(:, i) || The output cost represents the average distance achieved with the returned U. Notice that norms are not squared, for robustness. In practice, because this function is nonsmooth, it is smoothed with a pseudo-Huber loss function of parameter epsilon (noted e for short), and the smoothing parameter is iteratively reduced (with warm starts): f_e(U) = (1/n) Sum_{i = 1:n} l_e(|| U*U'*X(:, i) - X(:, i) ||) with l_e(x) = sqrt(x^2 + e^2) - e (for e = 0, this is absolute value). The intermediate optimization of the smooth cost over the Grassmann manifold is performed using the Manopt toolbox. Ideally, the non-outlier data should be centered. If not, this pre-processing centers all the data, but bear in mind that outliers will shift the center of mass too. X = X - repmat(mean(X, 2), [1, size(X, 2)]); There are no guarantees that this code will return the optimal U. This code is distributed to illustrate one possible way of optimizing a nonsmooth cost function over a manifold, using Manopt with smoothing. For practical use, the constants in the code would need to be tuned.
0001 function [U, cost] = robust_pca(X, d) 0002 % Computes a robust version of PCA (principal component analysis) on data. 0003 % 0004 % function [U, cost] = robustpca(X, d) 0005 % 0006 % Given a matrix X of size p by n, such that each column represents a 0007 % point in R^p, this computes U: an orthonormal basis of size p by d such 0008 % that the column space of U captures the points X as well as possible. 0009 % More precisely, the function attempts to compute U as the minimizer 0010 % over the Grassmann manifold (the set of linear subspaces) of: 0011 % 0012 % f(U) = (1/n) Sum_{i = 1:n} dist(X(:, i), the space spanned by U) 0013 % = (1/n) Sum_{i = 1:n} || U*U'*X(:, i) - X(:, i) || 0014 % 0015 % The output cost represents the average distance achieved with the 0016 % returned U. Notice that norms are not squared, for robustness. 0017 % 0018 % In practice, because this function is nonsmooth, it is smoothed with a 0019 % pseudo-Huber loss function of parameter epsilon (noted e for short), and 0020 % the smoothing parameter is iteratively reduced (with warm starts): 0021 % 0022 % f_e(U) = (1/n) Sum_{i = 1:n} l_e(|| U*U'*X(:, i) - X(:, i) ||) 0023 % 0024 % with l_e(x) = sqrt(x^2 + e^2) - e (for e = 0, this is absolute value). 0025 % 0026 % The intermediate optimization of the smooth cost over the Grassmann 0027 % manifold is performed using the Manopt toolbox. 0028 % 0029 % Ideally, the non-outlier data should be centered. If not, this 0030 % pre-processing centers all the data, but bear in mind that outliers will 0031 % shift the center of mass too. 0032 % X = X - repmat(mean(X, 2), [1, size(X, 2)]); 0033 % 0034 % There are no guarantees that this code will return the optimal U. 0035 % This code is distributed to illustrate one possible way of optimizing 0036 % a nonsmooth cost function over a manifold, using Manopt with smoothing. 0037 % For practical use, the constants in the code would need to be tuned. 0038 0039 % This file is part of Manopt and is copyrighted. See the license file. 0040 % 0041 % Main author: Nicolas Boumal and Teng Zhang, May 2, 2014 0042 % Contributors: 0043 % 0044 % Change log: 0045 % 0046 % March 4, 2015 (NB): 0047 % Uses a pseudo-Huber loss rather than a Huber loss: this has the 0048 % nice advantage of being smooth and simpler to code (no if's). 0049 % 0050 % April 8, 2015 (NB): 0051 % Built-in test data for quick tests; added comment about centering. 0052 % 0053 % Aug 20, 2021 (XJ): 0054 % Added AD to compute the egrad 0055 0056 0057 % If no inputs, generate random data for illustration purposes. 0058 if nargin == 0 0059 % Generate some data points aligned on a subspace 0060 X = rand(2, 1)*(1:30) + .05*randn(2, 30).*[(1:30);(1:30)]; 0061 % And add some random outliers to the mix 0062 P = randperm(size(X, 2)); 0063 outliers = 10; 0064 X(:, P(1:outliers)) = 30*randn(2, outliers); 0065 % Center the data 0066 % X = X - repmat(mean(X, 2), [1, size(X, 2)]); 0067 d = 1; 0068 end 0069 0070 0071 0072 0073 0074 % Prepare a Manopt problem structure for optimization of the given 0075 % cost (defined below) over the Grassmann manifold. 0076 [p, n] = size(X); 0077 manifold = grassmannfactory(p, d); 0078 problem.M = manifold; 0079 problem.cost = @robustpca_cost; 0080 problem.egrad = @robustpca_gradient; 0081 0082 % Do classical PCA for the initial guess. 0083 % This is just one idea: it is not necessarily useful or ideal. 0084 % Using a random initial guess, and starting over for a few different 0085 % ones is probably much better. For this example, we keep it simple. 0086 [U, ~, ~] = svds(X, d); 0087 0088 0089 % Iteratively reduce the smoothing constant epsilon and optimize 0090 % the cost function over Grassmann. 0091 epsilon = 1; 0092 n_iterations = 6; 0093 reduction = .5; 0094 options.verbosity = 2; % Change this number for more or less output 0095 warning('off', 'manopt:getHessian:approx'); 0096 0097 % An alternative way to compute the egrad is to use automatic 0098 % differentiation provided in the deep learning toolbox (slower). 0099 % Call manoptAD to automatically obtain the egrad 0100 % problem = manoptAD(problem,'egrad'); 0101 0102 for iter = 1 : n_iterations 0103 U = trustregions(problem, U, options); 0104 epsilon = epsilon * reduction; 0105 end 0106 warning('on', 'manopt:getHessian:approx'); 0107 0108 0109 % Return the cost as the actual sum of distances, not smoothed. 0110 epsilon = 0; 0111 cost = robustpca_cost(U); 0112 0113 0114 0115 % If working with the auto-generated input, plot the results. 0116 if nargin == 0 0117 figure; 0118 scatter(X(1,:), X(2,:)); 0119 hold on; 0120 plot(U(1)*[-1, 1]*100, U(2)*[-1 1]*100, 'r'); 0121 hold off; 0122 % Compare to a standard PCA 0123 [Upca, ~, ~] = svds(X,1); 0124 hold on; 0125 plot(Upca(1)*[-1, 1]*100, Upca(2)*[-1 1]*100, 'k'); 0126 hold off; 0127 xlim(1.1*[min(X(1,:)), max(X(1,:))]); 0128 ylim(1.1*[min(X(2,:)), max(X(2,:))]); 0129 legend('data points', 'Robust PCA fit', 'Standard PCA fit'); 0130 end 0131 0132 0133 0134 % Smoothed cost 0135 function value = robustpca_cost(U) 0136 0137 vecs = U*(U'*X) - X; 0138 sqnrms = sum(vecs.^2, 1); 0139 vals = sqrt(sqnrms + epsilon^2) - epsilon; 0140 value = mean(vals); 0141 0142 end 0143 0144 % Euclidean gradient of the smoothed cost (it will be transformed into 0145 % the Riemannian gradient automatically by Manopt). 0146 function G = robustpca_gradient(U) 0147 0148 % Note that the computation of vecs and sqnrms is redundant 0149 % with their computation in the cost function. To speed 0150 % up the code, it would be wise to use the caching capabilities 0151 % of Manopt (the store structure). See online documentation. 0152 % It is not done here to keep the code a bit simpler. 0153 UtX = U'*X; 0154 vecs = U*UtX-X; 0155 sqnrms = sum(vecs.^2, 1); 0156 % This explicit loop is a bit slow: the code below is equivalent 0157 % and faster to compute the gradient. 0158 % G = zeros(p, d); 0159 % for i=1:n 0160 % G = G + (1/sqrt(sqnrms(i) + epsilon^2)) * vecs(:,i) * UtX(:,i)'; 0161 % end 0162 % G = G/n; 0163 G = mean(multiscale(1./sqrt(sqnrms + epsilon^2), ... 0164 multiprod(reshape(vecs, [p, 1, n]), ... 0165 multitransp(reshape(UtX, [d, 1, n])))), 3); 0166 end 0167 0168 end