Sparse principal component analysis based on optimization over Stiefel. [Z, P, X] = sparse_pca(A, m, gamma) We consider sparse PCA applied to a data matrix A of size pxn, where p is the number of samples (observations) and n is the number of variables (features). We attempt to extract m different components. The parameter gamma, which must lie between 0 and the largest 2-norm of a column of A, tunes the balance between best explanation of the variance of the data (gamma = 0, mostly corresponds to standard PCA) and best sparsity of the principal components Z (gamma maximal, Z is zero). The variables contained in the columns of A are assumed centered (zero-mean). The output Z of size nxm represents the principal components. There are m columns, each one of unit norm and capturing a prefered direction of the data, while trying to be sparse. P has the same size as Z and represents the sparsity pattern of Z. X is an orthonormal matrix of size pxm produced internally by the algorithm. With classical PCA, the variability captured by m components is sum(svds(A, m)) With the outputted Z, which should be sparser than normal PCA, it is sum(svd(A*Z)) The method is based on the maximization of a differentiable function over the Stiefel manifold of dimension pxm. Notice that this dimension is independent of n, making this method particularly suitable for problems with many variables but few samples (n much larger than p). The complexity of each iteration of the algorithm is linear in n as a result. The theory behind this code is available in the paper http://jmlr.org/papers/volume11/journee10a/journee10a.pdf Generalized Power Method for Sparse Principal Component Analysis, by Journee, Nesterov, Richtarik and Sepulchre, JMLR, 2010. This implementation is not equivalent to the one described in that paper (and is independent from their authors) but is close in spirit nonetheless. It is provided with Manopt as an example file but was not optimized: please do not judge the quality of the algorithm described by the authors of the paper based on this implementation.
0001 function [Z, P, X, A] = sparse_pca(A, m, gamma) 0002 % Sparse principal component analysis based on optimization over Stiefel. 0003 % 0004 % [Z, P, X] = sparse_pca(A, m, gamma) 0005 % 0006 % We consider sparse PCA applied to a data matrix A of size pxn, where p is 0007 % the number of samples (observations) and n is the number of variables 0008 % (features). We attempt to extract m different components. The parameter 0009 % gamma, which must lie between 0 and the largest 2-norm of a column of 0010 % A, tunes the balance between best explanation of the variance of the data 0011 % (gamma = 0, mostly corresponds to standard PCA) and best sparsity of the 0012 % principal components Z (gamma maximal, Z is zero). The variables 0013 % contained in the columns of A are assumed centered (zero-mean). 0014 % 0015 % The output Z of size nxm represents the principal components. There are m 0016 % columns, each one of unit norm and capturing a prefered direction of the 0017 % data, while trying to be sparse. P has the same size as Z and represents 0018 % the sparsity pattern of Z. X is an orthonormal matrix of size pxm 0019 % produced internally by the algorithm. 0020 % 0021 % With classical PCA, the variability captured by m components is 0022 % sum(svds(A, m)) 0023 % With the outputted Z, which should be sparser than normal PCA, it is 0024 % sum(svd(A*Z)) 0025 % 0026 % The method is based on the maximization of a differentiable function over 0027 % the Stiefel manifold of dimension pxm. Notice that this dimension is 0028 % independent of n, making this method particularly suitable for problems 0029 % with many variables but few samples (n much larger than p). The 0030 % complexity of each iteration of the algorithm is linear in n as a result. 0031 % 0032 % The theory behind this code is available in the paper 0033 % http://jmlr.org/papers/volume11/journee10a/journee10a.pdf 0034 % Generalized Power Method for Sparse Principal Component Analysis, by 0035 % Journee, Nesterov, Richtarik and Sepulchre, JMLR, 2010. 0036 % This implementation is not equivalent to the one described in that paper 0037 % (and is independent from their authors) but is close in spirit 0038 % nonetheless. It is provided with Manopt as an example file but was not 0039 % optimized: please do not judge the quality of the algorithm described by 0040 % the authors of the paper based on this implementation. 0041 0042 % This file is part of Manopt and is copyrighted. See the license file. 0043 % 0044 % Main author: Nicolas Boumal, Dec. 24, 2013 0045 % Contributors: 0046 % 0047 % Change log: 0048 % 0049 % Aug. 23, 2021 (XJ): 0050 % Added AD to compute the egrad 0051 0052 % If no input is provided, generate random data for a quick demo 0053 if nargin == 0 0054 n = 100; 0055 p = 10; 0056 m = 2; 0057 0058 % Data matrix 0059 A = randn(p, n); 0060 0061 % Regularization parameter. This should be between 0 and the largest 0062 % 2-norm of a column of A. 0063 gamma = 1; 0064 0065 elseif nargin ~= 3 0066 error('Please provide 3 inputs (or none for a demo).'); 0067 end 0068 0069 % Execute the main algorithm: it will compute a sparsity pattern P. 0070 [P, X] = sparse_pca_stiefel_l1(A, m, gamma); 0071 0072 % Compute the principal components in accordance with the sparsity. 0073 Z = postprocess(A, P, X); 0074 0075 end 0076 0077 0078 % Sparse PCA based on the block sparse PCA algorithm with l1-penalty as 0079 % featured in the reference paper by Journee et al. This is not the same 0080 % algorithm but it is the same cost function optimized over the same search 0081 % space. We force N = eye(m). 0082 function [P, X] = sparse_pca_stiefel_l1(A, m, gamma) 0083 0084 [p, n] = size(A); %#ok<NASGU> 0085 0086 % The optimization takes place over a Stiefel manifold whose dimension 0087 % is independent of n. This is especially useful when there are many 0088 % more variables than samples. 0089 St = stiefelfactory(p, m); 0090 problem.M = St; 0091 0092 % In this helper function, given a point 'X' on the manifold we check 0093 % whether the caching structure 'store' has been populated with 0094 % quantities that are useful to compute at X or not. If they were not, 0095 % then we compute and store them now. 0096 function store = prepare(X, store) 0097 if ~isfield(store, 'ready') || ~store.ready 0098 store.AtX = A'*X; 0099 store.absAtX = abs(store.AtX); 0100 store.pos = max(0, store.absAtX - gamma); 0101 store.ready = true; 0102 end 0103 end 0104 0105 % Define the cost function here and set it in the problem structure. 0106 problem.cost = @cost; 0107 function [f, store] = cost(X, store) 0108 store = prepare(X, store); 0109 pos = store.pos; 0110 f = -.5*norm(pos, 'fro')^2; 0111 end 0112 0113 % Here, we chose to define the Euclidean gradient (egrad instead of 0114 % grad) : Manopt will take care of converting it to the Riemannian 0115 % gradient. 0116 problem.egrad = @egrad; 0117 function [G, store] = egrad(X, store) 0118 if ~isfield(store, 'G') 0119 store = prepare(X, store); 0120 pos = store.pos; 0121 AtX = store.AtX; 0122 sgAtX = sign(AtX); 0123 factor = pos.*sgAtX; 0124 store.G = -A*factor; 0125 end 0126 G = store.G; 0127 end 0128 0129 % An alternative way to compute the egrad is to use automatic 0130 % differentiation provided in the deep learning toolbox (slower). 0131 % Notice that the function norm is not supported for AD so far. 0132 % Replace norm(...,'fro')^2 with cnormsqfro described in the file 0133 % manoptADhelp.m. Notice that the abs function is not differentiable 0134 % mathematically. Instead the subgradient is computed automatically. 0135 % problem.cost = @cost_AD; 0136 % function f = cost_AD(X) 0137 % AtX = A'*X; 0138 % absAtX = abs(AtX); 0139 % pos = max(0, absAtX - gamma); 0140 % f = -.5*cnormsqfro(pos); 0141 % end 0142 % problem = manoptAD(problem,'egrad'); 0143 0144 % checkgradient(problem); 0145 % pause; 0146 0147 % The optimization happens here. To improve the method, it may be 0148 % interesting to investigate better-than-random initial iterates and, 0149 % possibly, to fine tune the parameters of the solver. 0150 X = trustregions(problem); 0151 0152 % Compute the sparsity pattern by thresholding 0153 P = abs(A'*X) > gamma; 0154 0155 end 0156 0157 0158 % This post-processing algorithm produces a matrix Z of size nxm matching 0159 % the sparsity pattern P and representing sparse principal components for 0160 % A. This is to be called with the output of the main algorithm. This 0161 % algorithm is described in the reference paper by Journee et al. 0162 function Z = postprocess(A, P, X) 0163 fprintf('Post-processing... '); 0164 counter = 0; 0165 maxiter = 1000; 0166 tolerance = 1e-8; 0167 while counter < maxiter 0168 Z = A'*X; 0169 Z(~P) = 0; 0170 Z = Z*diag(1./sqrt(diag(Z'*Z))); 0171 X = ufactor(A*Z); 0172 counter = counter + 1; 0173 if counter > 1 && norm(Z0-Z, 'fro') < tolerance*norm(Z0, 'fro') 0174 break; 0175 end 0176 Z0 = Z; 0177 end 0178 fprintf('done, in %d iterations (max = %d).\n', counter, maxiter); 0179 end 0180 0181 % Returns the U-factor of the polar decomposition of X 0182 function U = ufactor(X) 0183 [W, S, V] = svd(X, 0); %#ok<ASGLU> 0184 U = W*V'; 0185 end