Sparse principal component analysis based on optimization over Stiefel. [Z, P, X] = sparse_pca(A, m, gamma) We consider sparse PCA applied to a data matrix A of size pxn, where p is the number of samples (observations) and n is the number of variables (features). We attempt to extract m different components. The parameter gamma, which must lie between 0 and the largest 2-norm of a column of A, tunes the balance between best explanation of the variance of the data (gamma = 0, mostly corresponds to standard PCA) and best sparsity of the principal components Z (gamma maximal, Z is zero). The variables contained in the columns of A are assumed centered (zero-mean). The output Z of size nxm represents the principal components. There are m columns, each one of unit norm and capturing a prefered direction of the data, while trying to be sparse. P has the same size as Z and represents the sparsity pattern of Z. X is an orthonormal matrix of size pxm produced internally by the algorithm. With classical PCA, the variability captured by m components is sum(svds(A, m)) With the outputted Z, which should be sparser than normal PCA, it is sum(svd(A*Z)) The method is based on the maximization of a differentiable function over the Stiefel manifold of dimension pxm. Notice that this dimension is independent of n, making this method particularly suitable for problems with many variables but few samples (n much larger than p). The complexity of each iteration of the algorithm is linear in n as a result. The theory behind this code is available in the paper http://jmlr.org/papers/volume11/journee10a/journee10a.pdf Generalized Power Method for Sparse Principal Component Analysis, by Journee, Nesterov, Richtarik and Sepulchre, JMLR, 2010. This implementation is not equivalent to the one described in that paper (and is independent from their authors) but is close in spirit nonetheless. It is provided with Manopt as an example file but was not optimized: please do not judge the quality of the algorithm described by the authors of the paper based on this implementation.

- stiefelfactory Returns a manifold structure to optimize over orthonormal matrices.
- trustregions Riemannian trust-regions solver for optimization on manifolds.

- function [P, X] = sparse_pca_stiefel_l1(A, m, gamma)
- function store = prepare(X, store)
- function [f, store] = cost(X, store)
- function [G, store] = egrad(X, store)
- function Z = postprocess(A, P, X)
- function U = ufactor(X)

0001 function [Z, P, X, A] = sparse_pca(A, m, gamma) 0002 % Sparse principal component analysis based on optimization over Stiefel. 0003 % 0004 % [Z, P, X] = sparse_pca(A, m, gamma) 0005 % 0006 % We consider sparse PCA applied to a data matrix A of size pxn, where p is 0007 % the number of samples (observations) and n is the number of variables 0008 % (features). We attempt to extract m different components. The parameter 0009 % gamma, which must lie between 0 and the largest 2-norm of a column of 0010 % A, tunes the balance between best explanation of the variance of the data 0011 % (gamma = 0, mostly corresponds to standard PCA) and best sparsity of the 0012 % principal components Z (gamma maximal, Z is zero). The variables 0013 % contained in the columns of A are assumed centered (zero-mean). 0014 % 0015 % The output Z of size nxm represents the principal components. There are m 0016 % columns, each one of unit norm and capturing a prefered direction of the 0017 % data, while trying to be sparse. P has the same size as Z and represents 0018 % the sparsity pattern of Z. X is an orthonormal matrix of size pxm 0019 % produced internally by the algorithm. 0020 % 0021 % With classical PCA, the variability captured by m components is 0022 % sum(svds(A, m)) 0023 % With the outputted Z, which should be sparser than normal PCA, it is 0024 % sum(svd(A*Z)) 0025 % 0026 % The method is based on the maximization of a differentiable function over 0027 % the Stiefel manifold of dimension pxm. Notice that this dimension is 0028 % independent of n, making this method particularly suitable for problems 0029 % with many variables but few samples (n much larger than p). The 0030 % complexity of each iteration of the algorithm is linear in n as a result. 0031 % 0032 % The theory behind this code is available in the paper 0033 % http://jmlr.org/papers/volume11/journee10a/journee10a.pdf 0034 % Generalized Power Method for Sparse Principal Component Analysis, by 0035 % Journee, Nesterov, Richtarik and Sepulchre, JMLR, 2010. 0036 % This implementation is not equivalent to the one described in that paper 0037 % (and is independent from their authors) but is close in spirit 0038 % nonetheless. It is provided with Manopt as an example file but was not 0039 % optimized: please do not judge the quality of the algorithm described by 0040 % the authors of the paper based on this implementation. 0041 0042 % This file is part of Manopt and is copyrighted. See the license file. 0043 % 0044 % Main author: Nicolas Boumal, Dec. 24, 2013 0045 % Contributors: 0046 % 0047 % Change log: 0048 % 0049 0050 % If no input is provided, generate random data for a quick demo 0051 if nargin == 0 0052 n = 100; 0053 p = 10; 0054 m = 2; 0055 0056 % Data matrix 0057 A = randn(p, n); 0058 0059 % Regularization parameter. This should be between 0 and the largest 0060 % 2-norm of a column of A. 0061 gamma = 1; 0062 0063 elseif nargin ~= 3 0064 error('Please provide 3 inputs (or none for a demo).'); 0065 end 0066 0067 % Execute the main algorithm: it will compute a sparsity pattern P. 0068 [P, X] = sparse_pca_stiefel_l1(A, m, gamma); 0069 0070 % Compute the principal components in accordance with the sparsity. 0071 Z = postprocess(A, P, X); 0072 0073 end 0074 0075 0076 % Sparse PCA based on the block sparse PCA algorithm with l1-penalty as 0077 % featured in the reference paper by Journee et al. This is not the same 0078 % algorithm but it is the same cost function optimized over the same search 0079 % space. We force N = eye(m). 0080 function [P, X] = sparse_pca_stiefel_l1(A, m, gamma) 0081 0082 [p, n] = size(A); %#ok<NASGU> 0083 0084 % The optimization takes place over a Stiefel manifold whose dimension 0085 % is independent of n. This is especially useful when there are many 0086 % more variables than samples. 0087 St = stiefelfactory(p, m); 0088 problem.M = St; 0089 0090 % In this helper function, given a point 'X' on the manifold we check 0091 % whether the caching structure 'store' has been populated with 0092 % quantities that are useful to compute at X or not. If they were not, 0093 % then we compute and store them now. 0094 function store = prepare(X, store) 0095 if ~isfield(store, 'ready') || ~store.ready 0096 store.AtX = A'*X; 0097 store.absAtX = abs(store.AtX); 0098 store.pos = max(0, store.absAtX - gamma); 0099 store.ready = true; 0100 end 0101 end 0102 0103 % Define the cost function here and set it in the problem structure. 0104 problem.cost = @cost; 0105 function [f, store] = cost(X, store) 0106 store = prepare(X, store); 0107 pos = store.pos; 0108 f = -.5*norm(pos, 'fro')^2; 0109 end 0110 0111 % Here, we chose to define the Euclidean gradient (egrad instead of 0112 % grad) : Manopt will take care of converting it to the Riemannian 0113 % gradient. 0114 problem.egrad = @egrad; 0115 function [G, store] = egrad(X, store) 0116 if ~isfield(store, 'G') 0117 store = prepare(X, store); 0118 pos = store.pos; 0119 AtX = store.AtX; 0120 sgAtX = sign(AtX); 0121 factor = pos.*sgAtX; 0122 store.G = -A*factor; 0123 end 0124 G = store.G; 0125 end 0126 0127 % checkgradient(problem); 0128 % pause; 0129 0130 % The optimization happens here. To improve the method, it may be 0131 % interesting to investigate better-than-random initial iterates and, 0132 % possibly, to fine tune the parameters of the solver. 0133 X = trustregions(problem); 0134 0135 % Compute the sparsity pattern by thresholding 0136 P = abs(A'*X) > gamma; 0137 0138 end 0139 0140 0141 % This post-processing algorithm produces a matrix Z of size nxm matching 0142 % the sparsity pattern P and representing sparse principal components for 0143 % A. This is to be called with the output of the main algorithm. This 0144 % algorithm is described in the reference paper by Journee et al. 0145 function Z = postprocess(A, P, X) 0146 fprintf('Post-processing... '); 0147 counter = 0; 0148 maxiter = 1000; 0149 tolerance = 1e-8; 0150 while counter < maxiter 0151 Z = A'*X; 0152 Z(~P) = 0; 0153 Z = Z*diag(1./sqrt(diag(Z'*Z))); 0154 X = ufactor(A*Z); 0155 counter = counter + 1; 0156 if counter > 1 && norm(Z0-Z, 'fro') < tolerance*norm(Z0, 'fro') 0157 break; 0158 end 0159 Z0 = Z; 0160 end 0161 fprintf('done, in %d iterations (max = %d).\n', counter, maxiter); 0162 end 0163 0164 % Returns the U-factor of the polar decomposition of X 0165 function U = ufactor(X) 0166 [W, S, V] = svd(X, 0); %#ok<ASGLU> 0167 U = W*V'; 0168 end

Generated on Fri 08-Sep-2017 12:43:19 by