Description of sparse

 Sparse principal component analysis based on optimization over Stiefel.

 [Z, P, X] = sparse_pca(A, m, gamma)

 We consider sparse PCA applied to a data matrix A of size pxn, where p is
 the number of samples (observations) and n is the number of variables
 (features). We attempt to extract m different components. The parameter
 gamma, which must lie between 0 and the largest 2-norm of a column of
 A, tunes the balance between best explanation of the variance of the data
 (gamma = 0, mostly corresponds to standard PCA) and best sparsity of the
 principal components Z (gamma maximal, Z is zero). The variables
 contained in the columns of A are assumed centered (zero-mean).

 The output Z of size nxm represents the principal components. There are m
 columns, each one of unit norm and capturing a prefered direction of the
 data, while trying to be sparse. P has the same size as Z and represents
 the sparsity pattern of Z. X is an orthonormal matrix of size pxm
 produced internally by the algorithm.

 With classical PCA, the variability captured by m components is
 sum(svds(A, m))
 With the outputted Z, which should be sparser than normal PCA, it is
 sum(svd(A*Z))

 The method is based on the maximization of a differentiable function over
 the Stiefel manifold of dimension pxm. Notice that this dimension is
 independent of n, making this method particularly suitable for problems
 with many variables but few samples (n much larger than p). The
 complexity of each iteration of the algorithm is linear in n as a result.

 The theory behind this code is available in the paper
 http://jmlr.org/papers/volume11/journee10a/journee10a.pdf
 Generalized Power Method for Sparse Principal Component Analysis, by
 Journee, Nesterov, Richtarik and Sepulchre, JMLR, 2010.
 This implementation is not equivalent to the one described in that paper
 (and is independent from their authors) but is close in spirit
 nonetheless. It is provided with Manopt as an example file but was not
 optimized: please do not judge the quality of the algorithm described by
 the authors of the paper based on this implementation.

0001 function [Z, P, X, A] = sparse_pca(A, m, gamma)
0002 % Sparse principal component analysis based on optimization over Stiefel.
0003 %
0004 % [Z, P, X] = sparse_pca(A, m, gamma)
0005 %
0006 % We consider sparse PCA applied to a data matrix A of size pxn, where p is
0007 % the number of samples (observations) and n is the number of variables
0008 % (features). We attempt to extract m different components. The parameter
0009 % gamma, which must lie between 0 and the largest 2-norm of a column of
0010 % A, tunes the balance between best explanation of the variance of the data
0011 % (gamma = 0, mostly corresponds to standard PCA) and best sparsity of the
0012 % principal components Z (gamma maximal, Z is zero). The variables
0013 % contained in the columns of A are assumed centered (zero-mean).
0014 %
0015 % The output Z of size nxm represents the principal components. There are m
0016 % columns, each one of unit norm and capturing a prefered direction of the
0017 % data, while trying to be sparse. P has the same size as Z and represents
0018 % the sparsity pattern of Z. X is an orthonormal matrix of size pxm
0019 % produced internally by the algorithm.
0020 %
0021 % With classical PCA, the variability captured by m components is
0022 % sum(svds(A, m))
0023 % With the outputted Z, which should be sparser than normal PCA, it is
0024 % sum(svd(A*Z))
0025 %
0026 % The method is based on the maximization of a differentiable function over
0027 % the Stiefel manifold of dimension pxm. Notice that this dimension is
0028 % independent of n, making this method particularly suitable for problems
0029 % with many variables but few samples (n much larger than p). The
0030 % complexity of each iteration of the algorithm is linear in n as a result.
0031 %
0032 % The theory behind this code is available in the paper
0033 % http://jmlr.org/papers/volume11/journee10a/journee10a.pdf
0034 % Generalized Power Method for Sparse Principal Component Analysis, by
0035 % Journee, Nesterov, Richtarik and Sepulchre, JMLR, 2010.
0036 % This implementation is not equivalent to the one described in that paper
0037 % (and is independent from their authors) but is close in spirit
0038 % nonetheless. It is provided with Manopt as an example file but was not
0039 % optimized: please do not judge the quality of the algorithm described by
0040 % the authors of the paper based on this implementation.
0041 
0042 % This file is part of Manopt and is copyrighted. See the license file.
0043 %
0044 % Main author: Nicolas Boumal, Dec. 24, 2013
0045 % Contributors:
0046 %
0047 % Change log:
0048 %
0049 %   Aug. 23, 2021 (XJ):
0050 %       Added AD to compute the egrad
0051 
0052     % If no input is provided, generate random data for a quick demo
0053     if nargin == 0
0054         n = 100;
0055         p = 10;
0056         m = 2;
0057 
0058         % Data matrix
0059         A = randn(p, n);
0060 
0061         % Regularization parameter. This should be between 0 and the largest
0062         % 2-norm of a column of A.
0063         gamma = 1;
0064         
0065     elseif nargin ~= 3
0066         error('Please provide 3 inputs (or none for a demo).');
0067     end
0068     
0069     % Execute the main algorithm: it will compute a sparsity pattern P.
0070     [P, X] = sparse_pca_stiefel_l1(A, m, gamma);
0071     
0072     % Compute the principal components in accordance with the sparsity.
0073     Z = postprocess(A, P, X);
0074 
0075 end
0076 
0077 
0078 % Sparse PCA based on the block sparse PCA algorithm with l1-penalty as
0079 % featured in the reference paper by Journee et al. This is not the same
0080 % algorithm but it is the same cost function optimized over the same search
0081 % space. We force N = eye(m).
0082 function [P, X] = sparse_pca_stiefel_l1(A, m, gamma)
0083     
0084     [p, n] = size(A); %#ok<NASGU>
0085 
0086     % The optimization takes place over a Stiefel manifold whose dimension
0087     % is independent of n. This is especially useful when there are many
0088     % more variables than samples.
0089     St = stiefelfactory(p, m);
0090     problem.M = St;
0091 
0092     % In this helper function, given a point 'X' on the manifold we check
0093     % whether the caching structure 'store' has been populated with
0094     % quantities that are useful to compute at X or not. If they were not,
0095     % then we compute and store them now.
0096     function store = prepare(X, store)
0097         if ~isfield(store, 'ready') || ~store.ready
0098             store.AtX = A'*X;
0099             store.absAtX = abs(store.AtX);
0100             store.pos = max(0, store.absAtX - gamma);
0101             store.ready = true;
0102         end
0103     end
0104 
0105     % Define the cost function here and set it in the problem structure.
0106     problem.cost = @cost;
0107     function [f, store] = cost(X, store)
0108         store = prepare(X, store);
0109         pos = store.pos;
0110         f = -.5*norm(pos, 'fro')^2;
0111     end
0112 
0113     % Here, we chose to define the Euclidean gradient (egrad instead of
0114     % grad) : Manopt will take care of converting it to the Riemannian
0115     % gradient.
0116     problem.egrad = @egrad;
0117     function [G, store] = egrad(X, store)
0118         if ~isfield(store, 'G')
0119             store = prepare(X, store);
0120             pos = store.pos;
0121             AtX = store.AtX;
0122             sgAtX = sign(AtX);
0123             factor = pos.*sgAtX;
0124             store.G = -A*factor;
0125         end
0126         G = store.G;
0127     end
0128 
0129     % An alternative way to compute the egrad is to use automatic
0130     % differentiation provided in the deep learning toolbox (slower).
0131     % Notice that the function norm is not supported for AD so far.
0132     % Replace norm(...,'fro')^2 with cnormsqfro described in the file
0133     % manoptADhelp.m. Notice that the abs function is not differentiable
0134     % mathematically. Instead the subgradient is computed automatically.
0135     % problem.cost = @cost_AD;
0136     % function f = cost_AD(X)
0137     %    AtX = A'*X;
0138     %    absAtX = abs(AtX);
0139     %    pos = max(0, absAtX - gamma);
0140     %    f = -.5*cnormsqfro(pos);
0141     % end
0142     % problem = manoptAD(problem,'egrad');
0143 
0144     % checkgradient(problem);
0145     % pause;
0146 
0147     % The optimization happens here. To improve the method, it may be
0148     % interesting to investigate better-than-random initial iterates and,
0149     % possibly, to fine tune the parameters of the solver.
0150     X = trustregions(problem);
0151 
0152     % Compute the sparsity pattern by thresholding
0153     P = abs(A'*X) > gamma;
0154     
0155 end
0156 
0157 
0158 % This post-processing algorithm produces a matrix Z of size nxm matching
0159 % the sparsity pattern P and representing sparse principal components for
0160 % A. This is to be called with the output of the main algorithm. This
0161 % algorithm is described in the reference paper by Journee et al.
0162 function Z = postprocess(A, P, X)
0163     fprintf('Post-processing... ');
0164     counter = 0;
0165     maxiter = 1000;
0166     tolerance = 1e-8;
0167     while counter < maxiter
0168         Z = A'*X;
0169         Z(~P) = 0;
0170         Z = Z*diag(1./sqrt(diag(Z'*Z)));
0171         X = ufactor(A*Z);
0172         counter = counter + 1;
0173         if counter > 1 && norm(Z0-Z, 'fro') < tolerance*norm(Z0, 'fro')
0174             break;
0175         end
0176         Z0 = Z;
0177     end
0178     fprintf('done, in %d iterations (max = %d).\n', counter, maxiter);
0179 end
0180 
0181 % Returns the U-factor of the polar decomposition of X
0182 function U = ufactor(X)
0183     [W, S, V] = svd(X, 0); %#ok<ASGLU>
0184     U = W*V';
0185 end

sparse_pca

PURPOSE

SYNOPSIS

DESCRIPTION

CROSS-REFERENCE INFORMATION

SUBFUNCTIONS

SOURCE CODE