Home > examples > sparse_pca.m

sparse_pca

PURPOSE ^

Sparse principal component analysis based on optimization over Stiefel.

SYNOPSIS ^

function [Z, P, X, A] = sparse_pca(A, m, gamma)

DESCRIPTION ^

 Sparse principal component analysis based on optimization over Stiefel.

 [Z, P, X] = sparse_pca(A, m, gamma)

 We consider sparse PCA applied to a data matrix A of size pxn, where p is
 the number of samples (observations) and n is the number of variables
 (features). We attempt to extract m different components. The parameter
 gamma, which must lie between 0 and the largest 2-norm of a column of
 A, tunes the balance between best explanation of the variance of the data
 (gamma = 0, mostly corresponds to standard PCA) and best sparsity of the
 principal components Z (gamma maximal, Z is zero). The variables
 contained in the columns of A are assumed centered (zero-mean).

 The output Z of size nxm represents the principal components. There are m
 columns, each one of unit norm and capturing a prefered direction of the
 data, while trying to be sparse. P has the same size as Z and represents
 the sparsity pattern of Z. X is an orthonormal matrix of size pxm
 produced internally by the algorithm.

 With classical PCA, the variability captured by m components is
 sum(svds(A, m))
 With the outputted Z, which should be sparser than normal PCA, it is
 sum(svd(A*Z))

 The method is based on the maximization of a differentiable function over
 the Stiefel manifold of dimension pxm. Notice that this dimension is
 independent of n, making this method particularly suitable for problems
 with many variables but few samples (n much larger than p). The
 complexity of each iteration of the algorithm is linear in n as a result.

 The theory behind this code is available in the paper
 http://jmlr.org/papers/volume11/journee10a/journee10a.pdf
 Generalized Power Method for Sparse Principal Component Analysis, by
 Journee, Nesterov, Richtarik and Sepulchre, JMLR, 2010.
 This implementation is not equivalent to the one described in that paper
 (and is independent from their authors) but is close in spirit
 nonetheless. It is provided with Manopt as an example file but was not
 optimized: please do not judge the quality of the algorithm described by
 the authors of the paper based on this implementation.

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:

SUBFUNCTIONS ^

SOURCE CODE ^

0001 function [Z, P, X, A] = sparse_pca(A, m, gamma)
0002 % Sparse principal component analysis based on optimization over Stiefel.
0003 %
0004 % [Z, P, X] = sparse_pca(A, m, gamma)
0005 %
0006 % We consider sparse PCA applied to a data matrix A of size pxn, where p is
0007 % the number of samples (observations) and n is the number of variables
0008 % (features). We attempt to extract m different components. The parameter
0009 % gamma, which must lie between 0 and the largest 2-norm of a column of
0010 % A, tunes the balance between best explanation of the variance of the data
0011 % (gamma = 0, mostly corresponds to standard PCA) and best sparsity of the
0012 % principal components Z (gamma maximal, Z is zero). The variables
0013 % contained in the columns of A are assumed centered (zero-mean).
0014 %
0015 % The output Z of size nxm represents the principal components. There are m
0016 % columns, each one of unit norm and capturing a prefered direction of the
0017 % data, while trying to be sparse. P has the same size as Z and represents
0018 % the sparsity pattern of Z. X is an orthonormal matrix of size pxm
0019 % produced internally by the algorithm.
0020 %
0021 % With classical PCA, the variability captured by m components is
0022 % sum(svds(A, m))
0023 % With the outputted Z, which should be sparser than normal PCA, it is
0024 % sum(svd(A*Z))
0025 %
0026 % The method is based on the maximization of a differentiable function over
0027 % the Stiefel manifold of dimension pxm. Notice that this dimension is
0028 % independent of n, making this method particularly suitable for problems
0029 % with many variables but few samples (n much larger than p). The
0030 % complexity of each iteration of the algorithm is linear in n as a result.
0031 %
0032 % The theory behind this code is available in the paper
0033 % http://jmlr.org/papers/volume11/journee10a/journee10a.pdf
0034 % Generalized Power Method for Sparse Principal Component Analysis, by
0035 % Journee, Nesterov, Richtarik and Sepulchre, JMLR, 2010.
0036 % This implementation is not equivalent to the one described in that paper
0037 % (and is independent from their authors) but is close in spirit
0038 % nonetheless. It is provided with Manopt as an example file but was not
0039 % optimized: please do not judge the quality of the algorithm described by
0040 % the authors of the paper based on this implementation.
0041 
0042 % This file is part of Manopt and is copyrighted. See the license file.
0043 %
0044 % Main author: Nicolas Boumal, Dec. 24, 2013
0045 % Contributors:
0046 %
0047 % Change log:
0048 %
0049 
0050     % If no input is provided, generate random data for a quick demo
0051     if nargin == 0
0052         n = 100;
0053         p = 10;
0054         m = 2;
0055 
0056         % Data matrix
0057         A = randn(p, n);
0058 
0059         % Regularization parameter. This should be between 0 and the largest
0060         % 2-norm of a column of A.
0061         gamma = 1;
0062         
0063     elseif nargin ~= 3
0064         error('Please provide 3 inputs (or none for a demo).');
0065     end
0066     
0067     % Execute the main algorithm: it will compute a sparsity pattern P.
0068     [P, X] = sparse_pca_stiefel_l1(A, m, gamma);
0069     
0070     % Compute the principal components in accordance with the sparsity.
0071     Z = postprocess(A, P, X);
0072 
0073 end
0074 
0075 
0076 % Sparse PCA based on the block sparse PCA algorithm with l1-penalty as
0077 % featured in the reference paper by Journee et al. This is not the same
0078 % algorithm but it is the same cost function optimized over the same search
0079 % space. We force N = eye(m).
0080 function [P, X] = sparse_pca_stiefel_l1(A, m, gamma)
0081     
0082     [p, n] = size(A); %#ok<NASGU>
0083 
0084     % The optimization takes place over a Stiefel manifold whose dimension
0085     % is independent of n. This is especially useful when there are many
0086     % more variables than samples.
0087     St = stiefelfactory(p, m);
0088     problem.M = St;
0089 
0090     % In this helper function, given a point 'X' on the manifold we check
0091     % whether the caching structure 'store' has been populated with
0092     % quantities that are useful to compute at X or not. If they were not,
0093     % then we compute and store them now.
0094     function store = prepare(X, store)
0095         if ~isfield(store, 'ready') || ~store.ready
0096             store.AtX = A'*X;
0097             store.absAtX = abs(store.AtX);
0098             store.pos = max(0, store.absAtX - gamma);
0099             store.ready = true;
0100         end
0101     end
0102 
0103     % Define the cost function here and set it in the problem structure.
0104     problem.cost = @cost;
0105     function [f, store] = cost(X, store)
0106         store = prepare(X, store);
0107         pos = store.pos;
0108         f = -.5*norm(pos, 'fro')^2;
0109     end
0110 
0111     % Here, we chose to define the Euclidean gradient (egrad instead of
0112     % grad) : Manopt will take care of converting it to the Riemannian
0113     % gradient.
0114     problem.egrad = @egrad;
0115     function [G, store] = egrad(X, store)
0116         if ~isfield(store, 'G')
0117             store = prepare(X, store);
0118             pos = store.pos;
0119             AtX = store.AtX;
0120             sgAtX = sign(AtX);
0121             factor = pos.*sgAtX;
0122             store.G = -A*factor;
0123         end
0124         G = store.G;
0125     end
0126 
0127     % checkgradient(problem);
0128     % pause;
0129 
0130     % The optimization happens here. To improve the method, it may be
0131     % interesting to investigate better-than-random initial iterates and,
0132     % possibly, to fine tune the parameters of the solver.
0133     X = trustregions(problem);
0134 
0135     % Compute the sparsity pattern by thresholding
0136     P = abs(A'*X) > gamma;
0137     
0138 end
0139 
0140 
0141 % This post-processing algorithm produces a matrix Z of size nxm matching
0142 % the sparsity pattern P and representing sparse principal components for
0143 % A. This is to be called with the output of the main algorithm. This
0144 % algorithm is described in the reference paper by Journee et al.
0145 function Z = postprocess(A, P, X)
0146     fprintf('Post-processing... ');
0147     counter = 0;
0148     maxiter = 1000;
0149     tolerance = 1e-8;
0150     while counter < maxiter
0151         Z = A'*X;
0152         Z(~P) = 0;
0153         Z = Z*diag(1./sqrt(diag(Z'*Z)));
0154         X = ufactor(A*Z);
0155         counter = counter + 1;
0156         if counter > 1 && norm(Z0-Z, 'fro') < tolerance*norm(Z0, 'fro')
0157             break;
0158         end
0159         Z0 = Z;
0160     end
0161     fprintf('done, in %d iterations (max = %d).\n', counter, maxiter);
0162 end
0163 
0164 % Returns the U-factor of the polar decomposition of X
0165 function U = ufactor(X)
0166     [W, S, V] = svd(X, 0); %#ok<ASGLU>
0167     U = W*V';
0168 end

Generated on Sat 12-Nov-2016 14:11:22 by m2html © 2005