algorithms.utils.pca

Module: algorithms.utils.pca

This module provides a class for principal components analysis (PCA).

PCA is an orthonormal, linear transform (i.e., a rotation) that maps the data to a new coordinate system such that the maximal variability of the data lies on the first coordinate (or the first principal component), the second greatest variability is projected onto the second coordinate, and so on. The resulting data has unit covariance (i.e., it is decorrelated). This technique can be used to reduce the dimensionality of the data.

More specifically, the data is projected onto the eigenvectors of the covariance matrix.

Functions

nipy.algorithms.utils.pca.pca(data, axis=0, mask=None, ncomp=None, standardize=True, design_keep=None, design_resid='mean', tol_ratio=0.01)

Compute the SVD PCA of an array-like thing over axis.

Parameters:

data : ndarray-like (np.float)

The array on which to perform PCA over axis axis (below)

axis : int, optional

The axis over which to perform PCA (axis identifying observations). Default is 0 (first)

mask : ndarray-like (np.bool), optional

An optional mask, should have shape given by data axes, with axis removed, i.e.: s = data.shape; s.pop(axis); msk_shape=s

ncomp : {None, int}, optional

How many component basis projections to return. If ncomp is None (the default) then the number of components is given by the calculated rank of the data, after applying design_keep, design_resid and tol_ratio below. We always return all the basis vectors and percent variance for each component; ncomp refers only to the number of basis_projections returned.

standardize : bool, optional

If True, standardize so each time series (after application of design_keep and design_resid) has the same standard deviation, as calculated by the np.std function.

design_keep : None or ndarray, optional

Data is projected onto the column span of design_keep. None (default) equivalent to np.identity(data.shape[axis])

design_resid : str or None or ndarray, optional

After projecting onto the column span of design_keep, data is projected perpendicular to the column span of this matrix. If None, we do no such second projection. If a string ‘mean’, then the mean of the data is removed, equivalent to passing a column vector matrix of 1s.

tol_ratio : float, optional

If XZ is the vector of singular values of the projection matrix from design_keep and design_resid, and S are the singular values of XZ, then tol_ratio is the value used to calculate the effective rank of the projection of the design, as in rank = ((S / S.max) > tol_ratio).sum()

Returns:

results : dict

\(G\) is the number of non-trivial components found after applying

tol_ratio to the projections of design_keep and design_resid.

results has keys:

  • basis_vectors: series over axis, shape (data.shape[axis], G) -
    the eigenvectors of the PCA
  • pcnt_var: percent variance explained by component, shape
    (G,)
  • basis_projections: PCA components, with components varying
    over axis axis; thus shape given by: s = list(data.shape); s[axis] = ncomp
  • axis: axis over which PCA has been performed.

Notes

See pca_image.m from fmristat for Keith Worsley’s code on which some of this is based.

See: http://en.wikipedia.org/wiki/Principal_component_analysis for some inspiration for naming - particularly ‘basis_vectors’ and ‘basis_projections’

Examples

>>> arr = np.random.normal(size=(17, 10, 12, 14))
>>> msk = np.all(arr > -2, axis=0)
>>> res = pca(arr, mask=msk, ncomp=9)

Basis vectors are columns. There is one column for each component. The number of components is the calculated rank of the data matrix after applying the various projections listed in the parameters. In this case we are only removing the mean, so the number of components is one less than the axis over which we do the PCA (here axis=0 by default).

>>> res['basis_vectors'].shape
(17, 16)

Basis projections are arrays with components in the dimension over which we have done the PCA (axis=0 by default). Because we set ncomp above, we only retain ncomp components.

>>> res['basis_projections'].shape
(9, 10, 12, 14)
nipy.algorithms.utils.pca.pca_image(img, axis='t', mask=None, ncomp=None, standardize=True, design_keep=None, design_resid='mean', tol_ratio=0.01)

Compute the PCA of an image over a specified axis

Parameters:

img : Image

The image on which to perform PCA over the given axis

axis : str or int, optional

Axis over which to perform PCA. Default is ‘t’. If axis is an integer, gives the index of the input (domain) axis of img. If axis is a str, can be an input (domain) name, or an output (range) name, that maps to an input (domain) name.

mask : Image, optional

An optional mask, should have shape == image.shape[:3] and the same coordinate map as img but with axis dropped

ncomp : {None, int}, optional

How many component basis projections to return. If ncomp is None (the default) then the number of components is given by the calculated rank of the data, after applying design_keep, design_resid and tol_ratio below. We always return all the basis vectors and percent variance for each component; ncomp refers only to the number of basis_projections returned.

standardize : bool, optional

If True, standardize so each time series (after application of design_keep and design_resid) has the same standard deviation, as calculated by the np.std function.

design_keep : None or ndarray, optional

Data is projected onto the column span of design_keep. None (default) equivalent to np.identity(data.shape[axis])

design_resid : str or None or ndarray, optional

After projecting onto the column span of design_keep, data is projected perpendicular to the column span of this matrix. If None, we do no such second projection. If a string ‘mean’, then the mean of the data is removed, equivalent to passing a column vector matrix of 1s.

tol_ratio : float, optional

If XZ is the vector of singular values of the projection matrix from design_keep and design_resid, and S are the singular values of XZ, then tol_ratio is the value used to calculate the effective rank of the projection of the design, as in rank = ((S / S.max) > tol_ratio).sum()

Returns:

results : dict

\(L\) is the number of non-trivial components found after applying

tol_ratio to the projections of design_keep and design_resid.

results has keys: * basis_vectors: series over axis, shape (data.shape[axis], L) -

the eigenvectors of the PCA

  • pcnt_var: percent variance explained by component, shape
    (L,)
  • basis_projections: PCA components, with components varying
    over axis axis; thus shape given by: s = list(data.shape); s[axis] = ncomp
  • axis: axis over which PCA has been performed.

Examples

>>> from nipy.testing import funcfile
>>> from nipy import load_image
>>> func_img = load_image(funcfile)

Time is the fourth axis

>>> func_img.coordmap.function_range
CoordinateSystem(coord_names=('aligned-x=L->R', 'aligned-y=P->A', 'aligned-z=I->S', 't'), name='aligned', coord_dtype=float64)
>>> func_img.shape
(17, 21, 3, 20)

Calculate the PCA over time, by default

>>> res = pca_image(func_img)
>>> res['basis_projections'].coordmap.function_range
CoordinateSystem(coord_names=('aligned-x=L->R', 'aligned-y=P->A', 'aligned-z=I->S', 'PCA components'), name='aligned', coord_dtype=float64)

The number of components is one less than the number of time points

>>> res['basis_projections'].shape
(17, 21, 3, 19)