algorithms.statistics.formula.formulae

Module: algorithms.statistics.formula.formulae

Inheritance diagram for nipy.algorithms.statistics.formula.formulae:

Inheritance diagram of nipy.algorithms.statistics.formula.formulae

Formula objects

A formula is basically a sympy expression for the mean of something of the form:

mean = sum([Beta(e)*e for e in expr])

Or, a linear combination of sympy expressions, with each one multiplied by its own “Beta”. The elements of expr can be instances of Term (for a linear regression formula, they would all be instances of Term). But, in general, there might be some other parameters (i.e. sympy.Symbol instances) that are not Terms.

The design matrix is made up of columns that are the derivatives of mean with respect to everything that is not a Term, evaluted at a recarray that has field names given by [str(t) for t in self.terms].

For those familiar with R’s formula syntax, if we wanted a design matrix like the following:

> s.table = read.table("http://www-stat.stanford.edu/~jtaylo/courses/stats191/data/supervisor.table", header=T)
> d = model.matrix(lm(Y ~ X1*X3, s.table)
)
> d
   (Intercept) X1 X3 X1:X3
1            1 51 39  1989
2            1 64 54  3456
3            1 70 69  4830
4            1 63 47  2961
5            1 78 66  5148
6            1 55 44  2420
7            1 67 56  3752
8            1 75 55  4125
9            1 82 67  5494
10           1 61 47  2867
11           1 53 58  3074
12           1 60 39  2340
13           1 62 42  2604
14           1 83 45  3735
15           1 77 72  5544
16           1 90 72  6480
17           1 85 69  5865
18           1 60 75  4500
19           1 70 57  3990
20           1 58 54  3132
21           1 40 34  1360
22           1 61 62  3782
23           1 66 50  3300
24           1 37 58  2146
25           1 54 48  2592
26           1 77 63  4851
27           1 75 74  5550
28           1 57 45  2565
29           1 85 71  6035
30           1 82 59  4838
attr(,"assign")
[1] 0 1 2 3
>

With the Formula, it looks like this:

>>> r = np.rec.array([
...     (43, 51, 30, 39, 61, 92, 45), (63, 64, 51, 54, 63, 73, 47), 
...     (71, 70, 68, 69, 76, 86, 48), (61, 63, 45, 47, 54, 84, 35),
...     (81, 78, 56, 66, 71, 83, 47), (43, 55, 49, 44, 54, 49, 34),
...     (58, 67, 42, 56, 66, 68, 35), (71, 75, 50, 55, 70, 66, 41),
...     (72, 82, 72, 67, 71, 83, 31), (67, 61, 45, 47, 62, 80, 41),
...     (64, 53, 53, 58, 58, 67, 34), (67, 60, 47, 39, 59, 74, 41),
...     (69, 62, 57, 42, 55, 63, 25), (68, 83, 83, 45, 59, 77, 35),
...     (77, 77, 54, 72, 79, 77, 46), (81, 90, 50, 72, 60, 54, 36),
...     (74, 85, 64, 69, 79, 79, 63), (65, 60, 65, 75, 55, 80, 60),
...     (65, 70, 46, 57, 75, 85, 46), (50, 58, 68, 54, 64, 78, 52),
...     (50, 40, 33, 34, 43, 64, 33), (64, 61, 52, 62, 66, 80, 41),
...     (53, 66, 52, 50, 63, 80, 37), (40, 37, 42, 58, 50, 57, 49),
...     (63, 54, 42, 48, 66, 75, 33), (66, 77, 66, 63, 88, 76, 72),
...     (78, 75, 58, 74, 80, 78, 49), (48, 57, 44, 45, 51, 83, 38),
...     (85, 85, 71, 71, 77, 74, 55), (82, 82, 39, 59, 64, 78, 39)],
...              dtype=[('y', '<i8'), ('x1', '<i8'), ('x2', '<i8'),
...                     ('x3', '<i8'), ('x4', '<i8'), ('x5', '<i8'),
...                     ('x6', '<i8')])
>>> x1 = Term('x1'); x3 = Term('x3')
>>> f = Formula([x1, x3, x1*x3]) + I
>>> f.mean
_b0*x1 + _b1*x3 + _b2*x1*x3 + _b3

The I is the “intercept” term, I have explicity not used R’s default of adding it to everything.

>>> f.design(r)  
array([(51.0, 39.0, 1989.0, 1.0), (64.0, 54.0, 3456.0, 1.0),
       (70.0, 69.0, 4830.0, 1.0), (63.0, 47.0, 2961.0, 1.0),
       (78.0, 66.0, 5148.0, 1.0), (55.0, 44.0, 2420.0, 1.0),
       (67.0, 56.0, 3752.0, 1.0), (75.0, 55.0, 4125.0, 1.0),
       (82.0, 67.0, 5494.0, 1.0), (61.0, 47.0, 2867.0, 1.0),
       (53.0, 58.0, 3074.0, 1.0), (60.0, 39.0, 2340.0, 1.0),
       (62.0, 42.0, 2604.0, 1.0), (83.0, 45.0, 3735.0, 1.0),
       (77.0, 72.0, 5544.0, 1.0), (90.0, 72.0, 6480.0, 1.0),
       (85.0, 69.0, 5865.0, 1.0), (60.0, 75.0, 4500.0, 1.0),
       (70.0, 57.0, 3990.0, 1.0), (58.0, 54.0, 3132.0, 1.0),
       (40.0, 34.0, 1360.0, 1.0), (61.0, 62.0, 3782.0, 1.0),
       (66.0, 50.0, 3300.0, 1.0), (37.0, 58.0, 2146.0, 1.0),
       (54.0, 48.0, 2592.0, 1.0), (77.0, 63.0, 4851.0, 1.0),
       (75.0, 74.0, 5550.0, 1.0), (57.0, 45.0, 2565.0, 1.0),
       (85.0, 71.0, 6035.0, 1.0), (82.0, 59.0, 4838.0, 1.0)], 
      dtype=[('x1', '<f8'), ('x3', '<f8'), ('x1*x3', '<f8'), ('1', '<f8')])

Classes

Beta

class nipy.algorithms.statistics.formula.formulae.Beta

Bases: sympy.core.symbol.Dummy

A symbol tied to a Term term

__init__()

x.__init__(…) initializes x; see help(type(x)) for signature

default_assumptions = {}

Factor

class nipy.algorithms.statistics.formula.formulae.Factor(name, levels, char='b')

Bases: nipy.algorithms.statistics.formula.formulae.Formula

A qualitative variable in a regression model

A Factor is similar to R’s factor. The levels of the Factor can be either strings or ints.

__init__(name, levels, char='b')

Initialize Factor

Parameters:

name : str

levels : [str or int]

A sequence of strings or ints.

char : str, optional

prefix character for regression coefficients

static fromcol(col, name)

Create a Factor from a column array.

Parameters:

col : ndarray

an array with ndim==1

name : str

name of the Factor

Returns:

factor : Factor

Examples

>>> data = np.array([(3,'a'),(4,'a'),(5,'b'),(3,'b')], np.dtype([('x', np.float), ('y', 'S1')]))
>>> f1 = Factor.fromcol(data['y'], 'y')
>>> f2 = Factor.fromcol(data['x'], 'x')
>>> d = f1.design(data)
>>> print(d.dtype.descr)
[('y_a', '<f8'), ('y_b', '<f8')]
>>> d = f2.design(data)
>>> print(d.dtype.descr)
[('x_3', '<f8'), ('x_4', '<f8'), ('x_5', '<f8')]
get_term(level)

Retrieve a term of the Factor…

main_effect
stratify(variable)

Create a new variable, stratified by the levels of a Factor.

Parameters:

variable : str or simple sympy expression

If sympy expression, then string representation must be all lower or upper case letters, i.e. it can be interpreted as a name.

Returns:

formula : Formula

Formula whose mean has one parameter named variable%d, for each level in self.levels

Examples

>>> f = Factor('a', ['x','y'])
>>> sf = f.stratify('theta')
>>> sf.mean
_theta0*a_x + _theta1*a_y

FactorTerm

class nipy.algorithms.statistics.formula.formulae.FactorTerm

Bases: nipy.algorithms.statistics.formula.formulae.Term

Boolean Term derived from a Factor.

Its properties are the same as a Term except that its product with itself is itself.

__init__()

x.__init__(…) initializes x; see help(type(x)) for signature

default_assumptions = {}

Formula

class nipy.algorithms.statistics.formula.formulae.Formula(seq, char='b')

Bases: object

A Formula is a model for a mean in a regression model.

It is often given by a sequence of sympy expressions, with the mean model being the sum of each term multiplied by a linear regression coefficient.

The expressions may depend on additional Symbol instances, giving a non-linear regression model.

__init__(seq, char='b')
Parameters:

seq : sequence of sympy.Basic

char : str, optional

character for regression coefficient

coefs

Coefficients in the linear regression formula.

design(input, param=None, return_float=False, contrasts=None)

Construct the design matrix, and optional contrast matrices.

Parameters:

input : np.recarray

Recarray including fields needed to compute the Terms in getparams(self.design_expr).

param : None or np.recarray

Recarray including fields that are not Terms in getparams(self.design_expr)

return_float : bool, optional

If True, return a np.float array rather than a np.recarray

contrasts : None or dict, optional

Contrasts. The items in this dictionary should be (str, Formula) pairs where a contrast matrix is constructed for each Formula by evaluating its design at the same parameters as self.design. If not None, then the return_float is set to True.

Returns:

des : 2D array

design matrix

cmatrices : dict, optional

Dictionary with keys from contrasts input, and contrast matrices corresponding to des design matrix. Returned only if contrasts input is not None

design_expr
dtype

The dtype of the design matrix of the Formula.

static fromrec(rec, keep=[], drop=[])

Construct Formula from recarray

For fields with a string-dtype, it is assumed that these are qualtiatitve regressors, i.e. Factors.

Parameters:

rec: recarray

Recarray whose field names will be used to create a formula.

keep: []

Field names to explicitly keep, dropping all others.

drop: []

Field names to drop.

mean

Expression for the mean, expressed as a linear combination of terms, each with dummy variables in front.

params

The parameters in the Formula.

subs(old, new)

Perform a sympy substitution on all terms in the Formula

Returns a new instance of the same class

Parameters:

old : sympy.Basic

The expression to be changed

new : sympy.Basic

The value to change it to.

Returns:

newf : Formula

Examples

>>> s, t = [Term(l) for l in 'st']
>>> f, g = [sympy.Function(l) for l in 'fg']
>>> form = Formula([f(t),g(s)])
>>> newform = form.subs(g, sympy.Function('h'))
>>> newform.terms
array([f(t), h(s)], dtype=object)
>>> form.terms
array([f(t), g(s)], dtype=object)
terms

Terms in the linear regression formula.

RandomEffects

class nipy.algorithms.statistics.formula.formulae.RandomEffects(seq, sigma=None, char='e')

Bases: nipy.algorithms.statistics.formula.formulae.Formula

Covariance matrices for common random effects analyses.

Examples

Two subjects (here named 2 and 3):

>>> subj = make_recarray([2,2,2,3,3], 's')
>>> subj_factor = Factor('s', [2,3])

By default the covariance matrix is symbolic. The display differs a little between sympy versions (hence we don’t check it in the doctests):

>>> c = RandomEffects(subj_factor.terms)
>>> c.cov(subj) 
array([[_s2_0, _s2_0, _s2_0, 0, 0],
       [_s2_0, _s2_0, _s2_0, 0, 0],
       [_s2_0, _s2_0, _s2_0, 0, 0],
       [0, 0, 0, _s2_1, _s2_1],
       [0, 0, 0, _s2_1, _s2_1]], dtype=object)

With a numeric sigma, you get a numeric array:

>>> c = RandomEffects(subj_factor.terms, sigma=np.array([[4,1],[1,6]]))
>>> c.cov(subj)
array([[ 4.,  4.,  4.,  1.,  1.],
       [ 4.,  4.,  4.,  1.,  1.],
       [ 4.,  4.,  4.,  1.,  1.],
       [ 1.,  1.,  1.,  6.,  6.],
       [ 1.,  1.,  1.,  6.,  6.]])
__init__(seq, sigma=None, char='e')

Initialize random effects instance

Parameters:

seq : [sympy.Basic]

sigma : ndarray

Covariance of the random effects. Defaults to a diagonal with entries for each random effect.

char : character for regression coefficient

cov(term, param=None)

Compute the covariance matrix for some given data.

Parameters:

term : np.recarray

Recarray including fields corresponding to the Terms in getparams(self.design_expr).

param : np.recarray

Recarray including fields that are not Terms in getparams(self.design_expr)

Returns:

C : ndarray

Covariance matrix implied by design and self.sigma.

Term

class nipy.algorithms.statistics.formula.formulae.Term

Bases: sympy.core.symbol.Symbol

A sympy.Symbol type to represent a term an a regression model

Terms can be added to other sympy expressions with the single convention that a term plus itself returns itself.

It is meant to emulate something on the right hand side of a formula in R. In particular, its name can be the name of a field in a recarray used to create a design matrix.

>>> t = Term('x')
>>> xval = np.array([(3,),(4,),(5,)], np.dtype([('x', np.float)]))
>>> f = t.formula
>>> d = f.design(xval)
>>> print(d.dtype.descr)
[('x', '<f8')]
>>> f.design(xval, return_float=True)
array([ 3.,  4.,  5.])
__init__()

x.__init__(…) initializes x; see help(type(x)) for signature

default_assumptions = {}
formula

Return a Formula with only terms=[self].

Functions

nipy.algorithms.statistics.formula.formulae.contrast_from_cols_or_rows(L, D, pseudo=None)

Construct a contrast matrix from a design matrix D

(possibly with its pseudo inverse already computed) and a matrix L that either specifies something in the column space of D or the row space of D.

Parameters:

L : ndarray

Matrix used to try and construct a contrast.

D : ndarray

Design matrix used to create the contrast.

pseudo : None or array-like, optional

If not None, gives pseudo-inverse of D. Allows you to pass this if it is already calculated.

Returns:

C : ndarray

Matrix with C.shape[1] == D.shape[1] representing an estimable contrast.

Notes

From an n x p design matrix D and a matrix L, tries to determine a p x q contrast matrix C which determines a contrast of full rank, i.e. the n x q matrix

dot(transpose(C), pinv(D))

is full rank.

L must satisfy either L.shape[0] == n or L.shape[1] == p.

If L.shape[0] == n, then L is thought of as representing columns in the column space of D.

If L.shape[1] == p, then L is thought of as what is known as a contrast matrix. In this case, this function returns an estimable contrast corresponding to the dot(D, L.T)

This always produces a meaningful contrast, not always with the intended properties because q is always non-zero unless L is identically 0. That is, it produces a contrast that spans the column space of L (after projection onto the column space of D).

nipy.algorithms.statistics.formula.formulae.define(*args, **kwargs)
nipy.algorithms.statistics.formula.formulae.getparams(expression)

Return the parameters of an expression that are not Term instances but are instances of sympy.Symbol.

Examples

>>> x, y, z = [Term(l) for l in 'xyz']
>>> f = Formula([x,y,z])
>>> getparams(f)
[]
>>> f.mean
_b0*x + _b1*y + _b2*z
>>> getparams(f.mean)
[_b0, _b1, _b2]
>>> th = sympy.Symbol('theta')
>>> f.mean*sympy.exp(th)
(_b0*x + _b1*y + _b2*z)*exp(theta)
>>> getparams(f.mean*sympy.exp(th))
[_b0, _b1, _b2, theta]
nipy.algorithms.statistics.formula.formulae.getterms(expression)

Return the all instances of Term in an expression.

Examples

>>> x, y, z = [Term(l) for l in 'xyz']
>>> f = Formula([x,y,z])
>>> getterms(f)
[x, y, z]
>>> getterms(f.mean)
[x, y, z]
nipy.algorithms.statistics.formula.formulae.is_factor(obj)

Is obj a Factor?

nipy.algorithms.statistics.formula.formulae.is_factor_term(obj)

Is obj a FactorTerm?

nipy.algorithms.statistics.formula.formulae.is_formula(obj)

Is obj a Formula?

nipy.algorithms.statistics.formula.formulae.is_term(obj)

Is obj a Term?

nipy.algorithms.statistics.formula.formulae.make_dummy(*args, **kwds)

make_dummy is deprecated! Please use sympy.Dummy instead of this function

Make dummy variable of given name
Parameters:

name : str

name of dummy variable

Returns:

dum : Dummy instance

Notes

The interface to Dummy changed between 0.6.7 and 0.7.0, and we used this function to keep compatibility. Now we depend on sympy 0.7.0 and this function is obsolete.

nipy.algorithms.statistics.formula.formulae.make_recarray(rows, names, dtypes=None, drop_name_dim=<class nipy.utils._NoValue>)

Create recarray from rows with field names names

Create a recarray with named columns from a list or ndarray of rows and sequence of names for the columns. If rows is an ndarray, dtypes must be None, otherwise we raise a ValueError. Otherwise, if dtypes is None, we cast the data in all columns in rows as np.float. If dtypes is not None, the routine uses dtypes as a dtype specifier for the output structured array.

Parameters:

rows: list or array

Rows that will be turned into an recarray.

names: sequence

Sequence of strings - names for the columns.

dtypes: None or sequence of str or sequence of np.dtype, optional

Used to create a np.dtype, can be sequence of np.dtype or string.

drop_name_dim : {_NoValue, False, True}, optional

Flag for compatibility with future default behavior. Current default is False. If True, drops the length 1 dimension corresponding to the axis transformed into fields when converting into a recarray. If _NoValue specified, gives default. Default will change to True in the next version of Nipy.

Returns:

v : np.ndarray

Structured array with field names given by names.

Raises:

ValueError

dtypes not None when rows is array.

Examples

The following tests depend on machine byte order for their exact output.

>>> arr = np.array([[3, 4], [4, 6], [6, 8]])
>>> make_recarray(arr, ['x', 'y'],
...               drop_name_dim=True) 
array([(3, 4), (4, 6), (6, 8)],
      dtype=[('x', '<i8'), ('y', '<i8')])
>>> make_recarray(arr, ['x', 'y'],
...               drop_name_dim=False) 
array([[(3, 4)],
       [(4, 6)],
       [(6, 8)]],
      dtype=[('x', '<i8'), ('y', '<i8')])
>>> r = make_recarray(arr, ['w', 'u'], drop_name_dim=True)
>>> make_recarray(r, ['x', 'y'],
...               drop_name_dim=True) 
array([(3, 4), (4, 6), (6, 8)],
      dtype=[('x', '<i8'), ('y', '<i8')])
>>> make_recarray([[3, 4], [4, 6], [7, 9]], 'wv',
...               [np.float, np.int])  
array([(3.0, 4), (4.0, 6), (7.0, 9)],
      dtype=[('w', '<f8'), ('v', '<i8')])
nipy.algorithms.statistics.formula.formulae.natural_spline(t, knots=None, order=3, intercept=False)

Return a Formula containing a natural spline

Spline for a Term with specified knots and order.

Parameters:

t : Term

knots : None or sequence, optional

Sequence of float. Default None (same as empty list)

order : int, optional

Order of the spline. Defaults to a cubic (==3)

intercept : bool, optional

If True, include a constant function in the natural spline. Default is False

Returns:

formula : Formula

A Formula with (len(knots) + order) Terms (if intercept=False, otherwise includes one more Term), made up of the natural spline functions.

Examples

>>> x = Term('x')
>>> n = natural_spline(x, knots=[1,3,4], order=3)
>>> xval = np.array([3,5,7.]).view(np.dtype([('x', np.float)]))
>>> n.design(xval, return_float=True)
array([[   3.,    9.,   27.,    8.,    0.,   -0.],
       [   5.,   25.,  125.,   64.,    8.,    1.],
       [   7.,   49.,  343.,  216.,   64.,   27.]])
>>> d = n.design(xval)
>>> print(d.dtype.descr)
[('ns_1(x)', '<f8'), ('ns_2(x)', '<f8'), ('ns_3(x)', '<f8'), ('ns_4(x)', '<f8'), ('ns_5(x)', '<f8'), ('ns_6(x)', '<f8')]
nipy.algorithms.statistics.formula.formulae.terms(names, **kwargs)

Return list of terms with names given by names

This is just a convenience in defining a set of terms, and is the equivalent of sympy.symbols for defining symbols in sympy.

We enforce the sympy 0.7.0 behavior of returning symbol “abc” from input “abc”, rthan than 3 symbols “a”, “b”, “c”.

Parameters:

names : str or sequence of str

If a single str, can specify multiple ``Term``s with string containing space or ‘,’ as separator.

**kwargs : keyword arguments

keyword arguments as for sympy.symbols

Returns:

ts : Term or tuple

Term instance or list of Term instance objects named from names

Examples

>>> terms(('a', 'b', 'c'))
(a, b, c)
>>> terms('a, b, c')
(a, b, c)
>>> terms('abc')
abc