# algorithms.statistics.formula.formulae¶

## Module: algorithms.statistics.formula.formulae¶

Inheritance diagram for nipy.algorithms.statistics.formula.formulae:

### Formula objects¶

A formula is basically a sympy expression for the mean of something of the form:

mean = sum([Beta(e)*e for e in expr])


Or, a linear combination of sympy expressions, with each one multiplied by its own “Beta”. The elements of expr can be instances of Term (for a linear regression formula, they would all be instances of Term). But, in general, there might be some other parameters (i.e. sympy.Symbol instances) that are not Terms.

The design matrix is made up of columns that are the derivatives of mean with respect to everything that is not a Term, evaluted at a recarray that has field names given by [str(t) for t in self.terms].

For those familiar with R’s formula syntax, if we wanted a design matrix like the following:

> s.table = read.table("http://www-stat.stanford.edu/~jtaylo/courses/stats191/data/supervisor.table", header=T)
> d = model.matrix(lm(Y ~ X1*X3, s.table)
)
> d
(Intercept) X1 X3 X1:X3
1            1 51 39  1989
2            1 64 54  3456
3            1 70 69  4830
4            1 63 47  2961
5            1 78 66  5148
6            1 55 44  2420
7            1 67 56  3752
8            1 75 55  4125
9            1 82 67  5494
10           1 61 47  2867
11           1 53 58  3074
12           1 60 39  2340
13           1 62 42  2604
14           1 83 45  3735
15           1 77 72  5544
16           1 90 72  6480
17           1 85 69  5865
18           1 60 75  4500
19           1 70 57  3990
20           1 58 54  3132
21           1 40 34  1360
22           1 61 62  3782
23           1 66 50  3300
24           1 37 58  2146
25           1 54 48  2592
26           1 77 63  4851
27           1 75 74  5550
28           1 57 45  2565
29           1 85 71  6035
30           1 82 59  4838
attr(,"assign")
[1] 0 1 2 3
>


With the Formula, it looks like this:

>>> r = np.rec.array([
...     (43, 51, 30, 39, 61, 92, 45), (63, 64, 51, 54, 63, 73, 47),
...     (71, 70, 68, 69, 76, 86, 48), (61, 63, 45, 47, 54, 84, 35),
...     (81, 78, 56, 66, 71, 83, 47), (43, 55, 49, 44, 54, 49, 34),
...     (58, 67, 42, 56, 66, 68, 35), (71, 75, 50, 55, 70, 66, 41),
...     (72, 82, 72, 67, 71, 83, 31), (67, 61, 45, 47, 62, 80, 41),
...     (64, 53, 53, 58, 58, 67, 34), (67, 60, 47, 39, 59, 74, 41),
...     (69, 62, 57, 42, 55, 63, 25), (68, 83, 83, 45, 59, 77, 35),
...     (77, 77, 54, 72, 79, 77, 46), (81, 90, 50, 72, 60, 54, 36),
...     (74, 85, 64, 69, 79, 79, 63), (65, 60, 65, 75, 55, 80, 60),
...     (65, 70, 46, 57, 75, 85, 46), (50, 58, 68, 54, 64, 78, 52),
...     (50, 40, 33, 34, 43, 64, 33), (64, 61, 52, 62, 66, 80, 41),
...     (53, 66, 52, 50, 63, 80, 37), (40, 37, 42, 58, 50, 57, 49),
...     (63, 54, 42, 48, 66, 75, 33), (66, 77, 66, 63, 88, 76, 72),
...     (78, 75, 58, 74, 80, 78, 49), (48, 57, 44, 45, 51, 83, 38),
...     (85, 85, 71, 71, 77, 74, 55), (82, 82, 39, 59, 64, 78, 39)],
...              dtype=[('y', '<i8'), ('x1', '<i8'), ('x2', '<i8'),
...                     ('x3', '<i8'), ('x4', '<i8'), ('x5', '<i8'),
...                     ('x6', '<i8')])
>>> x1 = Term('x1'); x3 = Term('x3')
>>> f = Formula([x1, x3, x1*x3]) + I
>>> f.mean
_b0*x1 + _b1*x3 + _b2*x1*x3 + _b3


The I is the “intercept” term, I have explicity not used R’s default of adding it to everything.

>>> f.design(r)
array([(51.0, 39.0, 1989.0, 1.0), (64.0, 54.0, 3456.0, 1.0),
(70.0, 69.0, 4830.0, 1.0), (63.0, 47.0, 2961.0, 1.0),
(78.0, 66.0, 5148.0, 1.0), (55.0, 44.0, 2420.0, 1.0),
(67.0, 56.0, 3752.0, 1.0), (75.0, 55.0, 4125.0, 1.0),
(82.0, 67.0, 5494.0, 1.0), (61.0, 47.0, 2867.0, 1.0),
(53.0, 58.0, 3074.0, 1.0), (60.0, 39.0, 2340.0, 1.0),
(62.0, 42.0, 2604.0, 1.0), (83.0, 45.0, 3735.0, 1.0),
(77.0, 72.0, 5544.0, 1.0), (90.0, 72.0, 6480.0, 1.0),
(85.0, 69.0, 5865.0, 1.0), (60.0, 75.0, 4500.0, 1.0),
(70.0, 57.0, 3990.0, 1.0), (58.0, 54.0, 3132.0, 1.0),
(40.0, 34.0, 1360.0, 1.0), (61.0, 62.0, 3782.0, 1.0),
(66.0, 50.0, 3300.0, 1.0), (37.0, 58.0, 2146.0, 1.0),
(54.0, 48.0, 2592.0, 1.0), (77.0, 63.0, 4851.0, 1.0),
(75.0, 74.0, 5550.0, 1.0), (57.0, 45.0, 2565.0, 1.0),
(85.0, 71.0, 6035.0, 1.0), (82.0, 59.0, 4838.0, 1.0)],
dtype=[('x1', '<f8'), ('x3', '<f8'), ('x1*x3', '<f8'), ('1', '<f8')])


## Classes¶

### Beta¶

class nipy.algorithms.statistics.formula.formulae.Beta

Bases: sympy.core.symbol.Dummy

A symbol tied to a Term term

__init__()

x.__init__(…) initializes x; see help(type(x)) for signature

default_assumptions = {}

### Factor¶

class nipy.algorithms.statistics.formula.formulae.Factor(name, levels, char='b')

A qualitative variable in a regression model

A Factor is similar to R’s factor. The levels of the Factor can be either strings or ints.

__init__(name, levels, char='b')

Initialize Factor

Parameters: name : str levels : [str or int] A sequence of strings or ints. char : str, optional prefix character for regression coefficients
static fromcol(col, name)

Create a Factor from a column array.

Parameters: col : ndarray an array with ndim==1 name : str name of the Factor factor : Factor

Examples

>>> data = np.array([(3,'a'),(4,'a'),(5,'b'),(3,'b')], np.dtype([('x', np.float), ('y', 'S1')]))
>>> f1 = Factor.fromcol(data['y'], 'y')
>>> f2 = Factor.fromcol(data['x'], 'x')
>>> d = f1.design(data)
>>> print(d.dtype.descr)
[('y_a', '<f8'), ('y_b', '<f8')]
>>> d = f2.design(data)
>>> print(d.dtype.descr)
[('x_3', '<f8'), ('x_4', '<f8'), ('x_5', '<f8')]

get_term(level)

Retrieve a term of the Factor…

main_effect
stratify(variable)

Create a new variable, stratified by the levels of a Factor.

Parameters: variable : str or simple sympy expression If sympy expression, then string representation must be all lower or upper case letters, i.e. it can be interpreted as a name. formula : Formula Formula whose mean has one parameter named variable%d, for each level in self.levels

Examples

>>> f = Factor('a', ['x','y'])
>>> sf = f.stratify('theta')
>>> sf.mean
_theta0*a_x + _theta1*a_y


### FactorTerm¶

class nipy.algorithms.statistics.formula.formulae.FactorTerm

Boolean Term derived from a Factor.

Its properties are the same as a Term except that its product with itself is itself.

__init__()

x.__init__(…) initializes x; see help(type(x)) for signature

default_assumptions = {}

### Formula¶

class nipy.algorithms.statistics.formula.formulae.Formula(seq, char='b')

Bases: object

A Formula is a model for a mean in a regression model.

It is often given by a sequence of sympy expressions, with the mean model being the sum of each term multiplied by a linear regression coefficient.

The expressions may depend on additional Symbol instances, giving a non-linear regression model.

__init__(seq, char='b')
Parameters: seq : sequence of sympy.Basic char : str, optional character for regression coefficient
coefs

Coefficients in the linear regression formula.

design(input, param=None, return_float=False, contrasts=None)

Construct the design matrix, and optional contrast matrices.

Parameters: input : np.recarray Recarray including fields needed to compute the Terms in getparams(self.design_expr). param : None or np.recarray Recarray including fields that are not Terms in getparams(self.design_expr) return_float : bool, optional If True, return a np.float array rather than a np.recarray contrasts : None or dict, optional Contrasts. The items in this dictionary should be (str, Formula) pairs where a contrast matrix is constructed for each Formula by evaluating its design at the same parameters as self.design. If not None, then the return_float is set to True. des : 2D array design matrix cmatrices : dict, optional Dictionary with keys from contrasts input, and contrast matrices corresponding to des design matrix. Returned only if contrasts input is not None
design_expr
dtype

The dtype of the design matrix of the Formula.

static fromrec(rec, keep=[], drop=[])

Construct Formula from recarray

For fields with a string-dtype, it is assumed that these are qualtiatitve regressors, i.e. Factors.

Parameters: rec: recarray Recarray whose field names will be used to create a formula. keep: [] Field names to explicitly keep, dropping all others. drop: [] Field names to drop.
mean

Expression for the mean, expressed as a linear combination of terms, each with dummy variables in front.

params

The parameters in the Formula.

subs(old, new)

Perform a sympy substitution on all terms in the Formula

Returns a new instance of the same class

Parameters: old : sympy.Basic The expression to be changed new : sympy.Basic The value to change it to. newf : Formula

Examples

>>> s, t = [Term(l) for l in 'st']
>>> f, g = [sympy.Function(l) for l in 'fg']
>>> form = Formula([f(t),g(s)])
>>> newform = form.subs(g, sympy.Function('h'))
>>> newform.terms
array([f(t), h(s)], dtype=object)
>>> form.terms
array([f(t), g(s)], dtype=object)

terms

Terms in the linear regression formula.

### RandomEffects¶

class nipy.algorithms.statistics.formula.formulae.RandomEffects(seq, sigma=None, char='e')

Covariance matrices for common random effects analyses.

Examples

Two subjects (here named 2 and 3):

>>> subj = make_recarray([2,2,2,3,3], 's')
>>> subj_factor = Factor('s', [2,3])


By default the covariance matrix is symbolic. The display differs a little between sympy versions (hence we don’t check it in the doctests):

>>> c = RandomEffects(subj_factor.terms)
>>> c.cov(subj)
array([[_s2_0, _s2_0, _s2_0, 0, 0],
[_s2_0, _s2_0, _s2_0, 0, 0],
[_s2_0, _s2_0, _s2_0, 0, 0],
[0, 0, 0, _s2_1, _s2_1],
[0, 0, 0, _s2_1, _s2_1]], dtype=object)


With a numeric sigma, you get a numeric array:

>>> c = RandomEffects(subj_factor.terms, sigma=np.array([[4,1],[1,6]]))
>>> c.cov(subj)
array([[ 4.,  4.,  4.,  1.,  1.],
[ 4.,  4.,  4.,  1.,  1.],
[ 4.,  4.,  4.,  1.,  1.],
[ 1.,  1.,  1.,  6.,  6.],
[ 1.,  1.,  1.,  6.,  6.]])

__init__(seq, sigma=None, char='e')

Initialize random effects instance

Parameters: seq : [sympy.Basic] sigma : ndarray Covariance of the random effects. Defaults to a diagonal with entries for each random effect. char : character for regression coefficient
cov(term, param=None)

Compute the covariance matrix for some given data.

Parameters: term : np.recarray Recarray including fields corresponding to the Terms in getparams(self.design_expr). param : np.recarray Recarray including fields that are not Terms in getparams(self.design_expr) C : ndarray Covariance matrix implied by design and self.sigma.

### Term¶

class nipy.algorithms.statistics.formula.formulae.Term

Bases: sympy.core.symbol.Symbol

A sympy.Symbol type to represent a term an a regression model

Terms can be added to other sympy expressions with the single convention that a term plus itself returns itself.

It is meant to emulate something on the right hand side of a formula in R. In particular, its name can be the name of a field in a recarray used to create a design matrix.

>>> t = Term('x')
>>> xval = np.array([(3,),(4,),(5,)], np.dtype([('x', np.float)]))
>>> f = t.formula
>>> d = f.design(xval)
>>> print(d.dtype.descr)
[('x', '<f8')]
>>> f.design(xval, return_float=True)
array([ 3.,  4.,  5.])

__init__()

x.__init__(…) initializes x; see help(type(x)) for signature

default_assumptions = {}
formula

Return a Formula with only terms=[self].

## Functions¶

nipy.algorithms.statistics.formula.formulae.contrast_from_cols_or_rows(L, D, pseudo=None)

Construct a contrast matrix from a design matrix D

(possibly with its pseudo inverse already computed) and a matrix L that either specifies something in the column space of D or the row space of D.

Parameters: L : ndarray Matrix used to try and construct a contrast. D : ndarray Design matrix used to create the contrast. pseudo : None or array-like, optional If not None, gives pseudo-inverse of D. Allows you to pass this if it is already calculated. C : ndarray Matrix with C.shape[1] == D.shape[1] representing an estimable contrast.

Notes

From an n x p design matrix D and a matrix L, tries to determine a p x q contrast matrix C which determines a contrast of full rank, i.e. the n x q matrix

dot(transpose(C), pinv(D))

is full rank.

L must satisfy either L.shape[0] == n or L.shape[1] == p.

If L.shape[0] == n, then L is thought of as representing columns in the column space of D.

If L.shape[1] == p, then L is thought of as what is known as a contrast matrix. In this case, this function returns an estimable contrast corresponding to the dot(D, L.T)

This always produces a meaningful contrast, not always with the intended properties because q is always non-zero unless L is identically 0. That is, it produces a contrast that spans the column space of L (after projection onto the column space of D).

nipy.algorithms.statistics.formula.formulae.define(*args, **kwargs)
nipy.algorithms.statistics.formula.formulae.getparams(expression)

Return the parameters of an expression that are not Term instances but are instances of sympy.Symbol.

Examples

>>> x, y, z = [Term(l) for l in 'xyz']
>>> f = Formula([x,y,z])
>>> getparams(f)
[]
>>> f.mean
_b0*x + _b1*y + _b2*z
>>> getparams(f.mean)
[_b0, _b1, _b2]
>>> th = sympy.Symbol('theta')
>>> f.mean*sympy.exp(th)
(_b0*x + _b1*y + _b2*z)*exp(theta)
>>> getparams(f.mean*sympy.exp(th))
[_b0, _b1, _b2, theta]

nipy.algorithms.statistics.formula.formulae.getterms(expression)

Return the all instances of Term in an expression.

Examples

>>> x, y, z = [Term(l) for l in 'xyz']
>>> f = Formula([x,y,z])
>>> getterms(f)
[x, y, z]
>>> getterms(f.mean)
[x, y, z]

nipy.algorithms.statistics.formula.formulae.is_factor(obj)

Is obj a Factor?

nipy.algorithms.statistics.formula.formulae.is_factor_term(obj)

Is obj a FactorTerm?

nipy.algorithms.statistics.formula.formulae.is_formula(obj)

Is obj a Formula?

nipy.algorithms.statistics.formula.formulae.is_term(obj)

Is obj a Term?

nipy.algorithms.statistics.formula.formulae.make_dummy(*args, **kwds)

Make dummy variable of given name
Parameters: name : str name of dummy variable dum : Dummy instance

Notes

The interface to Dummy changed between 0.6.7 and 0.7.0, and we used this function to keep compatibility. Now we depend on sympy 0.7.0 and this function is obsolete.

nipy.algorithms.statistics.formula.formulae.make_recarray(rows, names, dtypes=None, drop_name_dim=<class nipy.utils._NoValue>)

Create recarray from rows with field names names

Create a recarray with named columns from a list or ndarray of rows and sequence of names for the columns. If rows is an ndarray, dtypes must be None, otherwise we raise a ValueError. Otherwise, if dtypes is None, we cast the data in all columns in rows as np.float. If dtypes is not None, the routine uses dtypes as a dtype specifier for the output structured array.

Parameters: rows: list or array Rows that will be turned into an recarray. names: sequence Sequence of strings - names for the columns. dtypes: None or sequence of str or sequence of np.dtype, optional Used to create a np.dtype, can be sequence of np.dtype or string. drop_name_dim : {_NoValue, False, True}, optional Flag for compatibility with future default behavior. Current default is False. If True, drops the length 1 dimension corresponding to the axis transformed into fields when converting into a recarray. If _NoValue specified, gives default. Default will change to True in the next version of Nipy. v : np.ndarray Structured array with field names given by names. ValueError dtypes not None when rows is array.

Examples

The following tests depend on machine byte order for their exact output.

>>> arr = np.array([[3, 4], [4, 6], [6, 8]])
>>> make_recarray(arr, ['x', 'y'],
...               drop_name_dim=True)
array([(3, 4), (4, 6), (6, 8)],
dtype=[('x', '<i8'), ('y', '<i8')])
>>> make_recarray(arr, ['x', 'y'],
...               drop_name_dim=False)
array([[(3, 4)],
[(4, 6)],
[(6, 8)]],
dtype=[('x', '<i8'), ('y', '<i8')])
>>> r = make_recarray(arr, ['w', 'u'], drop_name_dim=True)
>>> make_recarray(r, ['x', 'y'],
...               drop_name_dim=True)
array([(3, 4), (4, 6), (6, 8)],
dtype=[('x', '<i8'), ('y', '<i8')])
>>> make_recarray([[3, 4], [4, 6], [7, 9]], 'wv',
...               [np.float, np.int])
array([(3.0, 4), (4.0, 6), (7.0, 9)],
dtype=[('w', '<f8'), ('v', '<i8')])

nipy.algorithms.statistics.formula.formulae.natural_spline(t, knots=None, order=3, intercept=False)

Return a Formula containing a natural spline

Spline for a Term with specified knots and order.

Parameters: t : Term knots : None or sequence, optional Sequence of float. Default None (same as empty list) order : int, optional Order of the spline. Defaults to a cubic (==3) intercept : bool, optional If True, include a constant function in the natural spline. Default is False formula : Formula A Formula with (len(knots) + order) Terms (if intercept=False, otherwise includes one more Term), made up of the natural spline functions.

Examples

>>> x = Term('x')
>>> n = natural_spline(x, knots=[1,3,4], order=3)
>>> xval = np.array([3,5,7.]).view(np.dtype([('x', np.float)]))
>>> n.design(xval, return_float=True)
array([[   3.,    9.,   27.,    8.,    0.,   -0.],
[   5.,   25.,  125.,   64.,    8.,    1.],
[   7.,   49.,  343.,  216.,   64.,   27.]])
>>> d = n.design(xval)
>>> print(d.dtype.descr)
[('ns_1(x)', '<f8'), ('ns_2(x)', '<f8'), ('ns_3(x)', '<f8'), ('ns_4(x)', '<f8'), ('ns_5(x)', '<f8'), ('ns_6(x)', '<f8')]

nipy.algorithms.statistics.formula.formulae.terms(names, **kwargs)

Return list of terms with names given by names

This is just a convenience in defining a set of terms, and is the equivalent of sympy.symbols for defining symbols in sympy.

We enforce the sympy 0.7.0 behavior of returning symbol “abc” from input “abc”, rthan than 3 symbols “a”, “b”, “c”.

Parameters: names : str or sequence of str If a single str, can specify multiple Terms with string containing space or ‘,’ as separator. **kwargs : keyword arguments keyword arguments as for sympy.symbols ts : Term or tuple Term instance or list of Term instance objects named from names

Examples

>>> terms(('a', 'b', 'c'))
(a, b, c)
>>> terms('a, b, c')
(a, b, c)
>>> terms('abc')
abc