algorithms.clustering.hierarchical_clustering

Module: algorithms.clustering.hierarchical_clustering

Inheritance diagram for nipy.algorithms.clustering.hierarchical_clustering:

Inheritance diagram of nipy.algorithms.clustering.hierarchical_clustering

These routines perform some hierrachical agglomerative clustering of some input data. The following alternatives are proposed: - Distance based average-link - Similarity-based average-link - Distance based maximum-link - Ward’s algorithm under graph constraints - Ward’s algorithm without graph constraints

In this latest version, the results are returned in a ‘WeightedForest’ structure, which gives access to the clustering hierarchy, facilitates the plot of the result etc.

For back-compatibility, *_segment versions of the algorithms have been appended, with the old API (except the qmax parameter, which now represents the number of wanted clusters)

Author : Bertrand Thirion,Pamela Guevara, 2006-2009

Class

WeightedForest

class nipy.algorithms.clustering.hierarchical_clustering.WeightedForest(V, parents=None, height=None)

Bases: nipy.algorithms.graph.forest.Forest

This is a weighted Forest structure, i.e. a tree - each node has one parent and children (hierarchical structure) - some of the nodes can be viewed as leaves, other as roots - the edges within a tree are associated with a weight: +1 from child to parent -1 from parent to child - additionally, the nodes have a value, which is called ‘height’, especially useful from dendrograms

__init__(V, parents=None, height=None)
Parameters:

V: the number of edges of the graph

parents=None: array of shape (V)

the parents of the graph by default, the parents are set to range(V), i.e. each node is its own parent, and each node is a tree

height=None: array of shape(V)

the height of the nodes

check_compatible_height()

Check that height[parents[i]]>=height[i] for all nodes

get_height()

Get the height array

list_of_subtrees()

returns the list of all non-trivial subtrees in the graph Caveat: theis function assumes that the vertices are sorted in a way such that parent[i]>i forall i Only the leaves are listeed, not the subtrees themselves

partition(threshold)

Partition the tree according to a cut criterion

plot(ax=None)

Plot the dendrogram associated with self the rank of the data in the dendogram is returned

Parameters:ax: axis handle, optional
Returns:ax, the axis handle
plot_height()

Plot the height of the non-leaves nodes

set_height(height=None)

Set the height array

split(k)

idem as partition, but a number of components are supplied instead

Functions

Agglomerative function based on a (hopefully sparse) similarity graph

Parameters:G the input graph
Returns:t a weightForest structure that represents the dendrogram of the data

Agglomerative function based on a (hopefully sparse) similarity graph

Parameters:

G the input graph

stop: float

the stopping criterion

qmax: int, optional

the number of desired clusters (in the limit of the stopping criterion)

verbose : bool, optional

If True, print diagnostic information

Returns:

u: array of shape (G.V)

a labelling of the graph vertices according to the criterion

cost: array of shape (G.V (?))

the cost of each merge step during the clustering procedure

nipy.algorithms.clustering.hierarchical_clustering.fusion(K, pop, i, j, k)

Modifies the graph K to merge nodes i and j into nodes k

The similarity values are weighted averaged, where pop[i] and pop[j] yield the relative weights. this is used in average_link_slow (deprecated)

nipy.algorithms.clustering.hierarchical_clustering.ward(G, feature, verbose=False)

Agglomerative function based on a topology-defining graph and a feature matrix.

Parameters:

G : graph

the input graph (a topological graph essentially)

feature : array of shape (G.V,dim_feature)

vectorial information related to the graph vertices

verbose : bool, optional

If True, print diagnostic information

Returns:

t : WeightedForest instance

structure that represents the dendrogram

Notes

When G has more than 1 connected component, t is no longer a tree. This case is handled cleanly now

nipy.algorithms.clustering.hierarchical_clustering.ward_field_segment(F, stop=-1, qmax=-1, verbose=False)

Agglomerative function based on a field structure

Parameters:

F the input field (graph+feature)

stop: float, optional

the stopping crterion. if stop==-1, then no stopping criterion is used

qmax: int, optional

the maximum number of desired clusters (in the limit of the stopping criterion)

verbose : bool, optional

If True, print diagnostic information

Returns:

u: array of shape (F.V)

labelling of the graph vertices according to the criterion

cost array of shape (F.V - 1)

the cost of each merge step during the clustering procedure

Notes

See ward_quick_segment for more information

Caveat : only approximate

nipy.algorithms.clustering.hierarchical_clustering.ward_quick(G, feature, verbose=False)

Agglomerative function based on a topology-defining graph and a feature matrix.

Parameters:

G : graph instance

topology-defining graph

feature: array of shape (G.V,dim_feature)

some vectorial information related to the graph vertices

verbose : bool, optional

If True, print diagnostic information

Returns:

t: weightForest instance,

that represents the dendrogram of the data

Notes

Hopefully a quicker version

A euclidean distance is used in the feature space

Caveat : only approximate

nipy.algorithms.clustering.hierarchical_clustering.ward_quick_segment(G, feature, stop=-1, qmax=1, verbose=False)

Agglomerative function based on a topology-defining graph and a feature matrix.

Parameters:

G: labs.graph.WeightedGraph instance

the input graph (a topological graph essentially)

feature array of shape (G.V,dim_feature)

vectorial information related to the graph vertices

stop1 : int or float, optional

the stopping crterion if stop==-1, then no stopping criterion is used

qmax : int, optional

the maximum number of desired clusters (in the limit of the stopping criterion)

verbose : bool, optional

If True, print diagnostic information

Returns:

u: array of shape (G.V)

labelling of the graph vertices according to the criterion

cost: array of shape (G.V - 1)

the cost of each merge step during the clustering procedure

Notes

Hopefully a quicker version

A euclidean distance is used in the feature space

Caveat : only approximate

nipy.algorithms.clustering.hierarchical_clustering.ward_segment(G, feature, stop=-1, qmax=1, verbose=False)

Agglomerative function based on a topology-defining graph and a feature matrix.

Parameters:

G : graph object

the input graph (a topological graph essentially)

feature : array of shape (G.V,dim_feature)

some vectorial information related to the graph vertices

stop : int or float, optional

the stopping crterion. if stop==-1, then no stopping criterion is used

qmax : int, optional

the maximum number of desired clusters (in the limit of the stopping criterion)

verbose : bool, optional

If True, print diagnostic information

Returns:

u: array of shape (G.V):

a labelling of the graph vertices according to the criterion

cost: array of shape (G.V - 1)

the cost of each merge step during the clustering procedure

Notes

A euclidean distance is used in the feature space

Caveat : when the number of cc in G (nbcc) is greter than qmax, u contains nbcc values, not qmax !