algorithms.clustering.hierarchical_clustering¶
Module: algorithms.clustering.hierarchical_clustering
¶
Inheritance diagram for nipy.algorithms.clustering.hierarchical_clustering
:
These routines perform some hierrachical agglomerative clustering of some input data. The following alternatives are proposed: - Distance based average-link - Similarity-based average-link - Distance based maximum-link - Ward’s algorithm under graph constraints - Ward’s algorithm without graph constraints
In this latest version, the results are returned in a ‘WeightedForest’ structure, which gives access to the clustering hierarchy, facilitates the plot of the result etc.
For back-compatibility, *_segment versions of the algorithms have been appended, with the old API (except the qmax parameter, which now represents the number of wanted clusters)
Author : Bertrand Thirion,Pamela Guevara, 2006-2009
Class¶
WeightedForest
¶
- class nipy.algorithms.clustering.hierarchical_clustering.WeightedForest(V, parents=None, height=None)¶
Bases:
Forest
This is a weighted Forest structure, i.e. a tree - each node has one parent and children (hierarchical structure) - some of the nodes can be viewed as leaves, other as roots - the edges within a tree are associated with a weight: +1 from child to parent -1 from parent to child - additionally, the nodes have a value, which is called ‘height’, especially useful from dendrograms
- __init__(V, parents=None, height=None)¶
- Parameters:
- V: the number of edges of the graph
- parents=None: array of shape (V)
the parents of the graph by default, the parents are set to range(V), i.e. each node is its own parent, and each node is a tree
- height=None: array of shape(V)
the height of the nodes
- adjacency()¶
returns the adjacency matrix of the graph as a sparse coo matrix
- Returns:
- adj: scipy.sparse matrix instance,
that encodes the adjacency matrix of self
- all_distances(seed=None)¶
returns all the distances of the graph as a tree
- Parameters:
- seed=None array of shape(nbseed) with valuesin [0..self.V-1]
set of vertices from which tehe distances are computed
- Returns:
- dg: array of shape(nseed, self.V), the resulting distances
Notes
By convention infinite distances are given the distance np.inf
- anti_symmeterize()¶
anti-symmeterize self, i.e. produces the graph whose adjacency matrix would be the antisymmetric part of its current adjacency matrix
- cc()¶
Compte the different connected components of the graph.
- Returns:
- label: array of shape(self.V), labelling of the vertices
- check()¶
Check that self is indeed a forest, i.e. contains no loop
- Returns:
- a boolean b=0 iff there are loops, 1 otherwise
Notes
Slow implementation, might be rewritten in C or cython
- check_compatible_height()¶
Check that height[parents[i]]>=height[i] for all nodes
- cliques()¶
Extraction of the graphe cliques these are defined using replicator dynamics equations
- Returns:
- cliques: array of shape (self.V), type (np.int_)
labelling of the vertices according to the clique they belong to
- compact_neighb()¶
returns a compact representation of self
- Returns:
- idx: array of of shape(self.V + 1):
the positions where to find the neighbors of each node within neighb and weights
- neighb: array of shape(self.E), concatenated list of neighbors
- weights: array of shape(self.E), concatenated list of weights
- compute_children()¶
Define the children of each node (stored in self.children)
- copy()¶
returns a copy of self
- cut_redundancies()¶
Returns a graph with redundant edges removed: ecah edge (ab) is present only once in the edge matrix: the correspondng weights are added.
- Returns:
- the resulting WeightedGraph
- define_graph_attributes()¶
define the edge and weights array
- degrees()¶
Returns the degree of the graph vertices.
- Returns:
- rdegree: (array, type=int, shape=(self.V,)), the right degrees
- ldegree: (array, type=int, shape=(self.V,)), the left degrees
- depth_from_leaves()¶
compute an index for each node: 0 for the leaves, 1 for their parents etc. and maximal for the roots.
- Returns:
- depth: array of shape (self.V): the depth values of the vertices
- dijkstra(seed=0)¶
Returns all the [graph] geodesic distances starting from seed x
- seed (int, >-1, <self.V) or array of shape(p)
edge(s) from which the distances are computed
- Returns:
- dg: array of shape (self.V),
the graph distance dg from ant vertex to the nearest seed
Notes
It is mandatory that the graph weights are non-negative
- floyd(seed=None)¶
Compute all the geodesic distances starting from seeds
- Parameters:
- seed= None: array of shape (nbseed), type np.int_
vertex indexes from which the distances are computed if seed==None, then every edge is a seed point
- Returns:
- dg array of shape (nbseed, self.V)
the graph distance dg from each seed to any vertex
Notes
It is mandatory that the graph weights are non-negative. The algorithm proceeds by repeating Dijkstra’s algo for each seed. Floyd’s algo is not used (O(self.V)^3 complexity…)
- from_3d_grid(xyz, k=18)¶
Sets the graph to be the topological neighbours graph of the three-dimensional coordinates set xyz, in the k-connectivity scheme
- Parameters:
- xyz: array of shape (self.V, 3) and type np.int_,
- k = 18: the number of neighbours considered. (6, 18 or 26)
- Returns:
- E(int): the number of edges of self
- get_E()¶
To get the number of edges in the graph
- get_V()¶
To get the number of vertices in the graph
- get_children(v=-1)¶
Get the children of a node/each node
- Parameters:
- v: int, optional
a node index
- Returns:
- children: list of int the list of children of node v (if v is provided)
a list of lists of int, the children of all nodes otherwise
- get_descendants(v, exclude_self=False)¶
returns the nodes that are children of v as a list
- Parameters:
- v: int, a node index
- Returns:
- desc: list of int, the list of all descendant of the input node
- get_edges()¶
To get the graph’s edges
- get_height()¶
Get the height array
- get_vertices()¶
To get the graph’s vertices (as id)
- get_weights()¶
- is_connected()¶
States whether self is connected or not
- isleaf()¶
Identification of the leaves of the forest
- Returns:
- leaves: bool array of shape(self.V), indicator of the forest’s leaves
- isroot()¶
Returns an indicator of nodes being roots
- Returns:
- roots, array of shape(self.V, bool), indicator of the forest’s roots
- kruskal()¶
Creates the Minimum Spanning Tree of self using Kruskal’s algo. efficient is self is sparse
- Returns:
- K, WeightedGraph instance: the resulting MST
Notes
If self contains several connected components, will have the same number k of connected components
- leaves_of_a_subtree(ids, custom=False)¶
tests whether the given nodes are the leaves of a certain subtree
- Parameters:
- ids: array of shape (n) that takes values in [0..self.V-1]
- custom == False, boolean
if custom==true the behavior of the function is more specific - the different connected components are considered as being in a same greater tree - when a node has more than two subbranches, any subset of these children is considered as a subtree
- left_incidence()¶
Return left incidence matrix
- Returns:
- left_incid: list
the left incidence matrix of self as a list of lists: i.e. the list[[e.0.0, .., e.0.i(0)], .., [e.V.0, E.V.i(V)]] where e.i.j is the set of edge indexes so that e.i.j[0] = i
- list_of_neighbors()¶
returns the set of neighbors of self as a list of arrays
- list_of_subtrees()¶
returns the list of all non-trivial subtrees in the graph Caveat: this function assumes that the vertices are sorted in a way such that parent[i]>i for all i Only the leaves are listeed, not the subtrees themselves
- main_cc()¶
Returns the indexes of the vertices within the main cc
- Returns:
- idx: array of shape (sizeof main cc)
- merge_simple_branches()¶
Return a subforest, where chained branches are collapsed
- Returns:
- sf, Forest instance, same as self, without any chain
- normalize(c=0)¶
Normalize the graph according to the index c Normalization means that the sum of the edges values that go into or out each vertex must sum to 1
- Parameters:
- c=0 in {0, 1, 2}, optional: index that designates the way
according to which D is normalized c == 0 => for each vertex a, sum{edge[e, 0]=a} D[e]=1 c == 1 => for each vertex b, sum{edge[e, 1]=b} D[e]=1 c == 2 => symmetric (‘l2’) normalization
Notes
Note that when sum_{edge[e, .] == a } D[e] = 0, nothing is performed
- partition(threshold)¶
Partition the tree according to a cut criterion
- plot(ax=None)¶
Plot the dendrogram associated with self the rank of the data in the dendogram is returned
- Parameters:
- ax: axis handle, optional
- Returns:
- ax, the axis handle
- plot_height()¶
Plot the height of the non-leaves nodes
- propagate_upward(label)¶
Propagation of a certain labelling from leaves to roots Assuming that label is a certain positive integer field this propagates these labels to the parents whenever the children nodes have coherent properties otherwise the parent value is unchanged
- Parameters:
- label: array of shape(self.V)
- Returns:
- label: array of shape(self.V)
- propagate_upward_and(prop)¶
propagates from leaves to roots some binary property of the nodes so that prop[parents] = logical_and(prop[children])
- Parameters:
- prop, array of shape(self.V), the input property
- Returns:
- prop, array of shape(self.V), the output property field
- remove_edges(valid)¶
Removes all the edges for which valid==0
- Parameters:
- valid(self.E,) array
- remove_trivial_edges()¶
Removes trivial edges, i.e. edges that are (vv)-like self.weights and self.E are corrected accordingly
- Returns:
- self.E (int): The number of edges
- reorder_from_leaves_to_roots()¶
reorder the tree so that the leaves come first then their parents and so on, and the roots are last.
- Returns:
- order: array of shape(self.V)
the order of the old vertices in the reordered graph
- right_incidence()¶
Return right incidence matrix
- Returns:
- right_incid: list
the right incidence matrix of self as a list of lists: i.e. the list[[e.0.0, .., e.0.i(0)], .., [e.V.0, E.V.i(V)]] where e.i.j is the set of edge indexes so that e.i.j[1] = i
- set_edges(edges)¶
Sets the graph’s edges
Preconditions:
edges has a correct size
edges take values in [1..V]
- set_euclidian(X)¶
Compute the weights of the graph as the distances between the corresponding rows of X, which represents an embedding of self
- Parameters:
- X array of shape (self.V, edim),
the coordinate matrix of the embedding
- set_gaussian(X, sigma=0)¶
Compute the weights of the graph as a gaussian function of the distance between the corresponding rows of X, which represents an embedding of self
- Parameters:
- X array of shape (self.V, dim)
the coordinate matrix of the embedding
- sigma=0, float: the parameter of the gaussian function
Notes
When sigma == 0, the following value is used:
sigma = sqrt(mean(||X[self.edges[:, 0], :]-X[self.edges[:, 1], :]||^2))
- set_height(height=None)¶
Set the height array
- set_weights(weights)¶
Set edge weights
- Parameters:
- weights: array
array shape(self.V): edges weights
- show(X=None, ax=None)¶
Plots the current graph in 2D
- Parameters:
- XNone or array of shape (self.V, 2)
a set of coordinates that can be used to embed the vertices in 2D. If X.shape[1]>2, a svd reduces X for display. By default, the graph is presented on a circle
- ax: None or int, optional
ax handle
- Returns:
- ax: axis handle
Notes
This should be used only for small graphs.
- split(k)¶
idem as partition, but a number of components are supplied instead
- subforest(valid)¶
Creates a subforest with the vertices for which valid > 0
- Parameters:
- valid: array of shape (self.V): indicator of the selected nodes
- Returns:
- subforest: a new forest instance, with a reduced set of nodes
Notes
The children of deleted vertices become their own parent
- subgraph(valid)¶
Creates a subgraph with the vertices for which valid>0 and with the corresponding set of edges
- Parameters:
- valid, array of shape (self.V): nonzero for vertices to be retained
- Returns:
- G, WeightedGraph instance, the desired subgraph of self
Notes
The vertices are renumbered as [1..p] where p = sum(valid>0) when sum(valid==0) then None is returned
- symmeterize()¶
Symmeterize self, modify edges and weights so that self.adjacency becomes the symmetric part of the current self.adjacency.
- to_coo_matrix()¶
Return adjacency matrix as coo sparse
- Returns:
- sp: scipy.sparse matrix instance
that encodes the adjacency matrix of self
- tree_depth()¶
Returns the number of hierarchical levels in the tree
- voronoi_diagram(seeds, samples)¶
Defines the graph as the Voronoi diagram (VD) that links the seeds. The VD is defined using the sample points.
- Parameters:
- seeds: array of shape (self.V, dim)
- samples: array of shape (nsamples, dim)
Notes
By default, the weights are a Gaussian function of the distance The implementation is not optimal
- voronoi_labelling(seed)¶
Performs a voronoi labelling of the graph
- Parameters:
- seed: array of shape (nseeds), type (np.int_),
vertices from which the cells are built
- Returns:
- labels: array of shape (self.V) the labelling of the vertices
Functions¶
- nipy.algorithms.clustering.hierarchical_clustering.average_link_graph(G)¶
Agglomerative function based on a (hopefully sparse) similarity graph
- Parameters:
- G the input graph
- Returns:
- t a weightForest structure that represents the dendrogram of the data
- nipy.algorithms.clustering.hierarchical_clustering.average_link_graph_segment(G, stop=0, qmax=1, verbose=False)¶
Agglomerative function based on a (hopefully sparse) similarity graph
- Parameters:
- G the input graph
- stop: float
the stopping criterion
- qmax: int, optional
the number of desired clusters (in the limit of the stopping criterion)
- verbosebool, optional
If True, print diagnostic information
- Returns:
- u: array of shape (G.V)
a labelling of the graph vertices according to the criterion
- cost: array of shape (G.V (?))
the cost of each merge step during the clustering procedure
- nipy.algorithms.clustering.hierarchical_clustering.fusion(K, pop, i, j, k)¶
Modifies the graph K to merge nodes i and j into nodes k
The similarity values are weighted averaged, where pop[i] and pop[j] yield the relative weights. this is used in average_link_slow (deprecated)
- nipy.algorithms.clustering.hierarchical_clustering.ward(G, feature, verbose=False)¶
Agglomerative function based on a topology-defining graph and a feature matrix.
- Parameters:
- Ggraph
the input graph (a topological graph essentially)
- featurearray of shape (G.V,dim_feature)
vectorial information related to the graph vertices
- verbosebool, optional
If True, print diagnostic information
- Returns:
- t
WeightedForest
instance structure that represents the dendrogram
- t
Notes
When G has more than 1 connected component, t is no longer a tree. This case is handled cleanly now
- nipy.algorithms.clustering.hierarchical_clustering.ward_field_segment(F, stop=-1, qmax=-1, verbose=False)¶
Agglomerative function based on a field structure
- Parameters:
- F the input field (graph+feature)
- stop: float, optional
the stopping crterion. if stop==-1, then no stopping criterion is used
- qmax: int, optional
the maximum number of desired clusters (in the limit of the stopping criterion)
- verbosebool, optional
If True, print diagnostic information
- Returns:
- u: array of shape (F.V)
labelling of the graph vertices according to the criterion
- cost array of shape (F.V - 1)
the cost of each merge step during the clustering procedure
Notes
See ward_quick_segment for more information
Caveat : only approximate
- nipy.algorithms.clustering.hierarchical_clustering.ward_quick(G, feature, verbose=False)¶
Agglomerative function based on a topology-defining graph and a feature matrix.
- Parameters:
- Ggraph instance
topology-defining graph
- feature: array of shape (G.V,dim_feature)
some vectorial information related to the graph vertices
- verbosebool, optional
If True, print diagnostic information
- Returns:
- t: weightForest instance,
that represents the dendrogram of the data
- Notes
- Hopefully a quicker version
- A euclidean distance is used in the feature space
- Caveatonly approximate
- nipy.algorithms.clustering.hierarchical_clustering.ward_quick_segment(G, feature, stop=-1, qmax=1, verbose=False)¶
Agglomerative function based on a topology-defining graph and a feature matrix.
- Parameters:
- G: labs.graph.WeightedGraph instance
the input graph (a topological graph essentially)
- feature array of shape (G.V,dim_feature)
vectorial information related to the graph vertices
- stop1int or float, optional
the stopping crterion if stop==-1, then no stopping criterion is used
- qmaxint, optional
the maximum number of desired clusters (in the limit of the stopping criterion)
- verbosebool, optional
If True, print diagnostic information
- Returns:
- u: array of shape (G.V)
labelling of the graph vertices according to the criterion
- cost: array of shape (G.V - 1)
the cost of each merge step during the clustering procedure
Notes
Hopefully a quicker version
A euclidean distance is used in the feature space
Caveat : only approximate
- nipy.algorithms.clustering.hierarchical_clustering.ward_segment(G, feature, stop=-1, qmax=1, verbose=False)¶
Agglomerative function based on a topology-defining graph and a feature matrix.
- Parameters:
- Ggraph object
the input graph (a topological graph essentially)
- featurearray of shape (G.V,dim_feature)
some vectorial information related to the graph vertices
- stopint or float, optional
the stopping crterion. if stop==-1, then no stopping criterion is used
- qmaxint, optional
the maximum number of desired clusters (in the limit of the stopping criterion)
- verbosebool, optional
If True, print diagnostic information
- Returns:
- u: array of shape (G.V):
a labelling of the graph vertices according to the criterion
- cost: array of shape (G.V - 1)
the cost of each merge step during the clustering procedure
Notes
A euclidean distance is used in the feature space
Caveat : when the number of cc in G (nbcc) is greter than qmax, u contains nbcc values, not qmax !