Reasonable Inheritance of Cluster Identities in Repetitive Clustering

… or Inferring Identity from Observations

cluster-identityLet’s assume the following application:

A conservation organisation starts a project to geographically catalogue the remaining representatives of an endangered plant species. For that purpose hikers are encouraged to communicate the location of the plant if they encounter it. Due to those hikers using GPS technology ranging from cheap smartphones to highend GPS devices and weather as well as environmental circumstances the measurements are of varying accuracy. The goal of the conservation organisation is to build up a map locating all found plants with an ID assigned to them. Now every time a new location measurement is entered into the system a clustering is applied to identify related measurements – i.e. belonging to the same plant.

“I am he as you are he as you are me …

(… And we are all together” – I am the Walrus / Beatles) So far so good – but where it gets a bit tricky is when it comes to decide how to deal with IDs of clusters / plants when a newly introduced location estimate good-point-bad-pointnot just humbly joins an established cluster but causes trouble by messing up previously identified clusters / plants. Take the picture to the right. So far we had two plants with separate IDs – in the good case they stay separate and the new one is assigned to the red cluster. In the bad case the new one causes red and blue to merge and poses the question whether the new cluster is red or blue or something new itself. Here we are dealing with a clear draw and very few points and clusters – but it easy to come up with more ambiguous cases like f.x. the one described above. To make reasonable decisions for those cases well-chosen – and if possible mathematically at least plausibilzed – heuristcs are needed.

“Who Cares?”

Fair question as one might argue that an ID only serves the purpose of differentiating and there is no need for maintaining a family tree of clusters. Also in above use case this argument is not easily denied. But a stable inheritance of IDs might simplify understanding dynamics of how clustering takes place – a large number of representatives might render a cluster and its represented entity “important” and it would be weird if you have no stable way to refer to it. And some other possible motivations come to my mind. Maybe the organisation will send to selected plants researchers to perform an examination on them and henceforth intends to refer to those ones specifically.

 “Take arms!”

set-theoretic-contingency-tableSo how might we approach this almost philosophical problem? I guess what is needed first is a handy way to represent the relations. And for that purpose something one might be inclined to refer to as a “set theoretic contingency table” might make sense. Rows represent so far identified clusters, columns represent the result of the performed clustering and the values are the number of elements the respective clusters have in common. Take the illustration on the right hand side for an example – the new clustering leading to a temporary cluster with ID 2 has 1 element in common with cluster C. Now to choose A for clustering set 3 is an obvious choice but choosing B for 2 and C for 1 is not so evident but probably an obvious choice for a human being.

\text{E(x), E(i) are elements containtd in cluster x and clustered set i.}
(\text{Contingency Table})_{(x,i)} = \#(E(x) \cap E(i))

Choosing a Label for a Mixed Set



2vsnContinuing with above example. Clustered set 2 contains elements of type B and C. In this case one might say: “The choice of B is most reasonable as there are two Bs, one A and one unsettled element”. Fair enough – but what if we face a draw? Or if we would have two Bs and five more elements of different types, like C,D,E,F,G? Might seem odd but in a space of high dimensionality this is, I guess, a possibility.

Or take the situation illustrated to the right. For set 1 the label is a clear choice. But with above democratic labeling heuristic we would have to choose the same label for 2 and this would lead to a conflict. :/

A Conservative Approach to Restore Peace

To make a long story short a possible way to go might be to take a very conservative stance and expect from a cluster to properly tend its flock if it would like to keep its label. Id est, a cluster looses an element or gains one, then its new label is chosen randomly. This can be told by checking the contingency table – the condition is met if one and only one field in a row is non-zero and the corresponding column is as well non-zero exclusively for that field.

And now in action:

 Much Ado about Something

Congratulations for making it to this point – you are now part of a small distinguished circle! Write me a mail and I will organize for you a session so you will receive the fierce looking joyofdata-tattoo on your forehead which will grant you bargains in bio supermarkets all over the world and will facilitate meeting people at night clubs. Okay, seriously, I’d be interested in input!

(original article published on

2 thoughts on “Reasonable Inheritance of Cluster Identities in Repetitive Clustering

  1. Hi Raffael,

    This problem is very interesting. However, your solution seems doesn’t get a specific label for a mixed set (if I don’t misunderstanding). Here is my idea for your reference.

    1. Training a classifier with a small labeled dataset formed by currently found plant species data provided by hikers in a limited area)

    2. Use this learnt classifier to roughly predict the plant category label for unlabeled dataset in the whole area. You are easily get the probability of each unlabeled instance belonging to each of identified classes. For instance, an instance x is classified by a binary decision tree classifier, its probability belonging to the positive and the negative class receptively is 0.8 and 0.2, then the classifier assigns its label as positive.

    3. For the bad cases, when the probability belonging to each class is equal, for instance (0.5 for each class), then we could use the Dirichlet process to identify whether it belong to one of the existing class or a new class. (I could send you my note about Dirchlet process for more details.)

    4. Using the new labeled information to update the classifier.

    In this case, you could simply train a classifier with part of classes having been found so far, then update the classifier with new class information (if a new instance belongs to a new class) or more existing class data to make more accurate predictions.

    • Hi Stephanie,

      thanks for your very interesting approach to the described problem :)

      3. For the bad cases, when the probability belonging to each class is equal, for instance (0.5 for each class), then we could use the Dirichlet process to identify whether it belong to one of the existing class or a new class.

      I am not certain though whether this algorithm addresses the problem of a 50/50-case belonging to both classes (A and B). Which then will give rise to the question which class label is the one that should be preserved (A or B). “Mathemtically” that does not make a difference, b/c it’s just a label which is just supposed to be unique. But if the labels are associated with external data sets or maps then a human user will get confused if A contains many cases, B very few cases and the merged class is labelled B – that would be an extreme and rather trivial example (the rule “use the label of the larger class” would already resolve the question).

      But there are more tricky scenarios conceivable where for an ambiguous 50/50-case causes a merge of a class A and of just a subset of cases of B (see fig. “merge-split”).

      PS: the commenting tool here is so annoying …

Comments are closed.