Introduction to Clustering

 

< PREVIOUS: Clustering syllables > NEXT: Step by step clustering

 

Background & computational approach



In the previous chapter (DVD-maps) we saw an example of song development where syllable-feature distribution is broad during early stages, and then feature distribution becomes clustered. Most clusters (or types) appear to stay, and one can often keep track of each cluster throughout the rest of the movie. This cluster analysis module performs the (difficult) task of automatically classifying syllables into types and tracking each type during time. We use the terms 'type' and 'cluster' interchangeably, but one should keep in mind that a cluster is not a generic type, but a transient entity that is only defined in one animal (during a particular time). More often, the term 'syllable type' refer to species-specific sound, sometimes in relation to perceptual categories or other behaviors. The narrow definition of type has some advantages, however, of being simple, robust and 'analytic'.

At best, the clustering method is as good as the segmentation method is. We have no doubt that both cluster analysis and segmentation methods presented here are sub-optimal. Aylin Cimenser, a Physics PhD student at Columbia University is now in the process of optimizing both methods, and we believe that it should be possible to properly segment and cluster vague sounds that the current technology cannot handle.

The cluster analysis methods used here were implemented by Aylin Cimenser at Columbia University under the supervision of Partha P Mitra and Tim Halperin Hilly. The algorithm is based on the Nearest Neighbor Hierarchal Clustering (NNC) method. This is a non-parametric, density-based method (in contrast Gaussian mixture methods and K-means, which are parametric). This works nicely when clusters are reasonably segregated, regardless of their shape, and the back-tracing method is fully automated and self-checking. We cannot claim, however, that this approach always works. Rarely, clusters that can be identified visually are not easily identified. Users should keep in mind that a perfect solution for clustering problems should not be expected. It is no coincidence that there are many alternative cluster analysis methods in the 'market' - there is no generic solution to this problem. Namely, one cannot cluster data without making some strong assumptions about either shape or density of clusters, and those assumptions are sometimes false. For example, NNC does not handle well cases where some dense clusters are too close to sparse clusters.

More significantly, in song development, clustering has to be performed by recursive steps, starting from the end of song development, where identifying clusters is very easy, and concluding at an early stage of song development, when clustering becomes increasingly difficult, and finally impossible. The difference between approaches is not if the procedure will fail - but when.

NNC is a very simple algorithm and we encourage you to understand it:

1. Obtain a sample of a few thousand (say 3,000) syllables produced at a given time and compute features.
2. For each feature (e.g., syllable duration), calculate Euclidean distances between all possible (3000 x 3000) pairs of different syllables. Repeat for all features, scale units and add. This step includes about 50 million multiplications (but only about 1s).
3. Sort syllable pairs according to Euclidean distances in ascending order and discard syllable-pairs with a distance greater than a certain (say 0.01 MAD) threshold.
4. Now the best-best-friends, the syllable pair of shortest distance (nearest-neighbor pair), are establishing the first cluster.
5. Examine the next pair of nearest neighbors and consider the following scenarios:
n Case A. If both syllables are new (not clustered yet), define another cluster with this pair.
n Case B. If one of the syllables is already a member of a cluster (remember that any one syllable can appear in many pairs!), and its pair is not a member of any cluster, add this pair to that same cluster (that is: a friend of my friend is my friend!).
n Case C. If one syllable belongs to one cluster, and the second one to a different cluster - merge the two clusters.
n Case D. If both syllables belong to the same cluster, do nothing.
6. Repeat step 5 for all the remaining pairs (with a distances above threshold) according to distance order.

In many cases, you will find that the default threshold suffices to properly cluster the data, but be aware that the choice of threshold is very critical. Setting the threshold too high will join all the data into a single cluster. The threshold is determined empirically, and if not optimal you will experience one of two problems: if the threshold is too conservative, you will see too many clusters and many residuals (not clustered) syllables. If the threshold is too liberal, you will see 'false joining' of distinct clusters. In some cases (the most difficult ones) you will see both false joining and many residuals. To address these cases, you can endow a veto power to certain features. The meaning of veto is: 'do not merge two clusters if the difference between them is too big with respect to a specific feature', so that even if the pair of syllables (in case C) is really tight, SAP2011 won't join their clusters, if the centers of those clusters are too far away with respect to one feature. For example, you can tell SA+ to refuse merging two clusters if the duration difference between them is more than 0.5 MAD (which is 36ms in the zebra finch). If this 'distance gap' remains empty of other clusters, those two clusters will never merge.