Step by step clustering


< PREVIOUS: Introduction to clustering > NEXT: Correcting errors

Scaling syllable features

Scaling of syllables, based on 'maximum likelihood estimates' is the same as that used for similarity measurements at the levels of frames and intervals.

Example case: Syll_R109

Open the clustering module, open the table syll_R109. To prevent overwriting of existing clusters uncheck 'write permit', and click 'Analyze'. The result should look like this:
Note that this representation is basically the same as a 2D DVD-map, with a default of duration for the X axis and mean FM for the Y axis. Not all features are used for clustering: by default, SAP2011uses syllable duration and the mean values of pitch, Wiener entropy, FM and goodness of pitch. Feature units are scaled to MAD also in the display, as noted, proper scaling is essential for calculating meaningful Euclidean distances across features. For every one animal, you might find biases, but overall, all clusters should live in the neighborhood of 4 MADs and have a mean spread between 1-2 MADs (averaged across features). The colors identify the clusters, but the initial color identity is determined by the population of the cluster (how many members). Therefore, the color is not yet identifying any cluster in the long run (only at the current moment). Shortly, we will discuss the techniques of marking (identifying) a cluster for tracing. The legend on the left allows you to pick a cluster or to view some of its properties. For example, the most abundant cluster is painted red, and you can see near the red legend that this cluster has 606 member syllables. Once you identified a cluster and clicked at the appropriate legend - you should give the cluster an ID. The ID must be an integer number, type it in the edit box placed on top of the legend. Once you ID a cluster and start tracing it back, it will keep its original color (that is, we uncoupled the abundance from the color).

SAP2011 presents the actual Euclidean distance cutoff of the most distant pair of syllables included in the analysis in addition to the threshold. The 'data included' results show the number of paired syllables that passed that threshold. The upper bound for 3000 syllables is 3000x3000=9 million. The threshold reduces this number to about 50,000 syllable pairs that passed. You can reduce it even further by moving the 'data included' slider to the left and observe the gray display of 'data included' changing as you go. Now click restart and observe the consequence of this action on the cutoff.  This technique allows you to quickly test the consequences of changing the cutoff without re-calculating Euclidean distances (which is the time-limiting step).  Note that the table of syllable pairs has still a lot of redundancy with 50,000 syllables in pairs that are extracted from no more 3000 different syllables (and often, much less). In fact, looking below the legend will show you that only about 2500 different syllables passed the threshold.    A syllable in a 'crowded area of feature space' will participate in many pairs, whereas in a sparse area, a syllable might have no neighbor close enough to join a pair. Also, remember that SA+ only analyzes the 10 largest clusters. If you want to cluster more then 10 types, you can do so exhaustively as described later. You should be aware that filtering the table (removing clusters) is a non-linear operation with regards to clustering. That is, the results might change abruptly with filtering. In practice, this is more often a plus than a minus, since it can turn an unstable performance into a stable one.

Before we get into the tracing technique, let's explore the different displays that will help you judge how good the clustering is. Click on the 'all data' tab, then click the 'residuals' tab, and move back and forth from cluster to both of those displays. As you can see, most but not all the syllables were clustered.

A careful look at the outcome raises a few questions about the clustering performance.

First, how come that the yellow and green clusters where not joined into a single cluster? The answer to this question becomes clear when looking at different projections of this cluster in feature space. Changing the Y axis to pitch shows that the two clusters overlap in their frequency modulation but not in their pitch.Second, what sounds compose the 27% residuals? Looking at the residuals shows that some belong to 'sparse clouds' that have not been clustered. These 'clouds' are often long and short calls, which are less stereotyped and less abundant in our recording (the lower abundance of calls is, in fact, an artifact of our recording method, which intends to preserve song data and eliminate isolated calls).  Other residuals belong to the category of cage noise - these are often characterized by low-pitch and broad distribution of duration, as shown in the example above. Finally, some residuals are left-overs of clusters - these residuals can be reduced by having a more liberal threshold. Similarly, one can cluster a 'sparse cloud' of data by having a more liberal threshold.

You might ask - how can one decide what should the threshold value be? The answer is that the ideal threshold value is 'cluster dependent'. If a cluster is very close to another cluster, having a too liberal threshold will join them. The point is you do not want to try to cluster all your data at once. Instead, the strategy we implemented is of clustering types one by one. This requires more work, but it gives you the freedom to set appropriate conditions, that works well for this particular type of sound.

We will now start the process of tracing back syllables, but first, let us illustrate some of the problems involved in trace-back. As noted earlier, the major issue is that the nature of the task changes as we step backwards in song development, as we expect clusters to eventually fall apart when we reach the beginning of song development. What we are trying to trace is, in fact, the structure of this 'falling apart' (actually, the 'build up' when forward-tracking) process. During later stages of song development, we will typically observe movement of clusters in feature space. This process is easy to trace since we have a complete recording of ontogeny, and since most features change slowly compared to the size of time-slices we try to bridge across (typically, 3000 syllables occurs in time scales of several minutes to a few hours). Even non-linear vocal changes, such as period-doubling, will rarely cause problems since other features of the syllable will remain stable during this event. During early stages of song development, we often see how clusters merge - since in almost every bird, different syllable types emerge from a smaller number of prototype syllables in the process of 'syllable differentiation'. Detecting the point of transition is a challenging task.

Let's look at two clusters of bird 109, which are shown as fuchsia and green in the figure above. We noted that those clusters are close to each other. They have similar FM but different pitch, and there is also a slight duration difference between them. Move the 'Time control' slider 2/3 to the left and click 'Analyze'. Note, that since we are not back-tracing, SAP2011 will make no attempt to re-identify clusters, so the colors will change arbitrarily (according to the number of members in each cluster).

Note that although we stepped several weeks, the two images are similar, and we can see that the blue cluster is still there, but stained yellow, and the red one has turned blue (and is somewhat larger). The problem is that the yellow and green clusters have merged - and are both red now. The question is - is this a false merging, or something that the bird did? Looking at the raw data (right panel) shows clearly that the clusters are indeed merged. This example demonstrates some of the difficulties you might encounter while back-tracing - now let's try it.

Pull the Time control slider to the end of song development and click 'Analyze'. We will start with the easiest cluster - say the blue one. Now we need to tell SAP2011 that this is the cluster we want to trace, and we need to give it a permanent name. This name will appear in the appropriate records of the database table as we do the procedure, unless you uncheck the 'write permit' check box (please do uncheck it!). Since this cluster appeared blue we check the blue radio-button in the legend, and then on top of the legend we type the permanent cluster name. The cluster name must be an integer number, and we suggest starting with 1. Now you should see that the track-back button (top) became enabled. Click it.

Note that the 'Time control' did not take a step back yet - it only identified this cluster in the current slice and (if write permitted), registered it in the table of bird 109 so that each occurrence of this cluster is marked as 1 (the default value for a cluster is 0). Now click track-back once more. Note that the Time control has moved a tiny bit to the left.

Now click 'Repeat tracing back' and you will see that tracing back occurs automatically, step after step, until you click this button again - or - until something bad happens…

Let's try to understand more formally what is happening here. SAP2011 did clustering and you have chosen a cluster to trace, we will call it the reference cluster. When tracing back, SA+ does a similar clustering on a slightly earlier time window. The algorithm then computes the centroid of each cluster (that is, the mean duration of syllables in the cluster, the mean mean-pitch of syllables in the cluster and so forth. Then, the centroids of each of those new clusters are compared to the centroid of the reference cluster. The cluster with the most similar centroid to that of the reference cluster is assumed to be an earlier version of that cluster - but this is only if it is similar enough to the reference. The default threshold for this comparison is 0.2 MADs (across all features chosen).

Tracing this cluster should work very well for several weeks of song development, but eventually, it is doomed to fail.
You will need to define the cluster again - based on its location (change the radio-choice in the legend to the appropriate color to re-activate the 'trace-back' button). Then keep tracing back, with some 'playing around' you should be able to trace it back until August 10 or so, which is 3 days after the onset of training.

Try to trace other clusters of bird 109. You will find some of them easy and others more tricky. For example, this yellow cluster will cause frequent troubles by merging with the one below it:

To solve such problems, your first line of defense is decreasing the Euclidean distance threshold, e.g. to 0.01
To check quickly if threshold reduction can solve a problem, click analyze and then click the 'data included' left-arrow followed by 'restart'.
This approach also fails from time to time - but do not give up - reduce threshold to 0.008 to regain hold, back-trace once and try 0.01 again, and then auto-trace until the next failure. By the time you approach the beginning of September, tracing this cluster becomes really difficult.

This experience might (and should) have raised some concerns about the objectivity of this clustering method. Indeed, one would like to be able to set the parameters once, rather then keep playing with them. The reality of cluster analysis, however, is that one often needs to adjust parameters. We suggest you document your adjustments, and also, try it more than once in such difficult cases. In this particular cluster, the problem is that the only good distinguishable feature is pitch - all other features of these two clusters overlap. Trying to distinguish between them can only work for some time, but as the pitch values approach each other, the mission is turning impossible. Furthermore, you pay a toll of high percentage of residuals.

The solution is therefore to also cluster the two clusters together, and consider the time when they are separated as two descendent clusters of a main branch (as in a dendrogram).

To do this, move the time control to the end of song development, return the threshold to 0.015 and in the features included, uncheck the 'mean pitch'. Check the 'write permit' but uncheck the 'overwrite clusters'. Now give the joined cluster a different name (say 2) and click 'analyze' and the clusters will immediately merge. You will see that the two clusters immediately join, since when pitch is not taken into account, other features they have in common prevail.