Step by step clustering
| < PREVIOUS: Introduction to clustering | > NEXT: Correcting errors | 
|  | |
| Scaling syllable features Scaling of  syllables, based on 'maximum likelihood estimates' is the same as that used for  similarity measurements at the levels of frames and intervals.  Example  case: Syll_R109 Open the  clustering module, open the table syll_R109. To prevent overwriting of existing clusters uncheck 'write permit', and click 'Analyze'. The result  should look like this:  Note that  this representation is basically the same as a 2D DVD-map, with a default of  duration for the X axis and mean FM for the Y axis. Not all features are used  for clustering: by  default, SAP2011uses syllable duration and the mean values of pitch, Wiener  entropy, FM and goodness of pitch. Feature units are scaled to MAD also in the  display, as noted, proper scaling is essential for calculating meaningful  Euclidean distances across features. For every one animal, you might find  biases, but overall, all clusters should live in the neighborhood of 4 MADs and  have a mean spread between 1-2 MADs (averaged across features). The colors  identify the clusters, but the initial color identity is determined by the  population of the cluster (how many members). Therefore, the color is not yet  identifying any cluster in the long run (only at the current moment). Shortly,  we will discuss the techniques of marking (identifying) a cluster for tracing.  The legend on the left allows you to pick a cluster or to view some of its  properties. For example, the most abundant cluster is painted red, and you can  see near the red legend that this cluster has 606 member syllables. Once you  identified a cluster and clicked at the appropriate legend - you should give the  cluster an ID. The ID must be an integer number, type it in the edit box placed  on top of the legend. Once you ID a cluster and start tracing it back, it will  keep its original color (that is, we uncoupled the abundance from the color).  SAP2011 presents the actual Euclidean distance cutoff of the most distant pair of  syllables included in the analysis in addition to the threshold. The 'data  included' results show the number of paired syllables that passed that  threshold. The upper bound for 3000 syllables is 3000x3000=9 million. The  threshold reduces this number to about 50,000 syllable pairs that passed. You  can reduce it even further by moving the 'data included' slider to the left and  observe the gray display of 'data included' changing as you go. Now click  restart and observe the consequence of this action on the cutoff.  This  technique allows you to quickly test the consequences of changing the cutoff  without re-calculating Euclidean distances (which is the time-limiting step).   Note that the table of syllable pairs has still a lot of redundancy with 50,000  syllables in pairs that are extracted from no more 3000 different syllables  (and often, much less). In fact, looking below the legend will show you that  only about 2500 different syllables passed the threshold.    A syllable in a  'crowded area of feature space' will participate in many pairs, whereas in a  sparse area, a syllable might have no neighbor close enough to join a pair.  Also, remember that SA+ only analyzes the 10 largest clusters. If you  want to cluster more then 10 types, you can do so exhaustively as described  later. You should be aware that filtering the table (removing clusters) is a  non-linear operation with regards to clustering. That is, the results might  change abruptly with filtering. In practice, this is more often a plus than a  minus, since it can turn an unstable performance into a stable one.  Before we  get into the tracing technique, let's explore the different displays that will  help you judge how good the clustering is. Click on the 'all data' tab, then  click the 'residuals' tab, and move back and forth from cluster to both of those  displays. As you can see, most but not all the syllables were clustered. A careful  look at the outcome raises a few questions about the clustering  performance. First, how  come that the yellow and green clusters where not joined into a single cluster?  The answer to this question becomes clear when looking at different projections  of this cluster in feature space. Changing the Y axis to pitch shows that the  two clusters overlap in their frequency modulation but not in their pitch.Second,  what sounds compose the 27% residuals? Looking at the residuals shows that some  belong to 'sparse clouds' that have not been clustered. These 'clouds' are often  long and short calls, which are less stereotyped and less abundant in our  recording (the lower abundance of calls is, in fact, an artifact of our  recording method, which intends to preserve song data and eliminate isolated  calls).  Other residuals belong to the category of cage noise - these are often  characterized by low-pitch and broad distribution of duration, as shown in the  example above. Finally, some residuals are left-overs of clusters - these  residuals can be reduced by having a more liberal threshold. Similarly, one can  cluster a 'sparse cloud' of data by having a more liberal threshold. You might  ask - how can one decide what should the threshold value be? The answer is that  the ideal threshold value is 'cluster dependent'. If a cluster is very close to  another cluster, having a too liberal threshold will join them. The point is you  do not want to try to cluster all your data at once. Instead, the strategy we  implemented is of clustering types one by one. This requires more work, but it  gives you the freedom to set appropriate conditions, that works well for this  particular type of sound.  We will  now start the process of tracing back syllables, but first, let us illustrate  some of the problems involved in trace-back. As noted earlier, the major issue  is that the nature of the task changes as we step backwards in song development,  as we expect clusters to eventually fall apart when we reach the beginning of  song development. What we are trying to trace is, in fact, the structure of this  'falling apart' (actually, the 'build up' when forward-tracking) process. During  later stages of song development, we will typically observe movement of clusters  in feature space. This process is easy to trace since we have a complete  recording of ontogeny, and since most features change slowly compared to the  size of time-slices we try to bridge across (typically, 3000 syllables occurs in  time scales of several minutes to a few hours). Even non-linear vocal changes,  such as period-doubling, will rarely cause problems since other features of the  syllable will remain stable during this event. During early stages of song  development, we often see how clusters merge - since in almost every bird,  different syllable types emerge from a smaller number of prototype syllables in  the process of 'syllable differentiation'. Detecting the point of transition is  a challenging task.  Let's look  at two clusters of bird 109, which are shown as fuchsia and green in the figure  above. We noted that those clusters are close to each other. They have similar  FM but different pitch, and there is also a slight duration difference between  them. Move the 'Time control' slider 2/3 to the left and click 'Analyze'. Note,  that since we are not back-tracing, SAP2011 will make no attempt to  re-identify clusters, so the colors will change arbitrarily (according to the  number of members in each cluster). Note that  although we stepped several weeks, the two images are similar, and we can see  that the blue cluster is still there, but stained yellow, and the red one has  turned blue (and is somewhat larger). The problem is that the yellow and green  clusters have merged - and are both red now. The question is - is this a false  merging, or something that the bird did? Looking at the raw data (right panel)  shows clearly that the clusters are indeed merged. This example demonstrates  some of the difficulties you might encounter while back-tracing - now let's try  it. Pull the  Time control slider to the end of song development and click 'Analyze'. We will  start with the easiest cluster - say the blue one. Now we  need to tell SAP2011 that this is the cluster we want to trace, and we need  to give it a permanent name. This name will appear in the appropriate records of  the database table as we do the procedure, unless you uncheck the 'write permit'  check box (please do uncheck it!). Since this cluster appeared blue we check the  blue radio-button in the legend, and then on top of the legend we type the  permanent cluster name. The cluster name must be an integer number, and we  suggest starting with 1. Now you  should see that the track-back button (top) became enabled. Click it. Note that  the 'Time control' did not take a step back yet - it only identified this  cluster in the current slice and (if write permitted), registered it in the  table of bird 109 so that each occurrence of this cluster is marked as 1 (the  default value for a cluster is 0). Now click track-back once more. Note that the  Time control has moved a tiny bit to the left.  Now click  'Repeat tracing back' and you will see that tracing back occurs automatically,  step after step, until you click this button again - or - until something bad  happens…  Let's try  to understand more formally what is happening here. SAP2011 did clustering  and you have chosen a cluster to trace, we will call it the reference cluster.  When tracing back, SA+ does a similar clustering on a slightly earlier  time window. The algorithm then computes the centroid of each cluster (that is, the  mean duration of syllables in the cluster, the mean mean-pitch of syllables in  the cluster and so forth. Then, the centroids of each of those new clusters are  compared to the centroid of the reference cluster. The cluster with the most  similar centroid to that of the reference cluster is assumed to be an earlier  version of that cluster - but this is only if it is similar enough to the  reference. The default threshold for this comparison is 0.2 MADs (across all  features chosen).  Tracing  this cluster should work very well for several weeks of song development, but  eventually, it is doomed to fail. You will  need to define the cluster again - based on its location (change the  radio-choice in the legend to the appropriate color to re-activate the  'trace-back' button). Then keep tracing back, with some 'playing around' you  should be able to trace it back until August 10 or so, which is 3 days after the  onset of training.  Try to  trace other clusters of bird 109. You will find some of them easy and others  more tricky. For example, this yellow cluster will cause frequent troubles by  merging with the one below it: To solve  such problems, your first line of defense is decreasing the Euclidean distance  threshold, e.g. to 0.01 To check  quickly if threshold reduction can solve a problem, click analyze and then click  the 'data included' left-arrow followed by 'restart'.  This  approach also fails from time to time - but do not give up - reduce threshold to  0.008 to regain hold, back-trace once and try 0.01 again, and then auto-trace  until the next failure. By the time you approach the beginning of September,  tracing this cluster becomes really difficult.  This  experience might (and should) have raised some concerns about the objectivity of  this clustering method. Indeed, one would like to be able to set the parameters  once, rather then keep playing with them. The reality of cluster analysis,  however, is that one often needs to adjust parameters. We suggest you document  your adjustments, and also, try it more than once in such difficult cases. In  this particular cluster, the problem is that the only good distinguishable  feature is pitch - all other features of these two clusters overlap. Trying to  distinguish between them can only work for some time, but as the pitch values  approach each other, the mission is turning impossible. Furthermore, you pay a  toll of high percentage of residuals.  The  solution is therefore to also cluster the two clusters together, and consider  the time when they are separated as two descendent clusters of a main branch (as  in a dendrogram).  To do  this, move the time control to the end of song development, return the threshold  to 0.015 and in the features included, uncheck the 'mean pitch'. Check the  'write permit' but uncheck the 'overwrite clusters'. Now give the joined cluster  a different name (say 2) and click 'analyze' and the clusters will immediately  merge. You will see that the two clusters immediately join, since when pitch is  not taken into account, other features they have in common prevail.  | |
