You are here: Home Manual Chapter 10: Similarity Measurements The similarity score

The similarity score


< PREVIOUS: Asymmetric measurements __________________________________________> NEXT: Saving data criteria

In asymmetric comparisons, the similarity score has two major components: the percentage of similarity, which computed over intervals of sounds, and the accuracy -- which is the local, fine grained similarity. The %similarity measure, being computed across relatively long intervals (typically including 50 FFT windows or so), is designed to capture patterns in feature values. Once similar patterns are detected, one may look at the details, just like comparing pictures looking for similar faces (is in % similarity), and then comparing similar faces pixel by pixel (as in accuracy):

Scoring similarity between songs on the scale of a single window is hopeless, as is comparing pictures one pixel at a time. The solution is to compare intervals consisting of several windows. If such intervals are sufficiently long, they will contain enough information to identify a unique song segment. Yet, if the intervals are too long, similarities that are real at a smaller interval size may be rejected and that would reduce the power of analysis. We found empirically that comparisons using 50-70ms intervals, centered on each 10-ms time window were satisfactory. Perhaps not surprisingly, the duration of these song intervals is on the order of magnitude of a typical song note.  Our final score of similarity combined the two scales: the ‘large scale’ (usually 50-70 ms) is used for reducing ambiguity with a measure we call %similarity, while the ‘small scale’ (usually 5-10 ms) is used to obtain a fine-grained quantification of similarity, which we call accuracy.


For each pair of time windows labelled as ‘similar’ for two songs being compared, SAP2011 calculated the probability that the goodness of the match would have occurred by chance as described above. We are left, then, with a series of P values, and the lower the P, the higher the similarity. For convenience we transform these P values to 1-P; therefore, a 99% similarity between a pair of windows means that the probability that the goodness of the match would have occurred by chance is less than 1%. In this case, 99% similarity does not mean that the features in the two songs being compared are 99% similar to each other. In practice and because of how our thresholds were set, songs or sections of songs that get a score of 99% similarity tend, in fact, to be very similar. The SAP2011 procedure requires that there be a unique relation between a time window in the model and a time window in the pupil. Yet, our technique allows that more than one window in the pupil song will meet the similarity threshold. The probability of finding one or more pairs of sounds that meet this threshold increases with the number of comparisons made and so, in some species at least, the duration of the pupil’s song will influence the outcome. When a window in a tutor’s song is similar to more than one window in the pupil’s song, the problem is how to retain only one pair of windows. Two types of observations helped us make this final selection: the first is the magnitude of similarity, the second one is the length of the section that met the similarity criterion.

Windows with scores that meet the similarity threshold are often contiguous to each other and characterize discrete ‘sections’ of the song. In cases of good imitation, sections of similarity are interrupted only by silent intervals, where similarity is undefined. Depending on the species, a long section of sequentially similar windows (i.e. serial sounds similar in the two songs compared) is very unlikely to occur by chance, and thus the sequential similarity we observed in zebra finches was likely the result of imitation. Taken together, the longer the section of similarity and the higher the overall similarity score of its windows, the lower the likelihood of this having occurred by chance. Therefore, the overall similarity that a section captures has preeminence over the local similarity between time windows.


To calculate how much similarity each section captured SAP2011 used the following procedure. Consider for example, a tutor’s song of 1000 ms of sound (i.e. excluding silent intervals) that has a similarity section of 100 ms with the song of its pupil, and the average similarity score between windows of that section is 80%. The overall similarity that this section captures is therefore 8%.

This procedure is repeated for all sections of similarity. Then, we discarded parts of sections that showed overlapping projections, either on the tutor or on the pupil’s song. Starting from the section that received the highest overall similarity score (the product of similarity[1]duration), we accepted its similarity score as final and removed overlapping parts in other sections. We based the latter decision on the overall similarity of each section and not on the relative similarity of their overlapping parts. We repeated this process down the scoring hierarchy until all redundancy was removed. The remainder was retained for our final score of %similarity.



% Similarity: is the percentage of tutor's sounds included in final sections. Note that the p-value used to detect sections is computed across intervals. This similarity estimate is asymmetric and it bears no relation to the local similarity score we discussed above.
Accuracy: is the average local similarity (frame by frame) across final sections.
Sequential match: is calculated by sorting the final sections according to their temporal order in reference to sound 1, and then examining their corresponding order in sound 2. We say that two sections are sequential if the beginning of in sound 2 occurred between 0-80ms after the end of S. This tolerance level accounts for the duration of stops and also for possible filter effect of very short sections that are not sequential. This procedure is repeated for all the consecutive pairs of sections on sound 1 and the overall sequential match is estimated as:
Note that multiplying by 2 is offsetting the effect of adding only one (the smallest) of two sections in the numerator. This definition is used for asymmetric scoring, whereas for symmetric scoring the sequential match is simply the ratio between the two outlined intervals on sound 1 and sound 2. Weighting the sequential match into the overall score: In the case of symmetric scoring only the sequentially matching parts of the two sounds can be considered, so it makes sense to multiply the sequential match by the combined score. In the case of time-series comparison, it does not make sense to multiply the numbers, because this will mean that we give 100% weight to sections that are sequential, and 0% weight to those that are not. Therefore, you have to decide what weight should be given to non-sequential sections. The problem is that sequential mismatches might have different meanings. For example, in extreme cases of 'low contrast' similarity matrices (with lots of gray areas) the sequence might be the only similarity measure that captures meaningful differences, but when several similar sounds are present in the compared intervals, it might be shear luck if SAP2011 will sort them out sequentially or not. In short - we cannot advise you what to do about it, and the default setting of 50% weight is arbitrary.