An FFT data window, or frame, is a short (e.g., 10ms) interval of raw sound data (a waveform of time varying amplitude curve), which is the unit of spectral analysis. The spectral structure of each frame is summarized by measurements of song features (e.g., Pitch, FM, AM, Wiener entropy, and goodness of pitch). Each of these features has different units and different statistical distributions in the population of songs studied. To arrive at an overall score of similarity, we transformed the units for each feature to units of statistical distances. One can transform the units of pitch, for example, from Hz to units of standard deviation. Instead of SD we use a similar (and sometimes better) measure of deviation called MAD (median absolute deviation from the median). We can then compute Euclidean distances across all features. A similar procedure can be used to compare larger units of time, which we shall call intervals. SAP2011 uses two methods to estimate Euclidean distances across intervals.
Euclidean distances across mean values: given two intervals, A and B, we first calculate the mean feature values for each feature, and then compute Euclidean distances across the mean features, just as we would have done for a single frame. For example, consider two intervals of 3 frames in each, and (for simplicity) we shall consider only a single feature: A=[10, 20 ,30 ] ; B=[30, 20,10]. We first average across frames, which gives 20 for both A and B, and obviously, the Euclidian distance between the means is zero. That is, this approach looks at the overall interval, allowing local differences to cancel each other.
Euclidean distances across time courses: given two intervals, A and B, we compute Euclidean distances across pairs of features, A1 against B1, A2 against B2, and so forth. We then calculate the mean Euclidean distance across all pairs. Now consider the same example: A=[10, 20 ,30 ] ; B=[30, 20,10], the Euclidian difference is
SQRT[ (10-30)^2 +(20-20)^2 + (30-10)^2 ]= 28.3 MADs.
As shown, when we compared single frames, it is not unlikely to obtain short, or even zero distances, but comparing time series, a distance of zero requires that all the pairs of distances are zero. Hence, when examining the cumulative distribution of Euclidean distances across the two methods in a large sample of sounds, the two methods give different results:
This difference has a very practical implication when comparing songs: the time course approach is good for detecting similarity between two sequences of features that show similar curves of feature values. Note that moving an interval even by a single frame changes the entire frame of comparison. By comparing all possible pairs of intervals between two sounds, we can detect the rare pairs of intervals where the sequential match between all (or most) frames is high. Euclidean distance across mean values achieves exactly the opposite: dependency between neighboring intervals is high and we are looking for high similarity between distributions regardless of the short scale differences.
Note: The difference between those approaches applies also to other SAP modules: for example, the syllable table is based on mean and variance feature values calculated for each syllable, and hence all the table-based methods (DVD maps, cluster analysis) are based on Euclidean distances across mean values. Therefore, when we identify a nice cluster of syllables, we should not assume that similarity measurements based on the Euclidean distances across time series will show high similarity across members of the cluster. In fact, current findings suggest to us that birds stabilize the overall (mean) values of syllable features at a time when the frame-to-frame feature values are dissimilar across syllables.