Interepretation of scores


< PREVIOUS: Segmented comparisons ___________________________________________	> NEXT: Period of repetition

Scoring similarity between two sounds as described above might work well in some cases, and less well in other cases. It should be used carefully and wisely. The outcome of similarity scoring depends heavily on appropriate scaling of the features to units of median absolute deviation from the average in the ‘population’. The next chapter explains how to scale features and when new feature scaling should be considered. A related factor is feature weight: the default assumption is that the five features are equally important. This assumption has not been tested, but empirically, giving an equal weight to the five features works well for scoring song similarity across adult zebra finches. Each feature has different strengths and weaknesses and together, they complement each other. The feature that is most likely to cause you troubles is pitch: pitch is sometimes difficult to calculate, and an unstable pitch estimate might bias the scoring procedure.

Reading about the complexities involved in calculating similarity scores you might wonder about the consequences of improper use of these methods. Compared to the human-observer scoring method, the automated approach has pros and cons. No doubt, the judgment of any human observer is preferred over automated methods. The main difference is that automated methods provide well defined metrics for distances between sounds and can quantify subtle differences. Statistically, however, you should handle automated similarity scores just as you would handle human scores, except that you might consider using parametric methods (if the scores distribution appears to be normal) – but this is not a big issue. If at the end of the day all you care about is whether two groups of animals differ in their sounds – it does not matter how the scores were calculated, under what assumptions, etc. For example, if you use the feature scale of zebra finches on monkey sounds, and find strong differences in similarity scores across two groups of animals using some non-parametric estimate of the scores, the difference is real regardless of the strong biases injected by using a wrong normalization. However, you do not want to use wrong normalization since this might reduce the sensitivity and reliability of scoring method, making it most-likely that significant differences will be found. Overall, in most cases, you will want to use the scoring method that maximizes the difference between your groups. The actual p-value used for threshold is just a yardstick, and it has nothing to do with statistical significance.