Learning to Search a Melodic Metric Space

Viewer
Transcript

Learning to Search a Melodic Metric Space Michael Skalak and Jinyu Han and Anda Bereczky Northwestern University 633 Clark Street Evanston, Illinois 60208

Abstract Music is a popular category of multimedia content, and is increasingly moving online. New melodic search engines let people find music in online databases by singing, whistling, or humming the desired melody. However, comparison of the query to every song in a database with millions of songs is prohibitively slow. Recently, several authors have proposed speeding up melodic database search by placing the melodies in a metric space and then implementing one of several known algorithms so the query can be compared to a limited number of targets. We build on their methods by employing a very general metric search technique using a vantage point tree and then applying a genetic algorithm to concurrently learn values for the parameters describing the metric, the tree, and the note segmentation preprocessor, while considering both speed and accuracy. We show on a standard melodic database that the search which uses the optimal parameters computed by our learner is comparably accurate, but significantly faster than previous systems.

Introduction and Related Work Music is a popular category of multimedia content, and is increasingly moving online. Examples include the online repositories of Amazon.com and Apples iTunes, each containing millions of songs. These collections are currently indexed with text-based metadata tags that describe identifying features of the music, such as title, composer, and performer. Finding the desired music through this indexing scheme is a problem for users who do not already know the metadata for the desired piece. New melodic search engines let people find music by singing, whistling, or humming the desired tune. Currently used searching methods compare the sample provided by the user (the query) to every melody in the database (Dannenberg et al. 2007; Pardo, Shifrin, and Birmingham 2004). This approach becomes prohibitively slow for a collection of millions of songs. Recently, several authors (Parker, Fern, and Tadepalli 2007; Typke, Veltkamp, and Wiering 2004; Vleugels and Veltkamp 1999) have proposed speeding up melodic c 2007, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved.

database search by placing the melodies in a metric space and comparing the query to a limited number of vantage points rather than to every item in the database. (See Section “The Search Algorithm” for a detailed description of vantage point trees.) Typke et al. (Typke, Veltkamp, and Wiering 2004) encode melodies as piecewise constant pitch functions of duration and then apply a variant of Earth Mover Distance (EMD) (Giannopoulous and Veltkamp 2002), which forms a pseudo-metric. The 2005 MIREX competition for melodic similarity found EMD performed roughly equal in retrieval and substantially worse on time compared to string matching techniques (Downie et al. 2005), requiring application of a metric to organize melodies and speed search. We are interested in searching for exact matches of melodies. The approach taken in (Typke, Veltkamp, and Wiering 2004) needs O(n1−1/v ) operations to perform an exact match, where n is the size of the database and v is the number of vantage points used in the algorithm. This method results in very little speedup when compared to linear search, because a typical number of vantage points used is over 10. Parker et al. (Parker, Fern, and Tadepalli 2007) developed a method to approximate a metric using a string alignment approach that allows for variable edit-costs for string elements. This let them apply a simple vantage point tree (Yianilos 1993) to reduce the size of the search space on melodic search applications. Their approach adjusts the values of a measure at the expense of the search speed to make it close to a metric. It does not, however, create an actual metric, potentially compromising the effectiveness of the system. It also does not allow the vantage point tree to vary. In this work, we introduce a string-edit based melodic method that allows real-valued string elements, removing the quantization error. We show this method to be an actual metric in the space of melodies. This metric utilizes a substitution-cost function that can be automatically learned from a small number of sung queries (on the order of 10). Since we have a guaranteed metric, we can apply a more sophisticated vantage point tree approach than was used by Parker et al. We use a genetic algorithm to learn values for the parameters describing the metric, the vantage point tree, as well as the note segmentation preprocessor, resulting in significantly improved database search times while sacrificing very little in terms of precision and recall.

The Metric Encoding the Query In a typical Query by Humming system, a query is first transcribed into a time-frequency representation where the fundamental frequency and amplitude of the audio is estimated at very short fixed intervals (on the order of 10 milliseconds). We use the note segmenter described in (Little, Raffensperger, and Pardo 2007) to divide the fixed-frame queries into notes. Once a query is segmented into notes, we encode the query and all database melodies as sequences of note intervals. Each note interval is represented by a pair of values: the pitch interval between two adjacent note segments (encoded in units of musical semitones, or half-steps, the smallest musical intervals) and the log of the ratio between the length of a note segment and the length of the following segment (Pardo and Birmingham 2002). We use note intervals rather than notes because they are transposition invariant (identical melodies sung in different keys appear the same) and tempo invariant (identical melodies sung at different speeds appear the same). After we encode the melodies, we can view every query and target as a string of the atomic letters in our alphabet of note intervals. We will first give an example metric framework on individual elements of our string. Because translating a metric on the letters to a metric on the strings is very flexible, we could choose any metric for a set of letters, but for concreteness we describe the particular measurement we implemented.

The Metric For Individual Note Intervals We choose our distance measure to reflect the relative likelihood of substituting one note interval for another. For two note intervals x and y, with pitch differences xP and yP and duration ratios xL and yL , we let: d(x, y) =

min((α(min(|xL − yL |, L)) +β(min(|xP − yP |, P ))), D)

(1)

where α, β are non-negative weights for pitch differences and duration ratios; L, P , D are non-negative constraints imposed on the pitch interval differences, duration ratios differences and overall differences. By imposing L to the difference of duration ratios, we limit the punishment on noteinterval distance to at most L. P has the same function as L, but for pitch difference. D acts as a “worst case” function: using L, we limit how big the distance between two note intervals is, and thus we limit how much a single element can control the overall distance. A similar minimum was demonstrated effective in (Wu et al. 2006). A distance measure d(x, y) is a metric if it shows identity, symmetry and satisfies the triangle inequality. We know that our measure is a metric because metrics are closed under multiplication, summation, and transformation by a convex function (e.g. min). In our actual implementation, α, β, L, P , and D are chosen to optimize performance on a set of example queries on a given database. Note that this metric is real-valued unlike

the metric used by (Parker, Fern, and Tadepalli 2007), and hence does not force quantization of the note values. Quantization to an alphabet of fixed note intervals has been shown to introduce errors in melodic search (Little, Raffensperger, and Pardo 2007).

The String Metric To compare note interval strings, we use edit distance. The edit distance between two strings, also known as Generalized Levenshtein Distance, is intuitively the cost of the least expensive way of transforming one string into the other, where the cost of the transformation depends on a comparison function for the string elements. For two note interval strings X = {x1 , x2 , ..., xn } and Y = {y1 , y2 , ..., ym }, we let: D(X, Y )

= ed(n, m) (2) ( ed(i − 1, j) + 1, ed(i, j) = min ed(i, j − 1) + 1, (3) ed(i − 1, j − 1) + C(xi−1 , yj−1 ) ed(0, j) = 0 ed(i, 0) = 0

where D(X, Y ) is the distance measure between strings X and Y , ed(i, j) is the edit distance between the first i elements of X and the first j elements of Y . We use the distance measure described by Equation 1 as the function for substitution cost C on the string edit distance measure. For edit distance to be a metric as required by the search algorithm, we need the underlying comparison function for the string elements to also be a metric. We outlined in Section “The Metric For Individual Note Intervals” why our comparison function is a metric, hence the metric on strings is also a metric. For a more in-depth description and proof that edit distance creates a metric, see (Levenshtein 1966; Wagner and Fischer 1974).

The Search Algorithm We now give an overview of an algorithm (Yianilos 1993) that can shorten search time by minimizing the number of database elements that must be directly compared to the query. Rather than comparing each element of the database to the query, we precompute the distance of each element in the database to a small set of vantage points, drawn from the database. When a query is made to the database, its distance to a number of these vantage points is computed. We can then disregard regions of database elements that, based on the elements’ distances to the vantage points, are clearly too far away from the query to warrant a direct comparison. This greatly lessens the number of elements that must be compared to the query, speeding search. This process can be repeated recursively on the remaining set of database elements, further speeding search. We now describe this approach in detail. Our database consists of n elements {X1 , ..., Xn }, where Xi is a melody encoded as a string as discussed in Section “Encoding the Query”. The distance measure between two

database elements Xi and Xj is given by Equation 2. We organize our database by choosing k elements at random from the set of database elements. These k elements are the vantage points, V1 , ..., Vk , which we index with i. Each element in the database has its distance measured to each vantage point using the string metric discussed in Section “The String Metric”. We select these vantage points in the way described in (Vleugels and Veltkamp 1999), which effectively tries to maximize the minimum distance of each vantage point to any other vantage point with no additional calculations. We then partition the space into l rings, where each ring is the region between two concentric spheres defined by radii mi,j and mi,j+1 and center point Vi . (Each ring l ≥ 2, is indexed by j.) For each vantage point, we wish to find a set of radii such that each ring contains roughly as many elements as every other ring. In other words, find distances mi,j (the radii of the boundary spheres between successive rings), with 1 ≤ i ≤ k and 1 ≤ j ≤ k − 1, such that jl of the database is closer to the ith vantage point than the distance mi,j . Note that these radii are unrelated to the error radius introduced later.

Figure 1: A vantage point surrounded by 5 rings In Figure 1, we see a vantage point surrounded by 5 rings. The rings are numbered from 1 for the innermost ring to 5 for the outermost ring. Each ring contains roughly 15 of the database elements (not pictured). We can classify each database element X with a k-tuple, (b1 , b2 , ...bk ), called a branch ID, where bi is the ring X is in for vantage point Vi . Each branch ID defines a region of intersection between rings from the k vantage points. We call such a region a branch. Each branch ID uniquely indexes each branch. Two database elements are in the same branch if and only if they are in the same ring with respect to each vantage point and thus share the same branch ID. Given k vantage points and l rings, there are k l branches. In Figure 2, we see one level of building a tree with two vantage points and two rings per vantage point; we assume,

Figure 2: One level of a vantage point tree with two vantage points

for simplicity, that the distance between any two points is simply the straight line distance. V1 and V2 are the vantage points and X1 , X2 , and X3 are a sample subset of database elements (targets). Each vantage point has only two rings: a close inner region and a far outer region. m1,1 and m2,1 are the radii separating the inner rings from the outer rings. These radii are chosen as the median distance of all the database elements, though not all the elements are shown in this diagram. X1 is placed into branch (2, 1), X2 into (1, 2) and X3 into (2, 2). Note that each branch ID is a binary duple. To search this structure, we first choose a radius r, which is the maximal distance we anticipate a melody could be from the query and still be considered close enough to be a possible match. Note that this radius is different from the radii around the vantage points: this radius is a measure of error, while the radii mi,j were chosen to divide the space into equal sized shells. This value depends strongly on the database and the string metric, and must be chosen empirically. The better the metric resembles the perceptual similarity a human would perceive, the smaller we can make r, allowing for elimination of more database elements without direct comparison to the query. For a query Q, we first compute its distance to each vantage point D(Q, V1 ), ..., D(Q, Vk ). We then find which branches intersect the spherical region within r distance of the query. This set of branches contains the only points we need to consider when searching the database. We illustrate the following examples as they might occur in 2-D Euclidean space. In Figure 2, the first query, Q1 , is in the ring far from V1 and the ring close to V2 i.e. branch (2, 1). Furthermore, the area within radius r of Q1 (shown as a halo) is entirely enclosed in branch (2, 1). Hence all database elements which have distance less than r to Q1 , such as X1 , must also be within the same area, thus we only need to search elements in branch (2, 1). The second query, Q2 , is in the outer ring from both V1 and V2 i.e. branch (2, 2). However, a database element like X2 is closer than r to Q2 and is in the close ring to V1 . Hence we need to search

both branches (1, 2) and (2, 2). Even a single layer tree, as described in Figure 2, can eliminate a large number of database elements from active consideration. The true power of this approach, however, lies in creating levels. Within any branch b at the current level, one can create a tree structure at the next level. The set of elements within branch b at the current level are partitioned with vantage points, such that each database element in branch b at the current level is assigned a branch ID at the next level. Given a maximum number of levels, the branches at the final level are known as leaves. The full structure is called a vantage point tree. Thus, we recursively apply this algorithm to each level until the following termination point occurs: every branch at the current level lies fully within radius r. Upon this termination condition, the set of points in each remaining branch may be directly compared to the query using the distance measure to find the best match. Note that even a very shallow tree can have very small leaf sets which we have to compare directly to the query. Note also that these branches may have a very different number of database elements; while each ring for a particular vantage point has the same number of elements, their intersections may not. The method used in by Parker et al. (Parker, Fern, and Tadepalli 2007) is a special case with k = 1 (one vantage point per level) and l = 2 rings per vantage point, with some variable depth. Similarly, the structure described in (Vleugels and Veltkamp 1999) can be realized by setting k to the desired number of vantage points, l equal to the number of elements, and a fixed tree depth of 1. This structure can be efficiently searched if only approximate matching is desired. Since we are interested in exact matching, the number of operations required is O(n1−1/v ), where n is the size of the database and v is the number of vantage points. Given a typical number of vantage points (e.g. 50), this reduces to O(n) search. Within the general tree framework, the least complex tree is where k = 1, l = 1, and r is empirically fixed. In this case, there is only one ring and all points fall within the set that must be directly compared to the query. This is linear search. In Section “Empirical Results”, we experimentally determine values for k, l, and r, to build the optimal vantage point tree structure that reduces search time by eliminating database elements without direct comparisons to the query Q, while maintaining search quality. We note that not all metrics search equally well, even if they return the same ranking of songs. Take for example a hypothetical metric D1 which perfectly ranks songs (i.e. the correct song is always first) and ranges between 0 and 1. Now suppose we create a new metric D2 defined as follows: D2 (x, y) = 0 ⇔ D1 (x, y) = 0 D2 (x, y) = D1 (x, y) + 100 ⇔ D1 (x, y) 6= 0

(4) (5)

It can be shown that D2 is also a metric. Additionally, D2 ranks songs exactly as well as D1 . However, the triangle inequality cannot be used to infer unmeasured distances. For example, suppose we had a vantage point V , a target T and a query Q. The triangle inequality would state:

D2 (T, Q) ≤ D2 (T, V ) + D2 (V, Q)

(6)

However, assuming none of these are identical, we can substitute D1 to obtain the following: D1 (T, Q) ≤ D1 (T, V ) + D1 (V, Q) + 100

(7)

Since D1 (T, Q) < 1, we can now see that we have gained no information about D2 (T, Q) and thus we will never be able to eliminate a target. While this construction is a little artificial, subtler distinctions exist which cause similar though less drastic results. However, a system which learns a metric strictly based on ranking cannot distinguish between D1 and D2 . Thus learning the metric in light of searching is critical. We would also like to note that a similar transformation (i.e. the application of a convex function) can create a D2 which is a metric from a D1 which is not. In fact, the system developed in (Parker, Fern, and Tadepalli 2007) learns a measure which is not a metric and then transforms it in a related way such that it satisfies the triangle inequality in most cases. However, this process results not only in a comparison function which is only a metric approximation, but also in a dulling of the searching capability.

The Learner Parameters As described by Equation 1, our metric for individual noteintervals has five parameters which need to be optimized: α, β, L, P , D, all real valued. We also need to optimize three parameters of the vantage point tree: • k: the number of vantage points per level • l: the number of rings for each vantage point • r: the maximal allowable error radius where k and l are integers, and r is a positive real number. We also want to find the optimal parameters for note segmentation. Our note segmenter has four tunable parameters; these are described in (Little, Raffensperger, and Pardo 2007). Therefore, to tune a search system for our task, we trained the parameters of the note segmenting preprocessor, the note interval metric and the structure of the vantage point tree together using a genetic algorithm. In total, we have twelve parameters to tune.

The Genetic Algorithm In order to find the optimal values for the twelve parameters mentioned above, we decided to use a genetic algorithm because the task is an optimization problem and because the solution encoding as a string (genotype) is fairly straightforward. Each individual in the genetic algorithm population is one set of parameter values for our system. See Section “The Fitness Function” for information on the fitness measure we used.

To determine which individuals will reproduce in each generation, we used randomized fitness proportional selection, which means that individuals with higher fitness are very likely to reproduce but less fit individuals still have some chance of reproduction. We allow crossover to occur between parameters but not within parameters. We limit real number parameters in a range with a minimum step size. We then represent them as integers corresponding to the fraction of the total range. We have a mutation rate of .001, which means that every bit in these integers has a .1% chance of changing before becoming a member of the next generation.

The Fitness Function The two objectives of our system are to search correctly as well as quickly, but these two objectives often conflict with each other. We measure retrieval correctness using the mean reciprocal rank (MRR) of a set of n queries:

MRR =

n X i=1

1 rank of correct target for song i

For our datasets, we selected the QBSH-corpus queries augmented by the Essen Folksong database such that we can have a large number of melodies (1131) to test the efficiency of the search. We train our system on 13 singers. For each singer, we had 15 queries upon which we performed 3 fold crossvalidation. Thus for each experiment, we divided the queries into 10 training queries and 5 testing queries. The fitness for each individual has its performance measured on the training set after its tree had been constructed, as judged by Equation 9. We considered the result of the experiment to be the best individual in the 40th generation based on its performance on the training set. Figure 3 shows the performance of the 39 vantage point trees we obtained.

(8)

MRR emphasizes the importance of placing correct target songs near the top of the list while still rewarding improved rankings lower down on the returned list of songs (Dannenberg et al. 2007). Values for MRR range from 0 to 1; higher numbers indicate better performance. Thus, MRR = 0.25 roughly corresponds to the correct answer being in the top four songs returned by the search engine, MRR = 0.05 indicates the right answer is in the top twenty, and so on. Chance performance, on a database of the size used in this study would result in an MRR 0.001, given a uniform distribution. We note that decreasing the percentage of the database compared to the query makes our system find the target using less comparisons, and thus increases the search speed. Therefore, we measure the speed of our search using the negative log of the portion of the database compared to the query: −log(portion compared). If we used simply portion compared as a measure for speed, the learner would consider the difference from 0.91 to 0.90 and from 0.02 to 0.01 to show the same improvement. However, the former only shows 0.91−0.90 ≈ 1.1% improvement, while the latter 0.91 shows 50% improvement. Therefore, we vary the measure by calculating the negative log of the portion of the database compared to the query, which will give us a relatively accurate, intuitive measure of quickness. To determine the “fitness” of individuals in each generation, we convert the two objectives mentioned above into a single objective using the fitness function described below: MRR2 · (− log(portion compared))

Empirical Results

(9)

Note that we used M RR2 in order to favor search accuracy; empirical results from experiments which used M RR instead of M RR2 showed that search accuracy was neglected and only speed was being optimized.

Figure 3: MRR vs. Portion of the database compared to the query As Figure 3 illustrates, the best vantage point tree has an MRR of 0.35 and portion compared of 0.0047. This means that the correct melody would be ranked in the top three by direct comparison to only 0.47% of the database. The figure also shows that around 40% of the vantage point trees have an MRR more than 0.067 and a portion compared less than 0.05, which means the correct melodies would appear in the top 15 list while 95% of the database doesn’t need to be compared directly. After 40 generations of training, the system achieved a mean MRR of 0.065 and percent compared of 3.38% on the testing set over 3 cross-fold validation for 13 singers. The best mean MRR is 0.149 and percent compared 0.457%. While we might expect an inverse relation between MRR and portion compared, we must note that the singers vary in quality and consistency. For skilled singers, we are able to construct a metric which has the correct targets “near” (relative to other targets) the query. Since the correct song is a short distance away, the system can quickly eliminate many false targets while preserving the correct one. However, for a poor singer, the system could not find such a metric. Thus, to ensure the correct target is among those compared, the system must keep many targets to be compared linearly.

We also examined the direct correlation between MRR and the parameters of the system. The results do not indicate any strong relationship between any parameter and MRR. While more intricate multi-dimensional connections may exist, we expect that none such relations are common across different individuals or even among separate queries of a particular singer. One implication of the results is that we may need to restrict our hypothesis space. In our current formulation we learn 12 different parameters; this hypothesis search space is very large and difficult to effectively explore. Another avenue of improvement may be to increase the sizes of the testing and training sets by combining individuals and trying to learn a general metric for all singers.

Conclusion Building on past successes, we create a melodic searching system in which the comparison function is a metric, which allows us to take advantage of previous work in metric space searching. In a substantial step forward, our system allows a significant reduction in search time (on the order of 1000 times faster than linear search). We obtain this result by learning several (often separated) parts of the system together, and judging the quality of these parts by both their search quality and search speed at each step. By learning the note segmentation and comparison parameters together in this framework, we cause the metric space of melodies to be both distinguishing of correct songs and easy to rapidly eliminate many targets. By also learning the parameters for the metric access method itself, we create a personalized, fitted end-to-end system that considers both goals at every point. Future work will examine which particular algorithm works best for each of these steps. Perhaps another metric access method would be more effective when being learned in this context; the effectiveness of these methods are widely known to be heavily domain dependent. Also, our measurement of success was chosen from a few pilot tests. Perhaps others would better guide the difficult task of balancing conflicting goals. Further research is needed to address these possibilities.

Acknowledgments We would like to thank David Little for his assistance on adapting his implementation, and Bryan Pardo for his many suggestions for improvement.

References Dannenberg, R. B.; Birmingham, W. P.; Pardo, B.; Hu, N.; Meek, C.; and Tzanetakis, G. 2007. A Comparative Evaluation of Search Techniques for Query-by-Humming Using the MUSART Testbed. Journal of the American Society for Information Science and Technology 58(3): 687-701. Downie, J. S.; West, K.; Ehmann, A. F.; and Vincent, E. 2005. The 2005 Music Information Retrieval Evaluation Exchange (MIREX 2005): Preliminary Overview. In 6th International Conference on Music Information Retrieval, 320-323. London, UK.

Giannopoulous, P., and Veltkamp, R. C. 2002. A PseudoMetric for Weighted Point Sets. In Proceedings of the 7th European Conference on Computer Vision-Part III, 715730. London, UK: Springer-Verlag. Levenshtein, V. I. 1966. Binary Codes Capaple of Correcting Deletions Insertions and Reversals. Soviet Physics Doklady 10(8): 707-710. Little, D.; Raffensperger, D.; and Pardo, B. 2007. A Query by Humming System that Learns from Experience. In 8th International Conference on Music Information Retrieval, 335-338. Vienna, Austria. Pardo, B., and Birmingham, W. P. 2002. Encoding Timing Information for Musical Query Matching. In 3rd International Conference on Music Information Retrieval, 267268. Paris, France. Pardo, B.; Shifrin, J.; and Birmingham, W. P. 2004. Name That Tune: A Pilot Study in Finding a Melody from a Sung Query. Journal of the American Society for Information Science and Technology 55(4): 283-300. Parker, C.; Fern, A.; and Tadepalli, P. 2007. Learning for Efficient Retrieval of Structured Data with Noisy Queries. In Proceedings of the 24th International Conference on Machine Learning, 729-736. New York, NY: ACM. Typke, R.; Veltkamp, R. C.; and Wiering, F. 2004. Searching Notated Polyphonic Music Using Transportation Distances. In Proceedings of the 12th Annual ACM International Conference on Multimedia, 128-135. New York, NY: ACM. Vleugels, J., and Veltkamp, R. C. 1999. Efficient Image Retrieval through Vantage Objects. In Proceedings of the Third International Conference on Visual Information and Information Systems, 575-584. London, UK: SpringerVerlag. Wagner, R. A., and Fischer, M. J. 1974. The String-toString Correction Problem. Journal of the ACM (JACM) 21(1): 168-173. Wu, X.; Li, M.; Liu, J.; Yang J.; and Yan, Y. 2006. A TopDown Approach to Melody Matching and Pitch Contour for Query by Humming. In 2006 International Symposium on Chinese Spoken Language Processing. Yianilos, P. N. 1993. Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces. In Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, 311-321. Philadelphia, PA: Society for Industrial and Applied Mathematics.

Learning to Search a Melodic Metric Space

(3) ed(0,j) = 0 ed(i, 0) = 0 where D(X, Y ) is the distance measure between strings. X and Y , ed(i, j) is the edit distance between the first i elements of X and the first j elements of Y . We use the distance measure described by Equation 1 as .... as a halo) is entirely enclosed in branch (2, 1). Hence all database elements which ...

Download PDF

190KB Sizes 0 Downloads 209 Views

Report

Learning to Search a Melodic Metric Space

Recommend Documents