You have probably occasionally found yourself humming an excerpt of music that has been buzzing around in your head for a while, without being able to name that tune. In fact, this particular phenomenon has frequently been used as a starting point in an area of research that combines computer science and musicology called music information retrieval (MIR, for short).
MIR is an example of a family of closely related problems (other problems in the family include music comparison and music pattern induction), all of them dealing with the automated use of musical patterns. Although the idea of automating the identification of musically related melodies, or fragments of them, is not new – as early as 1949, Bertrand H. Bronson suggested the use of machine-readable encodings of “incipits” (short excerpts of initial phrases of music) to help the investigation of variations in music – the idea passed nearly unnoticed for over four decades before it captured the interest of researchers all over the world.
Recently, the situation has become somewhat better with several on-going academic MIR related projects. There are such projects at the University of Waikato, New Zealand; the University of Massachusetts, US; King’s College and City University in London, UK; Université Pierre et Marie Curie, France; Università degli Studi di Milano, Italy; and the University of Helsinki, Finland, to name but a few of them. Moreover, the first international MIR symposium is to take a place at Plymouth, US, in October, 2000. The area also has a promising future; the relevance of MIR to the forthcoming MPEG-7 standard (which will be a description interface for contents of multimedia documents) will increase activity in the field, both academic and commercial.
A rough strategy to perform MIR
In MIR the task is to find occurrences of a musical query pattern (or pattern, for short) in a music database. The occurrences can be exact or approximate. When approximate, a user-predefined threshold value is used to separate the approximate occurrences from all the possible instances. Since a natural way to give such a query pattern is by humming, the pattern is usually assumed to be monophonic, while it is feasible to assume that the music database may contain polyphony, as well.
MIR techniques are based on the assumption that music is symbolically representable, which is actually the case for most music (one such representation is the score notation of western music). However, in this case the useful forms of such symbolic representations are those available in computer-readable, digital form. One such representation is the well-known MIDI (Musical Instrument Digital Interface) which encodes music accurately enough for the problem in hand.
However, MIDI representation is not exactly the ideal representation because there is a lot of irrelevant information within a MIDI stream and it is time-consuming to locate and extract the relevant parts while performing a query (naturally this can be done beforehand, in a preprocessing phase). Instead, music is represented by a sequence of symbols (called a string), where the symbols are taken from an alphabet corresponding to some particular attribute of music, such as the duration or the pitch of a note.
Thus, a piece of monophonic music can be seen as a discrete linear string over a finite alphabet (there are only a rather small number of pitches and durations traditionally used in western music), and therefore, the well established discipline of computer science dealing with general (approximate) string matching can be applied. In this way, for instance by using a conventional technique called dynamic programming, a rough MIR routine can be achieved.
Extracting the prominent attribute of the music
To achieve good retrieval results with such a simple encoding of music, it is important that the chosen attribute corresponds to a feature of music, which, when represented as a sequence, is easy to remember (and thus, easy to reproduce so that it can be given as a pattern) and has good discrimination ability. Apparently, the two most prominent attributes are duration and pitch. Sequences of both of them are remembered easily enough, but a sequence of pitches is more discriminative than a rhythmic sequence (sequence of durations), at least in Western music. In other words, the range of possibilities for a pitch sequence of a fixed length is greater than that for a sequence of durations of the same length. This claim has actually been ratified indirectly by the US federal court; it has decided that in music copyright infringement suits the copyright of music relates to its pitch sequence.
One may argue that by using only the pitch (or the duration) information, the characteristics of the music get lost. This is obviously true in a general sense, but there are many reasons for taking only one attribute (at a time) into account here. The main one is to facilitate the modeling of music (in particular polyphonic music), and the modeling of distortions in the pattern. However, there are more practical reasons, as well. Consider, for instance, a case where a famous piece of music has been adapted to another western music genre. In such a case, it is often only the rhythmic pattern of the melody that has been changed. Therefore, to retrieve pieces of music from a database, without any a priori knowledge of the style they have been performed in, it would be advantageous to use only the pitch information. Moreover, if the pattern is given by playing an instrument, the pitches (or at least the intervals between the consecutive pitches) are probably more reliably accurate than the temporal information, because a musical instrument often has a fixed tuning.
Towards a more sophisticated MIR system
Let us now assume that the music is represented by a pitch sequence. The simplistic MIR strategy described above does not often achieve the desired results because it ignores many important properties of music. The three most relevant properties of music in this case are transposition invariance, polyphony, and the musical context. Transposition invariance, which allows the pattern and its occurrences to be in different musical keys, can be obtained straightforwardly by encoding the music by using the intervals (between consecutive pitches) instead of absolute pitch values, and subsequently using the pattern matching techniques on such interval sequences. However, such an encoding is somewhat risky because if, for some reason, an extra interval has been inserted (or an original interval has been deleted) the resulting sequence no longer represents the original piece of music (because such a distortion forces a modulation to take place). Therefore, a better idea is to encode the music by using absolute pitch values and calculate the intervals when accessed.
When it comes to musical context, the general string matching machinery needs to go through a slight refinement process. Moreover, analysis of the context requires a preprocessing phase executing some intelligent context-analyzing routine. Of the three properties mentioned, however, the most challenging problem is to be able to deal with polyphonic databases. This is the property most often ignored by the MIR researchers. To deal with it by using the general string matching algorithms requires often somewhat ingenious adaptations of those algorithms. Therefore, it has been suggested that the polyphonic sample should be reduced to a simple monophonic phrase by selecting the most representative musical line. Based on music psychology findings, the highest notes of chords have often been chosen in such reductions. Obviously, this method is unsatisfactory. In many cases, the highest notes do not contain the melody. Furthermore, the user should be permitted to search for such occurrences of the pattern that are “distributed” among the voices within a musical work in the database. Recently a few algorithms to deal with genuinely polyphonic samples have been introduced.
In designing and developing an ideal MIR system, it is not sufficient to consider only string matching techniques. An ideal system requires components that provide solutions for other apparent problems. For instance, (the characterization of) approximate pattern matching usually assumes that the target that is searched (that is in this case the music database) is error-free, while the query pattern may be somewhat distorted. In other words, the music in the database should be accurately represented; it should not require any normalization, such as quantization or time shifting (which is certainly not the case when dealing with real MIDI files). In effect, a direct, error-free conversion from notated music to the digital representation could solve this problem (this conversion problem, which is called Optical Music Recognition, or OMR, is currently the subject of active, global research). Furthermore, the most natural way to put a query to such a database is obviously by humming. To actualize this, a Pitch Tracking algorithm converting the analog waveform signal to a digital sequence containing only the base frequency of the input signal, would be required.
Proximity in matching and error tolerance
In order to be able to talk about similarity or dissimilarity between strings, a distance measure has to be defined. Once this has been done, there are three ways to allow approximation when dealing with general string matching. The first is to use exact strings with approximate matching, the second uses “generalized strings” with exact matching, and the third generalized strings with approximate matching techniques. The idea can be illustrated by imagining two strings PADDINGTON and FARRINGDON and the following alignment where the matching symbols have been aligned and an extra symbol “-” has been introduced (note that in the example the strings are aligned and matched in their entirety. The same method, however, applies to the case where a pattern is matched against any substring of the target):
This alignment suggests that either of the two strings can be obtained starting from the other by using 4 insertions and 4 deletions (consider the aligned pairs with the “-” symbols above). Thus, if the predefined error threshold value has been set to 4 and a distance measure calculating the number of insertions and deletions has been used, the two strings are not to be considered similar. However, if the conventional edit distance function, also allowing a replacement, had been used these strings would have been considered similar within the allowed range (by using the following 4 replacements: P F; D R; D R and T D).
In particular in MIR it is advantageous to generalize the strings. Reconsidering the strings given above, in some cases it could be advantageous for a vowel or a consonant to be permitted to match any other vowel or consonant, respectively. It is easy to see that in such a case the two strings above would be considered similar. Why is this relevant to MIR? It is because in an MIR query, where the pattern is given by humming, the pattern probably contains faults. Therefore, the user should be allowed to make typical errors without being penalized too heavily. One frequently used technique is to consider only the contour of the music, i.e., to observe only the direction (downward, upward, or a repeat) of the interval. Music psychology experiments have shown that, in many cases, the contour of a melody is correctly remembered even when the exact intervals cannot be recalled. Nevertheless, using only such a degenerate alphabet to represent music is not discriminative enough. For instance, in a reported experiment using 12-element contour patterns and a database of 8,000 MIDI files, it was found that, although 60,000 (of 530,000 possible) contours identified only two MIDI files, some of the patterns occurred in more than 1,000 files. There are, however, more sophisticated alphabet reduction schemes allowing natural errors and having better discriminative ability than that of the contour method. For instance, a method called quantized partially overlapping intervals classifies intervals into seven classes (up or downward intervals that can be small, medium, or large; and a repeat), where the intersections of the adjacent classes are not empty. In this way, the intervals near the border of a class can be matched with an interval approaching the border of the neighbouring class.
Two approaches to pattern matching and their capacities
The two alternatives that exist to organize pattern matching, and thus MIR queries as well, are (1) algorithms with indexing structures and (2) algorithms without indexing structures (the so-called online algorithms, such as the dynamic programming routine, mentioned above). Both approaches have their pros and cons, which are now considered with one example of each. A particularly efficient realization of indexing structures is the suffix tree. When using such a structure, exact pattern matching can be performed in a sublinear time (which depends on the length of the pattern, not at all on the length of the music database!). However, if some proximity is needed (which is obviously the case, even if a feasible alphabet reduction scheme had been used) it quite soon becomes impractical. The main drawback of the approach, however, is the large amount of space that such a structure needs. In practice, even the most space-efficient implementation will take up 10 times more memory than the length of the database. Online algorithms, on the other hand, can usually be executed without any, or with only a small amount of extra space, but they cannot be performed in a sublinear time; their running times are dependent on the length of the database. This is the case also when applying a sophisticated realization of dynamic programming, based on so-called bit parallelism.
If the main drawbacks of these two distinct approaches are assessed in the light of some real cases, although it is slower, the online approach using bit parallelism outshines the indexing approach using the suffix tree. Let us first consider a moderate sized database used with the MELDEX system (described below) which comprises nearly 10,000 folksongs from Britain, Ireland, North America, Germany, and China, having an average length of just under 60 notes. Assuming that one note fits one computer byte, concatenating the whole database in a single sequence would result in a sequence of around 600,000 notes needing a space of 0.6 MB. In this case, the indexing structure would take at least 6 MB of the main memory, while the efficient online algorithm scans through such a database within 0.05 seconds on modern computers.
The superiority of the bit parallel implementation becomes obvious when the database contains almost all the MIDI files available on the Internet (the exact duplicates have been removed). Such a database, containing 99,000 files having a total of 528 million notes, would require a suffix tree taking around 5 GB of the main memory! The bit parallel implementation, however, would scan through the database in approximately 50 seconds. Although the online method is not particularly fast anymore, the approach could still be used. This would not be the case with the excessive amount of space needed by the indexing approach being considered here.
MIR systems – three implementations
There are a few demos of MIR systems that can be tried on the Internet. Probably the best known, and actually the first, is the New Zealand Digital Library’s Web-based melody retrieval system called MELDEX (http://www.nzdl.org/musiclib). Their demo is run on the monophonic database described above. It uses the dynamic programming algorithm, and allows query-by-humming in the form of a URL given by the user. The retrieval is based on pitch and/or duration. Pitch is matched either by chromatic intervals or by contour. Recently, they have added an optical music recognition facility and text-based queries to the system.
Themefinder (http://www.themefinder.org), another Web-based melodic search tool, is a close successor to the thematic catalogues: its database consists of handpicked monophonic incipits from classical and folksong repertoires. Queries are based on pitch information and can be given in many forms, e.g., as intervals, scale degrees, pitch classes or contour patterns. In addition, the search may be limited by entering the name of the composer and a time signature. Only exact matching is available.
SEMEX is an MIR prototype developed at the University of Helsinki, Department of Computer Science. It allows, e.g., retrieving based on approximate matching, query-by-humming (the pitch tracking module it uses was developed at the Helsinki University of Technology, in the Laboratory of Acoustics and Audio Signal Processing), different reductions of the alphabet to enable feasible error tolerance, and queries targeted at polyphonic databases. It does not currently support the consideration of context in music, but the concept has been modeled. The queries are implemented by using the fast bit-parallel implementation of the dynamic programming approach. The estimates of the performance of an online method given in the previous section are actually those of the SEMEX prototype.
It is not easy to foresee the variety of MIR related applications that will be available once MPEG-7 has been released. Assuming that it would be possible to store the actual music data within a CD (compact disc) together with some “metadata” (comprising some symbolic representation of the music data, and links from such strings to the music data), a possible application is CD players. The system would allow a CD player with a built-in microphone, holding tens or hundreds of discs, to play a particular piece of music on request by humming a short excerpt of its melody. Similarly, whilst sitting in a pub, a request to a future jukebox might in future be submitted by using a personal mobile phone, or MP3 files may be downloaded from the Internet.
Although such applications would certainly attract wide interest, from a musician’s or musicologist’s point of view they are less interesting. Unfortunately, there will always be a tradeoff between the efficiency of the applied pattern matching techniques and musical feasibility, at least to some extent. Although it is possible to embed tailored distance measures in the system, all the possible tailoring needs could never be known beforehand. Therefore, there is no question of these kinds of techniques ever replacing the work done by musicologists in analyzing, e.g., some lengthy opera. However, as the techniques become more sophisticated by incorporating more musicological know-how, they will provide a great help by efficiently finding those sections that require a closer look by the human expert.
Kjell Lemström is a researcher at the University of Helsinki, in the Department of Computer Science (currently on leave to the City University of London). In November, he defended his PhD thesis on “String Matching Techniques for Music Retrieval”.
From Finnish Music Quarterly magazine 3-4/2000
Please note that the texts are protected by the copyright laws. They are free for background use, but when referring to these texts or articles, please mention the author and FMQ magazine.